Core API

from PFASGroups import parse_smiles, generate_fingerprint

# Parse — returns PFASEmbeddingSet
results = parse_smiles(["CCCC(F)(F)F", "ClCCCl"])

# Fingerprint — returns (numpy.ndarray, dict)
fps, info = generate_fingerprint(["CCCC(F)(F)F", "ClCCCl"])
print(fps.shape)             # (2, 116)   — 116 groups, F only, binary
print(info['group_names'][:2])

Parsing

parse_smiles

PFASGroups.parse_smiles(smiles, output_format='list', limit_effective_graph_resistance=None, compute_component_metrics=True, halogens='F', form=None, saturation=None, progress=False, verbose=False, **kwargs)[source]

Parse SMILES string(s) and return halogen group information.

Parameters:

smilesstr or list of str

Single SMILES string or list of SMILES strings

output_formatstr, default ‘list’

Output format: ‘list’ (default), ‘dataframe’, or ‘csv’ - ‘list’: Returns nested lists of tuples (default behavior) - ‘dataframe’: Returns pandas DataFrame with one row per match - ‘csv’: Returns CSV string

limit_effective_graph_resistanceint or None, default None

Maximum component size for computing effective graph resistance. - None: Compute for all components (default, may be slow for large molecules) - int > 0: Only compute for components with fewer atoms than this limit - 0: Skip computation for all components (set to NaN)

compute_component_metricsbool, default True

Whether to compute graph metrics (diameter, radius, etc.) for components. - True: Compute all metrics (default) - False: Only compute component size, skip all other metrics

halogensstr or list of str or None, default ‘F’

Filter components by halogen element. - ‘F’ (default): fluorine only - str (e.g. ‘Cl’): restrict to that single halogen - list (e.g. [‘F’, ‘Cl’]): restrict to those halogens - None: no filter (include all halogens)

formstr or list of str, optional

Filter components by form type (e.g., ‘alkyl’, [‘alkyl’, ‘cyclic’], or None for all)

saturationstr or list of str, optional

Filter components by saturation (e.g., ‘per’, ‘poly’, or None for all)

progressbool, default False

If True, display a tqdm progress bar during parsing.

verbosebool, default False

If True, collect fragmentation events from fragment_until_valence_is_correct() for each SMILES that could not be sanitised directly. When True the function returns a 2-tuple (result, verbose_info) where verbose_info is a dict with keys:

fragmented

List of per-SMILES dicts {'smiles': str, 'events': list, 'n_fragments': int} for every SMILES that triggered fragmentation. Each event dict contains atom_idx, atom_symbol, error, n_fragments, and smiles.

n_invalid

Count of SMILES that could not be converted to a valid RDKit molecule even after attempted fragmentation.

**kwargsdict

Additional parameters (pfas_groups, componentSmartss, etc.)

Returns:

list, pandas.DataFrame, or str

Depends on output_format parameter (or 2-tuple when verbose=True)

Parameters:

Parameter

Default

Description

smiles

required

Single SMILES string or list of SMILES strings

halogens

'F'

Halogen(s) to detect; string or list of strings

saturation

None

'per', 'poly', or None (all)

form

None

Structural form filter; None = all forms

compute_component_metrics

True

Compute effective graph resistance and related per-component metrics

limit_effective_graph_resistance

None

Max atoms for which EGR is computed; None = unlimited

include_PFAS_definitions

False

Classify each molecule against the five PFAS definitions

halogen_groups

None

Custom list of HalogenGroup instances

Returns: PFASEmbeddingSet

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "ClCCCl"])
for mol in results:
    print(mol.smiles, len(mol.matches))

parse_mols

PFASGroups.parse_mols(mols, output_format='list', include_PFAS_definitions=True, limit_effective_graph_resistance=None, compute_component_metrics=True, halogens='F', form=None, saturation=None, progress=False, _smiles_list=None, **kwargs)[source]

Parse RDKit molecule(s) and return halogen group information.

Parameters:

molslist of rdkit.Chem.Mol

Single RDKit molecule or list of molecules

output_formatstr, default ‘list’

Output format: ‘list’ (default), ‘dataframe’, or ‘csv’ - ‘list’: Returns nested lists of tuples (default behavior) - ‘dataframe’: Returns pandas DataFrame with one row per match - ‘csv’: Returns CSV string

limit_effective_graph_resistanceint or None, default None

Maximum component size for computing effective graph resistance. - None: Compute for all components (default, may be slow for large molecules) - int > 0: Only compute for components with fewer atoms than this limit - 0: Skip computation for all components (set to NaN)

compute_component_metricsbool, default True

Whether to compute graph metrics (diameter, radius, etc.) for components. - True: Compute all metrics (default) - False: Only compute component size, skip all other metrics

halogensstr or list of str or None, default ‘F’

Filter components by halogen element. - ‘F’ (default): fluorine only - str (e.g. ‘Cl’): restrict to that single halogen - list (e.g. [‘F’, ‘Cl’]): restrict to those halogens - None: no filter (include all halogens)

formstr or list of str, optional

Filter components by form type (e.g., ‘alkyl’, [‘alkyl’, ‘cyclic’], or None for all)

saturationstr or list of str, optional

Filter components by saturation (e.g., ‘per’, ‘poly’, or None for all)

progressbool, default False

If True, display a tqdm progress bar during parsing.

**kwargsdict

Additional parameters (halogen_groups, componentSmartss, etc.)

Returns:

list, pandas.DataFrame, or str

Depends on output_format parameter

Like parse_smiles() but accepts RDKit Mol objects directly:

from rdkit import Chem
from PFASGroups import parse_mols

mols = [Chem.MolFromSmiles(s) for s in ["CCCC(F)(F)F", "ClCCCl"]]
results = parse_mols(mols)

parse_mol

PFASGroups.parse_mol(mol, progress=False, **kwargs)[source]

Wrapper for parse_mols to handle single molecule input.

Returns a single PFASEmbedding when output_format='list' (default), preserving backwards-compatible dict behaviour while enabling richer navigation helpers.

Parse a single RDKit molecule. Returns a MoleculeResult.

Fingerprinting

generate_fingerprint

PFASGroups.generate_fingerprint(smiles, *, selected_groups=None, representation='vector', component_metrics=None, halogens='F', saturation='per', count_mode=None, **kwargs)[source]

Generate a fingerprint vector for one or more SMILES strings.

Parameters:
  • smiles (str or list of str) – Input SMILES.

  • halogens (str or list of str, default 'F') – Halogens to include in the fingerprint.

  • saturation (str or None, default 'per') – Saturation filter.

  • count_mode (str or None) – Fingerprint mode, e.g. ‘binary’ or ‘count’. When provided, overrides component_metrics.

  • component_metrics (list of str or None) – Explicit list of per-component metrics.

Returns:

array : EmbeddingArray — 1-D for a single SMILES, 2-D for a list. column_names : list of str.

Return type:

tuple (array, column_names)

Parameters:

Parameter

Default

Description

smiles

required

Single SMILES string or list of SMILES strings

selected_groups

None

0-based indices of groups to include (list, range, or None = all)

representation

'vector'

'vector', 'dict', 'sparse', 'detailed', or 'int'

component_metrics

['binary']

List of metrics: count modes ('binary', 'count', 'max_component', 'total_component') and/or graph metrics (e.g. 'effective_graph_resistance')

halogens

'F'

Halogen(s) to include; string or list of strings

saturation

'per'

'per', 'poly', or None

Returns: (numpy.ndarray of shape (n_mols, n_cols), dict)

The second return value is a dict with keys: group_names (list), group_ids (list), selected_indices (list), halogens (list), saturation (str or None).

from PFASGroups import generate_fingerprint

# Default: 116-col binary, F only
fps, info = generate_fingerprint(["CCCC(F)(F)F", "ClCCCl"])
print(fps.shape)                   # (2, 116)
print(info['group_names'][:2])     # list of group name strings
print(info['halogens'])            # ['F']

# OECD groups only (indices 0–27)
fps_oecd, _ = generate_fingerprint(["CCCC(F)(F)F"], selected_groups=range(0, 28))
print(fps_oecd.shape)              # (1, 28)

# Multi-halogen: 4 × 116 = 464 columns
fps_all, _ = generate_fingerprint(["CCCC(F)(F)F"],
                                   halogens=['F', 'Cl', 'Br', 'I'])
print(fps_all.shape)               # (1, 464)

Group library

get_compiled_HalogenGroups

PFASGroups.get_compiled_HalogenGroups(**kwargs)[source]

Return compiled HalogenGroup instances (compute=True groups only).

Unlike get_HalogenGroups() which returns raw JSON dicts, this function returns ready-to-use HalogenGroup instances that can be directly passed to parse_smiles() or extended with custom groups.

Returns:

All compiled groups, suitable for passing as pfas_groups to parse_smiles() or PFASFingerprint.

Return type:

list of HalogenGroup

Examples

>>> from PFASgroups import get_compiled_HalogenGroups, HalogenGroup, parse_smiles
>>> groups = get_compiled_HalogenGroups()
>>> groups.append(HalogenGroup(
...     id=200, name="Perfluoroalkyl nitrates",
...     smarts={"[C$(C[ON+](=O)[O-])]": 1},
...     componentSaturation="per", componentHalogens="F",
...     componentForm="alkyl",
...     constraints={"eq": {"N": 1}, "gte": {"F": 1}},
... ))
>>> results = parse_smiles(["FC(F)(F)C(F)(F)ON(=O)=O"], pfas_groups=groups)

Returns a list of compiled HalogenGroup instances (116 groups with compute=True):

from PFASGroups import get_compiled_HalogenGroups

groups = get_compiled_HalogenGroups()
print(len(groups))      # 116
print(groups[0].name)

get_HalogenGroups

PFASGroups.get_HalogenGroups(**kwargs)[source]

Returns raw JSON-like dicts (internal format). Prefer get_compiled_HalogenGroups() for most uses.

load_HalogenGroups

PFASGroups.load_HalogenGroups()[source]

Adds default HalogenGroups to function

Compile a list of raw group dicts into HalogenGroup instances.

get_PFASDefinitions

PFASGroups.get_PFASDefinitions(**kwargs)[source]

Returns the list of PFASDefinition objects.

get_componentSMARTSs

PFASGroups.get_componentSMARTSs()[source]

Returns component-level SMARTS patterns used for saturation filtering.

Molecule prioritization

prioritise_molecules

PFASGroups.prioritise_molecules(molecules: List[str] | List[Mol] | PFASEmbeddingSet, reference: List[str] | List[Mol] | PFASEmbeddingSet | None = None, group_selection: str = 'all', count_mode: str = 'max_component', halogens: str | List[str] = 'F', saturation: str | None = None, a: float = 1.0, b: float = 1.0, percentile: float = 90.0, return_scores: bool = True, ascending: bool = False, progress: bool = False) PFASEmbeddingSet | Tuple[PFASEmbeddingSet, ndarray][source]

Prioritize PFAS molecules based on similarity to a reference or intrinsic properties.

Parameters:
  • molecules (list of str, list of rdkit.Chem.Mol, or PFASEmbeddingSet) – Molecules to prioritize. Can be: - List of SMILES strings - List of RDKit molecule objects - PFASEmbeddingSet object (pre-computed results)

  • reference (list of str, list of rdkit.Chem.Mol, PFASEmbeddingSet, or None) – Reference molecules for similarity comparison. If provided, molecules are prioritized by distributional similarity (lower KL divergence = higher priority). If None, prioritization is based on intrinsic fluorinated component properties.

  • group_selection (str, default 'all') – PFAS group selection for fingerprint generation when using reference: - ‘all’: All 115 groups (OECD + generic) - ‘oecd’: OECD-defined groups (1-28) - ‘generic’: Generic functional groups (29-115) - ‘telomers’: Telomer-related groups - ‘generic+telomers’: Combined selection

  • count_mode (str, default 'binary') – Fingerprint encoding mode when using reference: - ‘binary’: 1 if present, 0 if absent - ‘count’: Number of matches - ‘max_component’: Maximum component size

  • halogens (str or list of str, default 'F') – Which halogen(s) to include when generating fingerprints for reference comparison. Passed directly to PFASEmbeddingSet.to_fingerprint.

  • saturation (str or None, default None) – Saturation filter applied to component SMARTS when generating fingerprints. None (default) includes both per- and polyfluorinated / polyhalogenated components, which gives the broadest coverage and avoids zero scores for candidates that only contain polyfluorinated chains. Pass 'per' or 'poly' to restrict.

  • a (float, default 1.0) – Weight for total fluorinated component size (sum of all component sizes). Used when reference is None. Higher values prioritize molecules with more total fluorination.

  • b (float, default 1.0) – Weight for component size percentile. Used when reference is None. Higher values prioritize molecules with larger individual fluorinated components.

  • percentile (float, default 90.0) – Percentile value (0-100) for component size distribution. Used when reference is None. Common values: - 90.0: Focus on largest 10% of components - 75.0: Focus on largest 25% of components - 50.0: Median component size

  • return_scores (bool, default True) – If True, returns tuple of (prioritized_results, scores). If False, returns only prioritized_results.

  • ascending (bool, default False) – Sort order. If False (default), highest priority first. If True, lowest priority first.

Returns:

If return_scores=True: (prioritized_results, scores) If return_scores=False: prioritized_results only

prioritized_resultsPFASEmbeddingSet

Molecules sorted by priority

scoresnp.ndarray

Priority scores for each molecule

Return type:

PFASEmbeddingSet or tuple

Examples

# Priority by similarity to reference list >>> from PFASGroups import prioritise_molecules >>> inventory = [“FC(F)(F)C(F)(F)C(=O)O”, “FC(F)(F)C(F)(F)C(F)(F)C(=O)O”] >>> reference = [“FC(F)(F)C(F)(F)C(=O)O”] # Known priority compounds >>> results, scores = prioritise_molecules(inventory, reference=reference) >>> print(f”Most similar: {results[0][‘smiles’]}”)

# Priority by fluorination characteristics >>> results, scores = prioritise_molecules( … inventory, … a=1.0, # Weight for total fluorination … b=2.0, # Weight for largest components … percentile=90 … ) >>> print(f”Highest priority: {results[0][‘smiles’]}”)

# Emphasize total fluorine content >>> results = prioritise_molecules(inventory, a=2.0, b=0.5, return_scores=False)

# Focus on molecules with longest chains >>> results = prioritise_molecules(inventory, a=0.5, b=2.0, percentile=95)

Notes

Reference-based prioritization:

Uses cosine similarity between each candidate’s fingerprint vector and the mean fingerprint of the reference set:

score_i = (fp_i · mean_ref) / (||fp_i|| × ||mean_ref||)

  • Higher cosine similarity = more similar group profile to reference = higher priority

  • Molecules that activate the same PFAS groups as the reference rank highest

  • Molecules with no group matches receive score 0

Intrinsic prioritization (no reference):

Score = a × Σ(component_sizes) + b × percentile(component_sizes, p)

Where: - Σ(component_sizes): Total number of fluorinated carbons across all components - percentile(component_sizes, p): Size of fluorinated components at pth percentile

This approach prioritizes molecules based on: - Total fluorination burden (a parameter) - Presence of long perfluorinated chains (b and percentile parameters)

Tuning guidelines:

For environmental persistence concerns: - High b, high percentile (e.g., b=2.0, p=90): Long-chain compounds

For bioaccumulation potential: - Balanced a and b (e.g., a=1.0, b=1.0): Both total and chain length

For screening priority: - High a, moderate b (e.g., a=2.0, b=1.0, p=75): Total fluorine load

See also

PFASEmbeddingSet.to_array

Convert results to fingerprints

PFASEmbedding.compare_kld

Compare embedding distributions

See Prioritization for detailed documentation.