Core API
from PFASGroups import parse_smiles, generate_fingerprint
# Parse — returns PFASEmbeddingSet
results = parse_smiles(["CCCC(F)(F)F", "ClCCCl"])
# Fingerprint — returns (numpy.ndarray, dict)
fps, info = generate_fingerprint(["CCCC(F)(F)F", "ClCCCl"])
print(fps.shape) # (2, 116) — 116 groups, F only, binary
print(info['group_names'][:2])
Parsing
parse_smiles
- PFASGroups.parse_smiles(smiles, output_format='list', limit_effective_graph_resistance=None, compute_component_metrics=True, halogens='F', form=None, saturation=None, progress=False, verbose=False, **kwargs)[source]
Parse SMILES string(s) and return halogen group information.
Parameters:
- smilesstr or list of str
Single SMILES string or list of SMILES strings
- output_formatstr, default ‘list’
Output format: ‘list’ (default), ‘dataframe’, or ‘csv’ - ‘list’: Returns nested lists of tuples (default behavior) - ‘dataframe’: Returns pandas DataFrame with one row per match - ‘csv’: Returns CSV string
- limit_effective_graph_resistanceint or None, default None
Maximum component size for computing effective graph resistance. - None: Compute for all components (default, may be slow for large molecules) - int > 0: Only compute for components with fewer atoms than this limit - 0: Skip computation for all components (set to NaN)
- compute_component_metricsbool, default True
Whether to compute graph metrics (diameter, radius, etc.) for components. - True: Compute all metrics (default) - False: Only compute component size, skip all other metrics
- halogensstr or list of str or None, default ‘F’
Filter components by halogen element. - ‘F’ (default): fluorine only - str (e.g. ‘Cl’): restrict to that single halogen - list (e.g. [‘F’, ‘Cl’]): restrict to those halogens - None: no filter (include all halogens)
- formstr or list of str, optional
Filter components by form type (e.g., ‘alkyl’, [‘alkyl’, ‘cyclic’], or None for all)
- saturationstr or list of str, optional
Filter components by saturation (e.g., ‘per’, ‘poly’, or None for all)
- progressbool, default False
If True, display a tqdm progress bar during parsing.
- verbosebool, default False
If True, collect fragmentation events from
fragment_until_valence_is_correct()for each SMILES that could not be sanitised directly. When True the function returns a 2-tuple(result, verbose_info)whereverbose_infois a dict with keys:fragmentedList of per-SMILES dicts
{'smiles': str, 'events': list, 'n_fragments': int}for every SMILES that triggered fragmentation. Each event dict containsatom_idx,atom_symbol,error,n_fragments, andsmiles.n_invalidCount of SMILES that could not be converted to a valid RDKit molecule even after attempted fragmentation.
- **kwargsdict
Additional parameters (pfas_groups, componentSmartss, etc.)
Returns:
- list, pandas.DataFrame, or str
Depends on output_format parameter (or 2-tuple when verbose=True)
Parameters:
Parameter |
Default |
Description |
|---|---|---|
|
required |
Single SMILES string or list of SMILES strings |
|
|
Halogen(s) to detect; string or list of strings |
|
|
|
|
|
Structural form filter; |
|
|
Compute effective graph resistance and related per-component metrics |
|
|
Max atoms for which EGR is computed; |
|
|
Classify each molecule against the five PFAS definitions |
|
|
Custom list of |
Returns: PFASEmbeddingSet
from PFASGroups import parse_smiles
results = parse_smiles(["CCCC(F)(F)F", "ClCCCl"])
for mol in results:
print(mol.smiles, len(mol.matches))
parse_mols
- PFASGroups.parse_mols(mols, output_format='list', include_PFAS_definitions=True, limit_effective_graph_resistance=None, compute_component_metrics=True, halogens='F', form=None, saturation=None, progress=False, _smiles_list=None, **kwargs)[source]
Parse RDKit molecule(s) and return halogen group information.
Parameters:
- molslist of rdkit.Chem.Mol
Single RDKit molecule or list of molecules
- output_formatstr, default ‘list’
Output format: ‘list’ (default), ‘dataframe’, or ‘csv’ - ‘list’: Returns nested lists of tuples (default behavior) - ‘dataframe’: Returns pandas DataFrame with one row per match - ‘csv’: Returns CSV string
- limit_effective_graph_resistanceint or None, default None
Maximum component size for computing effective graph resistance. - None: Compute for all components (default, may be slow for large molecules) - int > 0: Only compute for components with fewer atoms than this limit - 0: Skip computation for all components (set to NaN)
- compute_component_metricsbool, default True
Whether to compute graph metrics (diameter, radius, etc.) for components. - True: Compute all metrics (default) - False: Only compute component size, skip all other metrics
- halogensstr or list of str or None, default ‘F’
Filter components by halogen element. - ‘F’ (default): fluorine only - str (e.g. ‘Cl’): restrict to that single halogen - list (e.g. [‘F’, ‘Cl’]): restrict to those halogens - None: no filter (include all halogens)
- formstr or list of str, optional
Filter components by form type (e.g., ‘alkyl’, [‘alkyl’, ‘cyclic’], or None for all)
- saturationstr or list of str, optional
Filter components by saturation (e.g., ‘per’, ‘poly’, or None for all)
- progressbool, default False
If True, display a tqdm progress bar during parsing.
- **kwargsdict
Additional parameters (halogen_groups, componentSmartss, etc.)
Returns:
- list, pandas.DataFrame, or str
Depends on output_format parameter
Like parse_smiles() but accepts RDKit Mol objects directly:
from rdkit import Chem
from PFASGroups import parse_mols
mols = [Chem.MolFromSmiles(s) for s in ["CCCC(F)(F)F", "ClCCCl"]]
results = parse_mols(mols)
parse_mol
- PFASGroups.parse_mol(mol, progress=False, **kwargs)[source]
Wrapper for parse_mols to handle single molecule input.
Returns a single
PFASEmbeddingwhenoutput_format='list'(default), preserving backwards-compatible dict behaviour while enabling richer navigation helpers.
Parse a single RDKit molecule. Returns a MoleculeResult.
Fingerprinting
generate_fingerprint
- PFASGroups.generate_fingerprint(smiles, *, selected_groups=None, representation='vector', component_metrics=None, halogens='F', saturation='per', count_mode=None, **kwargs)[source]
Generate a fingerprint vector for one or more SMILES strings.
- Parameters:
halogens (str or list of str, default 'F') – Halogens to include in the fingerprint.
saturation (str or None, default 'per') – Saturation filter.
count_mode (str or None) – Fingerprint mode, e.g. ‘binary’ or ‘count’. When provided, overrides component_metrics.
component_metrics (list of str or None) – Explicit list of per-component metrics.
- Returns:
array : EmbeddingArray — 1-D for a single SMILES, 2-D for a list. column_names : list of str.
- Return type:
tuple (array, column_names)
Parameters:
Parameter |
Default |
Description |
|---|---|---|
|
required |
Single SMILES string or list of SMILES strings |
|
|
0-based indices of groups to include (list, range, or |
|
|
|
|
|
List of metrics: count modes ( |
|
|
Halogen(s) to include; string or list of strings |
|
|
|
Returns: (numpy.ndarray of shape (n_mols, n_cols), dict)
The second return value is a dict with keys:
group_names (list), group_ids (list), selected_indices (list),
halogens (list), saturation (str or None).
from PFASGroups import generate_fingerprint
# Default: 116-col binary, F only
fps, info = generate_fingerprint(["CCCC(F)(F)F", "ClCCCl"])
print(fps.shape) # (2, 116)
print(info['group_names'][:2]) # list of group name strings
print(info['halogens']) # ['F']
# OECD groups only (indices 0–27)
fps_oecd, _ = generate_fingerprint(["CCCC(F)(F)F"], selected_groups=range(0, 28))
print(fps_oecd.shape) # (1, 28)
# Multi-halogen: 4 × 116 = 464 columns
fps_all, _ = generate_fingerprint(["CCCC(F)(F)F"],
halogens=['F', 'Cl', 'Br', 'I'])
print(fps_all.shape) # (1, 464)
Group library
get_compiled_HalogenGroups
- PFASGroups.get_compiled_HalogenGroups(**kwargs)[source]
Return compiled HalogenGroup instances (compute=True groups only).
Unlike
get_HalogenGroups()which returns raw JSON dicts, this function returns ready-to-useHalogenGroupinstances that can be directly passed toparse_smiles()or extended with custom groups.- Returns:
All compiled groups, suitable for passing as pfas_groups to
parse_smiles()orPFASFingerprint.- Return type:
list of HalogenGroup
Examples
>>> from PFASgroups import get_compiled_HalogenGroups, HalogenGroup, parse_smiles >>> groups = get_compiled_HalogenGroups() >>> groups.append(HalogenGroup( ... id=200, name="Perfluoroalkyl nitrates", ... smarts={"[C$(C[ON+](=O)[O-])]": 1}, ... componentSaturation="per", componentHalogens="F", ... componentForm="alkyl", ... constraints={"eq": {"N": 1}, "gte": {"F": 1}}, ... )) >>> results = parse_smiles(["FC(F)(F)C(F)(F)ON(=O)=O"], pfas_groups=groups)
Returns a list of compiled HalogenGroup instances
(116 groups with compute=True):
from PFASGroups import get_compiled_HalogenGroups
groups = get_compiled_HalogenGroups()
print(len(groups)) # 116
print(groups[0].name)
get_HalogenGroups
Returns raw JSON-like dicts (internal format). Prefer
get_compiled_HalogenGroups() for most uses.
load_HalogenGroups
Compile a list of raw group dicts into HalogenGroup instances.
get_PFASDefinitions
Returns the list of PFASDefinition objects.
get_componentSMARTSs
Returns component-level SMARTS patterns used for saturation filtering.
Molecule prioritization
prioritise_molecules
- PFASGroups.prioritise_molecules(molecules: List[str] | List[Mol] | PFASEmbeddingSet, reference: List[str] | List[Mol] | PFASEmbeddingSet | None = None, group_selection: str = 'all', count_mode: str = 'max_component', halogens: str | List[str] = 'F', saturation: str | None = None, a: float = 1.0, b: float = 1.0, percentile: float = 90.0, return_scores: bool = True, ascending: bool = False, progress: bool = False) PFASEmbeddingSet | Tuple[PFASEmbeddingSet, ndarray][source]
Prioritize PFAS molecules based on similarity to a reference or intrinsic properties.
- Parameters:
molecules (list of str, list of rdkit.Chem.Mol, or PFASEmbeddingSet) – Molecules to prioritize. Can be: - List of SMILES strings - List of RDKit molecule objects - PFASEmbeddingSet object (pre-computed results)
reference (list of str, list of rdkit.Chem.Mol, PFASEmbeddingSet, or None) – Reference molecules for similarity comparison. If provided, molecules are prioritized by distributional similarity (lower KL divergence = higher priority). If None, prioritization is based on intrinsic fluorinated component properties.
group_selection (str, default 'all') – PFAS group selection for fingerprint generation when using reference: - ‘all’: All 115 groups (OECD + generic) - ‘oecd’: OECD-defined groups (1-28) - ‘generic’: Generic functional groups (29-115) - ‘telomers’: Telomer-related groups - ‘generic+telomers’: Combined selection
count_mode (str, default 'binary') – Fingerprint encoding mode when using reference: - ‘binary’: 1 if present, 0 if absent - ‘count’: Number of matches - ‘max_component’: Maximum component size
halogens (str or list of str, default 'F') – Which halogen(s) to include when generating fingerprints for reference comparison. Passed directly to
PFASEmbeddingSet.to_fingerprint.saturation (str or None, default None) – Saturation filter applied to component SMARTS when generating fingerprints.
None(default) includes both per- and polyfluorinated / polyhalogenated components, which gives the broadest coverage and avoids zero scores for candidates that only contain polyfluorinated chains. Pass'per'or'poly'to restrict.a (float, default 1.0) – Weight for total fluorinated component size (sum of all component sizes). Used when reference is None. Higher values prioritize molecules with more total fluorination.
b (float, default 1.0) – Weight for component size percentile. Used when reference is None. Higher values prioritize molecules with larger individual fluorinated components.
percentile (float, default 90.0) – Percentile value (0-100) for component size distribution. Used when reference is None. Common values: - 90.0: Focus on largest 10% of components - 75.0: Focus on largest 25% of components - 50.0: Median component size
return_scores (bool, default True) – If True, returns tuple of (prioritized_results, scores). If False, returns only prioritized_results.
ascending (bool, default False) – Sort order. If False (default), highest priority first. If True, lowest priority first.
- Returns:
If return_scores=True: (prioritized_results, scores) If return_scores=False: prioritized_results only
- prioritized_resultsPFASEmbeddingSet
Molecules sorted by priority
- scoresnp.ndarray
Priority scores for each molecule
- Return type:
Examples
# Priority by similarity to reference list >>> from PFASGroups import prioritise_molecules >>> inventory = [“FC(F)(F)C(F)(F)C(=O)O”, “FC(F)(F)C(F)(F)C(F)(F)C(=O)O”] >>> reference = [“FC(F)(F)C(F)(F)C(=O)O”] # Known priority compounds >>> results, scores = prioritise_molecules(inventory, reference=reference) >>> print(f”Most similar: {results[0][‘smiles’]}”)
# Priority by fluorination characteristics >>> results, scores = prioritise_molecules( … inventory, … a=1.0, # Weight for total fluorination … b=2.0, # Weight for largest components … percentile=90 … ) >>> print(f”Highest priority: {results[0][‘smiles’]}”)
# Emphasize total fluorine content >>> results = prioritise_molecules(inventory, a=2.0, b=0.5, return_scores=False)
# Focus on molecules with longest chains >>> results = prioritise_molecules(inventory, a=0.5, b=2.0, percentile=95)
Notes
Reference-based prioritization:
Uses cosine similarity between each candidate’s fingerprint vector and the mean fingerprint of the reference set:
score_i = (fp_i · mean_ref) / (||fp_i|| × ||mean_ref||)
Higher cosine similarity = more similar group profile to reference = higher priority
Molecules that activate the same PFAS groups as the reference rank highest
Molecules with no group matches receive score 0
Intrinsic prioritization (no reference):
Score = a × Σ(component_sizes) + b × percentile(component_sizes, p)
Where: - Σ(component_sizes): Total number of fluorinated carbons across all components - percentile(component_sizes, p): Size of fluorinated components at pth percentile
This approach prioritizes molecules based on: - Total fluorination burden (a parameter) - Presence of long perfluorinated chains (b and percentile parameters)
Tuning guidelines:
For environmental persistence concerns: - High b, high percentile (e.g., b=2.0, p=90): Long-chain compounds
For bioaccumulation potential: - Balanced a and b (e.g., a=1.0, b=1.0): Both total and chain length
For screening priority: - High a, moderate b (e.g., a=2.0, b=1.0, p=75): Total fluorine load
See also
PFASEmbeddingSet.to_arrayConvert results to fingerprints
PFASEmbedding.compare_kldCompare embedding distributions
See Prioritization for detailed documentation.