Data Models
from PFASGroups import parse_smiles
results = parse_smiles(["CCCC(F)(F)F", "OCCOCCO"])
mol = results[0] # MoleculeResult
print(mol.smiles, mol.is_PFAS)
for match in mol.matches: # GroupMatch objects
print(match.group_name, match.group_id)
for comp in match.components: # MatchComponent objects
print(" atoms:", comp.atoms)
arr = results.to_array() # EmbeddingArray
print(arr.shape) # (2, 116)
PFASEmbeddingSet
- class PFASGroups.PFASEmbeddingSet(iterable: Iterable[Dict[str, Any]] = ())[source]
Bases:
listList-like container for multiple
PFASEmbeddingresults.Subclasses
listso existing code that iterates over results continues to work. Callto_array()to produce a(n_molecules, n_columns)matrix from all stored results.- property matches: List[MatchView]
Flattened list of all MatchView objects across all molecules.
Some older code expects a
matchesattribute on a ResultsModel instance. Provide a read-only aggregated view by concatenating the per-molecule match lists.
- classmethod from_raw(results: Iterable[Dict[str, Any]]) PFASEmbeddingSet[source]
Wrap an existing list of result dicts without changing them.
- classmethod from_smiles(smiles: str | List[str], **kwargs) PFASEmbeddingSet[source]
Parse SMILES string(s) and return a
PFASEmbeddingSet.
- classmethod from_mols(mols, **kwargs) PFASEmbeddingSet[source]
Parse RDKit molecules and return a
PFASEmbeddingSet.- Parameters:
mols (list of rdkit.Chem.Mol) – List of RDKit molecule objects.
**kwargs – Forwarded to
parse_mols().
- classmethod from_inchis(inchis: List[str], **kwargs) PFASEmbeddingSet[source]
Parse InChI strings and return a
PFASEmbeddingSet.
- reorder(indices: list | None = None, key: Callable[[PFASEmbedding], Any] = None, reverse: bool = False) PFASEmbeddingSet[source]
Return a new PFASEmbeddingSet with results reordered by a key function.
- Parameters:
- iter_group_matches(group_id: int | None = None, group_name: str | None = None) Iterator[Tuple[PFASEmbedding, MatchView]][source]
Iterate over all PFAS group matches across all molecules.
- plot_components_for_group(group_id: int | None = None, group_name: str | None = None, max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) Tuple[Image, int, int][source]
Plot all components for a specific PFAS group across molecules.
Either
group_idorgroup_name(or both) can be provided to select the target group. Each panel corresponds to one molecule, with all its components for that group highlighted together.
- show(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) Image[source]
Show all component combinations in a grid plot.
Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.
Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.
- plot(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) Image
Show all component combinations in a grid plot.
Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.
Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.
- to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]
Export this molecule result to a SQL database.
Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).
- Parameters:
filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.
dbname (str, optional) – Database name (for PostgreSQL/MySQL).
user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.
password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.
host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).
port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.
- svg(filename: str, subwidth: int = 350, subheight: int = 350, ncols: int = 4) str[source]
Export all component combinations to an SVG file (vector graphics).
Components that share the same highlighted atoms within a molecule are merged into a single panel with a bullet-point legend.
- Parameters:
- Returns:
Path to the created SVG file.
- Return type:
- summarise() str[source]
Return a coloured text summary of the results.
The summary includes: - number of molecules - counts of PFAS group and definition matches - total number of components across all group matches - the most frequent PFAS groups (colour-coded by halogen)
- table() str[source]
Return a more detailed text table with one row per molecule.
The TSV table has the following columns:
index(1-based),smiles,group_matches(count),definition_matches(count), andgroups(per-molecule PFAS groups with counts, e.g."Perfluoroalkyl (2); Polyfluoroalkyl (1)").
- classify() DataFrame[source]
Return a classification DataFrame with one row per molecule.
Each molecule is classified by
MoleculeResult.classify().- Returns:
Columns:
smiles— molecule SMILES.category— classification label: OECD group name(s) if matched, otherwise"per-"/"poly-"+ generic/telomeric group names (comma-separated).total_component_size— sum of C-atom counts across all matched group components.
- Return type:
- summary() None[source]
Print a detailed coloured summary of matched groups across all molecules.
For each group, shows the component SMARTS type and, per component, the graph metrics: size (C-atom count), branching and mean eccentricity. Component size statistics (min, max, mean) are also shown.
- plot_all_components_with_group_colours(max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) Tuple[Image, int, int][source]
Plot all matched components, coloured by PFAS group.
Each panel corresponds to one molecule; atoms are highlighted with colours assigned per PFAS group. The legend lists the groups found in that molecule.
- to_sql_all(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]
Export all molecule results to a SQL database.
This method efficiently batches all molecules into the database in a single operation.
- Parameters:
conn (str or sqlalchemy.engine.Engine, optional) – Database connection. Can be: - SQLAlchemy Engine object - Connection string (e.g., ‘postgresql://user:pass@host:port/db’) - SQLite path with ‘sqlite:///’ prefix
filename (str, optional) – Path to SQLite database file (legacy parameter, use conn instead).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.
Examples
>>> # Using connection string >>> results.to_sql(conn='postgresql://user:pass@localhost/pfas_db') >>> >>> # Using SQLAlchemy engine >>> from sqlalchemy import create_engine >>> engine = create_engine('sqlite:///pfas.db') >>> results.to_sql_all(conn=engine) >>> >>> # Using filename (legacy) >>> results.to_sql(filename='pfas.db')
- to_fingerprint(group_selection: str = 'all', component_metrics: List[str] | None = None, selected_group_ids: List[int] | None = None, halogens: str | List[str] = 'F', saturation: str | None = 'per', molecule_metrics: List[str] | None = None, pfas_groups: List[Dict] | None = None, preset: str | None = None, count_mode: str | None = None, graph_metrics: List[str] | None = None, progress: bool = False, **kwargs) ndarray[source]
Deprecated. Use
to_array()instead.
- property match_cache: PFASEmbeddingSet
Alias for the set itself (backward compat with PFASFingerprint API).
- get_embedding(**kwargs) EmbeddingArray[source]
Alias for
to_array()(backward compat with PFASFingerprint API).
- to_array(component_metrics=<object object>, molecule_metrics=<object object>, group_selection=<object object>, selected_group_ids=<object object>, aggregation=<object object>, preset=<object object>, pfas_groups=<object object>, halogens=<object object>, progress: bool = True) EmbeddingArray[source]
Stack per-molecule embedding rows into a
(n_mols, n_cols)matrix.When called with no arguments, returns the last cached embedding (or binary by default on the first call). Pass explicit arguments to override and update the cache.
Parameters match those of
PFASEmbedding.to_array(), plus:- progressbool, default True
Show a tqdm progress bar while computing embeddings.
- compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') float[source]
Compare two sets using KL divergence on group-occurrence frequencies.
- Parameters:
other (PFASEmbeddingSet) – Second set to compare against.
method (str, default
'minmax') –'forward','reverse','symmetric', or'minmax'(normalised symmetric KLD).
- Returns:
KL divergence value (lower = more similar).
- Return type:
- perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform PCA on the embedding matrix.
- Parameters:
n_components (int, default 2)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','explained_variance','components','pca_model','scaler','labels'(if color_by is set).- Return type:
- perform_kernel_pca(n_components: int = 2, kernel: str = 'rbf', gamma: float | None = None, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform kernel PCA on the embedding matrix.
- Parameters:
n_components (int, default 2)
kernel (str, default
'rbf')gamma (float, optional)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','kpca_model','scaler','kernel','gamma','labels'(if color_by is set).- Return type:
- perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform t-SNE dimensionality reduction on the embedding matrix.
- Parameters:
n_components (int, default 2)
perplexity (float, default 30.0)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','tsne_model','scaler','perplexity','labels'(if color_by is set).- Return type:
- perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform UMAP dimensionality reduction on the embedding matrix.
- Parameters:
n_components (int, default 2)
n_neighbors (int, default 15)
min_dist (float, default 0.1)
metric (str, default
'euclidean')plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','umap_model','scaler','n_neighbors','min_dist','labels'(if color_by is set).- Return type:
- column_names(component_metrics: List[str] | None = None, molecule_metrics: List[str] | None = None, group_selection: str = 'all', selected_group_ids: List[int] | None = None, preset: str | None = None, pfas_groups=None, halogens=None) List[str][source]
Return column labels (delegates to first element).
- classmethod from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) PFASEmbeddingSet[source]
Load results from SQL database.
- Parameters:
conn (str or SQLAlchemy Engine, optional) – Database connection string or engine
filename (str, optional) – SQLite database filename (alternative to conn)
components_table (str, default "components") – Name of the components table
groups_table (str, default "pfas_groups_in_compound") – Name of the groups table
limit (int, optional) – Limit number of molecules to load
- Returns:
Loaded results
- Return type:
ResultsModel
PFASEmbeddingSet is a list-like container of MoleculeResult objects.
Its length equals the number of input SMILES.
Key methods:
Method |
Description |
|---|---|
|
Access the i-th |
|
Flatten all matches to a |
|
Convert to an |
|
Persist to a SQLite or PostgreSQL database |
|
Load from a previously saved database |
to_array options:
Note
The default value of halogens depends on which module the
PFASEmbeddingSet came from:
from PFASGroups import parse_smiles→halogens='F'(116 columns)from HalogenGroups import parse_smiles→halogens=['F','Cl','Br','I'](464 columns)
Always pass halogens explicitly in reusable helpers or notebook functions
to avoid silent fingerprint-width changes when the import source changes.
# PFASGroups default: all 116 groups, F only, binary → (n, 116)
arr = results.to_array() # (n, 116)
# Always-explicit (safe in any import context)
arr = results.to_array(halogens='F') # (n, 116)
# OECD groups only
arr = results.to_array(group_selection='oecd', halogens='F') # (n, 28)
# Count or max-component encoding
arr = results.to_array(component_metrics=['count'], halogens='F')
arr = results.to_array(component_metrics=['max_component'], halogens='F')
# Multi-halogen (advanced) — see halogengroups page
arr = results.to_array(halogens=['F', 'Cl', 'Br', 'I']) # (n, 464)
MoleculeResult
- PFASGroups.MoleculeResult
alias of
PFASEmbedding
Represents parsing results for a single molecule.
Attributes:
Attribute |
Description |
|---|---|
|
Canonical SMILES string |
|
InChI string |
|
InChIKey |
|
List of |
|
List of definition matches (populated when
|
|
Number of group matches |
|
|
GroupMatch
Represents a single group detected in a molecule.
Attributes:
Attribute |
Description |
|---|---|
|
Human-readable group name |
|
Integer group ID |
|
|
|
Whether this group qualifies as PFAS |
|
Halogen symbol matched ( |
|
List of |
|
Number of components |
MatchComponent
A single structural component of a group match.
Attributes:
Attribute |
Description |
|---|---|
|
List of atom indices (0-based) in the RDKit molecule |
|
Number of atoms in the component |
|
Number of halogen atoms |
|
Ratio of halogen atoms to total heavy atoms |
|
Kirchhoff index of the component graph ( |
EmbeddingArray
See Embedding Analysis for full documentation.
EmbeddingArray is a numpy array subclass returned by
PFASEmbeddingSet.to_array(). It carries molecule identity metadata:
arr = results.to_array()
print(arr.shape) # (n_mols, n_groups)
print(arr.smiles) # list of input SMILES strings
HalogenGroup
- class PFASGroups.HalogenGroup(**kwargs)[source]
Bases:
objectModel class representing a specific halogenated functional group with structural patterns.
A HalogenGroup defines a specific halogenated functional group using SMARTS patterns, component path types, and molecular formula constraints. Groups are used to classify molecules into specific categories (e.g., “Perfluoroalkyl carboxylic acid”).
- smarts
SMARTS patterns (compiled RDKit molecule) for functional group detection. None if group is defined by componentSmarts alone.
- Type:
Chem.Mol or None
- max_dist_from_comp
Maximum graph distance (number of bonds) from fluorinated component to functional group. When > 0, extends component search radius to find nearby functional groups.
- Type:
- linker_smarts
Compiled SMARTS pattern for validating linker atoms between fluorinated component and functional group. When None (default), no restriction is applied to linker atoms. Only used when max_dist_from_comp > 0.
- Type:
Chem.Mol or None
- constraints
Molecular formula constraints with keys: - ‘only’: Elements that must be present exclusively (e.g., [‘C’, ‘F’, ‘O’]) - ‘gte’: Minimum element counts (e.g., {‘C’: 2}) - ‘lte’: Maximum element counts (e.g., {‘O’: 2}) - ‘eq’: Exact element counts (e.g., {‘N’: 1}) - ‘rel’: Relational constraints (e.g., {‘O’: {‘atoms’: [‘C’], ‘div’: 2, ‘add’: 0}})
- Type:
Examples
>>> # Perfluoroalkyl carboxylic acid: R_F-COOH >>> pfaa = HalogenGroup( ... id=1, ... name="Perfluoroalkyl carboxylic acid", ... smarts={"C(=O)O":1}, # Carboxylic acid group ... componentSmarts="Perfluoroalkyl", ... constraints={"only": ["C", "F", "O", "H"]}, ... max_dist_from_comp=0, ... linker_smarts=None ... )
Notes
SMARTS patterns are compiled on initialization for efficient matching
Constraints are validated when checking if a molecule belongs to this group
max_dist_from_comp allows finding functional groups connected via non-fluorinated linkers
linker_smarts restricts which atoms can be in the path between component and functional group
- __init__(**kwargs)
- set_component_smarts(componentSmartss)[source]
Infers componentSmarts based on componentSmarts, componentSaturation, componentForm and componentHalogen
- constraint_gte(formula_dict)[source]
Check ‘greater than or equal’ constraints on element counts.
- Parameters:
formula_dict (dict) – Molecular formula as {element: count} dictionary
- Returns:
True if all ‘gte’ constraints are satisfied, False otherwise
- Return type:
Examples
>>> # Requires at least 2 carbons and 3 fluorines >>> group.constraints = {'gte': {'C': 2, 'F': 3}} >>> group.constraint_gte({'C': 3, 'F': 5, 'O': 1}) # True >>> group.constraint_gte({'C': 1, 'F': 5, 'O': 1}) # False (C < 2)
- constraint_lte(formula_dict)[source]
Check ‘less than or equal’ constraints on element counts.
- Parameters:
formula_dict (dict) – Molecular formula as {element: count} dictionary
- Returns:
True if all ‘lte’ constraints are satisfied, False otherwise
- Return type:
Examples
>>> # Requires at most 2 oxygens >>> group.constraints = {'lte': {'O': 2}} >>> group.constraint_lte({'C': 8, 'F': 15, 'O': 2}) # True >>> group.constraint_lte({'C': 8, 'F': 15, 'O': 3}) # False (O > 2)
- constraint_eq(formula_dict)[source]
Check ‘equal to’ constraints on element counts.
- Parameters:
formula_dict (dict) – Molecular formula as {element: count} dictionary
- Returns:
True if all ‘eq’ constraints are satisfied, False otherwise
- Return type:
Examples
>>> # Requires exactly 1 nitrogen >>> group.constraints = {'eq': {'N': 1}} >>> group.constraint_eq({'C': 8, 'F': 15, 'N': 1}) # True >>> group.constraint_eq({'C': 8, 'F': 15, 'N': 2}) # False (N != 1)
- constraint_only(formula_dict)[source]
Check ‘only’ constraint - molecule must contain only specified elements.
- Parameters:
formula_dict (dict) – Molecular formula as {element: count} dictionary
- Returns:
True if molecule contains only the allowed elements, False otherwise
- Return type:
Examples
>>> # Molecule must contain only C, F, O, H >>> group.constraints = {'only': ['C', 'F', 'O', 'H']} >>> group.constraint_only({'C': 8, 'F': 15, 'O': 2, 'H': 1}) # True >>> group.constraint_only({'C': 8, 'F': 15, 'O': 2, 'S': 1}) # False (S not allowed)
Notes
Checks that sum of allowed elements equals total atoms in molecule.
- constraint_rel(formula_dict)[source]
Check relational constraints between element counts.
Validates relationships of the form: count(element) = f(other_elements) where f can include division, addition, and summing other element counts.
- Parameters:
formula_dict (dict) – Molecular formula as {element: count} dictionary
- Returns:
True if all relational constraints are satisfied, False otherwise
- Return type:
Notes
Constraint Format:
'rel': { 'ElementA': { 'atoms': ['ElementB', 'ElementC'], # Elements to sum 'div': int, # Divisor (default 1) 'add': int, # Additive constant (default 0) 'add_atoms': ['ElementD'] # Additional elements to add } }
Formula: count(ElementA) = (sum(atoms) / div) + add + sum(add_atoms)
Examples
>>> # Carbon count must equal half the fluorine count >>> group.constraints = {'rel': {'C': {'atoms': ['F'], 'div': 2, 'add': 0}}} >>> group.constraint_rel({'C': 4, 'F': 8, 'O': 2}) # True (4 == 8/2) >>> group.constraint_rel({'C': 3, 'F': 8, 'O': 2}) # False (3 != 8/2)
>>> # Oxygen count must equal carbon count plus 1 >>> group.constraints = {'rel': {'O': {'atoms': ['C'], 'div': 1, 'add': 1}}} >>> group.constraint_rel({'C': 3, 'F': 7, 'O': 4}) # True (4 == 3 + 1)
- formula_dict_satisfies_constraints(formula_dict)[source]
Check if a molecular formula satisfies all constraints for this PFAS group.
Evaluates all constraint types in order: relational → only → equal → lte → gte. Stops evaluation at first failure for efficiency.
- Parameters:
formula_dict (dict) – Molecular formula as {element: count} dictionary (e.g., {‘C’: 8, ‘F’: 17, ‘O’: 2})
- Returns:
bool – True if all constraints are satisfied, False if any constraint fails
Constraint Evaluation Order
—————————
1. Relational constraints (‘rel’) - element count relationships
2. ‘Only’ constraints - allowed elements
3. Equality constraints (‘eq’) - exact element counts
4. Upper bound constraints (‘lte’) - maximum element counts
5. Lower bound constraints (‘gte’) - minimum element counts
Examples
>>> # Perfluoroalkyl carboxylic acid constraints >>> group.constraints = { ... 'only': ['C', 'F', 'O', 'H'], # No other elements ... 'gte': {'C': 2, 'F': 3}, # At least 2 carbons, 3 fluorines ... 'eq': {'O': 2} # Exactly 2 oxygens ... } >>> group.formula_dict_satisfies_constraints({'C': 8, 'F': 15, 'O': 2, 'H': 1}) True >>> group.formula_dict_satisfies_constraints({'C': 8, 'F': 15, 'O': 3, 'H': 1}) False # Fails 'eq': {'O': 2}
Notes
Returns True immediately if no constraints are defined
Short-circuits on first constraint failure for performance
Constraint evaluation order is fixed for consistency
- find_matched_atoms(mol)[source]
Find all substructure matches of this PFAS group’s SMARTS patterns in a molecule.
- Parameters:
mol (Chem.Mol) – RDKit molecule object to search for matches
- Returns:
List of matches, where each match is a list of atom indices in the molecule
- Return type:
List[List[int]]
Notes
If no SMARTS patterns are defined, returns an empty list.
Each SMARTS pattern is searched independently; matches from all patterns are combined.
- component_satisfies_all_smarts(component)[source]
Check if a fluorinated component matches all SMARTS patterns of this PFAS group.
- Parameters:
component (PFASComponent) – PFASComponent object representing a fluorinated component in the molecule
- Returns:
True if the component matches all SMARTS patterns, False otherwise
- Return type:
Notes
If no SMARTS patterns are defined for this group, returns True.
Each SMARTS pattern must have at least one match that includes the component’s atom.
- find_alkyl_components(mol, component_solver, **kwargs)[source]
Find fluorinated components in a molecule that match this PFAS group’s criteria.
- Parameters:
mol (Chem.Mol) – RDKit molecule object to search
components (List[PFASComponent]) – List of PFASComponent objects representing fluorinated components in the molecule
- Returns:
List of PFASComponent objects that match this PFAS group’s criteria
- Return type:
List[PFASComponent]
Notes
Matches are determined based on componentSmarts and max_dist_from_comp attributes.
If componentSmarts is None, all components are considered.
max_dist_from_comp allows extending the search radius for functional groups.
- find_aryl_components(mol, component_solver=None, **kwargs)[source]
Find aryl components in a molecule with comprehensive metrics.
- find_components(mol, fd, component_solver, **kwargs)[source]
Find fluorinated components in a molecule that match this PFAS group’s criteria.
- test(test_data=None)[source]
Test this PFAS group against test molecules from metadata.
Validates that the group correctly identifies positive examples and rejects negative examples based on test metadata in PFAS_groups_smarts.json.
- Parameters:
test_data (dict, optional) – Test metadata dictionary. If None, will be loaded from the group’s entry in PFAS_groups_smarts.json. Expected keys:
category,examples,generate.- Returns:
Test results with keys:
passed(bool),total_tests(int),failures(list of dicts),category(str).- Return type:
Notes
For OECD groups: Tests against curated positive examples
For telomer groups: Tests generated molecules based on smiles patterns
For generic groups: Tests both positive and negative examples
Returns detailed failure information for debugging
Defines a single halogen structural group (SMARTS pattern + metadata). Used to build custom group libraries. See Customization.
PFASDefinition
- class PFASGroups.PFASDefinition(id: int, name: str, smarts: List[str], fluorineRatio: float | None, description: str, **kwargs)[source]
Bases:
objectModel class representing a PFAS definition based on structural criteria.
A PFAS definition identifies molecules using SMARTS patterns and/or fluorine ratio thresholds. Unlike HalogenGroup which focuses on specific functional groups, PFASDefinition provides broader chemical definitions (e.g., “contains at least one perfluoroalkyl moiety”).
- fluorineRatio
Minimum ratio of fluorine atoms required (None if not applicable)
- Type:
Optional[float]
- smarts_patterns
Compiled SMARTS molecule objects for efficient matching
- Type:
List[Chem.Mol]
- requireBoth
If True, requires both SMARTS match AND fluorine ratio. If False, requires SMARTS match OR fluorine ratio.
- Type:
Examples
>>> # Definition requiring perfluoroalkyl chain OR high fluorine ratio >>> pfas_def = PFASDefinition( ... id=1, ... name="PFAS (OECD definition)", ... smarts=["[CX4][CX4]([F])([F])[F]"], ... fluorineRatio=0.4, ... description="Contains perfluoroalkyl moiety with ≥2 carbons", ... requireBoth=False ... )
- __init__(id: int, name: str, smarts: List[str], fluorineRatio: float | None, description: str, **kwargs)[source]
- applies_to_molecule(mol_or_smiles: Mol | str, formula: Dict[str, int] | None = None, **kwargs) bool[source]
Check if this PFAS definition applies to a given molecule.
This method evaluates whether a molecule meets the structural and/or compositional criteria defined by this PFASDefinition. The evaluation logic depends on the requireBoth flag:
If requireBoth=False (default): Returns True if EITHER SMARTS matches OR fluorine ratio is met (logical OR)
If requireBoth=True: Returns True only if BOTH SMARTS matches AND fluorine ratio are met (logical AND)
- Parameters:
mol_or_smiles (Union[Chem.Mol, str]) – Input molecule as RDKit Mol object or SMILES string
formula (Optional[Dict[str, int]], default=None) – Pre-computed molecular formula as {element: count} dictionary. If None, will be computed from the molecule.
**kwargs (dict) –
Additional parameters:
include_hydrogen (bool): Whether to include H in fluorine ratio calculation. Defaults to self.includeHydrogen
require_both (bool): Override the instance’s requireBoth setting
- Returns:
True if the molecule meets the definition criteria, False otherwise
- Return type:
Examples
>>> pfas_def = PFASDefinition( ... id=1, name="Test", smarts=["[CX4]F"], ... fluorineRatio=0.3, description="Test" ... ) >>> pfas_def.applies_to_molecule("FC(F)(F)C(F)(F)F") # PFOA-like True >>> pfas_def.applies_to_molecule("CCCCCC") # No fluorine False
Notes
SMARTS patterns are checked using substructure matching (HasSubstructMatch)
Fluorine ratio is calculated as: F_count / total_atom_count
Invalid SMILES strings return False
- test(test_data=None)[source]
Test this PFAS definition against test molecules from metadata.
Validates that the definition correctly classifies true positives, true negatives, false positives, and false negatives based on test metadata in PFAS_definitions_smarts.json.
- Parameters:
test_data (dict, optional) – Test metadata dictionary. If None, will be loaded from the definition’s entry in PFAS_definitions_smarts.json. Expected keys:
category,examples(dict with keystrue_positives,true_negatives,false_positives,false_negativeseach a list of dicts).- Returns:
Test results with keys:
passed(bool),total_tests(int),failures(list),category(str),stats(dict with counts for true/false positives/negatives).- Return type:
Notes
Tests against benchmark test compounds with known PFAS/non-PFAS labels
Validates both SMARTS patterns and fluorine ratio criteria
Returns detailed failure information for debugging
Encapsulates a regulatory PFAS definition and its matching logic. See PFAS Definitions for descriptions of the five built-in definitions.