Data Models

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "OCCOCCO"])
mol = results[0]                      # MoleculeResult

print(mol.smiles, mol.is_PFAS)
for match in mol.matches:             # GroupMatch objects
    print(match.group_name, match.group_id)
    for comp in match.components:     # MatchComponent objects
        print("  atoms:", comp.atoms)

arr = results.to_array()             # EmbeddingArray
print(arr.shape)                      # (2, 116)

PFASEmbeddingSet 

class PFASGroups.PFASEmbeddingSet(iterable: Iterable[Dict[str, Any]] = ())[source]

Bases: list

List-like container for multiple PFASEmbedding results.

Subclasses list so existing code that iterates over results continues to work. Call to_array() to produce a (n_molecules, n_columns) matrix from all stored results.

__init__(iterable: Iterable[Dict[str, Any]] = ())[source]

property matches: List[MatchView]

Flattened list of all MatchView objects across all molecules.

Some older code expects a matches attribute on a ResultsModel instance. Provide a read-only aggregated view by concatenating the per-molecule match lists.

classmethod from_raw(results: Iterable[Dict[str, Any]]) → PFASEmbeddingSet[source]: Wrap an existing list of result dicts without changing them.

classmethod from_smiles(smiles: str | List[str], **kwargs) → PFASEmbeddingSet[source]

Parse SMILES string(s) and return a PFASEmbeddingSet.

Parameters:

smiles (str or list of str) – One or more SMILES strings.
**kwargs – Forwarded to parse_smiles() (e.g. halogens, saturation, progress).

classmethod from_mols(mols, **kwargs) → PFASEmbeddingSet[source]

Parse RDKit molecules and return a PFASEmbeddingSet.

Parameters:

mols (list of rdkit.Chem.Mol) – List of RDKit molecule objects.
**kwargs – Forwarded to parse_mols().

classmethod from_inchis(inchis: List[str], **kwargs) → PFASEmbeddingSet[source]

Parse InChI strings and return a PFASEmbeddingSet.

Parameters:

inchis (list of str) – List of InChI strings.
**kwargs – Forwarded to parse_mols().

reorder(indices: list | None = None, key: Callable[[PFASEmbedding], Any] = None, reverse: bool = False) → PFASEmbeddingSet[source]

Return a new PFASEmbeddingSet with results reordered by a key function.

Parameters:

indices (list of int, optional) – Explicit list of indices defining the new order. If provided, this takes precedence over the key function.
key (callable) – Function that takes a PFASEmbedding and returns a value to sort by.
reverse (bool, default False) – Whether to sort in descending order.

iter_group_matches(group_id: int | None = None, group_name: str | None = None) → Iterator[Tuple[PFASEmbedding, MatchView]][source]: Iterate over all PFAS group matches across all molecules.

plot_components_for_group(group_id: int | None = None, group_name: str | None = None, max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) → Tuple[Image, int, int][source]

Plot all components for a specific PFAS group across molecules.

Either group_id or group_name (or both) can be provided to select the target group. Each panel corresponds to one molecule, with all its components for that group highlighted together.

show(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) → Image[source]

Show all component combinations in a grid plot.

Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.

Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.

plot(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) → Image

Show all component combinations in a grid plot.

Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.

Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.

to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') → None[source]

Export this molecule result to a SQL database.

Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).

Parameters:

filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.
dbname (str, optional) – Database name (for PostgreSQL/MySQL).
user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.
password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.
host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).
port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

svg(filename: str, subwidth: int = 350, subheight: int = 350, ncols: int = 4) → str[source]

Export all component combinations to an SVG file (vector graphics).

Components that share the same highlighted atoms within a molecule are merged into a single panel with a bullet-point legend.

Parameters:

filename (str) – Path to the output SVG file.
subwidth (int, default 350) – Width of each sub-image in pixels.
subheight (int, default 350) – Minimum height of each sub-image in pixels.
ncols (int, default 4) – Number of columns in the grid.

Returns:

Path to the created SVG file.

Return type:

str

summarise() → str[source]

Return a coloured text summary of the results.

The summary includes: - number of molecules - counts of PFAS group and definition matches - total number of components across all group matches - the most frequent PFAS groups (colour-coded by halogen)

table() → str[source]

Return a more detailed text table with one row per molecule.

The TSV table has the following columns: index (1-based), smiles, group_matches (count), definition_matches (count), and groups (per-molecule PFAS groups with counts, e.g. "Perfluoroalkyl (2); Polyfluoroalkyl (1)").

classify() → DataFrame[source]

Return a classification DataFrame with one row per molecule.

Each molecule is classified by MoleculeResult.classify().

Returns:

Columns:

smiles — molecule SMILES.
category — classification label: OECD group name(s) if matched, otherwise "per-"/"poly-" + generic/telomeric group names (comma-separated).
total_component_size — sum of C-atom counts across all matched group components.

Return type:

pandas.DataFrame

summary() → None[source]

Print a detailed coloured summary of matched groups across all molecules.

For each group, shows the component SMARTS type and, per component, the graph metrics: size (C-atom count), branching and mean eccentricity. Component size statistics (min, max, mean) are also shown.

plot_all_components_with_group_colours(max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) → Tuple[Image, int, int][source]

Plot all matched components, coloured by PFAS group.

Each panel corresponds to one molecule; atoms are highlighted with colours assigned per PFAS group. The legend lists the groups found in that molecule.

to_sql_all(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') → None[source]

Export all molecule results to a SQL database.

This method efficiently batches all molecules into the database in a single operation.

Parameters:

conn (str or sqlalchemy.engine.Engine, optional) – Database connection. Can be: - SQLAlchemy Engine object - Connection string (e.g., ‘postgresql://user:pass@host:port/db’) - SQLite path with ‘sqlite:///’ prefix
filename (str, optional) – Path to SQLite database file (legacy parameter, use conn instead).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

Examples

>>> # Using connection string
>>> results.to_sql(conn='postgresql://user:pass@localhost/pfas_db')
>>>
>>> # Using SQLAlchemy engine
>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///pfas.db')
>>> results.to_sql_all(conn=engine)
>>>
>>> # Using filename (legacy)
>>> results.to_sql(filename='pfas.db')

to_fingerprint(group_selection: str = 'all', component_metrics: List[str] | None = None, selected_group_ids: List[int] | None = None, halogens: str | List[str] = 'F', saturation: str | None = 'per', molecule_metrics: List[str] | None = None, pfas_groups: List[Dict] | None = None, preset: str | None = None, count_mode: str | None = None, graph_metrics: List[str] | None = None, progress: bool = False, **kwargs) → ndarray[source]: Deprecated. Use to_array() instead.

property n_molecules: int: Number of molecules in this set.

property has_cache: bool: Always True — PFASEmbeddingSet stores pre-parsed results.

property match_cache: PFASEmbeddingSet: Alias for the set itself (backward compat with PFASFingerprint API).

get_embedding(**kwargs) → EmbeddingArray[source]: Alias for to_array() (backward compat with PFASFingerprint API).

to_array(component_metrics=<object object>, molecule_metrics=<object object>, group_selection=<object object>, selected_group_ids=<object object>, aggregation=<object object>, preset=<object object>, pfas_groups=<object object>, halogens=<object object>, progress: bool = True) → EmbeddingArray[source]

Stack per-molecule embedding rows into a (n_mols, n_cols) matrix.

When called with no arguments, returns the last cached embedding (or binary by default on the first call). Pass explicit arguments to override and update the cache.

Parameters match those of PFASEmbedding.to_array(), plus:

progressbool, default True: Show a tqdm progress bar while computing embeddings.

compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') → float[source]

Compare two sets using KL divergence on group-occurrence frequencies.

Parameters:

other (PFASEmbeddingSet) – Second set to compare against.
method (str, default 'minmax') – 'forward', 'reverse', 'symmetric', or 'minmax' (normalised symmetric KLD).

Returns:

KL divergence value (lower = more similar).

Return type:

float

perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform PCA on the embedding matrix.

Parameters:

n_components (int, default 2)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'explained_variance', 'components', 'pca_model', 'scaler', 'labels' (if color_by is set).

Return type:

dict

perform_kernel_pca(n_components: int = 2, kernel: str = 'rbf', gamma: float | None = None, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform kernel PCA on the embedding matrix.

Parameters:

n_components (int, default 2)
kernel (str, default 'rbf')
gamma (float, optional)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'kpca_model', 'scaler', 'kernel', 'gamma', 'labels' (if color_by is set).

Return type:

dict

perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform t-SNE dimensionality reduction on the embedding matrix.

Parameters:

n_components (int, default 2)
perplexity (float, default 30.0)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'tsne_model', 'scaler', 'perplexity', 'labels' (if color_by is set).

Return type:

dict

perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform UMAP dimensionality reduction on the embedding matrix.

Parameters:

n_components (int, default 2)
n_neighbors (int, default 15)
min_dist (float, default 0.1)
metric (str, default 'euclidean')
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'umap_model', 'scaler', 'n_neighbors', 'min_dist', 'labels' (if color_by is set).

Return type:

dict

column_names(component_metrics: List[str] | None = None, molecule_metrics: List[str] | None = None, group_selection: str = 'all', selected_group_ids: List[int] | None = None, preset: str | None = None, pfas_groups=None, halogens=None) → List[str][source]: Return column labels (delegates to first element).

classmethod from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) → PFASEmbeddingSet[source]

Load results from SQL database.

Parameters:

conn (str or SQLAlchemy Engine, optional) – Database connection string or engine
filename (str, optional) – SQLite database filename (alternative to conn)
components_table (str, default "components") – Name of the components table
groups_table (str, default "pfas_groups_in_compound") – Name of the groups table
limit (int, optional) – Limit number of molecules to load

Returns:

Loaded results

Return type:

ResultsModel

PFASEmbeddingSet is a list-like container of MoleculeResult objects. Its length equals the number of input SMILES.

Key methods:

Method	Description
`results[i]`	Access the i-th `MoleculeResult`
`results.to_dataframe()`	Flatten all matches to a `pandas.DataFrame`
`results.to_array(group_selection='all', component_metrics=['binary'], halogens='F', saturation='per')`	Convert to an `EmbeddingArray` (PFASGroups default: `halogens='F'`, 116 cols; HalogenGroups subclass default: `halogens=['F','Cl','Br','I']`, 464 cols)
`results.to_sql(filename)`	Persist to a SQLite or PostgreSQL database
`PFASEmbeddingSet.from_sql(filename)`	Load from a previously saved database

to_array options:

Note

The default value of halogens depends on which module the PFASEmbeddingSet came from:

from PFASGroups import parse_smiles → halogens='F' (116 columns)
from HalogenGroups import parse_smiles → halogens=['F','Cl','Br','I'] (464 columns)

Always pass halogens explicitly in reusable helpers or notebook functions to avoid silent fingerprint-width changes when the import source changes.

# PFASGroups default: all 116 groups, F only, binary → (n, 116)
arr = results.to_array()                                        # (n, 116)

# Always-explicit (safe in any import context)
arr = results.to_array(halogens='F')                            # (n, 116)

# OECD groups only
arr = results.to_array(group_selection='oecd', halogens='F')   # (n, 28)

# Count or max-component encoding
arr = results.to_array(component_metrics=['count'], halogens='F')
arr = results.to_array(component_metrics=['max_component'], halogens='F')

# Multi-halogen (advanced) — see halogengroups page
arr = results.to_array(halogens=['F', 'Cl', 'Br', 'I'])       # (n, 464)

MoleculeResult 

PFASGroups.MoleculeResult: alias of PFASEmbedding

Represents parsing results for a single molecule.

Attributes:

Attribute	Description
`smiles`	Canonical SMILES string
`inchi`	InChI string
`inchikey`	InChIKey
`matches`	List of `GroupMatch` objects
`pfas_definition_matches`	List of definition matches (populated when `include_PFAS_definitions=True`)
`n_matches`	Number of group matches
`is_PFAS`	`True` if any match has `is_PFAS=True`

GroupMatch 

Represents a single group detected in a molecule.

Attributes:

Attribute	Description
`group_name`	Human-readable group name
`group_id`	Integer group ID
`group_category`	`'OECD'`, `'Generic'`, or `'Fluorotelomer'`
`is_PFAS`	Whether this group qualifies as PFAS
`halogen`	Halogen symbol matched (`'F'`, `'Cl'`, etc.)
`components`	List of `MatchComponent` objects
`n_components`	Number of components

MatchComponent 

A single structural component of a group match.

Attributes:

Attribute	Description
`atoms`	List of atom indices (0-based) in the RDKit molecule
`n_atoms`	Number of atoms in the component
`n_halogens`	Number of halogen atoms
`halogen_fraction`	Ratio of halogen atoms to total heavy atoms
`effective_graph_resistance`	Kirchhoff index of the component graph (`None` if not computed)

EmbeddingArray 

See Embedding Analysis for full documentation.

EmbeddingArray is a numpy array subclass returned by PFASEmbeddingSet.to_array(). It carries molecule identity metadata:

arr = results.to_array()
print(arr.shape)        # (n_mols, n_groups)
print(arr.smiles)       # list of input SMILES strings

HalogenGroup 

class PFASGroups.HalogenGroup(**kwargs)[source]

Bases: object

Model class representing a specific halogenated functional group with structural patterns.

A HalogenGroup defines a specific halogenated functional group using SMARTS patterns, component path types, and molecular formula constraints. Groups are used to classify molecules into specific categories (e.g., “Perfluoroalkyl carboxylic acid”).

id

Unique identifier for this Halogen group

Type:: int

name

Human-readable group name (e.g., “Perfluoroalkyl carboxylic acid”)

Type:: str

smarts

SMARTS patterns (compiled RDKit molecule) for functional group detection. None if group is defined by componentSmarts alone.

Type:: Chem.Mol or None

componentSmarts

Type:: list, str or None

componentForm

Type:: str or None

componentHalogens

Type:: list, str or None

componentSaturation

Type:: str or None (-> both)

max_dist_from_comp

Maximum graph distance (number of bonds) from fluorinated component to functional group. When > 0, extends component search radius to find nearby functional groups.

Type:: int

linker_smarts

Compiled SMARTS pattern for validating linker atoms between fluorinated component and functional group. When None (default), no restriction is applied to linker atoms. Only used when max_dist_from_comp > 0.

Type:: Chem.Mol or None

constraints

Molecular formula constraints with keys: - ‘only’: Elements that must be present exclusively (e.g., [‘C’, ‘F’, ‘O’]) - ‘gte’: Minimum element counts (e.g., {‘C’: 2}) - ‘lte’: Maximum element counts (e.g., {‘O’: 2}) - ‘eq’: Exact element counts (e.g., {‘N’: 1}) - ‘rel’: Relational constraints (e.g., {‘O’: {‘atoms’: [‘C’], ‘div’: 2, ‘add’: 0}})

Type:: dict

Examples

>>> # Perfluoroalkyl carboxylic acid: R_F-COOH
>>> pfaa = HalogenGroup(
...     id=1,
...     name="Perfluoroalkyl carboxylic acid",
...     smarts={"C(=O)O":1},  # Carboxylic acid group
...     componentSmarts="Perfluoroalkyl",
...     constraints={"only": ["C", "F", "O", "H"]},
...     max_dist_from_comp=0,
...     linker_smarts=None
... )

Notes

SMARTS patterns are compiled on initialization for efficient matching
Constraints are validated when checking if a molecule belongs to this group
max_dist_from_comp allows finding functional groups connected via non-fluorinated linkers
linker_smarts restricts which atoms can be in the path between component and functional group

__init__(**kwargs)

set_component_smarts(componentSmartss)[source]: Infers componentSmarts based on componentSmarts, componentSaturation, componentForm and componentHalogen

set_componentSmarts(componentSmartss)[source]: Backward-compatible alias for set_component_smarts.

constraint_gte(formula_dict)[source]

Check ‘greater than or equal’ constraints on element counts.

Parameters:: formula_dict (dict) – Molecular formula as {element: count} dictionary
Returns:: True if all ‘gte’ constraints are satisfied, False otherwise
Return type:: bool

Examples

>>> # Requires at least 2 carbons and 3 fluorines
>>> group.constraints = {'gte': {'C': 2, 'F': 3}}
>>> group.constraint_gte({'C': 3, 'F': 5, 'O': 1})  # True
>>> group.constraint_gte({'C': 1, 'F': 5, 'O': 1})  # False (C < 2)

constraint_lte(formula_dict)[source]

Check ‘less than or equal’ constraints on element counts.

Parameters:: formula_dict (dict) – Molecular formula as {element: count} dictionary
Returns:: True if all ‘lte’ constraints are satisfied, False otherwise
Return type:: bool

Examples

>>> # Requires at most 2 oxygens
>>> group.constraints = {'lte': {'O': 2}}
>>> group.constraint_lte({'C': 8, 'F': 15, 'O': 2})  # True
>>> group.constraint_lte({'C': 8, 'F': 15, 'O': 3})  # False (O > 2)

constraint_eq(formula_dict)[source]

Check ‘equal to’ constraints on element counts.

Parameters:: formula_dict (dict) – Molecular formula as {element: count} dictionary
Returns:: True if all ‘eq’ constraints are satisfied, False otherwise
Return type:: bool

Examples

>>> # Requires exactly 1 nitrogen
>>> group.constraints = {'eq': {'N': 1}}
>>> group.constraint_eq({'C': 8, 'F': 15, 'N': 1})  # True
>>> group.constraint_eq({'C': 8, 'F': 15, 'N': 2})  # False (N != 1)

constraint_only(formula_dict)[source]

Check ‘only’ constraint - molecule must contain only specified elements.

Parameters:: formula_dict (dict) – Molecular formula as {element: count} dictionary
Returns:: True if molecule contains only the allowed elements, False otherwise
Return type:: bool

Examples

>>> # Molecule must contain only C, F, O, H
>>> group.constraints = {'only': ['C', 'F', 'O', 'H']}
>>> group.constraint_only({'C': 8, 'F': 15, 'O': 2, 'H': 1})  # True
>>> group.constraint_only({'C': 8, 'F': 15, 'O': 2, 'S': 1})  # False (S not allowed)

Notes

Checks that sum of allowed elements equals total atoms in molecule.

constraint_rel(formula_dict)[source]

Check relational constraints between element counts.

Validates relationships of the form: count(element) = f(other_elements) where f can include division, addition, and summing other element counts.

Parameters:: formula_dict (dict) – Molecular formula as {element: count} dictionary
Returns:: True if all relational constraints are satisfied, False otherwise
Return type:: bool

Notes

Constraint Format:

'rel': {
    'ElementA': {
        'atoms': ['ElementB', 'ElementC'],  # Elements to sum
        'div': int,  # Divisor (default 1)
        'add': int,  # Additive constant (default 0)
        'add_atoms': ['ElementD']  # Additional elements to add
    }
}

Formula: count(ElementA) = (sum(atoms) / div) + add + sum(add_atoms)

Examples

>>> # Carbon count must equal half the fluorine count
>>> group.constraints = {'rel': {'C': {'atoms': ['F'], 'div': 2, 'add': 0}}}
>>> group.constraint_rel({'C': 4, 'F': 8, 'O': 2})  # True (4 == 8/2)
>>> group.constraint_rel({'C': 3, 'F': 8, 'O': 2})  # False (3 != 8/2)

>>> # Oxygen count must equal carbon count plus 1
>>> group.constraints = {'rel': {'O': {'atoms': ['C'], 'div': 1, 'add': 1}}}
>>> group.constraint_rel({'C': 3, 'F': 7, 'O': 4})  # True (4 == 3 + 1)

formula_dict_satisfies_constraints(formula_dict)[source]

Check if a molecular formula satisfies all constraints for this PFAS group.

Evaluates all constraint types in order: relational → only → equal → lte → gte. Stops evaluation at first failure for efficiency.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary (e.g., {‘C’: 8, ‘F’: 17, ‘O’: 2})

Returns:

bool – True if all constraints are satisfied, False if any constraint fails
Constraint Evaluation Order
—————————
1. Relational constraints (‘rel’) - element count relationships
2. ‘Only’ constraints - allowed elements
3. Equality constraints (‘eq’) - exact element counts
4. Upper bound constraints (‘lte’) - maximum element counts
5. Lower bound constraints (‘gte’) - minimum element counts

Examples

>>> # Perfluoroalkyl carboxylic acid constraints
>>> group.constraints = {
...     'only': ['C', 'F', 'O', 'H'],  # No other elements
...     'gte': {'C': 2, 'F': 3},  # At least 2 carbons, 3 fluorines
...     'eq': {'O': 2}  # Exactly 2 oxygens
... }
>>> group.formula_dict_satisfies_constraints({'C': 8, 'F': 15, 'O': 2, 'H': 1})
True
>>> group.formula_dict_satisfies_constraints({'C': 8, 'F': 15, 'O': 3, 'H': 1})
False  # Fails 'eq': {'O': 2}

Notes

Returns True immediately if no constraints are defined
Short-circuits on first constraint failure for performance
Constraint evaluation order is fixed for consistency

find_matched_atoms(mol)[source]

Find all substructure matches of this PFAS group’s SMARTS patterns in a molecule.

Parameters:: mol (Chem.Mol) – RDKit molecule object to search for matches
Returns:: List of matches, where each match is a list of atom indices in the molecule
Return type:: List[List[int]]

Notes

If no SMARTS patterns are defined, returns an empty list.
Each SMARTS pattern is searched independently; matches from all patterns are combined.

component_satisfies_all_smarts(component)[source]

Check if a fluorinated component matches all SMARTS patterns of this PFAS group.

Parameters:: component (PFASComponent) – PFASComponent object representing a fluorinated component in the molecule
Returns:: True if the component matches all SMARTS patterns, False otherwise
Return type:: bool

Notes

If no SMARTS patterns are defined for this group, returns True.
Each SMARTS pattern must have at least one match that includes the component’s atom.

find_alkyl_components(mol, component_solver, **kwargs)[source]

Find fluorinated components in a molecule that match this PFAS group’s criteria.

Parameters:

mol (Chem.Mol) – RDKit molecule object to search
components (List[PFASComponent]) – List of PFASComponent objects representing fluorinated components in the molecule

Returns:

List of PFASComponent objects that match this PFAS group’s criteria

Return type:

List[PFASComponent]

Notes

Matches are determined based on componentSmarts and max_dist_from_comp attributes.
If componentSmarts is None, all components are considered.
max_dist_from_comp allows extending the search radius for functional groups.

find_aryl_components(mol, component_solver=None, **kwargs)[source]: Find aryl components in a molecule with comprehensive metrics.

find_components(mol, fd, component_solver, **kwargs)[source]: Find fluorinated components in a molecule that match this PFAS group’s criteria.

test(test_data=None)[source]

Test this PFAS group against test molecules from metadata.

Validates that the group correctly identifies positive examples and rejects negative examples based on test metadata in PFAS_groups_smarts.json.

Parameters:: test_data (dict, optional) – Test metadata dictionary. If None, will be loaded from the group’s entry in PFAS_groups_smarts.json. Expected keys: category, examples, generate.
Returns:: Test results with keys: passed (bool), total_tests (int), failures (list of dicts), category (str).
Return type:: dict

Notes

For OECD groups: Tests against curated positive examples
For telomer groups: Tests generated molecules based on smiles patterns
For generic groups: Tests both positive and negative examples
Returns detailed failure information for debugging

Defines a single halogen structural group (SMARTS pattern + metadata). Used to build custom group libraries. See Customization.

PFASDefinition 

class PFASGroups.PFASDefinition(id: int, name: str, smarts: List[str], fluorineRatio: float | None, description: str, **kwargs)[source]

Bases: object

Model class representing a PFAS definition based on structural criteria.

A PFAS definition identifies molecules using SMARTS patterns and/or fluorine ratio thresholds. Unlike HalogenGroup which focuses on specific functional groups, PFASDefinition provides broader chemical definitions (e.g., “contains at least one perfluoroalkyl moiety”).

id

Unique identifier for this PFAS definition

Type:: int

name

Human-readable name (e.g., “Per- and polyfluoroalkyl substances”)

Type:: str

description

Detailed description of what this definition represents

Type:: str

fluorineRatio

Minimum ratio of fluorine atoms required (None if not applicable)

Type:: Optional[float]

smarts_strings

Original SMARTS pattern strings for structural matching

Type:: List[str]

smarts_patterns

Compiled SMARTS molecule objects for efficient matching

Type:: List[Chem.Mol]

includeHydrogen

Whether to include hydrogen atoms in fluorine ratio calculations

Type:: bool

requireBoth

If True, requires both SMARTS match AND fluorine ratio. If False, requires SMARTS match OR fluorine ratio.

Type:: bool

Examples

>>> # Definition requiring perfluoroalkyl chain OR high fluorine ratio
>>> pfas_def = PFASDefinition(
...     id=1,
...     name="PFAS (OECD definition)",
...     smarts=["[CX4][CX4]([F])([F])[F]"],
...     fluorineRatio=0.4,
...     description="Contains perfluoroalkyl moiety with ≥2 carbons",
...     requireBoth=False
... )

__init__(id: int, name: str, smarts: List[str], fluorineRatio: float | None, description: str, **kwargs)[source]

applies_to_molecule(mol_or_smiles: Mol | str, formula: Dict[str, int] | None = None, **kwargs) → bool[source]

Check if this PFAS definition applies to a given molecule.

This method evaluates whether a molecule meets the structural and/or compositional criteria defined by this PFASDefinition. The evaluation logic depends on the requireBoth flag:

If requireBoth=False (default): Returns True if EITHER SMARTS matches OR fluorine ratio is met (logical OR)
If requireBoth=True: Returns True only if BOTH SMARTS matches AND fluorine ratio are met (logical AND)

Parameters:

mol_or_smiles (Union[Chem.Mol, str]) – Input molecule as RDKit Mol object or SMILES string
formula (Optional[Dict[str, int]], default=None) – Pre-computed molecular formula as {element: count} dictionary. If None, will be computed from the molecule.
**kwargs (dict) –
Additional parameters:
- include_hydrogen (bool): Whether to include H in fluorine ratio calculation. Defaults to self.includeHydrogen
- require_both (bool): Override the instance’s requireBoth setting

Returns:

True if the molecule meets the definition criteria, False otherwise

Return type:

bool

Examples

>>> pfas_def = PFASDefinition(
...     id=1, name="Test", smarts=["[CX4]F"],
...     fluorineRatio=0.3, description="Test"
... )
>>> pfas_def.applies_to_molecule("FC(F)(F)C(F)(F)F")  # PFOA-like
True
>>> pfas_def.applies_to_molecule("CCCCCC")  # No fluorine
False

Notes

SMARTS patterns are checked using substructure matching (HasSubstructMatch)
Fluorine ratio is calculated as: F_count / total_atom_count
Invalid SMILES strings return False

test(test_data=None)[source]

Test this PFAS definition against test molecules from metadata.

Validates that the definition correctly classifies true positives, true negatives, false positives, and false negatives based on test metadata in PFAS_definitions_smarts.json.

Parameters:: test_data (dict, optional) – Test metadata dictionary. If None, will be loaded from the definition’s entry in PFAS_definitions_smarts.json. Expected keys: category, examples (dict with keys true_positives, true_negatives, false_positives, false_negatives each a list of dicts).
Returns:: Test results with keys: passed (bool), total_tests (int), failures (list), category (str), stats (dict with counts for true/false positives/negatives).
Return type:: dict

Notes

Tests against benchmark test compounds with known PFAS/non-PFAS labels
Validates both SMARTS patterns and fluorine ratio criteria
Returns detailed failure information for debugging

Encapsulates a regulatory PFAS definition and its matching logic. See PFAS Definitions for descriptions of the five built-in definitions.

Data Models

PFASEmbeddingSet

MoleculeResult

GroupMatch

MatchComponent

EmbeddingArray

HalogenGroup

PFASDefinition

PFASEmbeddingSet 

MoleculeResult 

GroupMatch 

MatchComponent 

EmbeddingArray 

HalogenGroup 

PFASDefinition 