Data Models

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "OCCOCCO"])
mol = results[0]                      # MoleculeResult

print(mol.smiles, mol.is_PFAS)
for match in mol.matches:             # GroupMatch objects
    print(match.group_name, match.group_id)
    for comp in match.components:     # MatchComponent objects
        print("  atoms:", comp.atoms)

arr = results.to_array()             # EmbeddingArray
print(arr.shape)                      # (2, 116)

PFASEmbeddingSet

class PFASGroups.PFASEmbeddingSet(iterable: Iterable[Dict[str, Any]] = ())[source]

Bases: list

List-like container for multiple PFASEmbedding results.

Subclasses list so existing code that iterates over results continues to work. Call to_array() to produce a (n_molecules, n_columns) matrix from all stored results.

__init__(iterable: Iterable[Dict[str, Any]] = ())[source]
property matches: List[MatchView]

Flattened list of all MatchView objects across all molecules.

Some older code expects a matches attribute on a ResultsModel instance. Provide a read-only aggregated view by concatenating the per-molecule match lists.

classmethod from_raw(results: Iterable[Dict[str, Any]]) PFASEmbeddingSet[source]

Wrap an existing list of result dicts without changing them.

classmethod from_smiles(smiles: str | List[str], **kwargs) PFASEmbeddingSet[source]

Parse SMILES string(s) and return a PFASEmbeddingSet.

Parameters:
  • smiles (str or list of str) – One or more SMILES strings.

  • **kwargs – Forwarded to parse_smiles() (e.g. halogens, saturation, progress).

classmethod from_mols(mols, **kwargs) PFASEmbeddingSet[source]

Parse RDKit molecules and return a PFASEmbeddingSet.

Parameters:
  • mols (list of rdkit.Chem.Mol) – List of RDKit molecule objects.

  • **kwargs – Forwarded to parse_mols().

classmethod from_inchis(inchis: List[str], **kwargs) PFASEmbeddingSet[source]

Parse InChI strings and return a PFASEmbeddingSet.

Parameters:
  • inchis (list of str) – List of InChI strings.

  • **kwargs – Forwarded to parse_mols().

reorder(indices: list | None = None, key: Callable[[PFASEmbedding], Any] = None, reverse: bool = False) PFASEmbeddingSet[source]

Return a new PFASEmbeddingSet with results reordered by a key function.

Parameters:
  • indices (list of int, optional) – Explicit list of indices defining the new order. If provided, this takes precedence over the key function.

  • key (callable) – Function that takes a PFASEmbedding and returns a value to sort by.

  • reverse (bool, default False) – Whether to sort in descending order.

iter_group_matches(group_id: int | None = None, group_name: str | None = None) Iterator[Tuple[PFASEmbedding, MatchView]][source]

Iterate over all PFAS group matches across all molecules.

plot_components_for_group(group_id: int | None = None, group_name: str | None = None, max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) Tuple[Image, int, int][source]

Plot all components for a specific PFAS group across molecules.

Either group_id or group_name (or both) can be provided to select the target group. Each panel corresponds to one molecule, with all its components for that group highlighted together.

show(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) Image[source]

Show all component combinations in a grid plot.

Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.

Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.

plot(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) Image

Show all component combinations in a grid plot.

Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.

Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.

to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]

Export this molecule result to a SQL database.

Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).

Parameters:
  • filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.

  • dbname (str, optional) – Database name (for PostgreSQL/MySQL).

  • user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.

  • password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.

  • host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).

  • port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).

  • components_table (str, default "components") – Name of the table to store component-level data.

  • groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.

  • if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

svg(filename: str, subwidth: int = 350, subheight: int = 350, ncols: int = 4) str[source]

Export all component combinations to an SVG file (vector graphics).

Components that share the same highlighted atoms within a molecule are merged into a single panel with a bullet-point legend.

Parameters:
  • filename (str) – Path to the output SVG file.

  • subwidth (int, default 350) – Width of each sub-image in pixels.

  • subheight (int, default 350) – Minimum height of each sub-image in pixels.

  • ncols (int, default 4) – Number of columns in the grid.

Returns:

Path to the created SVG file.

Return type:

str

summarise() str[source]

Return a coloured text summary of the results.

The summary includes: - number of molecules - counts of PFAS group and definition matches - total number of components across all group matches - the most frequent PFAS groups (colour-coded by halogen)

table() str[source]

Return a more detailed text table with one row per molecule.

The TSV table has the following columns: index (1-based), smiles, group_matches (count), definition_matches (count), and groups (per-molecule PFAS groups with counts, e.g. "Perfluoroalkyl (2); Polyfluoroalkyl (1)").

classify() DataFrame[source]

Return a classification DataFrame with one row per molecule.

Each molecule is classified by MoleculeResult.classify().

Returns:

Columns:

  • smiles — molecule SMILES.

  • category — classification label: OECD group name(s) if matched, otherwise "per-"/"poly-" + generic/telomeric group names (comma-separated).

  • total_component_size — sum of C-atom counts across all matched group components.

Return type:

pandas.DataFrame

summary() None[source]

Print a detailed coloured summary of matched groups across all molecules.

For each group, shows the component SMARTS type and, per component, the graph metrics: size (C-atom count), branching and mean eccentricity. Component size statistics (min, max, mean) are also shown.

plot_all_components_with_group_colours(max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) Tuple[Image, int, int][source]

Plot all matched components, coloured by PFAS group.

Each panel corresponds to one molecule; atoms are highlighted with colours assigned per PFAS group. The legend lists the groups found in that molecule.

to_sql_all(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]

Export all molecule results to a SQL database.

This method efficiently batches all molecules into the database in a single operation.

Parameters:
  • conn (str or sqlalchemy.engine.Engine, optional) – Database connection. Can be: - SQLAlchemy Engine object - Connection string (e.g., ‘postgresql://user:pass@host:port/db’) - SQLite path with ‘sqlite:///’ prefix

  • filename (str, optional) – Path to SQLite database file (legacy parameter, use conn instead).

  • components_table (str, default "components") – Name of the table to store component-level data.

  • groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.

  • if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

Examples

>>> # Using connection string
>>> results.to_sql(conn='postgresql://user:pass@localhost/pfas_db')
>>>
>>> # Using SQLAlchemy engine
>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///pfas.db')
>>> results.to_sql_all(conn=engine)
>>>
>>> # Using filename (legacy)
>>> results.to_sql(filename='pfas.db')
to_fingerprint(group_selection: str = 'all', component_metrics: List[str] | None = None, selected_group_ids: List[int] | None = None, halogens: str | List[str] = 'F', saturation: str | None = 'per', molecule_metrics: List[str] | None = None, pfas_groups: List[Dict] | None = None, preset: str | None = None, count_mode: str | None = None, graph_metrics: List[str] | None = None, progress: bool = False, **kwargs) ndarray[source]

Deprecated. Use to_array() instead.

property n_molecules: int

Number of molecules in this set.

property has_cache: bool

Always True — PFASEmbeddingSet stores pre-parsed results.

property match_cache: PFASEmbeddingSet

Alias for the set itself (backward compat with PFASFingerprint API).

get_embedding(**kwargs) EmbeddingArray[source]

Alias for to_array() (backward compat with PFASFingerprint API).

to_array(component_metrics=<object object>, molecule_metrics=<object object>, group_selection=<object object>, selected_group_ids=<object object>, aggregation=<object object>, preset=<object object>, pfas_groups=<object object>, halogens=<object object>, progress: bool = True) EmbeddingArray[source]

Stack per-molecule embedding rows into a (n_mols, n_cols) matrix.

When called with no arguments, returns the last cached embedding (or binary by default on the first call). Pass explicit arguments to override and update the cache.

Parameters match those of PFASEmbedding.to_array(), plus:

progressbool, default True

Show a tqdm progress bar while computing embeddings.

compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') float[source]

Compare two sets using KL divergence on group-occurrence frequencies.

Parameters:
  • other (PFASEmbeddingSet) – Second set to compare against.

  • method (str, default 'minmax') – 'forward', 'reverse', 'symmetric', or 'minmax' (normalised symmetric KLD).

Returns:

KL divergence value (lower = more similar).

Return type:

float

perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]

Perform PCA on the embedding matrix.

Parameters:
  • n_components (int, default 2)

  • plot (bool, default True)

  • output_file (str, optional)

  • color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'explained_variance', 'components', 'pca_model', 'scaler', 'labels' (if color_by is set).

Return type:

dict

perform_kernel_pca(n_components: int = 2, kernel: str = 'rbf', gamma: float | None = None, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]

Perform kernel PCA on the embedding matrix.

Parameters:
  • n_components (int, default 2)

  • kernel (str, default 'rbf')

  • gamma (float, optional)

  • plot (bool, default True)

  • output_file (str, optional)

  • color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'kpca_model', 'scaler', 'kernel', 'gamma', 'labels' (if color_by is set).

Return type:

dict

perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]

Perform t-SNE dimensionality reduction on the embedding matrix.

Parameters:
  • n_components (int, default 2)

  • perplexity (float, default 30.0)

  • learning_rate (float, default 200.0)

  • max_iter (int, default 1000)

  • plot (bool, default True)

  • output_file (str, optional)

  • color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'tsne_model', 'scaler', 'perplexity', 'labels' (if color_by is set).

Return type:

dict

perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]

Perform UMAP dimensionality reduction on the embedding matrix.

Parameters:
  • n_components (int, default 2)

  • n_neighbors (int, default 15)

  • min_dist (float, default 0.1)

  • metric (str, default 'euclidean')

  • plot (bool, default True)

  • output_file (str, optional)

  • color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'umap_model', 'scaler', 'n_neighbors', 'min_dist', 'labels' (if color_by is set).

Return type:

dict

column_names(component_metrics: List[str] | None = None, molecule_metrics: List[str] | None = None, group_selection: str = 'all', selected_group_ids: List[int] | None = None, preset: str | None = None, pfas_groups=None, halogens=None) List[str][source]

Return column labels (delegates to first element).

classmethod from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) PFASEmbeddingSet[source]

Load results from SQL database.

Parameters:
  • conn (str or SQLAlchemy Engine, optional) – Database connection string or engine

  • filename (str, optional) – SQLite database filename (alternative to conn)

  • components_table (str, default "components") – Name of the components table

  • groups_table (str, default "pfas_groups_in_compound") – Name of the groups table

  • limit (int, optional) – Limit number of molecules to load

Returns:

Loaded results

Return type:

ResultsModel

PFASEmbeddingSet is a list-like container of MoleculeResult objects. Its length equals the number of input SMILES.

Key methods:

Method

Description

results[i]

Access the i-th MoleculeResult

results.to_dataframe()

Flatten all matches to a pandas.DataFrame

results.to_array(group_selection='all', component_metrics=['binary'], halogens='F', saturation='per')

Convert to an EmbeddingArray (PFASGroups default: halogens='F', 116 cols; HalogenGroups subclass default: halogens=['F','Cl','Br','I'], 464 cols)

results.to_sql(filename)

Persist to a SQLite or PostgreSQL database

PFASEmbeddingSet.from_sql(filename)

Load from a previously saved database

to_array options:

Note

The default value of halogens depends on which module the PFASEmbeddingSet came from:

  • from PFASGroups import parse_smileshalogens='F' (116 columns)

  • from HalogenGroups import parse_smileshalogens=['F','Cl','Br','I'] (464 columns)

Always pass halogens explicitly in reusable helpers or notebook functions to avoid silent fingerprint-width changes when the import source changes.

# PFASGroups default: all 116 groups, F only, binary → (n, 116)
arr = results.to_array()                                        # (n, 116)

# Always-explicit (safe in any import context)
arr = results.to_array(halogens='F')                            # (n, 116)

# OECD groups only
arr = results.to_array(group_selection='oecd', halogens='F')   # (n, 28)

# Count or max-component encoding
arr = results.to_array(component_metrics=['count'], halogens='F')
arr = results.to_array(component_metrics=['max_component'], halogens='F')

# Multi-halogen (advanced) — see halogengroups page
arr = results.to_array(halogens=['F', 'Cl', 'Br', 'I'])       # (n, 464)

MoleculeResult

PFASGroups.MoleculeResult

alias of PFASEmbedding

Represents parsing results for a single molecule.

Attributes:

Attribute

Description

smiles

Canonical SMILES string

inchi

InChI string

inchikey

InChIKey

matches

List of GroupMatch objects

pfas_definition_matches

List of definition matches (populated when include_PFAS_definitions=True)

n_matches

Number of group matches

is_PFAS

True if any match has is_PFAS=True

GroupMatch

Represents a single group detected in a molecule.

Attributes:

Attribute

Description

group_name

Human-readable group name

group_id

Integer group ID

group_category

'OECD', 'Generic', or 'Fluorotelomer'

is_PFAS

Whether this group qualifies as PFAS

halogen

Halogen symbol matched ('F', 'Cl', etc.)

components

List of MatchComponent objects

n_components

Number of components

MatchComponent

A single structural component of a group match.

Attributes:

Attribute

Description

atoms

List of atom indices (0-based) in the RDKit molecule

n_atoms

Number of atoms in the component

n_halogens

Number of halogen atoms

halogen_fraction

Ratio of halogen atoms to total heavy atoms

effective_graph_resistance

Kirchhoff index of the component graph (None if not computed)

EmbeddingArray

See Embedding Analysis for full documentation.

EmbeddingArray is a numpy array subclass returned by PFASEmbeddingSet.to_array(). It carries molecule identity metadata:

arr = results.to_array()
print(arr.shape)        # (n_mols, n_groups)
print(arr.smiles)       # list of input SMILES strings

HalogenGroup

class PFASGroups.HalogenGroup(**kwargs)[source]

Bases: object

Model class representing a specific halogenated functional group with structural patterns.

A HalogenGroup defines a specific halogenated functional group using SMARTS patterns, component path types, and molecular formula constraints. Groups are used to classify molecules into specific categories (e.g., “Perfluoroalkyl carboxylic acid”).

id

Unique identifier for this Halogen group

Type:

int

name

Human-readable group name (e.g., “Perfluoroalkyl carboxylic acid”)

Type:

str

smarts

SMARTS patterns (compiled RDKit molecule) for functional group detection. None if group is defined by componentSmarts alone.

Type:

Chem.Mol or None

componentSmarts
Type:

list, str or None

componentForm
Type:

str or None

componentHalogens
Type:

list, str or None

componentSaturation
Type:

str or None (-> both)

max_dist_from_comp

Maximum graph distance (number of bonds) from fluorinated component to functional group. When > 0, extends component search radius to find nearby functional groups.

Type:

int

linker_smarts

Compiled SMARTS pattern for validating linker atoms between fluorinated component and functional group. When None (default), no restriction is applied to linker atoms. Only used when max_dist_from_comp > 0.

Type:

Chem.Mol or None

constraints

Molecular formula constraints with keys: - ‘only’: Elements that must be present exclusively (e.g., [‘C’, ‘F’, ‘O’]) - ‘gte’: Minimum element counts (e.g., {‘C’: 2}) - ‘lte’: Maximum element counts (e.g., {‘O’: 2}) - ‘eq’: Exact element counts (e.g., {‘N’: 1}) - ‘rel’: Relational constraints (e.g., {‘O’: {‘atoms’: [‘C’], ‘div’: 2, ‘add’: 0}})

Type:

dict

Examples

>>> # Perfluoroalkyl carboxylic acid: R_F-COOH
>>> pfaa = HalogenGroup(
...     id=1,
...     name="Perfluoroalkyl carboxylic acid",
...     smarts={"C(=O)O":1},  # Carboxylic acid group
...     componentSmarts="Perfluoroalkyl",
...     constraints={"only": ["C", "F", "O", "H"]},
...     max_dist_from_comp=0,
...     linker_smarts=None
... )

Notes

  • SMARTS patterns are compiled on initialization for efficient matching

  • Constraints are validated when checking if a molecule belongs to this group

  • max_dist_from_comp allows finding functional groups connected via non-fluorinated linkers

  • linker_smarts restricts which atoms can be in the path between component and functional group

__init__(**kwargs)
set_component_smarts(componentSmartss)[source]

Infers componentSmarts based on componentSmarts, componentSaturation, componentForm and componentHalogen

set_componentSmarts(componentSmartss)[source]

Backward-compatible alias for set_component_smarts.

constraint_gte(formula_dict)[source]

Check ‘greater than or equal’ constraints on element counts.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary

Returns:

True if all ‘gte’ constraints are satisfied, False otherwise

Return type:

bool

Examples

>>> # Requires at least 2 carbons and 3 fluorines
>>> group.constraints = {'gte': {'C': 2, 'F': 3}}
>>> group.constraint_gte({'C': 3, 'F': 5, 'O': 1})  # True
>>> group.constraint_gte({'C': 1, 'F': 5, 'O': 1})  # False (C < 2)
constraint_lte(formula_dict)[source]

Check ‘less than or equal’ constraints on element counts.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary

Returns:

True if all ‘lte’ constraints are satisfied, False otherwise

Return type:

bool

Examples

>>> # Requires at most 2 oxygens
>>> group.constraints = {'lte': {'O': 2}}
>>> group.constraint_lte({'C': 8, 'F': 15, 'O': 2})  # True
>>> group.constraint_lte({'C': 8, 'F': 15, 'O': 3})  # False (O > 2)
constraint_eq(formula_dict)[source]

Check ‘equal to’ constraints on element counts.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary

Returns:

True if all ‘eq’ constraints are satisfied, False otherwise

Return type:

bool

Examples

>>> # Requires exactly 1 nitrogen
>>> group.constraints = {'eq': {'N': 1}}
>>> group.constraint_eq({'C': 8, 'F': 15, 'N': 1})  # True
>>> group.constraint_eq({'C': 8, 'F': 15, 'N': 2})  # False (N != 1)
constraint_only(formula_dict)[source]

Check ‘only’ constraint - molecule must contain only specified elements.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary

Returns:

True if molecule contains only the allowed elements, False otherwise

Return type:

bool

Examples

>>> # Molecule must contain only C, F, O, H
>>> group.constraints = {'only': ['C', 'F', 'O', 'H']}
>>> group.constraint_only({'C': 8, 'F': 15, 'O': 2, 'H': 1})  # True
>>> group.constraint_only({'C': 8, 'F': 15, 'O': 2, 'S': 1})  # False (S not allowed)

Notes

Checks that sum of allowed elements equals total atoms in molecule.

constraint_rel(formula_dict)[source]

Check relational constraints between element counts.

Validates relationships of the form: count(element) = f(other_elements) where f can include division, addition, and summing other element counts.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary

Returns:

True if all relational constraints are satisfied, False otherwise

Return type:

bool

Notes

Constraint Format:

'rel': {
    'ElementA': {
        'atoms': ['ElementB', 'ElementC'],  # Elements to sum
        'div': int,  # Divisor (default 1)
        'add': int,  # Additive constant (default 0)
        'add_atoms': ['ElementD']  # Additional elements to add
    }
}

Formula: count(ElementA) = (sum(atoms) / div) + add + sum(add_atoms)

Examples

>>> # Carbon count must equal half the fluorine count
>>> group.constraints = {'rel': {'C': {'atoms': ['F'], 'div': 2, 'add': 0}}}
>>> group.constraint_rel({'C': 4, 'F': 8, 'O': 2})  # True (4 == 8/2)
>>> group.constraint_rel({'C': 3, 'F': 8, 'O': 2})  # False (3 != 8/2)
>>> # Oxygen count must equal carbon count plus 1
>>> group.constraints = {'rel': {'O': {'atoms': ['C'], 'div': 1, 'add': 1}}}
>>> group.constraint_rel({'C': 3, 'F': 7, 'O': 4})  # True (4 == 3 + 1)
formula_dict_satisfies_constraints(formula_dict)[source]

Check if a molecular formula satisfies all constraints for this PFAS group.

Evaluates all constraint types in order: relational → only → equal → lte → gte. Stops evaluation at first failure for efficiency.

Parameters:

formula_dict (dict) – Molecular formula as {element: count} dictionary (e.g., {‘C’: 8, ‘F’: 17, ‘O’: 2})

Returns:

  • bool – True if all constraints are satisfied, False if any constraint fails

  • Constraint Evaluation Order

  • —————————

  • 1. Relational constraints (‘rel’) - element count relationships

  • 2. ‘Only’ constraints - allowed elements

  • 3. Equality constraints (‘eq’) - exact element counts

  • 4. Upper bound constraints (‘lte’) - maximum element counts

  • 5. Lower bound constraints (‘gte’) - minimum element counts

Examples

>>> # Perfluoroalkyl carboxylic acid constraints
>>> group.constraints = {
...     'only': ['C', 'F', 'O', 'H'],  # No other elements
...     'gte': {'C': 2, 'F': 3},  # At least 2 carbons, 3 fluorines
...     'eq': {'O': 2}  # Exactly 2 oxygens
... }
>>> group.formula_dict_satisfies_constraints({'C': 8, 'F': 15, 'O': 2, 'H': 1})
True
>>> group.formula_dict_satisfies_constraints({'C': 8, 'F': 15, 'O': 3, 'H': 1})
False  # Fails 'eq': {'O': 2}

Notes

  • Returns True immediately if no constraints are defined

  • Short-circuits on first constraint failure for performance

  • Constraint evaluation order is fixed for consistency

find_matched_atoms(mol)[source]

Find all substructure matches of this PFAS group’s SMARTS patterns in a molecule.

Parameters:

mol (Chem.Mol) – RDKit molecule object to search for matches

Returns:

List of matches, where each match is a list of atom indices in the molecule

Return type:

List[List[int]]

Notes

  • If no SMARTS patterns are defined, returns an empty list.

  • Each SMARTS pattern is searched independently; matches from all patterns are combined.

component_satisfies_all_smarts(component)[source]

Check if a fluorinated component matches all SMARTS patterns of this PFAS group.

Parameters:

component (PFASComponent) – PFASComponent object representing a fluorinated component in the molecule

Returns:

True if the component matches all SMARTS patterns, False otherwise

Return type:

bool

Notes

  • If no SMARTS patterns are defined for this group, returns True.

  • Each SMARTS pattern must have at least one match that includes the component’s atom.

find_alkyl_components(mol, component_solver, **kwargs)[source]

Find fluorinated components in a molecule that match this PFAS group’s criteria.

Parameters:
  • mol (Chem.Mol) – RDKit molecule object to search

  • components (List[PFASComponent]) – List of PFASComponent objects representing fluorinated components in the molecule

Returns:

List of PFASComponent objects that match this PFAS group’s criteria

Return type:

List[PFASComponent]

Notes

  • Matches are determined based on componentSmarts and max_dist_from_comp attributes.

  • If componentSmarts is None, all components are considered.

  • max_dist_from_comp allows extending the search radius for functional groups.

find_aryl_components(mol, component_solver=None, **kwargs)[source]

Find aryl components in a molecule with comprehensive metrics.

find_components(mol, fd, component_solver, **kwargs)[source]

Find fluorinated components in a molecule that match this PFAS group’s criteria.

test(test_data=None)[source]

Test this PFAS group against test molecules from metadata.

Validates that the group correctly identifies positive examples and rejects negative examples based on test metadata in PFAS_groups_smarts.json.

Parameters:

test_data (dict, optional) – Test metadata dictionary. If None, will be loaded from the group’s entry in PFAS_groups_smarts.json. Expected keys: category, examples, generate.

Returns:

Test results with keys: passed (bool), total_tests (int), failures (list of dicts), category (str).

Return type:

dict

Notes

  • For OECD groups: Tests against curated positive examples

  • For telomer groups: Tests generated molecules based on smiles patterns

  • For generic groups: Tests both positive and negative examples

  • Returns detailed failure information for debugging

Defines a single halogen structural group (SMARTS pattern + metadata). Used to build custom group libraries. See Customization.

PFASDefinition

class PFASGroups.PFASDefinition(id: int, name: str, smarts: List[str], fluorineRatio: float | None, description: str, **kwargs)[source]

Bases: object

Model class representing a PFAS definition based on structural criteria.

A PFAS definition identifies molecules using SMARTS patterns and/or fluorine ratio thresholds. Unlike HalogenGroup which focuses on specific functional groups, PFASDefinition provides broader chemical definitions (e.g., “contains at least one perfluoroalkyl moiety”).

id

Unique identifier for this PFAS definition

Type:

int

name

Human-readable name (e.g., “Per- and polyfluoroalkyl substances”)

Type:

str

description

Detailed description of what this definition represents

Type:

str

fluorineRatio

Minimum ratio of fluorine atoms required (None if not applicable)

Type:

Optional[float]

smarts_strings

Original SMARTS pattern strings for structural matching

Type:

List[str]

smarts_patterns

Compiled SMARTS molecule objects for efficient matching

Type:

List[Chem.Mol]

includeHydrogen

Whether to include hydrogen atoms in fluorine ratio calculations

Type:

bool

requireBoth

If True, requires both SMARTS match AND fluorine ratio. If False, requires SMARTS match OR fluorine ratio.

Type:

bool

Examples

>>> # Definition requiring perfluoroalkyl chain OR high fluorine ratio
>>> pfas_def = PFASDefinition(
...     id=1,
...     name="PFAS (OECD definition)",
...     smarts=["[CX4][CX4]([F])([F])[F]"],
...     fluorineRatio=0.4,
...     description="Contains perfluoroalkyl moiety with ≥2 carbons",
...     requireBoth=False
... )
__init__(id: int, name: str, smarts: List[str], fluorineRatio: float | None, description: str, **kwargs)[source]
applies_to_molecule(mol_or_smiles: Mol | str, formula: Dict[str, int] | None = None, **kwargs) bool[source]

Check if this PFAS definition applies to a given molecule.

This method evaluates whether a molecule meets the structural and/or compositional criteria defined by this PFASDefinition. The evaluation logic depends on the requireBoth flag:

  • If requireBoth=False (default): Returns True if EITHER SMARTS matches OR fluorine ratio is met (logical OR)

  • If requireBoth=True: Returns True only if BOTH SMARTS matches AND fluorine ratio are met (logical AND)

Parameters:
  • mol_or_smiles (Union[Chem.Mol, str]) – Input molecule as RDKit Mol object or SMILES string

  • formula (Optional[Dict[str, int]], default=None) – Pre-computed molecular formula as {element: count} dictionary. If None, will be computed from the molecule.

  • **kwargs (dict) –

    Additional parameters:

    • include_hydrogen (bool): Whether to include H in fluorine ratio calculation. Defaults to self.includeHydrogen

    • require_both (bool): Override the instance’s requireBoth setting

Returns:

True if the molecule meets the definition criteria, False otherwise

Return type:

bool

Examples

>>> pfas_def = PFASDefinition(
...     id=1, name="Test", smarts=["[CX4]F"],
...     fluorineRatio=0.3, description="Test"
... )
>>> pfas_def.applies_to_molecule("FC(F)(F)C(F)(F)F")  # PFOA-like
True
>>> pfas_def.applies_to_molecule("CCCCCC")  # No fluorine
False

Notes

  • SMARTS patterns are checked using substructure matching (HasSubstructMatch)

  • Fluorine ratio is calculated as: F_count / total_atom_count

  • Invalid SMILES strings return False

test(test_data=None)[source]

Test this PFAS definition against test molecules from metadata.

Validates that the definition correctly classifies true positives, true negatives, false positives, and false negatives based on test metadata in PFAS_definitions_smarts.json.

Parameters:

test_data (dict, optional) – Test metadata dictionary. If None, will be loaded from the definition’s entry in PFAS_definitions_smarts.json. Expected keys: category, examples (dict with keys true_positives, true_negatives, false_positives, false_negatives each a list of dicts).

Returns:

Test results with keys: passed (bool), total_tests (int), failures (list), category (str), stats (dict with counts for true/false positives/negatives).

Return type:

dict

Notes

  • Tests against benchmark test compounds with known PFAS/non-PFAS labels

  • Validates both SMARTS patterns and fluorine ratio criteria

  • Returns detailed failure information for debugging

Encapsulates a regulatory PFAS definition and its matching logic. See PFAS Definitions for descriptions of the five built-in definitions.