Algorithm
from PFASGroups import parse_smiles
results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
# results[0].matches contains the detected halogen groups
for mol in results:
print(mol.smiles, [m.group_name for m in mol.matches])
This page describes how PFASGroups detects and classifies halogenated structural groups in a molecule.
Overview
The algorithm processes a SMILES string in four stages:
Parse — convert SMILES to an RDKit molecule
Match — apply each HalogenGroup SMARTS pattern to the molecule
Deduplicate — resolve overlapping and redundant matches
Score — compute per-component structural metrics
Group library
The library contains 119 halogen groups organised into four categories:
Category |
Count |
Description |
|---|---|---|
OECD |
27 |
Groups adopted from the 2021 OECD PFAS definition framework |
Generic |
48 |
Broader halogenated structural motifs (alkyl, aryl, acyl, sulfonyl, …) |
Fluorotelomer |
43 |
Fluorotelomer groups including telomer alcohols, sulfonates, amides |
Aggregate |
3 |
Pattern-matching groups (e.g. |
When fluorine-only mode is used (the default), 114 groups are compiled (the 3 aggregate groups are always included in fingerprint headers but not directly matched).
SMARTS-based matching
Each group is defined by one or more SMARTS patterns. The core matcher calls
GetSubstructMatches() for each pattern. If a
group has multiple SMARTS, the union of all matched atom-sets is collected.
Halogen filtering
After matching, each hit is filtered to keep only components that actually contain the requested halogen(s). For fluorine-only mode only C-F bonds are retained; for multi-halogen mode any C-X bond where X ∈ target set is kept.
This filtering happens at the component level, so a group match can be partially retained (some components kept, others discarded).
Saturation filtering
If saturation='saturated', only components whose carbon scaffold is fully
saturated (no sp2/sp3d carbons) are kept. If saturation='unsaturated',
the complement is kept. The default None retains all.
Overlap deduplication
The algorithm uses a priority-ordered merge to resolve overlapping SMARTS matches:
Sort candidate groups by specificity (more specific first).
For each candidate group, remove any already-claimed atom from its matched components.
If a component shrinks below the minimum size threshold, discard it.
Record surviving components as confirmed matches.
This ensures that a carbon chain is attributed to the most specific group it belongs to, and prevents double-counting.
Component graph metrics
For each matched component a molecular graph is constructed and the following
metrics are computed (unless compute_component_metrics=False):
Effective graph resistance (BDE-weighted Kirchhoff index)
Each bond \((u,v)\) with bond order \(b\) is assigned a conductance proportional to the bond’s dissociation energy:
where:
\(\text{BDE}(Z_u, Z_v, 1)\) — single-bond dissociation energy (kcal/mol) for the element pair \((Z_u, Z_v)\), looked up from
PFASGroups/data/diatomic_bonds_dict.json(the same reference table used bymolecular_quantum_graph).\(\text{BDE}(Z_u, Z_v, b) = \text{BDE}(Z_u, Z_v, 1) \cdot f(b)\) — scaled by the bond-order model \(f(b)\) described below.
\(\text{BDE}_\text{ref}\) — the C-C single-bond BDE (~83 kcal/mol), used as normalisation so that \(c_{CC,\text{single}} = 1\).
Higher BDE → higher conductance → shorter effective resistance path. This means C-F bonds (BDE ≈ 130 kcal/mol, \(c \approx 1.56\)) contribute less resistance than C-C bonds (BDE ≈ 83 kcal/mol, \(c = 1.0\)).
Bond-order model
For bonds with order \(b > 1\) (double, triple, aromatic), the single-bond
BDE is scaled by \(f(b)\) where \(f(1) = 1\) by construction.
PFASGroups attempts to load the calibrated model produced by
molecular_quantum_graph’s bond_order_calibration.py from
molecular_quantum_graph/data/bond_order_model.json. Five functional forms
are supported:
Model |
Formula \(f(b)\) |
Default params |
|---|---|---|
|
\(1 + \alpha\,(b-1)\) |
\(\alpha = 0.3\) |
|
\(b^{\,\beta}\) |
\(\beta = 0.6\) |
|
\(1 + a\,\ln b\) |
\(a = 1.0\) |
|
\(1 + a\,(b-1) + c\,(b-1)^2\) |
— |
|
\(1 + a\,(b-1) + b_2\,(b-1)^2 + c\,(b-1)^3\) |
— |
When bond_order_model.json is absent (e.g. molecular_quantum_graph is
not installed), the linear model with \(\alpha = 0.3\) is used
automatically.
Resistance distance and Kirchhoff index
The weighted Laplacian \(L\) is assembled from the per-bond conductances. The exact resistance distance between atoms \(u\) and \(v\) is then:
where \(L^+\) is the Moore-Penrose pseudoinverse of \(L\) (computed
via numpy.linalg.pinv).
The Kirchhoff index (reported as effective_graph_resistance) satisfies
Physical properties of \(R(u,v)\):
Symmetry: \(R(u,v) = R(v,u)\)
Positivity: \(R(u,v) > 0\) for \(u \neq v\) in a connected graph
Triangle inequality: \(R(i,k) \leq R(i,j) + R(j,k)\)
Monotone with chain length: for a linear homologous PFCA series, \(K_f\) increases strictly with \(n\) (fluorinated carbons)
Branching reduces \(K_f\): branched isomers have lower \(K_f\) than the linear chain of the same carbon count, because branching shortens the maximum pairwise path
Accessing resistance metrics
All resistance values are available per-component in the parse results:
from PFASGroups import parse_smiles
results = parse_smiles('OC(=O)C(F)(F)C(F)(F)C(F)(F)F', halogens='F')
comp = results[0]['matches'][0]['components'][0]
# BDE-weighted Kirchhoff index
print(comp['effective_graph_resistance'])
# BDE-weighted resistance from functional group to structural landmarks
print(comp['min_resistance_dist_to_barycentre'])
print(comp['min_resistance_dist_to_centre'])
print(comp['max_resistance_dist_to_periphery'])
To limit computation time on very large components:
# Only compute resistance for components with < 200 atoms
results = parse_smiles(smiles, limit_effective_graph_resistance=200)
# Skip resistance entirely
results = parse_smiles(smiles, limit_effective_graph_resistance=0)
Atom count — number of heavy atoms in the component.
Halogen count — number of halogen atoms in the component.
Halogen fraction — ratio of halogen atoms to total heavy atoms.
Embedding generation
The primary embedding API is to_array(),
called on a pre-parsed PFASEmbeddingSet (avoids re-parsing):
results = parse_smiles(smiles)
arr = results.to_array() # (n_mols, 114) binary, fluorine-only
generate_fingerprint() is a convenience wrapper that parses
and embeds in a single call, returning (array, info_dict):
fps, info = generate_fingerprint(smiles) # (n_mols, 114), {'group_names': …}
Both functions share the same component_metrics / group_selection /
halogens parameters. The default mode is fluorine-only (halogens='F'),
producing 114 compiled columns — one per group. The column layout for
multi-halogen mode is:
[group_0_F, group_0_Cl, group_0_Br, group_0_I,
`` group_1_F, …,``
`` group_113_F, group_113_Cl, group_113_Br, group_113_I]``
Default F-only column count: 114 × 1 = 114. All-halogen column count: 114 × 4 = 456 (see Multi-Halogen Fingerprints).
Four count encoding values are available as items in component_metrics:
component_metrics value |
Cell value |
|---|---|
|
1 if any match exists, 0 otherwise |
|
Number of matching components |
|
Size of the largest matching component (atom count) |
|
Sum of all matching component sizes (atom count) |
PFAS definition classification
When include_PFAS_definitions=True, each molecule is additionally
evaluated against five regulatory PFAS definitions. Each definition is
encoded as a set of logical rules operating on the group matches already
computed. No additional SMARTS matching is performed.
See PFAS Definitions for the rule logic of each definition.