Algorithm

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
# results[0].matches contains the detected halogen groups
for mol in results:
    print(mol.smiles, [m.group_name for m in mol.matches])

This page describes how PFASGroups detects and classifies halogenated structural groups in a molecule.

Overview 

The algorithm processes a SMILES string in four stages:

Parse — convert SMILES to an RDKit molecule
Match — apply each HalogenGroup SMARTS pattern to the molecule
Deduplicate — resolve overlapping and redundant matches
Score — compute per-component structural metrics

Group library 

The library contains 119 halogen groups organised into four categories:

Category	Count	Description
OECD	27	Groups adopted from the 2021 OECD PFAS definition framework
Generic	48	Broader halogenated structural motifs (alkyl, aryl, acyl, sulfonyl, …)
Fluorotelomer	43	Fluorotelomer groups including telomer alcohols, sulfonates, amides
Aggregate	3	Pattern-matching groups (e.g. `Telomers`) with `compute=False`; included in fingerprint headers but not matched directly

When fluorine-only mode is used (the default), 114 groups are compiled (the 3 aggregate groups are always included in fingerprint headers but not directly matched).

SMARTS-based matching 

Each group is defined by one or more SMARTS patterns. The core matcher calls GetSubstructMatches() for each pattern. If a group has multiple SMARTS, the union of all matched atom-sets is collected.

After matching, each hit is filtered to keep only components that actually contain the requested halogen(s). For fluorine-only mode only C–F bonds are retained; for multi-halogen mode any C–X bond where X ∈ target set is kept.

This filtering happens at the component level, so a group match can be partially retained (some components kept, others discarded).

Saturation filtering 

If saturation='saturated', only components whose carbon scaffold is fully saturated (no sp2/sp3d carbons) are kept. If saturation='unsaturated', the complement is kept. The default None retains all.

Overlap deduplication 

The algorithm uses a priority-ordered merge to resolve overlapping SMARTS matches:

Sort candidate groups by specificity (more specific first).
For each candidate group, remove any already-claimed atom from its matched components.
If a component shrinks below the minimum size threshold, discard it.
Record surviving components as confirmed matches.

This ensures that a carbon chain is attributed to the most specific group it belongs to, and prevents double-counting.

Component graph metrics 

For each matched component a molecular graph is constructed and the following metrics are computed (unless compute_component_metrics=False):

Effective graph resistance (BDE-weighted Kirchhoff index)

Each bond \((u,v)\) with bond order \(b\) is assigned a conductance proportional to the bond’s dissociation energy:

\[c_{uv} = \frac{\text{BDE}(Z_u,\, Z_v,\, b)}{\text{BDE}_\text{ref}}\]

where:

\(\text{BDE}(Z_u, Z_v, 1)\) — single-bond dissociation energy (kcal/mol) for the element pair \((Z_u, Z_v)\), looked up from PFASGroups/data/diatomic_bonds_dict.json (the same reference table used by molecular_quantum_graph).
\(\text{BDE}(Z_u, Z_v, b) = \text{BDE}(Z_u, Z_v, 1) \cdot f(b)\) — scaled by the bond-order model \(f(b)\) described below.
\(\text{BDE}_\text{ref}\) — the C–C single-bond BDE (~83 kcal/mol), used as normalisation so that \(c_{CC,\text{single}} = 1\).

Higher BDE → higher conductance → shorter effective resistance path. This means C–F bonds (BDE ≈ 130 kcal/mol, \(c \approx 1.56\)) contribute less resistance than C–C bonds (BDE ≈ 83 kcal/mol, \(c = 1.0\)).

Bond-order model 

For bonds with order \(b > 1\) (double, triple, aromatic), the single-bond BDE is scaled by \(f(b)\) where \(f(1) = 1\) by construction. PFASGroups attempts to load the calibrated model produced by molecular_quantum_graph’s bond_order_calibration.py from molecular_quantum_graph/data/bond_order_model.json. Five functional forms are supported:

Model	Formula \(f(b)\)	Default params
`linear` (fallback)	\(1 + \alpha\,(b-1)\)	\(\alpha = 0.3\)
`power`	\(b^{\,\beta}\)	\(\beta = 0.6\)
`log`	\(1 + a\,\ln b\)	\(a = 1.0\)
`poly2`	\(1 + a\,(b-1) + c\,(b-1)^2\)	—
`poly3`	\(1 + a\,(b-1) + b_2\,(b-1)^2 + c\,(b-1)^3\)	—

When bond_order_model.json is absent (e.g. molecular_quantum_graph is not installed), the linear model with \(\alpha = 0.3\) is used automatically.

Resistance distance and Kirchhoff index 

The weighted Laplacian \(L\) is assembled from the per-bond conductances. The exact resistance distance between atoms \(u\) and \(v\) is then:

\[R(u,v) = L^+_{uu} + L^+_{vv} - 2\,L^+_{uv}\]

where \(L^+\) is the Moore–Penrose pseudoinverse of \(L\) (computed via numpy.linalg.pinv).

The Kirchhoff index (reported as effective_graph_resistance) satisfies

\[K_f = \sum_{u < v} R(u, v) = n \sum_{i=2}^{n} \frac{1}{\lambda_i(L)}\]

Physical properties of \(R(u,v)\):

Symmetry: \(R(u,v) = R(v,u)\)
Positivity: \(R(u,v) > 0\) for \(u \neq v\) in a connected graph
Triangle inequality: \(R(i,k) \leq R(i,j) + R(j,k)\)
Monotone with chain length: for a linear homologous PFCA series, \(K_f\) increases strictly with \(n\) (fluorinated carbons)
Branching reduces \(K_f\): branched isomers have lower \(K_f\) than the linear chain of the same carbon count, because branching shortens the maximum pairwise path

Accessing resistance metrics 

All resistance values are available per-component in the parse results:

from PFASGroups import parse_smiles

results = parse_smiles('OC(=O)C(F)(F)C(F)(F)C(F)(F)F', halogens='F')
comp = results[0]['matches'][0]['components'][0]

# BDE-weighted Kirchhoff index
print(comp['effective_graph_resistance'])

# BDE-weighted resistance from functional group to structural landmarks
print(comp['min_resistance_dist_to_barycentre'])
print(comp['min_resistance_dist_to_centre'])
print(comp['max_resistance_dist_to_periphery'])

To limit computation time on very large components:

# Only compute resistance for components with < 200 atoms
results = parse_smiles(smiles, limit_effective_graph_resistance=200)

# Skip resistance entirely
results = parse_smiles(smiles, limit_effective_graph_resistance=0)

Atom count — number of heavy atoms in the component.

Halogen count — number of halogen atoms in the component.

Halogen fraction — ratio of halogen atoms to total heavy atoms.

Embedding generation 

The primary embedding API is to_array(), called on a pre-parsed PFASEmbeddingSet (avoids re-parsing):

results = parse_smiles(smiles)
arr = results.to_array()          # (n_mols, 114) binary, fluorine-only

generate_fingerprint() is a convenience wrapper that parses and embeds in a single call, returning (array, info_dict):

fps, info = generate_fingerprint(smiles)   # (n_mols, 114), {'group_names': …}

Both functions share the same component_metrics / group_selection / halogens parameters. The default mode is fluorine-only (halogens='F'), producing 114 compiled columns — one per group. The column layout for multi-halogen mode is:

[group_0_F, group_0_Cl, group_0_Br, group_0_I, `` group_1_F, …,`` `` group_113_F, group_113_Cl, group_113_Br, group_113_I]``

Default F-only column count: 114 × 1 = 114. All-halogen column count: 114 × 4 = 456 (see Multi-Halogen Fingerprints).

Four count encoding values are available as items in component_metrics:

component_metrics value	Cell value
`'binary'` (default)	1 if any match exists, 0 otherwise
`'count'`	Number of matching components
`'max_component'`	Size of the largest matching component (atom count)
`'total_component'`	Sum of all matching component sizes (atom count)

PFAS definition classification 

When include_PFAS_definitions=True, each molecule is additionally evaluated against five regulatory PFAS definitions. Each definition is encoded as a set of logical rules operating on the group matches already computed. No additional SMARTS matching is performed.

See PFAS Definitions for the rule logic of each definition.