Algorithm

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
# results[0].matches contains the detected halogen groups
for mol in results:
    print(mol.smiles, [m.group_name for m in mol.matches])

This page describes how PFASGroups detects and classifies halogenated structural groups in a molecule.

Overview

The algorithm processes a SMILES string in four stages:

  1. Parse — convert SMILES to an RDKit molecule

  2. Match — apply each HalogenGroup SMARTS pattern to the molecule

  3. Deduplicate — resolve overlapping and redundant matches

  4. Score — compute per-component structural metrics

Group library

The library contains 119 halogen groups organised into four categories:

Category

Count

Description

OECD

27

Groups adopted from the 2021 OECD PFAS definition framework

Generic

48

Broader halogenated structural motifs (alkyl, aryl, acyl, sulfonyl, …)

Fluorotelomer

43

Fluorotelomer groups including telomer alcohols, sulfonates, amides

Aggregate

3

Pattern-matching groups (e.g. Telomers) with compute=False; included in fingerprint headers but not matched directly

When fluorine-only mode is used (the default), 114 groups are compiled (the 3 aggregate groups are always included in fingerprint headers but not directly matched).

SMARTS-based matching

Each group is defined by one or more SMARTS patterns. The core matcher calls GetSubstructMatches() for each pattern. If a group has multiple SMARTS, the union of all matched atom-sets is collected.

Halogen filtering

After matching, each hit is filtered to keep only components that actually contain the requested halogen(s). For fluorine-only mode only C–F bonds are retained; for multi-halogen mode any C–X bond where X ∈ target set is kept.

This filtering happens at the component level, so a group match can be partially retained (some components kept, others discarded).

Saturation filtering

If saturation='saturated', only components whose carbon scaffold is fully saturated (no sp2/sp3d carbons) are kept. If saturation='unsaturated', the complement is kept. The default None retains all.

Overlap deduplication

The algorithm uses a priority-ordered merge to resolve overlapping SMARTS matches:

  1. Sort candidate groups by specificity (more specific first).

  2. For each candidate group, remove any already-claimed atom from its matched components.

  3. If a component shrinks below the minimum size threshold, discard it.

  4. Record surviving components as confirmed matches.

This ensures that a carbon chain is attributed to the most specific group it belongs to, and prevents double-counting.

Component graph metrics

For each matched component a molecular graph is constructed and the following metrics are computed (unless compute_component_metrics=False):

Effective graph resistance (BDE-weighted Kirchhoff index)

Each bond \((u,v)\) with bond order \(b\) is assigned a conductance proportional to the bond’s dissociation energy:

\[c_{uv} = \frac{\text{BDE}(Z_u,\, Z_v,\, b)}{\text{BDE}_\text{ref}}\]

where:

  • \(\text{BDE}(Z_u, Z_v, 1)\) — single-bond dissociation energy (kcal/mol) for the element pair \((Z_u, Z_v)\), looked up from PFASGroups/data/diatomic_bonds_dict.json (the same reference table used by molecular_quantum_graph).

  • \(\text{BDE}(Z_u, Z_v, b) = \text{BDE}(Z_u, Z_v, 1) \cdot f(b)\) — scaled by the bond-order model \(f(b)\) described below.

  • \(\text{BDE}_\text{ref}\) — the C–C single-bond BDE (~83 kcal/mol), used as normalisation so that \(c_{CC,\text{single}} = 1\).

Higher BDE → higher conductance → shorter effective resistance path. This means C–F bonds (BDE ≈ 130 kcal/mol, \(c \approx 1.56\)) contribute less resistance than C–C bonds (BDE ≈ 83 kcal/mol, \(c = 1.0\)).

Bond-order model

For bonds with order \(b > 1\) (double, triple, aromatic), the single-bond BDE is scaled by \(f(b)\) where \(f(1) = 1\) by construction. PFASGroups attempts to load the calibrated model produced by molecular_quantum_graph’s bond_order_calibration.py from molecular_quantum_graph/data/bond_order_model.json. Five functional forms are supported:

Model

Formula \(f(b)\)

Default params

linear (fallback)

\(1 + \alpha\,(b-1)\)

\(\alpha = 0.3\)

power

\(b^{\,\beta}\)

\(\beta = 0.6\)

log

\(1 + a\,\ln b\)

\(a = 1.0\)

poly2

\(1 + a\,(b-1) + c\,(b-1)^2\)

poly3

\(1 + a\,(b-1) + b_2\,(b-1)^2 + c\,(b-1)^3\)

When bond_order_model.json is absent (e.g. molecular_quantum_graph is not installed), the linear model with \(\alpha = 0.3\) is used automatically.

Resistance distance and Kirchhoff index

The weighted Laplacian \(L\) is assembled from the per-bond conductances. The exact resistance distance between atoms \(u\) and \(v\) is then:

\[R(u,v) = L^+_{uu} + L^+_{vv} - 2\,L^+_{uv}\]

where \(L^+\) is the Moore–Penrose pseudoinverse of \(L\) (computed via numpy.linalg.pinv).

The Kirchhoff index (reported as effective_graph_resistance) satisfies

\[K_f = \sum_{u < v} R(u, v) = n \sum_{i=2}^{n} \frac{1}{\lambda_i(L)}\]

Physical properties of \(R(u,v)\):

  • Symmetry: \(R(u,v) = R(v,u)\)

  • Positivity: \(R(u,v) > 0\) for \(u \neq v\) in a connected graph

  • Triangle inequality: \(R(i,k) \leq R(i,j) + R(j,k)\)

  • Monotone with chain length: for a linear homologous PFCA series, \(K_f\) increases strictly with \(n\) (fluorinated carbons)

  • Branching reduces \(K_f\): branched isomers have lower \(K_f\) than the linear chain of the same carbon count, because branching shortens the maximum pairwise path

Accessing resistance metrics

All resistance values are available per-component in the parse results:

from PFASGroups import parse_smiles

results = parse_smiles('OC(=O)C(F)(F)C(F)(F)C(F)(F)F', halogens='F')
comp = results[0]['matches'][0]['components'][0]

# BDE-weighted Kirchhoff index
print(comp['effective_graph_resistance'])

# BDE-weighted resistance from functional group to structural landmarks
print(comp['min_resistance_dist_to_barycentre'])
print(comp['min_resistance_dist_to_centre'])
print(comp['max_resistance_dist_to_periphery'])

To limit computation time on very large components:

# Only compute resistance for components with < 200 atoms
results = parse_smiles(smiles, limit_effective_graph_resistance=200)

# Skip resistance entirely
results = parse_smiles(smiles, limit_effective_graph_resistance=0)

Atom count — number of heavy atoms in the component.

Halogen count — number of halogen atoms in the component.

Halogen fraction — ratio of halogen atoms to total heavy atoms.

Embedding generation

The primary embedding API is to_array(), called on a pre-parsed PFASEmbeddingSet (avoids re-parsing):

results = parse_smiles(smiles)
arr = results.to_array()          # (n_mols, 114) binary, fluorine-only

generate_fingerprint() is a convenience wrapper that parses and embeds in a single call, returning (array, info_dict):

fps, info = generate_fingerprint(smiles)   # (n_mols, 114), {'group_names': …}

Both functions share the same component_metrics / group_selection / halogens parameters. The default mode is fluorine-only (halogens='F'), producing 114 compiled columns — one per group. The column layout for multi-halogen mode is:

[group_0_F, group_0_Cl, group_0_Br, group_0_I, `` group_1_F, …,`` `` group_113_F, group_113_Cl, group_113_Br, group_113_I]``

Default F-only column count: 114 × 1 = 114. All-halogen column count: 114 × 4 = 456 (see Multi-Halogen Fingerprints).

Four count encoding values are available as items in component_metrics:

component_metrics value

Cell value

'binary' (default)

1 if any match exists, 0 otherwise

'count'

Number of matching components

'max_component'

Size of the largest matching component (atom count)

'total_component'

Sum of all matching component sizes (atom count)

PFAS definition classification

When include_PFAS_definitions=True, each molecule is additionally evaluated against five regulatory PFAS definitions. Each definition is encoded as a set of logical rules operating on the group matches already computed. No additional SMARTS matching is performed.

See PFAS Definitions for the rule logic of each definition.