Algorithm
=========

.. code-block:: python

   from PFASGroups import parse_smiles

   results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
   # results[0].matches contains the detected halogen groups
   for mol in results:
       print(mol.smiles, [m.group_name for m in mol.matches])

This page describes how PFASGroups detects and classifies halogenated
structural groups in a molecule.

.. contents:: Contents
   :local:
   :depth: 2

Overview
--------

The algorithm processes a SMILES string in four stages:

1. **Parse** — convert SMILES to an RDKit molecule
2. **Match** — apply each HalogenGroup SMARTS pattern to the molecule
3. **Deduplicate** — resolve overlapping and redundant matches
4. **Score** — compute per-component structural metrics

Group library
-------------

The library contains **119 halogen groups** organised into four categories:

.. list-table::
   :header-rows: 1
   :widths: 15 15 70

   * - Category
     - Count
     - Description
   * - OECD
     - 27
     - Groups adopted from the 2021 OECD PFAS definition framework
   * - Generic
     - 48
     - Broader halogenated structural motifs (alkyl, aryl, acyl, sulfonyl, …)
   * - Fluorotelomer
     - 43
     - Fluorotelomer groups including telomer alcohols, sulfonates, amides
   * - Aggregate
     - 3
     - Pattern-matching groups (e.g. ``Telomers``) with ``compute=False``;
       included in fingerprint headers but not matched directly

When fluorine-only mode is used (the default), **114 groups** are compiled
(the 3 aggregate groups are always included in fingerprint headers but not
directly matched).

SMARTS-based matching
----------------------

Each group is defined by one or more SMARTS patterns.  The core matcher calls
:func:`~rdkit.Chem.rdchem.Mol.GetSubstructMatches` for each pattern.  If a
group has multiple SMARTS, the union of all matched atom-sets is collected.

Halogen filtering
~~~~~~~~~~~~~~~~~

After matching, each hit is filtered to keep only components that actually
contain the requested halogen(s).  For fluorine-only mode only C–F bonds are
retained; for multi-halogen mode any C–X bond where X ∈ target set is kept.

This filtering happens at the component level, so a group match can be
partially retained (some components kept, others discarded).

Saturation filtering
~~~~~~~~~~~~~~~~~~~~~

If ``saturation='saturated'``, only components whose carbon scaffold is fully
saturated (no sp2/sp3d carbons) are kept.  If ``saturation='unsaturated'``,
the complement is kept.  The default ``None`` retains all.

Overlap deduplication
-----------------------

The algorithm uses a priority-ordered merge to resolve overlapping SMARTS
matches:

1. Sort candidate groups by *specificity* (more specific first).
2. For each candidate group, remove any already-claimed atom from its
   matched components.
3. If a component shrinks below the minimum size threshold, discard it.
4. Record surviving components as confirmed matches.

This ensures that a carbon chain is attributed to the most specific group
it belongs to, and prevents double-counting.

Component graph metrics
------------------------

For each matched component a molecular graph is constructed and the following
metrics are computed (unless ``compute_component_metrics=False``):

**Effective graph resistance** (BDE-weighted Kirchhoff index)

Each bond :math:`(u,v)` with bond order :math:`b` is assigned a *conductance*
proportional to the bond's dissociation energy:

.. math::

   c_{uv} = \frac{\text{BDE}(Z_u,\, Z_v,\, b)}{\text{BDE}_\text{ref}}

where:

* :math:`\text{BDE}(Z_u, Z_v, 1)` — single-bond dissociation energy (kcal/mol)
  for the element pair :math:`(Z_u, Z_v)`, looked up from
  ``PFASGroups/data/diatomic_bonds_dict.json`` (the same reference table used
  by ``molecular_quantum_graph``).
* :math:`\text{BDE}(Z_u, Z_v, b) = \text{BDE}(Z_u, Z_v, 1) \cdot f(b)` —
  scaled by the **bond-order model** :math:`f(b)` described below.
* :math:`\text{BDE}_\text{ref}` — the C–C single-bond BDE (~83 kcal/mol),
  used as normalisation so that :math:`c_{CC,\text{single}} = 1`.

Higher BDE → higher conductance → *shorter* effective resistance path.
This means C–F bonds (BDE ≈ 130 kcal/mol, :math:`c \approx 1.56`) contribute
less resistance than C–C bonds (BDE ≈ 83 kcal/mol, :math:`c = 1.0`).

Bond-order model
~~~~~~~~~~~~~~~~

For bonds with order :math:`b > 1` (double, triple, aromatic), the single-bond
BDE is scaled by :math:`f(b)` where :math:`f(1) = 1` by construction.
PFASGroups attempts to load the calibrated model produced by
``molecular_quantum_graph``'s ``bond_order_calibration.py`` from
``molecular_quantum_graph/data/bond_order_model.json``.  Five functional forms
are supported:

.. list-table::
   :header-rows: 1
   :widths: 12 35 18

   * - Model
     - Formula :math:`f(b)`
     - Default params
   * - ``linear``  *(fallback)*
     - :math:`1 + \alpha\,(b-1)`
     - :math:`\alpha = 0.3`
   * - ``power``
     - :math:`b^{\,\beta}`
     - :math:`\beta = 0.6`
   * - ``log``
     - :math:`1 + a\,\ln b`
     - :math:`a = 1.0`
   * - ``poly2``
     - :math:`1 + a\,(b-1) + c\,(b-1)^2`
     - —
   * - ``poly3``
     - :math:`1 + a\,(b-1) + b_2\,(b-1)^2 + c\,(b-1)^3`
     - —

When ``bond_order_model.json`` is absent (e.g. ``molecular_quantum_graph`` is
not installed), the **linear model with** :math:`\alpha = 0.3` is used
automatically.

Resistance distance and Kirchhoff index
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The **weighted Laplacian** :math:`L` is assembled from the per-bond conductances.
The exact resistance distance between atoms :math:`u` and :math:`v` is then:

.. math::

   R(u,v) = L^+_{uu} + L^+_{vv} - 2\,L^+_{uv}

where :math:`L^+` is the **Moore–Penrose pseudoinverse** of :math:`L` (computed
via ``numpy.linalg.pinv``).

The **Kirchhoff index** (reported as ``effective_graph_resistance``) satisfies

.. math::

   K_f = \sum_{u < v} R(u, v) = n \sum_{i=2}^{n} \frac{1}{\lambda_i(L)}

Physical properties of :math:`R(u,v)`:

* **Symmetry**: :math:`R(u,v) = R(v,u)`
* **Positivity**: :math:`R(u,v) > 0` for :math:`u \neq v` in a connected graph
* **Triangle inequality**: :math:`R(i,k) \leq R(i,j) + R(j,k)`
* **Monotone with chain length**: for a linear homologous PFCA series,
  :math:`K_f` increases strictly with :math:`n` (fluorinated carbons)
* **Branching reduces** :math:`K_f`: branched isomers have lower :math:`K_f`
  than the linear chain of the same carbon count, because branching shortens
  the maximum pairwise path

Accessing resistance metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

All resistance values are available per-component in the parse results:

.. code-block:: python

   from PFASGroups import parse_smiles

   results = parse_smiles('OC(=O)C(F)(F)C(F)(F)C(F)(F)F', halogens='F')
   comp = results[0]['matches'][0]['components'][0]

   # BDE-weighted Kirchhoff index
   print(comp['effective_graph_resistance'])

   # BDE-weighted resistance from functional group to structural landmarks
   print(comp['min_resistance_dist_to_barycentre'])
   print(comp['min_resistance_dist_to_centre'])
   print(comp['max_resistance_dist_to_periphery'])

To limit computation time on very large components:

.. code-block:: python

   # Only compute resistance for components with < 200 atoms
   results = parse_smiles(smiles, limit_effective_graph_resistance=200)

   # Skip resistance entirely
   results = parse_smiles(smiles, limit_effective_graph_resistance=0)

**Atom count** — number of heavy atoms in the component.

**Halogen count** — number of halogen atoms in the component.

**Halogen fraction** — ratio of halogen atoms to total heavy atoms.

Embedding generation
----------------------

The primary embedding API is :meth:`~PFASGroups.PFASEmbeddingSet.to_array`,
called on a pre-parsed :class:`~PFASGroups.PFASEmbeddingSet` (avoids re-parsing):

.. code-block:: python

   results = parse_smiles(smiles)
   arr = results.to_array()          # (n_mols, 114) binary, fluorine-only

:func:`~PFASGroups.generate_fingerprint` is a convenience wrapper that parses
and embeds in a single call, returning ``(array, info_dict)``:

.. code-block:: python

   fps, info = generate_fingerprint(smiles)   # (n_mols, 114), {'group_names': …}

Both functions share the same ``component_metrics`` / ``group_selection`` /
``halogens`` parameters.  The default mode is fluorine-only (``halogens='F'``),
producing **114 compiled columns** — one per group.  The column layout for
multi-halogen mode is:

``[group_0_F, group_0_Cl, group_0_Br, group_0_I,``
``  group_1_F, …,``
``  group_113_F, group_113_Cl, group_113_Br, group_113_I]``

Default F-only column count: 114 × 1 = **114**.
All-halogen column count: 114 × 4 = **456** (see :ref:`multi_halogen_fingerprint`).

Four count encoding values are available as items in ``component_metrics``:

.. list-table::
   :header-rows: 1

   * - component_metrics value
     - Cell value
   * - ``'binary'`` (default)
     - 1 if any match exists, 0 otherwise
   * - ``'count'``
     - Number of matching components
   * - ``'max_component'``
     - Size of the largest matching component (atom count)
   * - ``'total_component'``
     - Sum of all matching component sizes (atom count)

PFAS definition classification
--------------------------------

When ``include_PFAS_definitions=True``, each molecule is additionally
evaluated against five regulatory PFAS definitions.  Each definition is
encoded as a set of logical rules operating on the group matches already
computed.  No additional SMARTS matching is performed.

See :doc:`pfas_definitions` for the rule logic of each definition.