Prioritization
==============

:func:`~PFASGroups.prioritise_molecules` ranks a set of molecules by
their structural novelty.  It is useful for selecting a structurally diverse
subset from a large chemical library, or for identifying which molecules in a
test set are most different from a reference set.

.. contents:: Contents
   :local:
   :depth: 2

Basic usage
-----------

.. code-block:: python

   from PFASGroups import prioritise_molecules

   molecules = [
       "CCCC(F)(F)F",
       "FC(F)(F)C(=O)O",
       "FCCC(F)(F)F",
       "ClCCCl",
       "BrCCBr",
   ]

   ranking = prioritise_molecules(molecules)

``ranking`` is a list of ``(smiles, score)`` tuples, sorted from highest to
lowest novelty score.

.. code-block:: python

   for smiles, score in ranking:
       print(f"{smiles:40s}  {score:.4f}")

Novelty against a reference set
---------------------------------

To measure novelty *relative to* a set of known molecules supply the
``reference`` parameter:

.. code-block:: python

   known = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O"]
   new_candidates = ["FCCC(F)(F)F", "ClCCCl", "BrCCBr"]

   ranking = prioritise_molecules(new_candidates, reference=known)

Molecules most dissimilar from the reference set receive the highest scores.

Scoring formula
----------------

The score for molecule :math:`i` is computed from two components:

**Intra-set distance** :math:`d_\text{self}(i)`:

The average Tanimoto distance from molecule :math:`i` to all other molecules
in the query set:

.. math::

   d_\text{self}(i) = \frac{1}{N-1} \sum_{j \neq i} \left(1 - T_{ij}\right)

**Reference distance** :math:`d_\text{ref}(i)` (when a reference set is given):

The average Tanimoto distance from molecule :math:`i` to the top
``percentile`` percent of the reference set:

.. math::

   d_\text{ref}(i) = \frac{1}{\lvert R_p \rvert}
       \sum_{r \in R_p} \left(1 - T_{ir}\right)

**Combined score**:

.. math::

   \text{score}(i) = a \cdot d_\text{self}(i) + b \cdot d_\text{ref}(i)

where :math:`a` and :math:`b` are weight parameters (both default to 1.0).

Parameters
----------

.. code-block:: python

   prioritise_molecules(
       molecules,
       reference=None,    # optional list of reference SMILES
       a=1.0,             # weight for intra-set distance
       b=1.0,             # weight for reference distance
       percentile=90.0,   # percentile cut-off for reference set
       halogens='F',      # halogens to include in fingerprint
       saturation=None,   # 'saturated', 'unsaturated', or None
       component_metrics=['max_component'],  # fingerprint component metrics
       return_scores=True,                   # if False, return sorted smiles only
       ascending=False,   # if True, lowest score first
   )

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Parameter
     - Description
   * - ``molecules``
     - List of SMILES strings to rank
   * - ``reference``
     - Optional reference SMILES list.  If ``None`` only intra-set distance is
       used (``a`` controls weighting, ``b`` is ignored).
   * - ``a``
     - Weight applied to the intra-set distance component (default 1.0)
   * - ``b``
     - Weight applied to the reference distance component (default 1.0).
       Has no effect when ``reference=None``.
   * - ``percentile``
     - Only molecules in the top ``percentile`` percent of the reference set
       (by distance to the query molecule) are used in :math:`d_\text{ref}`
       (default 90.0)
   * - ``halogens``
     - Halogen(s) used to build fingerprints for distance computation
   * - ``saturation``
     - Saturation filter passed to :func:`~PFASGroups.generate_fingerprint`
   * - ``component_metrics``
     - List of metrics for fingerprints: ``['max_component']``, ``['count']``,
       ``['binary']``, or combined e.g. ``['binary', 'effective_graph_resistance']``
   * - ``return_scores``
     - If ``True`` (default), return ``[(smiles, score), …]``.  If ``False``,
       return ``[smiles, …]``.
   * - ``ascending``
     - Sort order.  Default ``False`` = highest novelty first.

Example: selecting a diverse subset
-------------------------------------

.. code-block:: python

   import pandas as pd
   from PFASGroups import prioritise_molecules

   df = pd.read_csv("pfas_candidates.csv")
   smiles = df["SMILES"].tolist()

   ranking = prioritise_molecules(smiles, halogens='F')

   # Take the top-50 most novel structures
   top50 = [s for s, _ in ranking[:50]]