Prioritization ============== :func:`~PFASGroups.prioritise_molecules` ranks a set of molecules by their structural novelty. It is useful for selecting a structurally diverse subset from a large chemical library, or for identifying which molecules in a test set are most different from a reference set. .. contents:: Contents :local: :depth: 2 Basic usage ----------- .. code-block:: python from PFASGroups import prioritise_molecules molecules = [ "CCCC(F)(F)F", "FC(F)(F)C(=O)O", "FCCC(F)(F)F", "ClCCCl", "BrCCBr", ] ranking = prioritise_molecules(molecules) ``ranking`` is a list of ``(smiles, score)`` tuples, sorted from highest to lowest novelty score. .. code-block:: python for smiles, score in ranking: print(f"{smiles:40s} {score:.4f}") Novelty against a reference set --------------------------------- To measure novelty *relative to* a set of known molecules supply the ``reference`` parameter: .. code-block:: python known = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O"] new_candidates = ["FCCC(F)(F)F", "ClCCCl", "BrCCBr"] ranking = prioritise_molecules(new_candidates, reference=known) Molecules most dissimilar from the reference set receive the highest scores. Scoring formula ---------------- The score for molecule :math:`i` is computed from two components: **Intra-set distance** :math:`d_\text{self}(i)`: The average Tanimoto distance from molecule :math:`i` to all other molecules in the query set: .. math:: d_\text{self}(i) = \frac{1}{N-1} \sum_{j \neq i} \left(1 - T_{ij}\right) **Reference distance** :math:`d_\text{ref}(i)` (when a reference set is given): The average Tanimoto distance from molecule :math:`i` to the top ``percentile`` percent of the reference set: .. math:: d_\text{ref}(i) = \frac{1}{\lvert R_p \rvert} \sum_{r \in R_p} \left(1 - T_{ir}\right) **Combined score**: .. math:: \text{score}(i) = a \cdot d_\text{self}(i) + b \cdot d_\text{ref}(i) where :math:`a` and :math:`b` are weight parameters (both default to 1.0). Parameters ---------- .. code-block:: python prioritise_molecules( molecules, reference=None, # optional list of reference SMILES a=1.0, # weight for intra-set distance b=1.0, # weight for reference distance percentile=90.0, # percentile cut-off for reference set halogens='F', # halogens to include in fingerprint saturation=None, # 'saturated', 'unsaturated', or None component_metrics=['max_component'], # fingerprint component metrics return_scores=True, # if False, return sorted smiles only ascending=False, # if True, lowest score first ) .. list-table:: :header-rows: 1 :widths: 25 75 * - Parameter - Description * - ``molecules`` - List of SMILES strings to rank * - ``reference`` - Optional reference SMILES list. If ``None`` only intra-set distance is used (``a`` controls weighting, ``b`` is ignored). * - ``a`` - Weight applied to the intra-set distance component (default 1.0) * - ``b`` - Weight applied to the reference distance component (default 1.0). Has no effect when ``reference=None``. * - ``percentile`` - Only molecules in the top ``percentile`` percent of the reference set (by distance to the query molecule) are used in :math:`d_\text{ref}` (default 90.0) * - ``halogens`` - Halogen(s) used to build fingerprints for distance computation * - ``saturation`` - Saturation filter passed to :func:`~PFASGroups.generate_fingerprint` * - ``component_metrics`` - List of metrics for fingerprints: ``['max_component']``, ``['count']``, ``['binary']``, or combined e.g. ``['binary', 'effective_graph_resistance']`` * - ``return_scores`` - If ``True`` (default), return ``[(smiles, score), …]``. If ``False``, return ``[smiles, …]``. * - ``ascending`` - Sort order. Default ``False`` = highest novelty first. Example: selecting a diverse subset ------------------------------------- .. code-block:: python import pandas as pd from PFASGroups import prioritise_molecules df = pd.read_csv("pfas_candidates.csv") smiles = df["SMILES"].tolist() ranking = prioritise_molecules(smiles, halogens='F') # Take the top-50 most novel structures top50 = [s for s, _ in ranking[:50]]