Prioritization

prioritise_molecules() ranks a set of molecules by their structural novelty. It is useful for selecting a structurally diverse subset from a large chemical library, or for identifying which molecules in a test set are most different from a reference set.

Basic usage 

from PFASGroups import prioritise_molecules

molecules = [
    "CCCC(F)(F)F",
    "FC(F)(F)C(=O)O",
    "FCCC(F)(F)F",
    "ClCCCl",
    "BrCCBr",
]

ranking = prioritise_molecules(molecules)

ranking is a list of (smiles, score) tuples, sorted from highest to lowest novelty score.

for smiles, score in ranking:
    print(f"{smiles:40s}  {score:.4f}")

Novelty against a reference set 

To measure novelty relative to a set of known molecules supply the reference parameter:

known = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O"]
new_candidates = ["FCCC(F)(F)F", "ClCCCl", "BrCCBr"]

ranking = prioritise_molecules(new_candidates, reference=known)

Molecules most dissimilar from the reference set receive the highest scores.

Scoring formula 

The score for molecule \(i\) is computed from two components:

Intra-set distance \(d_\text{self}(i)\):

The average Tanimoto distance from molecule \(i\) to all other molecules in the query set:

\[d_\text{self}(i) = \frac{1}{N-1} \sum_{j \neq i} \left(1 - T_{ij}\right)\]

Reference distance \(d_\text{ref}(i)\) (when a reference set is given):

The average Tanimoto distance from molecule \(i\) to the top percentile percent of the reference set:

\[d_\text{ref}(i) = \frac{1}{\lvert R_p \rvert} \sum_{r \in R_p} \left(1 - T_{ir}\right)\]

Combined score:

\[\text{score}(i) = a \cdot d_\text{self}(i) + b \cdot d_\text{ref}(i)\]

where \(a\) and \(b\) are weight parameters (both default to 1.0).

Parameters 

prioritise_molecules(
    molecules,
    reference=None,    # optional list of reference SMILES
    a=1.0,             # weight for intra-set distance
    b=1.0,             # weight for reference distance
    percentile=90.0,   # percentile cut-off for reference set
    halogens='F',      # halogens to include in fingerprint
    saturation=None,   # 'saturated', 'unsaturated', or None
    component_metrics=['max_component'],  # fingerprint component metrics
    return_scores=True,                   # if False, return sorted smiles only
    ascending=False,   # if True, lowest score first
)

Parameter	Description
`molecules`	List of SMILES strings to rank
`reference`	Optional reference SMILES list. If `None` only intra-set distance is used (`a` controls weighting, `b` is ignored).
`a`	Weight applied to the intra-set distance component (default 1.0)
`b`	Weight applied to the reference distance component (default 1.0). Has no effect when `reference=None`.
`percentile`	Only molecules in the top `percentile` percent of the reference set (by distance to the query molecule) are used in \(d_\text{ref}\) (default 90.0)
`halogens`	Halogen(s) used to build fingerprints for distance computation
`saturation`	Saturation filter passed to `generate_fingerprint()`
`component_metrics`	List of metrics for fingerprints: `['max_component']`, `['count']`, `['binary']`, or combined e.g. `['binary', 'effective_graph_resistance']`
`return_scores`	If `True` (default), return `[(smiles, score), …]`. If `False`, return `[smiles, …]`.
`ascending`	Sort order. Default `False` = highest novelty first.

Example: selecting a diverse subset 

import pandas as pd
from PFASGroups import prioritise_molecules

df = pd.read_csv("pfas_candidates.csv")
smiles = df["SMILES"].tolist()

ranking = prioritise_molecules(smiles, halogens='F')

# Take the top-50 most novel structures
top50 = [s for s, _ in ranking[:50]]

Prioritization

Basic usage

Novelty against a reference set

Scoring formula

Parameters

Example: selecting a diverse subset

Basic usage 

Novelty against a reference set 

Scoring formula 

Parameters 

Example: selecting a diverse subset 