Prioritization

prioritise_molecules() ranks a set of molecules by their structural novelty. It is useful for selecting a structurally diverse subset from a large chemical library, or for identifying which molecules in a test set are most different from a reference set.

Basic usage

from PFASGroups import prioritise_molecules

molecules = [
    "CCCC(F)(F)F",
    "FC(F)(F)C(=O)O",
    "FCCC(F)(F)F",
    "ClCCCl",
    "BrCCBr",
]

ranking = prioritise_molecules(molecules)

ranking is a list of (smiles, score) tuples, sorted from highest to lowest novelty score.

for smiles, score in ranking:
    print(f"{smiles:40s}  {score:.4f}")

Novelty against a reference set

To measure novelty relative to a set of known molecules supply the reference parameter:

known = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O"]
new_candidates = ["FCCC(F)(F)F", "ClCCCl", "BrCCBr"]

ranking = prioritise_molecules(new_candidates, reference=known)

Molecules most dissimilar from the reference set receive the highest scores.

Scoring formula

The score for molecule \(i\) is computed from two components:

Intra-set distance \(d_\text{self}(i)\):

The average Tanimoto distance from molecule \(i\) to all other molecules in the query set:

\[d_\text{self}(i) = \frac{1}{N-1} \sum_{j \neq i} \left(1 - T_{ij}\right)\]

Reference distance \(d_\text{ref}(i)\) (when a reference set is given):

The average Tanimoto distance from molecule \(i\) to the top percentile percent of the reference set:

\[d_\text{ref}(i) = \frac{1}{\lvert R_p \rvert} \sum_{r \in R_p} \left(1 - T_{ir}\right)\]

Combined score:

\[\text{score}(i) = a \cdot d_\text{self}(i) + b \cdot d_\text{ref}(i)\]

where \(a\) and \(b\) are weight parameters (both default to 1.0).

Parameters

prioritise_molecules(
    molecules,
    reference=None,    # optional list of reference SMILES
    a=1.0,             # weight for intra-set distance
    b=1.0,             # weight for reference distance
    percentile=90.0,   # percentile cut-off for reference set
    halogens='F',      # halogens to include in fingerprint
    saturation=None,   # 'saturated', 'unsaturated', or None
    component_metrics=['max_component'],  # fingerprint component metrics
    return_scores=True,                   # if False, return sorted smiles only
    ascending=False,   # if True, lowest score first
)

Parameter

Description

molecules

List of SMILES strings to rank

reference

Optional reference SMILES list. If None only intra-set distance is used (a controls weighting, b is ignored).

a

Weight applied to the intra-set distance component (default 1.0)

b

Weight applied to the reference distance component (default 1.0). Has no effect when reference=None.

percentile

Only molecules in the top percentile percent of the reference set (by distance to the query molecule) are used in \(d_\text{ref}\) (default 90.0)

halogens

Halogen(s) used to build fingerprints for distance computation

saturation

Saturation filter passed to generate_fingerprint()

component_metrics

List of metrics for fingerprints: ['max_component'], ['count'], ['binary'], or combined e.g. ['binary', 'effective_graph_resistance']

return_scores

If True (default), return [(smiles, score), …]. If False, return [smiles, …].

ascending

Sort order. Default False = highest novelty first.

Example: selecting a diverse subset

import pandas as pd
from PFASGroups import prioritise_molecules

df = pd.read_csv("pfas_candidates.csv")
smiles = df["SMILES"].tolist()

ranking = prioritise_molecules(smiles, halogens='F')

# Take the top-50 most novel structures
top50 = [s for s, _ in ranking[:50]]