Prioritization
prioritise_molecules() ranks a set of molecules by
their structural novelty. It is useful for selecting a structurally diverse
subset from a large chemical library, or for identifying which molecules in a
test set are most different from a reference set.
Basic usage
from PFASGroups import prioritise_molecules
molecules = [
"CCCC(F)(F)F",
"FC(F)(F)C(=O)O",
"FCCC(F)(F)F",
"ClCCCl",
"BrCCBr",
]
ranking = prioritise_molecules(molecules)
ranking is a list of (smiles, score) tuples, sorted from highest to
lowest novelty score.
for smiles, score in ranking:
print(f"{smiles:40s} {score:.4f}")
Novelty against a reference set
To measure novelty relative to a set of known molecules supply the
reference parameter:
known = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O"]
new_candidates = ["FCCC(F)(F)F", "ClCCCl", "BrCCBr"]
ranking = prioritise_molecules(new_candidates, reference=known)
Molecules most dissimilar from the reference set receive the highest scores.
Scoring formula
The score for molecule \(i\) is computed from two components:
Intra-set distance \(d_\text{self}(i)\):
The average Tanimoto distance from molecule \(i\) to all other molecules in the query set:
Reference distance \(d_\text{ref}(i)\) (when a reference set is given):
The average Tanimoto distance from molecule \(i\) to the top
percentile percent of the reference set:
Combined score:
where \(a\) and \(b\) are weight parameters (both default to 1.0).
Parameters
prioritise_molecules(
molecules,
reference=None, # optional list of reference SMILES
a=1.0, # weight for intra-set distance
b=1.0, # weight for reference distance
percentile=90.0, # percentile cut-off for reference set
halogens='F', # halogens to include in fingerprint
saturation=None, # 'saturated', 'unsaturated', or None
component_metrics=['max_component'], # fingerprint component metrics
return_scores=True, # if False, return sorted smiles only
ascending=False, # if True, lowest score first
)
Parameter |
Description |
|---|---|
|
List of SMILES strings to rank |
|
Optional reference SMILES list. If |
|
Weight applied to the intra-set distance component (default 1.0) |
|
Weight applied to the reference distance component (default 1.0).
Has no effect when |
|
Only molecules in the top |
|
Halogen(s) used to build fingerprints for distance computation |
|
Saturation filter passed to |
|
List of metrics for fingerprints: |
|
If |
|
Sort order. Default |
Example: selecting a diverse subset
import pandas as pd
from PFASGroups import prioritise_molecules
df = pd.read_csv("pfas_candidates.csv")
smiles = df["SMILES"].tolist()
ranking = prioritise_molecules(smiles, halogens='F')
# Take the top-50 most novel structures
top50 = [s for s, _ in ranking[:50]]