Benchmarking
from PFASGroups import parse_smiles, generate_fingerprint
# Parse 10 k molecules
smiles = [...] # your list
results = parse_smiles(smiles)
fps, info = generate_fingerprint(smiles) # shape (n, 116)
This page describes the validation studies performed to assess the accuracy and completeness of the PFASGroups library.
Validation against PFASSTRUCTv5
Dataset: PFASSTRUCTv5 is a curated annotated PFAS structure database containing ~14 000 structures (Schymanski et al. 2023).
Method: Each structure was parsed with parse_smiles() using the
default fluorine-only mode and OECD 2021 group definitions. The binary
PFAS/non-PFAS label from PFASSTRUCTv5 was used as the ground truth.
Results:
Metric |
Value |
Notes |
|---|---|---|
Sensitivity (recall) |
> 0.97 |
Fraction of known PFAS correctly detected |
Specificity |
> 0.90 |
Fraction of non-PFAS correctly excluded |
F1 score |
> 0.96 |
Failures are mostly due to:
Highly functionalised PFAS with unusual connectivity not covered by any current SMARTS pattern
Partially fluorinated structures at the boundary of the definition
Comparison with CSRML classifier
The CSRML (Chemical Structure Rule Markup Language) classifier from the OECD toolbox was used as an external reference.
Key findings:
PFASGroups matches or exceeds CSRML accuracy on structures with at least one perfluoroalkyl chain.
PFASGroups additionally detects fluorotelomer groups not covered by the CSRML rule set.
For borderline polyfluoroalkyl structures sensitivity is comparable (both ~0.85).
Running the benchmark scripts
Benchmark scripts are available in the benchmark/ directory of the source
repository:
cd benchmark
# Reproduce PFASSTRUCTv5 validation
python benchmark_pfasstructv5.py
# Compare with CSRML classifier
python benchmark_csrml.py
# Accuracy report across all group categories
python accuracy_report.py
Expected outputs are in benchmark/results/.
Performance
Typical throughput on a modern laptop (single core):
Task |
Molecules |
Time |
|---|---|---|
|
10 000 |
~5 s |
|
10 000 |
~8 s |
|
10 000 |
~12 s |
For large datasets (>100 k molecules) consider using the
compute_component_metrics=False flag to skip the effective graph
resistance computation:
results = parse_smiles(large_smiles_list, compute_component_metrics=False)