Benchmarking

from PFASGroups import parse_smiles, generate_fingerprint

# Parse 10 k molecules
smiles = [...]   # your list
results = parse_smiles(smiles)
fps, info = generate_fingerprint(smiles)   # shape (n, 116)

This page describes the validation studies performed to assess the accuracy and completeness of the PFASGroups library.

Validation against PFASSTRUCTv5 

Dataset: PFASSTRUCTv5 is a curated annotated PFAS structure database containing ~14 000 structures (Schymanski et al. 2023).

Method: Each structure was parsed with parse_smiles() using the default fluorine-only mode and OECD 2021 group definitions. The binary PFAS/non-PFAS label from PFASSTRUCTv5 was used as the ground truth.

Results:

Metric	Value	Notes
Sensitivity (recall)	> 0.97	Fraction of known PFAS correctly detected
Specificity	> 0.90	Fraction of non-PFAS correctly excluded
F1 score	> 0.96

Failures are mostly due to:

Highly functionalised PFAS with unusual connectivity not covered by any current SMARTS pattern
Partially fluorinated structures at the boundary of the definition

Comparison with CSRML classifier 

The CSRML (Chemical Structure Rule Markup Language) classifier from the OECD toolbox was used as an external reference.

Key findings:

PFASGroups matches or exceeds CSRML accuracy on structures with at least one perfluoroalkyl chain.
PFASGroups additionally detects fluorotelomer groups not covered by the CSRML rule set.
For borderline polyfluoroalkyl structures sensitivity is comparable (both ~0.85).

Running the benchmark scripts 

Benchmark scripts are available in the benchmark/ directory of the source repository:

cd benchmark
# Reproduce PFASSTRUCTv5 validation
python benchmark_pfasstructv5.py

# Compare with CSRML classifier
python benchmark_csrml.py

# Accuracy report across all group categories
python accuracy_report.py

Expected outputs are in benchmark/results/.

Performance 

Typical throughput on a modern laptop (single core):

Task	Molecules	Time
`parse_smiles` with defaults	10 000	~5 s
`generate_fingerprint`	10 000	~8 s
`parse_smiles` with `compute_component_metrics=True`	10 000	~12 s

For large datasets (>100 k molecules) consider using the compute_component_metrics=False flag to skip the effective graph resistance computation:

results = parse_smiles(large_smiles_list, compute_component_metrics=False)

Benchmarking

Validation against PFASSTRUCTv5

Comparison with CSRML classifier

Running the benchmark scripts

Performance

Validation against PFASSTRUCTv5 

Comparison with CSRML classifier 

Running the benchmark scripts 

Performance 