Quickstart
Five-minute overview of the most common PFASGroups workflows.
from PFASGroups import parse_smiles
results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
arr, cols = results.to_array(), results.column_names()
print(arr.shape) # (2, 114) — 114 groups compiled by default for fluorine-only
Parsing SMILES
from PFASGroups import parse_smiles
smiles = [
"CCCC(F)(F)F", # perfluoroalkyl chain
"FC(F)(F)C(=O)O", # trifluoroacetic acid (TFA)
"OCCOCCO", # no halogen — returns no matches
]
results = parse_smiles(smiles)
results is a PFASEmbeddingSet — a list-like container
of PFASEmbedding objects (dict subclass), one per input SMILES.
Accessing matches
mol = results[0] # first molecule (PFASEmbedding)
print(mol.smiles) # canonical SMILES
print(bool(mol.matches)) # True if any group matched
for match in mol.matches: # iterate over MatchView objects
if match.is_group:
print(match.group_name) # e.g. "Perfluoroalkyl"
print(match.group_id) # integer group ID
for comp in match.components:
print(comp.atoms) # list of atom indices
Loop over only molecules that have at least one match:
for mol in results:
if mol.matches:
print(mol.smiles, "—", len(mol.matches), "match(es)")
Converting to a DataFrame
df = results.to_dataframe()
print(df.columns.tolist())
# ['smiles', 'inchi', 'group_name', 'group_id', ...]
Generating embeddings
Embeddings encode group matches as a fixed-length numeric vector suitable for machine learning. By default PFASGroups produces a binary vector with one column per group (fluorine only):
from PFASGroups import parse_smiles
smiles = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O", "OCCOCCO"]
# Convenience function — parses and returns (array, column_names)
results = parse_smiles(smiles)
arr, cols = results.to_array(), results.column_names()
print(arr.shape) # (3, n_groups) — one row per molecule
print(type(arr)) # numpy.ndarray
print(cols[:2]) # ['Perfluoromethyl [binary]', 'Perfluoroalkyl [binary]', ...]
# From a pre-parsed set
results = parse_smiles(smiles)
arr = results.to_array() # (3, n_groups) matrix
cols = results.column_names() # matching column labels
Group selection — restrict to a subset of groups:
# OECD groups only
arr_oecd, cols_oecd = results.to_array(group_selection='oecd'), results.column_names(group_selection='oecd')
# From a pre-parsed set
arr_oecd = results.to_array(group_selection='oecd')
component_metrics — control how matches are encoded:
# binary (default): 1 = present, 0 = absent
arr_bin, _ = results.to_array(component_metrics=['binary']), results.column_names(component_metrics=['binary'])
# count: number of independent matches
arr_cnt, _ = results.to_array(component_metrics=['count']), results.column_names(component_metrics=['count'])
# max_component: size (atom count) of the largest matching component
arr_max, _ = results.to_array(component_metrics=['max_component']), results.column_names(component_metrics=['max_component'])
# Preset combining binary + effective graph resistance ('best')
arr_best, _ = results.to_array(preset='best'), results.column_names(preset='best')
n_spacer — telomer CH2 spacer length (the m in m:n notation):
# n_spacer is 0 for non-telomers; encodes the linker length for
# fluorotelomers (2 for 4:2 FTOH, 4 for 6:2 FTOH, etc.)
arr_ns = results.to_array(component_metrics=['n_spacer'])
# Non-zero entries only appear for telomers group columns
ring_size — smallest ring containing the matched component:
# ring_size is 0 for acyclic groups; 5 for azoles/furans; 6 for benzene/cyclohexane
arr_rs = results.to_array(component_metrics=['ring_size'])
Combined embedding with multiple metrics and molecule-wide descriptors:
arr_combined = results.to_array(
component_metrics=['binary', 'effective_graph_resistance',
'n_spacer', 'ring_size'],
molecule_metrics=['n_components', 'max_size',
'mean_branching', 'max_component_fraction'],
)
Note
For multi-halogen embeddings covering F, Cl, Br and I, see multi-halogen fingerprinting in Multi-Halogen Analysis (Advanced).
PFAS definition screening
from PFASGroups import parse_smiles
results = parse_smiles(
["CCCC(F)(F)F", "OCCOCCO"],
include_PFAS_definitions=True,
)
for mol in results:
for match in mol.matches:
if match.is_definition:
print(mol.smiles, "matches", match.get("definition_name"))
Saturation filter
# Only perfluorinated (fully saturated C–F) groups
results = parse_smiles(smiles, saturation='per')
# Polyfluorinated groups (partially substituted)
results = parse_smiles(smiles, saturation='poly')
# No filter — all groups (default: saturation=None for parse_smiles)
results = parse_smiles(smiles, saturation=None)
Multi-halogen parsing (advanced)
To detect Cl, Br and I groups in addition to fluorine, use the
halogens argument or import from HalogenGroups:
from PFASGroups import parse_smiles
results = parse_smiles(["ClCCCl", "BrCCBr"], halogens=['F', 'Cl', 'Br', 'I'])
See Multi-Halogen Analysis (Advanced) for full multi-halogen documentation.
Command-line usage
# Parse a CSV of SMILES
pfasgroups parse input.csv --output results.json
# Generate fingerprints
pfasgroups fingerprint input.csv --output fps.csv
# List all 119 group names (114 compiled by default for fluorine-only)
pfasgroups list-groups
See Command-Line Interface for the full CLI reference.