Quickstart

Five-minute overview of the most common PFASGroups workflows.

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
arr, cols = results.to_array(), results.column_names()
print(arr.shape)   # (2, 114) — 114 groups compiled by default for fluorine-only

Parsing SMILES

from PFASGroups import parse_smiles

smiles = [
    "CCCC(F)(F)F",       # perfluoroalkyl chain
    "FC(F)(F)C(=O)O",    # trifluoroacetic acid (TFA)
    "OCCOCCO",           # no halogen — returns no matches
]

results = parse_smiles(smiles)

results is a PFASEmbeddingSet — a list-like container of PFASEmbedding objects (dict subclass), one per input SMILES.

Accessing matches

mol = results[0]                             # first molecule (PFASEmbedding)
print(mol.smiles)                            # canonical SMILES
print(bool(mol.matches))                     # True if any group matched

for match in mol.matches:                    # iterate over MatchView objects
    if match.is_group:
        print(match.group_name)              # e.g. "Perfluoroalkyl"
        print(match.group_id)               # integer group ID
        for comp in match.components:
            print(comp.atoms)               # list of atom indices

Loop over only molecules that have at least one match:

for mol in results:
    if mol.matches:
        print(mol.smiles, "—", len(mol.matches), "match(es)")

Converting to a DataFrame

df = results.to_dataframe()
print(df.columns.tolist())
# ['smiles', 'inchi', 'group_name', 'group_id', ...]

Generating embeddings

Embeddings encode group matches as a fixed-length numeric vector suitable for machine learning. By default PFASGroups produces a binary vector with one column per group (fluorine only):

from PFASGroups import parse_smiles

smiles = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O", "OCCOCCO"]

# Convenience function — parses and returns (array, column_names)
results = parse_smiles(smiles)
arr, cols = results.to_array(), results.column_names()
print(arr.shape)   # (3, n_groups) — one row per molecule
print(type(arr))   # numpy.ndarray
print(cols[:2])    # ['Perfluoromethyl [binary]', 'Perfluoroalkyl [binary]', ...]

# From a pre-parsed set
results = parse_smiles(smiles)
arr  = results.to_array()    # (3, n_groups) matrix
cols = results.column_names()  # matching column labels

Group selection — restrict to a subset of groups:

# OECD groups only
arr_oecd, cols_oecd = results.to_array(group_selection='oecd'), results.column_names(group_selection='oecd')

# From a pre-parsed set
arr_oecd = results.to_array(group_selection='oecd')

component_metrics — control how matches are encoded:

# binary (default): 1 = present, 0 = absent
arr_bin, _ = results.to_array(component_metrics=['binary']), results.column_names(component_metrics=['binary'])

# count: number of independent matches
arr_cnt, _ = results.to_array(component_metrics=['count']), results.column_names(component_metrics=['count'])

# max_component: size (atom count) of the largest matching component
arr_max, _ = results.to_array(component_metrics=['max_component']), results.column_names(component_metrics=['max_component'])

# Preset combining binary + effective graph resistance ('best')
arr_best, _ = results.to_array(preset='best'), results.column_names(preset='best')

n_spacer — telomer CH2 spacer length (the m in m:n notation):

# n_spacer is 0 for non-telomers; encodes the linker length for
# fluorotelomers (2 for 4:2 FTOH, 4 for 6:2 FTOH, etc.)
arr_ns = results.to_array(component_metrics=['n_spacer'])
# Non-zero entries only appear for telomers group columns

ring_size — smallest ring containing the matched component:

# ring_size is 0 for acyclic groups; 5 for azoles/furans; 6 for benzene/cyclohexane
arr_rs = results.to_array(component_metrics=['ring_size'])

Combined embedding with multiple metrics and molecule-wide descriptors:

arr_combined = results.to_array(
    component_metrics=['binary', 'effective_graph_resistance',
                       'n_spacer', 'ring_size'],
    molecule_metrics=['n_components', 'max_size',
                      'mean_branching', 'max_component_fraction'],
)

Note

For multi-halogen embeddings covering F, Cl, Br and I, see multi-halogen fingerprinting in Multi-Halogen Analysis (Advanced).

PFAS definition screening

from PFASGroups import parse_smiles

results = parse_smiles(
    ["CCCC(F)(F)F", "OCCOCCO"],
    include_PFAS_definitions=True,
)

for mol in results:
    for match in mol.matches:
        if match.is_definition:
            print(mol.smiles, "matches", match.get("definition_name"))

Saturation filter

# Only perfluorinated (fully saturated C–F) groups
results = parse_smiles(smiles, saturation='per')

# Polyfluorinated groups (partially substituted)
results = parse_smiles(smiles, saturation='poly')

# No filter — all groups (default: saturation=None for parse_smiles)
results = parse_smiles(smiles, saturation=None)

Multi-halogen parsing (advanced)

To detect Cl, Br and I groups in addition to fluorine, use the halogens argument or import from HalogenGroups:

from PFASGroups import parse_smiles

results = parse_smiles(["ClCCCl", "BrCCBr"], halogens=['F', 'Cl', 'Br', 'I'])

See Multi-Halogen Analysis (Advanced) for full multi-halogen documentation.

Command-line usage

# Parse a CSV of SMILES
pfasgroups parse input.csv --output results.json

# Generate fingerprints
pfasgroups fingerprint input.csv --output fps.csv

# List all 119 group names (114 compiled by default for fluorine-only)
pfasgroups list-groups

See Command-Line Interface for the full CLI reference.