Command-Line Interface
PFASGroups ships a command-line tool available as both halogengroups and
pfasgroups (aliases for the same entry point).
Installation makes the command available in your shell after:
pip install PFASGroups
# or: pip install -e .
Quick example:
# Parse a SMILES and list detected groups
pfasgroups parse "CCCC(F)(F)F"
# Generate a 116-column fingerprint (F only) and print as CSV
pfasgroups fingerprint --format csv "CCCC(F)(F)F" "FC(F)(F)C(=O)O"
# List all 116 built-in groups
pfasgroups list-groups
Synopsis
halogengroups <command> [options] [smiles ...]
Commands:
parse Parse SMILES and identify halogen groups
fingerprint Generate group fingerprints
list-groups List all available halogen groups
list-paths List available component SMARTS types
parse
Identify halogen groups in one or more SMILES strings.
halogengroups parse [options] [smiles ...]
Positional arguments
smiles One or more SMILES strings (quoted if they contain spaces)
Options
-i, --input FILE Read SMILES from file (one per line)
-o, --output FILE Write results to FILE (default: stdout)
--format {json,csv} Output format (default: json)
--pretty Pretty-print JSON
--halogens H [H ...] Filter components by halogen(s): F Cl Br I
--saturation {per,poly} Filter by saturation level
--form {alkyl,cyclic} Filter by molecular form
--no-component-metrics Skip all graph-theory metrics (fastest mode)
--limit-effective-graph-resistance N
Only compute effective resistance for components
with fewer than N atoms (0 = disable; omit = always)
--groups-file FILE Custom halogen groups JSON file
--component_smarts-file FILE
Custom component SMARTS JSON file
Examples
# Single SMILES
halogengroups parse "FC(F)(F)C(F)(F)C(=O)O"
# Multiple SMILES, pretty-printed JSON
halogengroups parse --pretty \
"FC(F)(F)C(F)(F)C(=O)O" \
"FC(F)(F)C(F)(F)S(=O)(=O)O"
# Fluorine-only, perfluorinated alkyl components
halogengroups parse --halogens F --saturation per --form alkyl \
"FC(F)(F)C(F)(F)C(=O)O"
# All four halogens
halogengroups parse --halogens F Cl Br I \
"ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"
# From file, CSV output
halogengroups parse --input smiles.txt --output results.csv --format csv
# Skip graph metrics (fastest)
halogengroups parse --no-component-metrics "FC(F)(F)C(F)(F)C(=O)O"
# Compute resistance only for small components
halogengroups parse --limit-effective-graph-resistance 100 \
"FC(F)(F)C(F)(F)C(=O)O"
# Custom group file
halogengroups parse --groups-file my_groups.json "FC(F)(F)C(F)(F)C(=O)O"
JSON output structure
[
{
"smiles": "FC(F)(F)C(F)(F)C(=O)O",
"matches": [
{
"group_id": 1,
"group_name": "Perfluoroalkyl carboxylic acids",
"match_count": 1,
"components": [
{
"size": 3,
"SMARTS": "Perfluoroalkyl",
"branching": 1.0,
"smarts_centrality": 0.5
}
]
}
]
}
]
CSV output columns
smiles, group_id, group_name, match_count, component_sizes
fingerprint
Generate halogen-group fingerprints suitable for machine learning.
halogengroups fingerprint [options] [smiles ...]
Positional arguments
smiles One or more SMILES strings
Options
-i, --input FILE Read SMILES from file
-o, --output FILE Write output to FILE (default: stdout)
-g, --groups SPEC Group selection as range "1-28" or
comma-separated "1,2,3" (default: all)
-f, --format {vector,dict,sparse,detailed,int}
Fingerprint representation (default: vector)
--count-mode {binary,count,max_chain}
Encoding mode (default: binary)
--halogens H [H ...] Halogens to include (default: F)
--output-format {json,csv} File format (default: json)
--pretty Pretty-print JSON
Examples
# Binary vector (default)
halogengroups fingerprint "FC(F)(F)C(F)(F)C(=O)O"
# Dictionary representation
halogengroups fingerprint --format dict "FC(F)(F)C(F)(F)C(=O)O"
# OECD groups only (IDs 1-28)
halogengroups fingerprint --groups 1-28 "FC(F)(F)C(F)(F)C(=O)O"
# Count mode
halogengroups fingerprint --count-mode count "FC(F)(F)C(F)(F)C(=O)O"
# Multi-halogen stacked fingerprint (116 × 4 = 464 columns)
halogengroups fingerprint --halogens F Cl Br I "FC(F)(F)C(F)(F)C(=O)O"
# From file, save to CSV
halogengroups fingerprint --input smiles.txt \
--output fps.csv --output-format csv
list-groups
List all available halogen groups and their definitions.
halogengroups list-groups [options]
Options
-o, --output FILE Write to FILE (default: stdout)
--pretty Pretty-print JSON (default: true)
Example output (excerpt)
[
{
"id": 1,
"name": "Perfluoroalkyl carboxylic acids",
"alias": "PFCA",
"componentSmarts": "Perfluoroalkyl",
"componentSaturation": "per",
"componentHalogens": "F",
"componentForm": "alkyl"
},
{
"id": 6,
"name": "Perfluoroalkyl sulfonic acids",
"alias": "PFSA"
}
]
Examples
# Print to console
halogengroups list-groups
# Save to file
halogengroups list-groups --output groups.json
list-paths
List available component SMARTS types (path/component definitions).
halogengroups list-paths [options]
Options
-o, --output FILE Write to FILE (default: stdout)
Example output (excerpt)
{
"Perfluoroalkyl": {
"component": "[C;X4;H0](F)(F)!@!=!#[C;X4;H0](F)(F)",
"end": "[C;X4;H0](F)(F)F",
"halogen": "F",
"form": "alkyl",
"saturation": "per"
},
"Polyfluoroalkyl": {
"component": "[C;X4;H1](F)!@!=!#[C;X4](F)",
"end": "[C;X4;H1](F)F",
"halogen": "F",
"form": "alkyl",
"saturation": "poly"
}
}
Global Options
These options can be placed before any sub-command:
--groups-file FILE Custom halogen groups JSON
--component_smarts-file FILE Custom component SMARTS JSON
Environment
The CLI uses the Python environment in which PFASGroups is installed. To use a specific environment:
# conda
conda activate chem
halogengroups parse "FC(F)(F)C(F)(F)C(=O)O"
# direct Python invocation
python -m PFASGroups.cli parse "FC(F)(F)C(F)(F)C(=O)O"
Performance Tips
For large input files:
# Skip graph metrics (5-10× faster for large molecules)
halogengroups parse --no-component-metrics --input big_file.smi \
--output results.json
# Skip effective graph resistance for molecules with > 50 atoms
halogengroups parse --limit-effective-graph-resistance 50 \
--input big_file.smi --output results.json
See the Benchmarking page for timing data.
See Also
Quickstart — Python API quick start
Core API —
parse_smiles,generate_fingerprintreferenceCustomization — using custom groups and path files with the CLI