Multi-Halogen Analysis (Advanced)
Note
This is an advanced topic. For standard PFAS / fluorine-only work,
use Quickstart which focuses on the PFASGroups import.
This page describes how to extend detection to Cl, Br and I groups.
PFASGroups focuses on fluorinated substances (halogens='F' by default).
The same codebase can also detect chlorinated, brominated and iodinated
structural groups when you pass halogens explicitly, or by importing from
HalogenGroups which sets all four halogens as the default.
# Option A — explicit halogens argument with PFASGroups (recommended)
from PFASGroups import parse_smiles
results = parse_smiles(["ClC(Cl)(Cl)Cl", "BrCCBr"], halogens=['F','Cl','Br','I'])
# Option B — import HalogenGroups, all-halogens is the default
from HalogenGroups import parse_smiles
results = parse_smiles(["ClC(Cl)(Cl)Cl", "BrCCBr"])
Import |
Default |
Typical use |
|---|---|---|
|
|
Broad multi-halogen screening |
|
|
PFAS / fluorine-focused analysis |
Either import can be forced to any halogen set by passing the halogens
argument explicitly — the results will be identical.
Quick Comparison
from HalogenGroups import parse_smiles as hal_parse
from PFASGroups import parse_smiles as pfas_parse
smiles = ["ClC(Cl)(Cl)C(Cl)(Cl)Cl", # perchlorinated
"BrC(Br)(Br)CBr", # brominated
"FC(F)(F)C(F)(F)C(=O)O"] # PFBA
# HalogenGroups: all halogens included by default
results_all = hal_parse(smiles)
# PFASGroups: fluorine only by default → Cl/Br compounds show 0 matches
results_f = pfas_parse(smiles)
# Equivalent — pass halogens explicitly to PFASGroups
results_eq = pfas_parse(smiles, halogens=['F', 'Cl', 'Br', 'I'])
Warning
The default embedding width depends on which module you import from.
parse_smiles from HalogenGroups returns a
HalogenGroups.PFASEmbeddingSet subclass. Calling results.to_array()
on that object without an explicit halogens argument produces a
464-column array (116 groups × 4 halogens), not the standard
116-column fluorine-only one. To guarantee fluorine-only output,
always pass halogens='F' explicitly:
arr = results.to_array(halogens='F') # always 116 columns
This applies to compute_config, _txp_fingerprint, and any similar
helpers in notebooks or scripts that pass **kwargs to
to_array() — the halogens key must be present in those kwargs.
Functions with Altered Defaults
The following functions have their halogens default overridden to
['F', 'Cl', 'Br', 'I'] when imported from HalogenGroups:
Function |
Default change |
|---|---|
|
|
|
|
|
|
|
|
All other functions (parse_mol, parse_groups_in_mol, get_HalogenGroups,
prioritise_molecules, … ) are re-exported unchanged.
Parsing Multi-Halogen Molecules
from HalogenGroups import parse_smiles
smiles = [
"FC(F)(F)C(F)(F)C(=O)O", # PFBA — fluorinated
"ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O", # perchlorinated carboxylic acid
"BrC(Br)(Br)CBr", # brominated
]
# All four halogens considered — no extra argument needed
results = parse_smiles(smiles)
for mol in results:
print(mol.smiles)
for match in mol.matches:
comps = match.components
print(f" {match.group_name}: {len(comps)} component(s)")
To restrict to a subset of halogens with HalogenGroups:
# Override the default to fluorine + chlorine only
results = parse_smiles(smiles, halogens=['F', 'Cl'])
Multi-Halogen Fingerprints
generate_fingerprint() with multiple halogens stacks per-halogen vectors
horizontally. With 116 groups and 4 halogens the resulting fingerprint has
116 × 4 = 464 columns. Group names are suffixed [F], [Cl], [Br],
[I].
from HalogenGroups import generate_fingerprint
smiles = ["FC(F)(F)C(F)(F)C(=O)O",
"ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"]
# Default: all 4 halogens, per-saturation → shape (2, 464)
fps, info = generate_fingerprint(smiles)
print(fps.shape) # (2, 464)
print(info['halogens']) # ['F', 'Cl', 'Br', 'I']
print(info['group_names'][:3]) # ['... [F]', '... [F]', '... [F]']
Via PFASEmbeddingSet
from HalogenGroups import parse_smiles
results = parse_smiles(smiles)
# to_array() defaults to all halogens in HalogenGroups
arr_all = results.to_array() # shape (n, 464)
arr_f = results.to_array(halogens='F') # shape (n, 116)
arr_fc = results.to_array(halogens=['F', 'Cl']) # shape (n, 232)
arr_oecd = results.to_array(
group_selection='oecd', halogens=['F', 'Cl', 'Br', 'I']) # shape (n, 112)
# Best preset (binary + effective_graph_resistance) with F only
arr_best = results.to_array(preset='best', halogens='F') # shape (n, 232)
# Explicit component_metrics — count mode + graph metric with all halogens
arr_cm = results.to_array(
component_metrics=['binary', 'effective_graph_resistance'],
halogens=['F', 'Cl', 'Br', 'I']) # shape (n, 928) — 2 × 116 × 4
Each entry in component_metrics adds one block of n_groups × n_halogens
columns. The column naming follows the pattern "GroupName [halogen] [metric]"
(e.g. "Perfluoroalkyl [F] [binary]", "Perfluoroalkyl [F] [effective_graph_resistance]").
Combining with Saturation Filters
The saturation parameter ('per', 'poly', or None) applies to all
halogens simultaneously and controls which component SMARTS are used for groups
that have halogenated-chain components (OECD groups 1–28).
from HalogenGroups import parse_smiles
# Perhalogenated components only (default when using HalogenGroups fingerprinting)
r_per = parse_smiles(smiles, saturation='per')
# Polyhalogenated components only
r_poly = parse_smiles(smiles, saturation='poly')
# All components (per + poly)
r_all = parse_smiles(smiles, saturation=None)
CLI with Multiple Halogens
The CLI always requires an explicit --halogens flag:
# All four halogens
halogengroups parse --halogens F Cl Br I "ClC(Cl)(Cl)C(Cl)(Cl)C(=O)O"
# Per-saturation, F + Cl, OECD groups
halogengroups parse --halogens F Cl --saturation per "FC(F)(F)C(F)(F)C(=O)O"
# Stacked fingerprint with F + Cl
halogengroups fingerprint --halogens F Cl "FC(F)(F)C(F)(F)C(=O)O"
See Command-Line Interface for the full CLI reference.
Implementation Details
HalogenGroups/__init__.py uses functools.wraps()-compatible wrappers
that inject halogens=['F', 'Cl', 'Br', 'I'] as a default keyword argument.
Because Python keyword defaults can always be overridden at call time, every
explicit halogens=... argument takes precedence.
The PFASEmbeddingSet subclass in HalogenGroups overrides only
to_array(); all other methods (show(), summary(), to_sql(),
etc.) are inherited unchanged from PFASGroups.PFASEmbeddings.PFASEmbeddingSet.
When to Use Which Import
Use HalogenGroups when:
Your compounds include Cl-, Br- or I-containing structures
You want to compare fluorination vs. chlorination patterns side by side
You are building a generic halogenated-substance screening workflow
Use PFASGroups (halogens='F') when:
You are working exclusively with PFAS (fluorine-only)
You need strict compatibility with published PFAS fingerprint benchmarks
You want smaller fingerprints (116-column vs. 464-column)
You mix fluorine-only and multi-halogen calls in the same script
See Also
Quickstart — first steps with the package
Core API — full
parse_smiles/generate_fingerprintreferenceData Models —
PFASEmbeddingSetandEmbeddingArraydetailsCustomization — adding custom halogen groups