Advanced Features: H-Components and Wildcard Groups
This guide covers two advanced features in PFASGroups:
H-Components: Pseudo-halogenated hydrocarbon chains for expanding homologous series to non-fluorinated organic compounds
Wildcard Groups: Generic functional group detection beyond OECD PFAS groups
H-Components (Hydrocarbon Analysis)
Overview
H-Components allow PFASGroups to treat ordinary hydrocarbon chains (CH₂ units) as if they were halogenated components. This feature enables:
Homologue series exploration for non-fluorinated organic compounds
Validation of component detection logic on simpler test cases
Extended chemical space mapping to broader alkyl and aliphatic systems
Research into structural variation of neutral hydrocarbons
While PFASGroups is primarily designed for halogenated substances, the H-component framework demonstrates the generality of the underlying component-based architecture.
Usage in Homologue Generation
The most common use of H-components is in the generate_homologues() function, where you can specify halogen='H' to generate shorter alkyl chain variants.
Basic Example
from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem
# Simple hydrocarbon with a carboxylic acid head group
smiles = 'OC(=O)CCCCCC' # 6-carbon straight chain
mol = Chem.MolFromSmiles(smiles)
# Generate homologues by removing CH2 units
homologues = generate_homologues(mol, halogen='H')
print(f"Parent: {Chem.MolToSmiles(mol)}")
print(f"Halogen mode: {homologues.halogen}")
print(f"Number of homologues: {len(homologues)}")
# Inspect each homologue
for inchikey, inner_dict in homologues.items():
for formula, h_mol in inner_dict.items():
print(f" {formula}: {Chem.MolToSmiles(h_mol)}")
Expected Output:
For a 6-carbon chain, you would typically generate 4–5 shorter homologues:
Parent: OC(=O)CCCCCC
Halogen mode: H
Number of homologues: 5
C6H12O2: OC(=O)CCCCC
C5H10O2: OC(=O)CCCC
C4H8O2: OC(=O)CCC
C3H6O2: OC(=O)CC
C2H4O2: OC(=O)C
How It Works
When halogen='H':
PFASGroups detects “Alkyl” components by matching CH₂-rich backbones
Instead of looking for C–F bonds, it identifies C–H bonds in repeating units
Homologues are generated by systematically removing CH₂ units
The
n_removedfield in results tracks how many CH₂ units were shortened
Comparing Halogen Modes
from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem
# Same parent SMILES, analyzed with different halogens
smiles = 'OC(=O)' + 'C(F)(F)' * 4 + 'F' # Perfluoroalkyl acid
mol = Chem.MolFromSmiles(smiles)
# Fluorine (default) — removes CF2 units
hom_f = generate_homologues(mol, halogen='F')
print(f"F-mode homologues: {len(hom_f)}")
# What if we treat it as hydrogen? (not typical, but demonstrates flexibility)
# Note: This would find CH2 backbone patterns, which exist in the linker region
# hom_h = generate_homologues(mol, halogen='H')
# Chlorine — removes CCl2 units (if present)
# hom_cl = generate_homologues(mol, halogen='Cl')
Component Detection with H
To inspect H-components through the standard parser, define a custom H-constrained
group and run parse_smiles in H mode:
from PFASGroups import parse_smiles, HalogenGroup
smiles = 'CCCCO'
h_group = HalogenGroup(
id=9990,
name='Hydrocarbon alcohol via H-alkyl component',
smarts={'[#6$([#6!$([#6]=O)][OH1,Oh1,O-])]': 1},
componentSmarts='Alkyl',
componentSaturation='per',
componentHalogens='H',
componentForm='alkyl',
constraints={},
max_dist_from_comp=1,
)
results = parse_smiles(smiles, halogens='H', pfas_groups=[h_group], bycomponent=True)
h_matches = [m for m in results[0].matches if m.get('id') == 9990 and m.get('type') == 'HalogenGroup']
print(f"Found {len(h_matches)} H-component match(es)")
if h_matches:
print(f"Component count: {h_matches[0]['num_components']}")
Limitations
H-components are not true halogenated components — they use CH₂ patterns as stand-ins
No graph metrics are computed for H-components (only for fluorinated components with sufficient size)
Limited validation: fewer test cases exist for hydrocarbon analysis
Use case specificity: H-mode is mainly for research and validation, not production PFAS analysis
Wildcard Groups
Overview
Wildcard groups provide generic functional group detection beyond the 27 OECD PFAS groups. They enable:
Broader organic chemistry coverage (esters, ethers, alcohols, aldehydes, etc.)
Complementary analysis to PFAS-specific patterns
Non-halogenated compound screening (identifying functional groups in any molecule)
Cross-framework validation (comparing wildcard matches across different halogens)
Wildcard groups are assigned group IDs in ranges 29–76 (and some higher special groups), and their matches are tagged with a ‘W’ prefix in match IDs (e.g., W-001, W-042).
Enabling Wildcard Detection
Basic Toggle
Wildcard groups are disabled by default. Enable them by including '*' in halogens:
from PFASGroups import parse_smiles
smiles = "CCO" # ethanol
# Without wildcards
results_no_wc = parse_smiles(smiles, halogens='F')
print(f"Matches (no wildcards): {len(results_no_wc[0].matches)}")
# With wildcards
results_with_wc = parse_smiles(smiles, halogens='*')
print(f"Matches (with wildcards): {len(results_with_wc[0].matches)}")
# Wildcard matches are present in the second set
for match in results_with_wc[0].matches:
print(f" {match.group_name} (ID {match.group_id})")
Example Output:
Matches (no wildcards): 0
Matches (with wildcards): 1
Alcohol (ID 30)
Wildcard vs. Halogen Groups
When both are enabled, you can distinguish them by match_id prefix:
from PFASGroups import parse_smiles
# Molecule with both PFAS and functional group interest
smiles = "FC(F)(F)C(F)(F)C(=O)O" # TFA with carboxylic acid
results = parse_smiles(smiles, halogens=['F', '*'])
mol = results[0]
# Separate by match type
halogen_matches = [m for m in mol.matches if not m.match_id.startswith('W')]
wildcard_matches = [m for m in mol.matches if m.match_id.startswith('W')]
print(f"Halogen group matches ({len(halogen_matches)}):")
for m in halogen_matches:
print(f" {m['id'] if m['type']=='PFASdefinition' else m.group_id} - {m['definition_name'] if m['type']=='PFASdefinition' else m.group_name} - {m['type']}")
print(f"\nWildcard group matches ({len(wildcard_matches)}):")
for m in wildcard_matches:
print(f" {m.group_id} - {m.group_name} - {m['type']}")
Common Wildcard Groups
The following are frequently matched wildcard groups (IDs 29–76):
ID |
Group Name |
Pattern |
Example |
|---|---|---|---|
30 |
Alcohol |
[OH] on C |
|
31 |
Ether |
C-O-C linkage |
|
32 |
Aldehyde |
[CH1]=O |
|
33 |
Alkene |
C=C double bond |
|
36 |
Alkyne |
C≡C triple bond |
|
42 |
Carboxylic acid |
C(=O)OH |
|
46 |
Ester |
C(=O)O-C |
|
47 |
Ether |
C-O-C (again) |
|
For a complete list, see PFASGroups/data/Halogen_groups_smarts.json (Groups 29+).
Filtering by Functional Group
You can filter matches to only those in specific wildcard groups:
from PFASGroups import parse_smiles
test_molecules = [
("CCO", "alcohol"),
("CC(=O)O", "carboxylic acid"),
("CC(=O)OC", "ester"),
("CCOC", "ether"),
("C=C", "alkene"),
("C#C", "alkyne"),
]
results = parse_smiles([smi for smi, _ in test_molecules], halogens='*')
for (smi, desc), mol in zip(test_molecules, results):
print(f"{desc:20} ({smi:15}): ", end="")
if mol.matches:
names = [m.group_name for m in mol.matches]
print(", ".join(names))
else:
print("(no match)")
Expected Output:
alcohol (CCO ): Alcohol
carboxylic acid (CC(=O)O ): Carboxylic acid
ester (CC(=O)OC ): Ester
ether (CCOC ): Ether
alkene (C=C ): Alkene
alkyne (C#C ): Alkyne
Multi-Halogen Wildcard Analysis
Wildcards work across all halogen modes. Compare wildcard matches across halogens:
from PFASGroups import parse_smiles
# Non-halogenated molecule with multiple functional groups
smiles = "O=C(O)C(F)(F)CCN(CC(C)O)CC(=O)OC"
# Analyze with each halogen (wildcards enabled)
halogens = ['F', 'Cl', 'Br', 'I', 'H', "*"]
results = parse_smiles(smiles, halogens=halogens)
for result in results:
print(f"Results for {result.smiles}")
for match in result.matches:
print(f" - {match.group_name} (ID {match.group_id}) under '{match['halogen']}' mode")
Understanding the halogen field: The match['halogen'] field shows which halogens are
actually present in that match’s components (e.g., 'F', 'Cl', ['F', 'H'], '*'
for wildcards). For HalogenGroup matches, this reflects the real halogens bonded to the
matched carbon components. For WildcardGroup matches, it is always '*'. For PFASdefinition
matches, it is always 'F'.
Note: To run both H-component and wildcard matching in one call, use
halogens=['H', '*'].
Note: Since this molecule has no halogens, matches will be identical across all halogen modes (wildcard patterns don’t depend on the halogen parameter).
Advanced: Combining H-Components and Wildcards
You can analyze molecules using both features simultaneously:
from PFASGroups import parse_smiles
from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem
# Complex molecule: PFOA-like with a non-fluorinated tail
smiles = "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)CCCCCC(=O)O"
mol = Chem.MolFromSmiles(smiles)
# 1. Detect all functional groups and PFAS patterns
results = parse_smiles(smiles, halogens=['F', '*'])
print("Detected groups:")
for match in results[0].matches:
prefix = "PFAS" if not match.match_id.startswith('W') else "Generic"
print(f" [{prefix}] {match.group_name}")
# 2. Generate PFAS (F-based) homologues
pfas_homologues = generate_homologues(mol, halogen='F')
print(f"\nFluorinated homologues: {len(pfas_homologues)}")
# 3. Analyze the non-halogenated portion as H-component
# (Extract a fragment to demonstrate)
non_halo_smiles = "CCCCCC(=O)O" # Hydrocarbon portion only
non_halo_mol = Chem.MolFromSmiles(non_halo_smiles)
h_homologues = generate_homologues(non_halo_mol, halogen='H')
print(f"Hydrocarbon homologues: {len(h_homologues)}")
Custom Wildcard Definitions
If you need to detect additional functional groups, you can extend the wildcard definitions by modifying PFASGroups/data/Halogen_groups_smarts.json.
To create a custom wildcard group:
from PFASGroups import HalogenGroup, parse_smiles
# Define a custom wildcard group (e.g., for a specific ketone pattern)
custom_group = HalogenGroup(
id=9999, # Use a high ID to avoid conflicts
name="Methyl ketone",
smarts={"[#6]C(=O)[#6]": 1}, # Ketone with two alkyl groups
alias="Methyl ketone",
# Wildcard groups typically have no halogen-specific constraints:
componentSmarts=None,
componentSaturation=None,
linker_smarts=None,
constraints={},
)
# Use in parsing
results = parse_smiles(["CC(=O)C", "CC(=O)CC"], pfas_groups=[custom_group], halogens='*')
for mol in results:
if mol.matches:
for m in mol.matches:
if m.group_id == 9999:
print(f"Matched custom group: {m.group_name}")
Best Practices
For H-Components
Use for research/validation only — not for production PFAS screening
Expect fewer homologues than fluorinated analogues (fewer repeated units in hydrocarbons)
Check component detection manually if results are unexpected
Combine with standard PFAS analysis for comprehensive coverage
For Wildcard Groups
Enable selectively — only when needed to reduce parse overhead
Filter by prefix — distinguish wildcards (
W-*) from halogen groupsValidate against chemistry — wildcard patterns are generic and may match unintended substructures
Telomer behavior — telomer groups are not assessed in H-only or wildcard-only flows
PFAS definitions gate — definitions are assessed only when
'F'is included inhalogensDocument custom groups — if you extend the definitions, add comments to
PFASGroups/data/Halogen_groups_smarts.jsonTest on reference sets — verify that your chosen patterns work for your use case
Summary
H-Components and Wildcard Groups extend PFASGroups beyond PFAS-specific analysis:
H-Components: Model hydrocarbon chains as pseudo-halogenated for homologue generation and component validation
Wildcards: Detect generic functional groups (alcohols, ethers, esters, etc.) complementary to PFAS patterns
Both features maintain the core architecture of PFASGroups while enabling broader applications in organic chemistry and structural validation.
See Also
Quickstart — Quick reference for common workflows
Algorithm — Detailed explanation of SMARTS matching and component detection
Customization — How to define custom PFAS groups and component patterns