Advanced Features: H-Components and Wildcard Groups

This guide covers two advanced features in PFASGroups:

H-Components: Pseudo-halogenated hydrocarbon chains for expanding homologous series to non-fluorinated organic compounds
Wildcard Groups: Generic functional group detection beyond OECD PFAS groups

H-Components (Hydrocarbon Analysis)

Overview

H-Components allow PFASGroups to treat ordinary hydrocarbon chains (CH₂ units) as if they were halogenated components. This feature enables:

Homologue series exploration for non-fluorinated organic compounds
Validation of component detection logic on simpler test cases
Extended chemical space mapping to broader alkyl and aliphatic systems
Research into structural variation of neutral hydrocarbons

While PFASGroups is primarily designed for halogenated substances, the H-component framework demonstrates the generality of the underlying component-based architecture.

Usage in Homologue Generation

The most common use of H-components is in the generate_homologues() function, where you can specify halogen='H' to generate shorter alkyl chain variants.

Basic Example

from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem

# Simple hydrocarbon with a carboxylic acid head group
smiles = 'OC(=O)CCCCCC'  # 6-carbon straight chain
mol = Chem.MolFromSmiles(smiles)

# Generate homologues by removing CH2 units
homologues = generate_homologues(mol, halogen='H')

print(f"Parent: {Chem.MolToSmiles(mol)}")
print(f"Halogen mode: {homologues.halogen}")
print(f"Number of homologues: {len(homologues)}")

# Inspect each homologue
for inchikey, inner_dict in homologues.items():
    for formula, h_mol in inner_dict.items():
        print(f"  {formula}: {Chem.MolToSmiles(h_mol)}")

Expected Output:

For a 6-carbon chain, you would typically generate 4–5 shorter homologues:

Parent: OC(=O)CCCCCC
Halogen mode: H
Number of homologues: 5
  C6H12O2: OC(=O)CCCCC
  C5H10O2: OC(=O)CCCC
  C4H8O2: OC(=O)CCC
  C3H6O2: OC(=O)CC
  C2H4O2: OC(=O)C

How It Works

When halogen='H':

PFASGroups detects “Alkyl” components by matching CH₂-rich backbones
Instead of looking for C–F bonds, it identifies C–H bonds in repeating units
Homologues are generated by systematically removing CH₂ units
The n_removed field in results tracks how many CH₂ units were shortened

Comparing Halogen Modes

from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem

# Same parent SMILES, analyzed with different halogens
smiles = 'OC(=O)' + 'C(F)(F)' * 4 + 'F'  # Perfluoroalkyl acid
mol = Chem.MolFromSmiles(smiles)

# Fluorine (default) — removes CF2 units
hom_f = generate_homologues(mol, halogen='F')
print(f"F-mode homologues: {len(hom_f)}")

# What if we treat it as hydrogen? (not typical, but demonstrates flexibility)
# Note: This would find CH2 backbone patterns, which exist in the linker region
# hom_h = generate_homologues(mol, halogen='H')

# Chlorine — removes CCl2 units (if present)
# hom_cl = generate_homologues(mol, halogen='Cl')

Component Detection with H

To inspect H-components through the standard parser, define a custom H-constrained group and run parse_smiles in H mode:

from PFASGroups import parse_smiles, HalogenGroup

smiles = 'CCCCO'

h_group = HalogenGroup(
  id=9990,
  name='Hydrocarbon alcohol via H-alkyl component',
  smarts={'[#6$([#6!$([#6]=O)][OH1,Oh1,O-])]': 1},
  componentSmarts='Alkyl',
  componentSaturation='per',
  componentHalogens='H',
  componentForm='alkyl',
  constraints={},
  max_dist_from_comp=1,
)

results = parse_smiles(smiles, halogens='H', pfas_groups=[h_group], bycomponent=True)
h_matches = [m for m in results[0].matches if m.get('id') == 9990 and m.get('type') == 'HalogenGroup']

print(f"Found {len(h_matches)} H-component match(es)")
if h_matches:
  print(f"Component count: {h_matches[0]['num_components']}")

Limitations

H-components are not true halogenated components — they use CH₂ patterns as stand-ins
No graph metrics are computed for H-components (only for fluorinated components with sufficient size)
Limited validation: fewer test cases exist for hydrocarbon analysis
Use case specificity: H-mode is mainly for research and validation, not production PFAS analysis

Wildcard Groups

Overview

Wildcard groups provide generic functional group detection beyond the 27 OECD PFAS groups. They enable:

Broader organic chemistry coverage (esters, ethers, alcohols, aldehydes, etc.)
Complementary analysis to PFAS-specific patterns
Non-halogenated compound screening (identifying functional groups in any molecule)
Cross-framework validation (comparing wildcard matches across different halogens)

Wildcard groups are assigned group IDs in ranges 29–76 (and some higher special groups), and their matches are tagged with a ‘W’ prefix in match IDs (e.g., W-001, W-042).

Enabling Wildcard Detection

Basic Toggle

Wildcard groups are disabled by default. Enable them by including '*' in halogens:

 from PFASGroups import parse_smiles

 smiles = "CCO"  # ethanol

# Without wildcards
results_no_wc = parse_smiles(smiles, halogens='F')
 print(f"Matches (no wildcards): {len(results_no_wc[0].matches)}")

 # With wildcards
results_with_wc = parse_smiles(smiles, halogens='*')
 print(f"Matches (with wildcards): {len(results_with_wc[0].matches)}")

 # Wildcard matches are present in the second set
 for match in results_with_wc[0].matches:
     print(f"  {match.group_name} (ID {match.group_id})")

Example Output:

Matches (no wildcards): 0
Matches (with wildcards): 1
  Alcohol (ID 30)

Wildcard vs. Halogen Groups

When both are enabled, you can distinguish them by match_id prefix:

 from PFASGroups import parse_smiles

 # Molecule with both PFAS and functional group interest
 smiles = "FC(F)(F)C(F)(F)C(=O)O"  # TFA with carboxylic acid

results = parse_smiles(smiles, halogens=['F', '*'])
 mol = results[0]

 # Separate by match type
 halogen_matches = [m for m in mol.matches if not m.match_id.startswith('W')]
 wildcard_matches = [m for m in mol.matches if m.match_id.startswith('W')]

 print(f"Halogen group matches ({len(halogen_matches)}):")
 for m in halogen_matches:
     print(f"       {m['id'] if m['type']=='PFASdefinition' else m.group_id} - {m['definition_name'] if m['type']=='PFASdefinition' else m.group_name} - {m['type']}")

 print(f"\nWildcard group matches ({len(wildcard_matches)}):")
 for m in wildcard_matches:
     print(f"       {m.group_id} - {m.group_name} - {m['type']}")

Common Wildcard Groups

The following are frequently matched wildcard groups (IDs 29–76):

Common Wildcard Groups
ID	Group Name	Pattern	Example
30	Alcohol	[OH] on C	`CCO`, `CC(C)O`
31	Ether	C-O-C linkage	`CCOC`, `CCOc1ccccc1`
32	Aldehyde	[CH1]=O	`CC=O`
33	Alkene	C=C double bond	`C=C`, `CC=CC`
36	Alkyne	C≡C triple bond	`C#C`
42	Carboxylic acid	C(=O)OH	`CC(=O)O`, `C(=O)O`
46	Ester	C(=O)O-C	`CC(=O)OC`
47	Ether	C-O-C (again)	`CCOC`

For a complete list, see PFASGroups/data/Halogen_groups_smarts.json (Groups 29+).

Filtering by Functional Group

You can filter matches to only those in specific wildcard groups:

from PFASGroups import parse_smiles

test_molecules = [
    ("CCO", "alcohol"),
    ("CC(=O)O", "carboxylic acid"),
    ("CC(=O)OC", "ester"),
    ("CCOC", "ether"),
    ("C=C", "alkene"),
    ("C#C", "alkyne"),
]

 results = parse_smiles([smi for smi, _ in test_molecules], halogens='*')

for (smi, desc), mol in zip(test_molecules, results):
    print(f"{desc:20} ({smi:15}): ", end="")
    if mol.matches:
        names = [m.group_name for m in mol.matches]
        print(", ".join(names))
    else:
        print("(no match)")

Expected Output:

alcohol              (CCO             ): Alcohol
carboxylic acid      (CC(=O)O         ): Carboxylic acid
ester                (CC(=O)OC        ): Ester
ether                (CCOC            ): Ether
alkene               (C=C              ): Alkene
alkyne               (C#C              ): Alkyne

Multi-Halogen Wildcard Analysis

Wildcards work across all halogen modes. Compare wildcard matches across halogens:

from PFASGroups import parse_smiles

# Non-halogenated molecule with multiple functional groups
smiles = "O=C(O)C(F)(F)CCN(CC(C)O)CC(=O)OC"

# Analyze with each halogen (wildcards enabled)
halogens = ['F', 'Cl', 'Br', 'I', 'H', "*"]
results = parse_smiles(smiles, halogens=halogens)
for result in results:
      print(f"Results for {result.smiles}")
      for match in result.matches:
          print(f"   - {match.group_name} (ID {match.group_id}) under '{match['halogen']}' mode")

Understanding the halogen field: The match['halogen'] field shows which halogens are actually present in that match’s components (e.g., 'F', 'Cl', ['F', 'H'], '*' for wildcards). For HalogenGroup matches, this reflects the real halogens bonded to the matched carbon components. For WildcardGroup matches, it is always '*'. For PFASdefinition matches, it is always 'F'.

Note: To run both H-component and wildcard matching in one call, use halogens=['H', '*'].

Note: Since this molecule has no halogens, matches will be identical across all halogen modes (wildcard patterns don’t depend on the halogen parameter).

Advanced: Combining H-Components and Wildcards

You can analyze molecules using both features simultaneously:

from PFASGroups import parse_smiles
from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem

# Complex molecule: PFOA-like with a non-fluorinated tail
smiles = "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)CCCCCC(=O)O"
mol = Chem.MolFromSmiles(smiles)

# 1. Detect all functional groups and PFAS patterns
 results = parse_smiles(smiles, halogens=['F', '*'])
print("Detected groups:")
for match in results[0].matches:
    prefix = "PFAS" if not match.match_id.startswith('W') else "Generic"
    print(f"  [{prefix}] {match.group_name}")

# 2. Generate PFAS (F-based) homologues
pfas_homologues = generate_homologues(mol, halogen='F')
print(f"\nFluorinated homologues: {len(pfas_homologues)}")

# 3. Analyze the non-halogenated portion as H-component
# (Extract a fragment to demonstrate)
non_halo_smiles = "CCCCCC(=O)O"  # Hydrocarbon portion only
non_halo_mol = Chem.MolFromSmiles(non_halo_smiles)
h_homologues = generate_homologues(non_halo_mol, halogen='H')
print(f"Hydrocarbon homologues: {len(h_homologues)}")

Custom Wildcard Definitions

If you need to detect additional functional groups, you can extend the wildcard definitions by modifying PFASGroups/data/Halogen_groups_smarts.json.

To create a custom wildcard group:

from PFASGroups import HalogenGroup, parse_smiles

# Define a custom wildcard group (e.g., for a specific ketone pattern)
custom_group = HalogenGroup(
    id=9999,  # Use a high ID to avoid conflicts
    name="Methyl ketone",
    smarts={"[#6]C(=O)[#6]": 1},  # Ketone with two alkyl groups
    alias="Methyl ketone",
    # Wildcard groups typically have no halogen-specific constraints:
    componentSmarts=None,
    componentSaturation=None,
    linker_smarts=None,
    constraints={},
)

# Use in parsing
 results = parse_smiles(["CC(=O)C", "CC(=O)CC"], pfas_groups=[custom_group], halogens='*')
for mol in results:
    if mol.matches:
        for m in mol.matches:
            if m.group_id == 9999:
                print(f"Matched custom group: {m.group_name}")

Best Practices

For H-Components

Use for research/validation only — not for production PFAS screening
Expect fewer homologues than fluorinated analogues (fewer repeated units in hydrocarbons)
Check component detection manually if results are unexpected
Combine with standard PFAS analysis for comprehensive coverage

For Wildcard Groups

Enable selectively — only when needed to reduce parse overhead
Filter by prefix — distinguish wildcards (W-*) from halogen groups
Validate against chemistry — wildcard patterns are generic and may match unintended substructures
Telomer behavior — telomer groups are not assessed in H-only or wildcard-only flows
PFAS definitions gate — definitions are assessed only when 'F' is included in halogens
Document custom groups — if you extend the definitions, add comments to PFASGroups/data/Halogen_groups_smarts.json
Test on reference sets — verify that your chosen patterns work for your use case

Summary

H-Components and Wildcard Groups extend PFASGroups beyond PFAS-specific analysis:

H-Components: Model hydrocarbon chains as pseudo-halogenated for homologue generation and component validation
Wildcards: Detect generic functional groups (alcohols, ethers, esters, etc.) complementary to PFAS patterns

Both features maintain the core architecture of PFASGroups while enabling broader applications in organic chemistry and structural validation.

Advanced Features: H-Components and Wildcard Groups

H-Components (Hydrocarbon Analysis)

Overview

Usage in Homologue Generation

Basic Example

Expected Output:

How It Works

Comparing Halogen Modes

Component Detection with H

Limitations

Wildcard Groups

Overview

Enabling Wildcard Detection

Basic Toggle

Example Output:

Wildcard vs. Halogen Groups

Common Wildcard Groups

Filtering by Functional Group

Expected Output:

Multi-Halogen Wildcard Analysis

Advanced: Combining H-Components and Wildcards

Custom Wildcard Definitions

Best Practices

For H-Components

For Wildcard Groups

Summary

See Also