Advanced Features: H-Components and Wildcard Groups

This guide covers two advanced features in PFASGroups:

  1. H-Components: Pseudo-halogenated hydrocarbon chains for expanding homologous series to non-fluorinated organic compounds

  2. Wildcard Groups: Generic functional group detection beyond OECD PFAS groups

H-Components (Hydrocarbon Analysis)

Overview

H-Components allow PFASGroups to treat ordinary hydrocarbon chains (CH₂ units) as if they were halogenated components. This feature enables:

  • Homologue series exploration for non-fluorinated organic compounds

  • Validation of component detection logic on simpler test cases

  • Extended chemical space mapping to broader alkyl and aliphatic systems

  • Research into structural variation of neutral hydrocarbons

While PFASGroups is primarily designed for halogenated substances, the H-component framework demonstrates the generality of the underlying component-based architecture.

Usage in Homologue Generation

The most common use of H-components is in the generate_homologues() function, where you can specify halogen='H' to generate shorter alkyl chain variants.

Basic Example

from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem

# Simple hydrocarbon with a carboxylic acid head group
smiles = 'OC(=O)CCCCCC'  # 6-carbon straight chain
mol = Chem.MolFromSmiles(smiles)

# Generate homologues by removing CH2 units
homologues = generate_homologues(mol, halogen='H')

print(f"Parent: {Chem.MolToSmiles(mol)}")
print(f"Halogen mode: {homologues.halogen}")
print(f"Number of homologues: {len(homologues)}")

# Inspect each homologue
for inchikey, inner_dict in homologues.items():
    for formula, h_mol in inner_dict.items():
        print(f"  {formula}: {Chem.MolToSmiles(h_mol)}")

Expected Output:

For a 6-carbon chain, you would typically generate 4–5 shorter homologues:

Parent: OC(=O)CCCCCC
Halogen mode: H
Number of homologues: 5
  C6H12O2: OC(=O)CCCCC
  C5H10O2: OC(=O)CCCC
  C4H8O2: OC(=O)CCC
  C3H6O2: OC(=O)CC
  C2H4O2: OC(=O)C

How It Works

When halogen='H':

  1. PFASGroups detects “Alkyl” components by matching CH₂-rich backbones

  2. Instead of looking for C–F bonds, it identifies C–H bonds in repeating units

  3. Homologues are generated by systematically removing CH₂ units

  4. The n_removed field in results tracks how many CH₂ units were shortened

Comparing Halogen Modes

from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem

# Same parent SMILES, analyzed with different halogens
smiles = 'OC(=O)' + 'C(F)(F)' * 4 + 'F'  # Perfluoroalkyl acid
mol = Chem.MolFromSmiles(smiles)

# Fluorine (default) — removes CF2 units
hom_f = generate_homologues(mol, halogen='F')
print(f"F-mode homologues: {len(hom_f)}")

# What if we treat it as hydrogen? (not typical, but demonstrates flexibility)
# Note: This would find CH2 backbone patterns, which exist in the linker region
# hom_h = generate_homologues(mol, halogen='H')

# Chlorine — removes CCl2 units (if present)
# hom_cl = generate_homologues(mol, halogen='Cl')

Component Detection with H

To inspect H-components through the standard parser, define a custom H-constrained group and run parse_smiles in H mode:

from PFASGroups import parse_smiles, HalogenGroup

smiles = 'CCCCO'

h_group = HalogenGroup(
  id=9990,
  name='Hydrocarbon alcohol via H-alkyl component',
  smarts={'[#6$([#6!$([#6]=O)][OH1,Oh1,O-])]': 1},
  componentSmarts='Alkyl',
  componentSaturation='per',
  componentHalogens='H',
  componentForm='alkyl',
  constraints={},
  max_dist_from_comp=1,
)

results = parse_smiles(smiles, halogens='H', pfas_groups=[h_group], bycomponent=True)
h_matches = [m for m in results[0].matches if m.get('id') == 9990 and m.get('type') == 'HalogenGroup']

print(f"Found {len(h_matches)} H-component match(es)")
if h_matches:
  print(f"Component count: {h_matches[0]['num_components']}")

Limitations

  • H-components are not true halogenated components — they use CH₂ patterns as stand-ins

  • No graph metrics are computed for H-components (only for fluorinated components with sufficient size)

  • Limited validation: fewer test cases exist for hydrocarbon analysis

  • Use case specificity: H-mode is mainly for research and validation, not production PFAS analysis

Wildcard Groups

Overview

Wildcard groups provide generic functional group detection beyond the 27 OECD PFAS groups. They enable:

  • Broader organic chemistry coverage (esters, ethers, alcohols, aldehydes, etc.)

  • Complementary analysis to PFAS-specific patterns

  • Non-halogenated compound screening (identifying functional groups in any molecule)

  • Cross-framework validation (comparing wildcard matches across different halogens)

Wildcard groups are assigned group IDs in ranges 29–76 (and some higher special groups), and their matches are tagged with a ‘W’ prefix in match IDs (e.g., W-001, W-042).

Enabling Wildcard Detection

Basic Toggle

Wildcard groups are disabled by default. Enable them by including '*' in halogens:

 from PFASGroups import parse_smiles

 smiles = "CCO"  # ethanol

# Without wildcards
results_no_wc = parse_smiles(smiles, halogens='F')
 print(f"Matches (no wildcards): {len(results_no_wc[0].matches)}")

 # With wildcards
results_with_wc = parse_smiles(smiles, halogens='*')
 print(f"Matches (with wildcards): {len(results_with_wc[0].matches)}")

 # Wildcard matches are present in the second set
 for match in results_with_wc[0].matches:
     print(f"  {match.group_name} (ID {match.group_id})")

Example Output:

Matches (no wildcards): 0
Matches (with wildcards): 1
  Alcohol (ID 30)

Wildcard vs. Halogen Groups

When both are enabled, you can distinguish them by match_id prefix:

 from PFASGroups import parse_smiles

 # Molecule with both PFAS and functional group interest
 smiles = "FC(F)(F)C(F)(F)C(=O)O"  # TFA with carboxylic acid

results = parse_smiles(smiles, halogens=['F', '*'])
 mol = results[0]

 # Separate by match type
 halogen_matches = [m for m in mol.matches if not m.match_id.startswith('W')]
 wildcard_matches = [m for m in mol.matches if m.match_id.startswith('W')]

 print(f"Halogen group matches ({len(halogen_matches)}):")
 for m in halogen_matches:
     print(f"       {m['id'] if m['type']=='PFASdefinition' else m.group_id} - {m['definition_name'] if m['type']=='PFASdefinition' else m.group_name} - {m['type']}")

 print(f"\nWildcard group matches ({len(wildcard_matches)}):")
 for m in wildcard_matches:
     print(f"       {m.group_id} - {m.group_name} - {m['type']}")

Common Wildcard Groups

The following are frequently matched wildcard groups (IDs 29–76):

Common Wildcard Groups

ID

Group Name

Pattern

Example

30

Alcohol

[OH] on C

CCO, CC(C)O

31

Ether

C-O-C linkage

CCOC, CCOc1ccccc1

32

Aldehyde

[CH1]=O

CC=O

33

Alkene

C=C double bond

C=C, CC=CC

36

Alkyne

C≡C triple bond

C#C

42

Carboxylic acid

C(=O)OH

CC(=O)O, C(=O)O

46

Ester

C(=O)O-C

CC(=O)OC

47

Ether

C-O-C (again)

CCOC

For a complete list, see PFASGroups/data/Halogen_groups_smarts.json (Groups 29+).

Filtering by Functional Group

You can filter matches to only those in specific wildcard groups:

from PFASGroups import parse_smiles

test_molecules = [
    ("CCO", "alcohol"),
    ("CC(=O)O", "carboxylic acid"),
    ("CC(=O)OC", "ester"),
    ("CCOC", "ether"),
    ("C=C", "alkene"),
    ("C#C", "alkyne"),
]

 results = parse_smiles([smi for smi, _ in test_molecules], halogens='*')

for (smi, desc), mol in zip(test_molecules, results):
    print(f"{desc:20} ({smi:15}): ", end="")
    if mol.matches:
        names = [m.group_name for m in mol.matches]
        print(", ".join(names))
    else:
        print("(no match)")

Expected Output:

alcohol              (CCO             ): Alcohol
carboxylic acid      (CC(=O)O         ): Carboxylic acid
ester                (CC(=O)OC        ): Ester
ether                (CCOC            ): Ether
alkene               (C=C              ): Alkene
alkyne               (C#C              ): Alkyne

Multi-Halogen Wildcard Analysis

Wildcards work across all halogen modes. Compare wildcard matches across halogens:

from PFASGroups import parse_smiles

# Non-halogenated molecule with multiple functional groups
smiles = "O=C(O)C(F)(F)CCN(CC(C)O)CC(=O)OC"

# Analyze with each halogen (wildcards enabled)
halogens = ['F', 'Cl', 'Br', 'I', 'H', "*"]
results = parse_smiles(smiles, halogens=halogens)
for result in results:
      print(f"Results for {result.smiles}")
      for match in result.matches:
          print(f"   - {match.group_name} (ID {match.group_id}) under '{match['halogen']}' mode")

Understanding the halogen field: The match['halogen'] field shows which halogens are actually present in that match’s components (e.g., 'F', 'Cl', ['F', 'H'], '*' for wildcards). For HalogenGroup matches, this reflects the real halogens bonded to the matched carbon components. For WildcardGroup matches, it is always '*'. For PFASdefinition matches, it is always 'F'.

Note: To run both H-component and wildcard matching in one call, use halogens=['H', '*'].

Note: Since this molecule has no halogens, matches will be identical across all halogen modes (wildcard patterns don’t depend on the halogen parameter).

Advanced: Combining H-Components and Wildcards

You can analyze molecules using both features simultaneously:

from PFASGroups import parse_smiles
from PFASGroups.generate_homologues import generate_homologues
from rdkit import Chem

# Complex molecule: PFOA-like with a non-fluorinated tail
smiles = "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)CCCCCC(=O)O"
mol = Chem.MolFromSmiles(smiles)

# 1. Detect all functional groups and PFAS patterns
 results = parse_smiles(smiles, halogens=['F', '*'])
print("Detected groups:")
for match in results[0].matches:
    prefix = "PFAS" if not match.match_id.startswith('W') else "Generic"
    print(f"  [{prefix}] {match.group_name}")

# 2. Generate PFAS (F-based) homologues
pfas_homologues = generate_homologues(mol, halogen='F')
print(f"\nFluorinated homologues: {len(pfas_homologues)}")

# 3. Analyze the non-halogenated portion as H-component
# (Extract a fragment to demonstrate)
non_halo_smiles = "CCCCCC(=O)O"  # Hydrocarbon portion only
non_halo_mol = Chem.MolFromSmiles(non_halo_smiles)
h_homologues = generate_homologues(non_halo_mol, halogen='H')
print(f"Hydrocarbon homologues: {len(h_homologues)}")

Custom Wildcard Definitions

If you need to detect additional functional groups, you can extend the wildcard definitions by modifying PFASGroups/data/Halogen_groups_smarts.json.

To create a custom wildcard group:

from PFASGroups import HalogenGroup, parse_smiles

# Define a custom wildcard group (e.g., for a specific ketone pattern)
custom_group = HalogenGroup(
    id=9999,  # Use a high ID to avoid conflicts
    name="Methyl ketone",
    smarts={"[#6]C(=O)[#6]": 1},  # Ketone with two alkyl groups
    alias="Methyl ketone",
    # Wildcard groups typically have no halogen-specific constraints:
    componentSmarts=None,
    componentSaturation=None,
    linker_smarts=None,
    constraints={},
)

# Use in parsing
 results = parse_smiles(["CC(=O)C", "CC(=O)CC"], pfas_groups=[custom_group], halogens='*')
for mol in results:
    if mol.matches:
        for m in mol.matches:
            if m.group_id == 9999:
                print(f"Matched custom group: {m.group_name}")

Best Practices

For H-Components

  1. Use for research/validation only — not for production PFAS screening

  2. Expect fewer homologues than fluorinated analogues (fewer repeated units in hydrocarbons)

  3. Check component detection manually if results are unexpected

  4. Combine with standard PFAS analysis for comprehensive coverage

For Wildcard Groups

  1. Enable selectively — only when needed to reduce parse overhead

  2. Filter by prefix — distinguish wildcards (W-*) from halogen groups

  3. Validate against chemistry — wildcard patterns are generic and may match unintended substructures

  4. Telomer behavior — telomer groups are not assessed in H-only or wildcard-only flows

  5. PFAS definitions gate — definitions are assessed only when 'F' is included in halogens

  6. Document custom groups — if you extend the definitions, add comments to PFASGroups/data/Halogen_groups_smarts.json

  7. Test on reference sets — verify that your chosen patterns work for your use case

Summary

H-Components and Wildcard Groups extend PFASGroups beyond PFAS-specific analysis:

  • H-Components: Model hydrocarbon chains as pseudo-halogenated for homologue generation and component validation

  • Wildcards: Detect generic functional groups (alcohols, ethers, esters, etc.) complementary to PFAS patterns

Both features maintain the core architecture of PFASGroups while enabling broader applications in organic chemistry and structural validation.

See Also

  • Quickstart — Quick reference for common workflows

  • Algorithm — Detailed explanation of SMARTS matching and component detection

  • Customization — How to define custom PFAS groups and component patterns