Advanced Features: H-Components and Wildcard Groups ===================================================== This guide covers two advanced features in PFASGroups: 1. **H-Components**: Pseudo-halogenated hydrocarbon chains for expanding homologous series to non-fluorinated organic compounds 2. **Wildcard Groups**: Generic functional group detection beyond OECD PFAS groups .. contents:: Contents :local: :depth: 2 H-Components (Hydrocarbon Analysis) ==================================== Overview -------- H-Components allow PFASGroups to treat ordinary hydrocarbon chains (CH₂ units) as if they were halogenated components. This feature enables: - **Homologue series exploration** for non-fluorinated organic compounds - **Validation** of component detection logic on simpler test cases - **Extended chemical space mapping** to broader alkyl and aliphatic systems - **Research** into structural variation of neutral hydrocarbons While PFASGroups is primarily designed for halogenated substances, the H-component framework demonstrates the generality of the underlying component-based architecture. Usage in Homologue Generation ------------------------------ The most common use of H-components is in the ``generate_homologues()`` function, where you can specify ``halogen='H'`` to generate shorter alkyl chain variants. Basic Example ~~~~~~~~~~~~~ .. code-block:: python from PFASGroups.generate_homologues import generate_homologues from rdkit import Chem # Simple hydrocarbon with a carboxylic acid head group smiles = 'OC(=O)CCCCCC' # 6-carbon straight chain mol = Chem.MolFromSmiles(smiles) # Generate homologues by removing CH2 units homologues = generate_homologues(mol, halogen='H') print(f"Parent: {Chem.MolToSmiles(mol)}") print(f"Halogen mode: {homologues.halogen}") print(f"Number of homologues: {len(homologues)}") # Inspect each homologue for inchikey, inner_dict in homologues.items(): for formula, h_mol in inner_dict.items(): print(f" {formula}: {Chem.MolToSmiles(h_mol)}") Expected Output: ~~~~~~~~~~~~~~~~~ For a 6-carbon chain, you would typically generate 4–5 shorter homologues: .. code-block:: text Parent: OC(=O)CCCCCC Halogen mode: H Number of homologues: 5 C6H12O2: OC(=O)CCCCC C5H10O2: OC(=O)CCCC C4H8O2: OC(=O)CCC C3H6O2: OC(=O)CC C2H4O2: OC(=O)C How It Works ~~~~~~~~~~~~ When ``halogen='H'``: 1. PFASGroups detects "Alkyl" components by matching CH₂-rich backbones 2. Instead of looking for C–F bonds, it identifies C–H bonds in repeating units 3. Homologues are generated by systematically removing CH₂ units 4. The ``n_removed`` field in results tracks how many CH₂ units were shortened Comparing Halogen Modes ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from PFASGroups.generate_homologues import generate_homologues from rdkit import Chem # Same parent SMILES, analyzed with different halogens smiles = 'OC(=O)' + 'C(F)(F)' * 4 + 'F' # Perfluoroalkyl acid mol = Chem.MolFromSmiles(smiles) # Fluorine (default) — removes CF2 units hom_f = generate_homologues(mol, halogen='F') print(f"F-mode homologues: {len(hom_f)}") # What if we treat it as hydrogen? (not typical, but demonstrates flexibility) # Note: This would find CH2 backbone patterns, which exist in the linker region # hom_h = generate_homologues(mol, halogen='H') # Chlorine — removes CCl2 units (if present) # hom_cl = generate_homologues(mol, halogen='Cl') Component Detection with H ~~~~~~~~~~~~~~~~~~~~~~~~~~~ To inspect H-components through the standard parser, define a custom H-constrained group and run ``parse_smiles`` in H mode: .. code-block:: python from PFASGroups import parse_smiles, HalogenGroup smiles = 'CCCCO' h_group = HalogenGroup( id=9990, name='Hydrocarbon alcohol via H-alkyl component', smarts={'[#6$([#6!$([#6]=O)][OH1,Oh1,O-])]': 1}, componentSmarts='Alkyl', componentSaturation='per', componentHalogens='H', componentForm='alkyl', constraints={}, max_dist_from_comp=1, ) results = parse_smiles(smiles, halogens='H', pfas_groups=[h_group], bycomponent=True) h_matches = [m for m in results[0].matches if m.get('id') == 9990 and m.get('type') == 'HalogenGroup'] print(f"Found {len(h_matches)} H-component match(es)") if h_matches: print(f"Component count: {h_matches[0]['num_components']}") Limitations ~~~~~~~~~~~ - H-components are **not true halogenated components** — they use CH₂ patterns as stand-ins - **No graph metrics** are computed for H-components (only for fluorinated components with sufficient size) - **Limited validation**: fewer test cases exist for hydrocarbon analysis - **Use case specificity**: H-mode is mainly for research and validation, not production PFAS analysis Wildcard Groups =============== Overview -------- Wildcard groups provide **generic functional group detection** beyond the 27 OECD PFAS groups. They enable: - **Broader organic chemistry coverage** (esters, ethers, alcohols, aldehydes, etc.) - **Complementary analysis** to PFAS-specific patterns - **Non-halogenated compound screening** (identifying functional groups in any molecule) - **Cross-framework validation** (comparing wildcard matches across different halogens) Wildcard groups are assigned group IDs in ranges 29–76 (and some higher special groups), and their matches are tagged with a 'W' prefix in match IDs (e.g., ``W-001``, ``W-042``). Enabling Wildcard Detection ---------------------------- Basic Toggle ~~~~~~~~~~~~ Wildcard groups are **disabled by default**. Enable them by including ``'*'`` in ``halogens``: .. code-block:: python from PFASGroups import parse_smiles smiles = "CCO" # ethanol # Without wildcards results_no_wc = parse_smiles(smiles, halogens='F') print(f"Matches (no wildcards): {len(results_no_wc[0].matches)}") # With wildcards results_with_wc = parse_smiles(smiles, halogens='*') print(f"Matches (with wildcards): {len(results_with_wc[0].matches)}") # Wildcard matches are present in the second set for match in results_with_wc[0].matches: print(f" {match.group_name} (ID {match.group_id})") Example Output: ~~~~~~~~~~~~~~~ .. code-block:: text Matches (no wildcards): 0 Matches (with wildcards): 1 Alcohol (ID 30) Wildcard vs. Halogen Groups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When both are enabled, you can distinguish them by match_id prefix: .. code-block:: python from PFASGroups import parse_smiles # Molecule with both PFAS and functional group interest smiles = "FC(F)(F)C(F)(F)C(=O)O" # TFA with carboxylic acid results = parse_smiles(smiles, halogens=['F', '*']) mol = results[0] # Separate by match type halogen_matches = [m for m in mol.matches if not m.match_id.startswith('W')] wildcard_matches = [m for m in mol.matches if m.match_id.startswith('W')] print(f"Halogen group matches ({len(halogen_matches)}):") for m in halogen_matches: print(f" {m['id'] if m['type']=='PFASdefinition' else m.group_id} - {m['definition_name'] if m['type']=='PFASdefinition' else m.group_name} - {m['type']}") print(f"\nWildcard group matches ({len(wildcard_matches)}):") for m in wildcard_matches: print(f" {m.group_id} - {m.group_name} - {m['type']}") Common Wildcard Groups ---------------------- The following are frequently matched wildcard groups (IDs 29–76): .. list-table:: Common Wildcard Groups :header-rows: 1 * - ID - Group Name - Pattern - Example * - 30 - Alcohol - [OH] on C - ``CCO``, ``CC(C)O`` * - 31 - Ether - C-O-C linkage - ``CCOC``, ``CCOc1ccccc1`` * - 32 - Aldehyde - [CH1]=O - ``CC=O`` * - 33 - Alkene - C=C double bond - ``C=C``, ``CC=CC`` * - 36 - Alkyne - C≡C triple bond - ``C#C`` * - 42 - Carboxylic acid - C(=O)OH - ``CC(=O)O``, ``C(=O)O`` * - 46 - Ester - C(=O)O-C - ``CC(=O)OC`` * - 47 - Ether - C-O-C (again) - ``CCOC`` For a complete list, see ``PFASGroups/data/Halogen_groups_smarts.json`` (Groups 29+). Filtering by Functional Group ------------------------------ You can filter matches to only those in specific wildcard groups: .. code-block:: python from PFASGroups import parse_smiles test_molecules = [ ("CCO", "alcohol"), ("CC(=O)O", "carboxylic acid"), ("CC(=O)OC", "ester"), ("CCOC", "ether"), ("C=C", "alkene"), ("C#C", "alkyne"), ] results = parse_smiles([smi for smi, _ in test_molecules], halogens='*') for (smi, desc), mol in zip(test_molecules, results): print(f"{desc:20} ({smi:15}): ", end="") if mol.matches: names = [m.group_name for m in mol.matches] print(", ".join(names)) else: print("(no match)") Expected Output: ~~~~~~~~~~~~~~~~ .. code-block:: text alcohol (CCO ): Alcohol carboxylic acid (CC(=O)O ): Carboxylic acid ester (CC(=O)OC ): Ester ether (CCOC ): Ether alkene (C=C ): Alkene alkyne (C#C ): Alkyne Multi-Halogen Wildcard Analysis -------------------------------- Wildcards work across all halogen modes. Compare wildcard matches across halogens: .. code-block:: python from PFASGroups import parse_smiles # Non-halogenated molecule with multiple functional groups smiles = "O=C(O)C(F)(F)CCN(CC(C)O)CC(=O)OC" # Analyze with each halogen (wildcards enabled) halogens = ['F', 'Cl', 'Br', 'I', 'H', "*"] results = parse_smiles(smiles, halogens=halogens) for result in results: print(f"Results for {result.smiles}") for match in result.matches: print(f" - {match.group_name} (ID {match.group_id}) under '{match['halogen']}' mode") **Understanding the halogen field**: The ``match['halogen']`` field shows which halogens are *actually present* in that match's components (e.g., ``'F'``, ``'Cl'``, ``['F', 'H']``, ``'*'`` for wildcards). For HalogenGroup matches, this reflects the real halogens bonded to the matched carbon components. For WildcardGroup matches, it is always ``'*'``. For PFASdefinition matches, it is always ``'F'``. Note: To run both H-component and wildcard matching in one call, use ``halogens=['H', '*']``. Note: Since this molecule has no halogens, matches will be identical across all halogen modes (wildcard patterns don't depend on the halogen parameter). Advanced: Combining H-Components and Wildcards =============================================== You can analyze molecules using both features simultaneously: .. code-block:: python from PFASGroups import parse_smiles from PFASGroups.generate_homologues import generate_homologues from rdkit import Chem # Complex molecule: PFOA-like with a non-fluorinated tail smiles = "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)CCCCCC(=O)O" mol = Chem.MolFromSmiles(smiles) # 1. Detect all functional groups and PFAS patterns results = parse_smiles(smiles, halogens=['F', '*']) print("Detected groups:") for match in results[0].matches: prefix = "PFAS" if not match.match_id.startswith('W') else "Generic" print(f" [{prefix}] {match.group_name}") # 2. Generate PFAS (F-based) homologues pfas_homologues = generate_homologues(mol, halogen='F') print(f"\nFluorinated homologues: {len(pfas_homologues)}") # 3. Analyze the non-halogenated portion as H-component # (Extract a fragment to demonstrate) non_halo_smiles = "CCCCCC(=O)O" # Hydrocarbon portion only non_halo_mol = Chem.MolFromSmiles(non_halo_smiles) h_homologues = generate_homologues(non_halo_mol, halogen='H') print(f"Hydrocarbon homologues: {len(h_homologues)}") Custom Wildcard Definitions ============================ If you need to detect additional functional groups, you can extend the wildcard definitions by modifying ``PFASGroups/data/Halogen_groups_smarts.json``. To create a custom wildcard group: .. code-block:: python from PFASGroups import HalogenGroup, parse_smiles # Define a custom wildcard group (e.g., for a specific ketone pattern) custom_group = HalogenGroup( id=9999, # Use a high ID to avoid conflicts name="Methyl ketone", smarts={"[#6]C(=O)[#6]": 1}, # Ketone with two alkyl groups alias="Methyl ketone", # Wildcard groups typically have no halogen-specific constraints: componentSmarts=None, componentSaturation=None, linker_smarts=None, constraints={}, ) # Use in parsing results = parse_smiles(["CC(=O)C", "CC(=O)CC"], pfas_groups=[custom_group], halogens='*') for mol in results: if mol.matches: for m in mol.matches: if m.group_id == 9999: print(f"Matched custom group: {m.group_name}") Best Practices ============== For H-Components ---------------- 1. **Use for research/validation only** — not for production PFAS screening 2. **Expect fewer homologues** than fluorinated analogues (fewer repeated units in hydrocarbons) 3. **Check component detection** manually if results are unexpected 4. **Combine with standard PFAS analysis** for comprehensive coverage For Wildcard Groups -------------------- 1. **Enable selectively** — only when needed to reduce parse overhead 2. **Filter by prefix** — distinguish wildcards (``W-*``) from halogen groups 3. **Validate against chemistry** — wildcard patterns are generic and may match unintended substructures 4. **Telomer behavior** — telomer groups are not assessed in H-only or wildcard-only flows 5. **PFAS definitions gate** — definitions are assessed only when ``'F'`` is included in ``halogens`` 6. **Document custom groups** — if you extend the definitions, add comments to ``PFASGroups/data/Halogen_groups_smarts.json`` 7. **Test on reference sets** — verify that your chosen patterns work for your use case Summary ======= **H-Components** and **Wildcard Groups** extend PFASGroups beyond PFAS-specific analysis: - **H-Components**: Model hydrocarbon chains as pseudo-halogenated for homologue generation and component validation - **Wildcards**: Detect generic functional groups (alcohols, ethers, esters, etc.) complementary to PFAS patterns Both features maintain the core architecture of PFASGroups while enabling broader applications in organic chemistry and structural validation. See Also ======== - :doc:`quickstart` — Quick reference for common workflows - :doc:`algorithm` — Detailed explanation of SMARTS matching and component detection - :doc:`customization` — How to define custom PFAS groups and component patterns