Advanced Features: H-Components and Wildcard Groups
=====================================================

This guide covers two advanced features in PFASGroups:

1. **H-Components**: Pseudo-halogenated hydrocarbon chains for expanding homologous series to non-fluorinated organic compounds
2. **Wildcard Groups**: Generic functional group detection beyond OECD PFAS groups

.. contents:: Contents
   :local:
   :depth: 2

H-Components (Hydrocarbon Analysis)
====================================

Overview
--------

H-Components allow PFASGroups to treat ordinary hydrocarbon chains (CH₂ units) as if they were halogenated components. This feature enables:

- **Homologue series exploration** for non-fluorinated organic compounds
- **Validation** of component detection logic on simpler test cases
- **Extended chemical space mapping** to broader alkyl and aliphatic systems
- **Research** into structural variation of neutral hydrocarbons

While PFASGroups is primarily designed for halogenated substances, the H-component framework demonstrates the generality of the underlying component-based architecture.

Usage in Homologue Generation
------------------------------

The most common use of H-components is in the ``generate_homologues()`` function, where you can specify ``halogen='H'`` to generate shorter alkyl chain variants.

Basic Example
~~~~~~~~~~~~~

.. code-block:: python

   from PFASGroups.generate_homologues import generate_homologues
   from rdkit import Chem

   # Simple hydrocarbon with a carboxylic acid head group
   smiles = 'OC(=O)CCCCCC'  # 6-carbon straight chain
   mol = Chem.MolFromSmiles(smiles)

   # Generate homologues by removing CH2 units
   homologues = generate_homologues(mol, halogen='H')

   print(f"Parent: {Chem.MolToSmiles(mol)}")
   print(f"Halogen mode: {homologues.halogen}")
   print(f"Number of homologues: {len(homologues)}")

   # Inspect each homologue
   for inchikey, inner_dict in homologues.items():
       for formula, h_mol in inner_dict.items():
           print(f"  {formula}: {Chem.MolToSmiles(h_mol)}")

Expected Output:
~~~~~~~~~~~~~~~~~

For a 6-carbon chain, you would typically generate 4–5 shorter homologues:

.. code-block:: text

   Parent: OC(=O)CCCCCC
   Halogen mode: H
   Number of homologues: 5
     C6H12O2: OC(=O)CCCCC
     C5H10O2: OC(=O)CCCC
     C4H8O2: OC(=O)CCC
     C3H6O2: OC(=O)CC
     C2H4O2: OC(=O)C

How It Works
~~~~~~~~~~~~

When ``halogen='H'``:

1. PFASGroups detects "Alkyl" components by matching CH₂-rich backbones
2. Instead of looking for C–F bonds, it identifies C–H bonds in repeating units
3. Homologues are generated by systematically removing CH₂ units
4. The ``n_removed`` field in results tracks how many CH₂ units were shortened

Comparing Halogen Modes
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from PFASGroups.generate_homologues import generate_homologues
   from rdkit import Chem

   # Same parent SMILES, analyzed with different halogens
   smiles = 'OC(=O)' + 'C(F)(F)' * 4 + 'F'  # Perfluoroalkyl acid
   mol = Chem.MolFromSmiles(smiles)

   # Fluorine (default) — removes CF2 units
   hom_f = generate_homologues(mol, halogen='F')
   print(f"F-mode homologues: {len(hom_f)}")

   # What if we treat it as hydrogen? (not typical, but demonstrates flexibility)
   # Note: This would find CH2 backbone patterns, which exist in the linker region
   # hom_h = generate_homologues(mol, halogen='H')

   # Chlorine — removes CCl2 units (if present)
   # hom_cl = generate_homologues(mol, halogen='Cl')

Component Detection with H
~~~~~~~~~~~~~~~~~~~~~~~~~~~

To inspect H-components through the standard parser, define a custom H-constrained
group and run ``parse_smiles`` in H mode:

.. code-block:: python

   from PFASGroups import parse_smiles, HalogenGroup

   smiles = 'CCCCO'

   h_group = HalogenGroup(
     id=9990,
     name='Hydrocarbon alcohol via H-alkyl component',
     smarts={'[#6$([#6!$([#6]=O)][OH1,Oh1,O-])]': 1},
     componentSmarts='Alkyl',
     componentSaturation='per',
     componentHalogens='H',
     componentForm='alkyl',
     constraints={},
     max_dist_from_comp=1,
   )

   results = parse_smiles(smiles, halogens='H', pfas_groups=[h_group], bycomponent=True)
   h_matches = [m for m in results[0].matches if m.get('id') == 9990 and m.get('type') == 'HalogenGroup']

   print(f"Found {len(h_matches)} H-component match(es)")
   if h_matches:
     print(f"Component count: {h_matches[0]['num_components']}")

Limitations
~~~~~~~~~~~

- H-components are **not true halogenated components** — they use CH₂ patterns as stand-ins
- **No graph metrics** are computed for H-components (only for fluorinated components with sufficient size)
- **Limited validation**: fewer test cases exist for hydrocarbon analysis
- **Use case specificity**: H-mode is mainly for research and validation, not production PFAS analysis

Wildcard Groups
===============

Overview
--------

Wildcard groups provide **generic functional group detection** beyond the 27 OECD PFAS groups. They enable:

- **Broader organic chemistry coverage** (esters, ethers, alcohols, aldehydes, etc.)
- **Complementary analysis** to PFAS-specific patterns
- **Non-halogenated compound screening** (identifying functional groups in any molecule)
- **Cross-framework validation** (comparing wildcard matches across different halogens)

Wildcard groups are assigned group IDs in ranges 29–76 (and some higher special groups), and their matches are tagged with a 'W' prefix in match IDs (e.g., ``W-001``, ``W-042``).

Enabling Wildcard Detection
----------------------------

Basic Toggle
~~~~~~~~~~~~

Wildcard groups are **disabled by default**. Enable them by including ``'*'`` in ``halogens``:

.. code-block:: python

   from PFASGroups import parse_smiles

   smiles = "CCO"  # ethanol

  # Without wildcards
  results_no_wc = parse_smiles(smiles, halogens='F')
   print(f"Matches (no wildcards): {len(results_no_wc[0].matches)}")

   # With wildcards
  results_with_wc = parse_smiles(smiles, halogens='*')
   print(f"Matches (with wildcards): {len(results_with_wc[0].matches)}")

   # Wildcard matches are present in the second set
   for match in results_with_wc[0].matches:
       print(f"  {match.group_name} (ID {match.group_id})")

Example Output:
~~~~~~~~~~~~~~~

.. code-block:: text

   Matches (no wildcards): 0
   Matches (with wildcards): 1
     Alcohol (ID 30)

Wildcard vs. Halogen Groups
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When both are enabled, you can distinguish them by match_id prefix:

.. code-block:: python

   from PFASGroups import parse_smiles

   # Molecule with both PFAS and functional group interest
   smiles = "FC(F)(F)C(F)(F)C(=O)O"  # TFA with carboxylic acid

  results = parse_smiles(smiles, halogens=['F', '*'])
   mol = results[0]

   # Separate by match type
   halogen_matches = [m for m in mol.matches if not m.match_id.startswith('W')]
   wildcard_matches = [m for m in mol.matches if m.match_id.startswith('W')]

   print(f"Halogen group matches ({len(halogen_matches)}):")
   for m in halogen_matches:
       print(f"       {m['id'] if m['type']=='PFASdefinition' else m.group_id} - {m['definition_name'] if m['type']=='PFASdefinition' else m.group_name} - {m['type']}")        

   print(f"\nWildcard group matches ({len(wildcard_matches)}):")
   for m in wildcard_matches:
       print(f"       {m.group_id} - {m.group_name} - {m['type']}")

Common Wildcard Groups
----------------------

The following are frequently matched wildcard groups (IDs 29–76):

.. list-table:: Common Wildcard Groups
   :header-rows: 1

   * - ID
     - Group Name
     - Pattern
     - Example
   * - 30
     - Alcohol
     - [OH] on C
     - ``CCO``, ``CC(C)O``
   * - 31
     - Ether
     - C-O-C linkage
     - ``CCOC``, ``CCOc1ccccc1``
   * - 32
     - Aldehyde
     - [CH1]=O
     - ``CC=O``
   * - 33
     - Alkene
     - C=C double bond
     - ``C=C``, ``CC=CC``
   * - 36
     - Alkyne
     - C≡C triple bond
     - ``C#C``
   * - 42
     - Carboxylic acid
     - C(=O)OH
     - ``CC(=O)O``, ``C(=O)O``
   * - 46
     - Ester
     - C(=O)O-C
     - ``CC(=O)OC``
   * - 47
     - Ether
     - C-O-C (again)
     - ``CCOC``

For a complete list, see ``PFASGroups/data/Halogen_groups_smarts.json`` (Groups 29+).

Filtering by Functional Group
------------------------------

You can filter matches to only those in specific wildcard groups:

.. code-block:: python

   from PFASGroups import parse_smiles

   test_molecules = [
       ("CCO", "alcohol"),
       ("CC(=O)O", "carboxylic acid"),
       ("CC(=O)OC", "ester"),
       ("CCOC", "ether"),
       ("C=C", "alkene"),
       ("C#C", "alkyne"),
   ]

    results = parse_smiles([smi for smi, _ in test_molecules], halogens='*')

   for (smi, desc), mol in zip(test_molecules, results):
       print(f"{desc:20} ({smi:15}): ", end="")
       if mol.matches:
           names = [m.group_name for m in mol.matches]
           print(", ".join(names))
       else:
           print("(no match)")

Expected Output:
~~~~~~~~~~~~~~~~

.. code-block:: text

   alcohol              (CCO             ): Alcohol
   carboxylic acid      (CC(=O)O         ): Carboxylic acid
   ester                (CC(=O)OC        ): Ester
   ether                (CCOC            ): Ether
   alkene               (C=C              ): Alkene
   alkyne               (C#C              ): Alkyne

Multi-Halogen Wildcard Analysis
--------------------------------

Wildcards work across all halogen modes. Compare wildcard matches across halogens:

.. code-block:: python

   from PFASGroups import parse_smiles

   # Non-halogenated molecule with multiple functional groups
   smiles = "O=C(O)C(F)(F)CCN(CC(C)O)CC(=O)OC"

   # Analyze with each halogen (wildcards enabled)
   halogens = ['F', 'Cl', 'Br', 'I', 'H', "*"]
   results = parse_smiles(smiles, halogens=halogens)
   for result in results:
         print(f"Results for {result.smiles}")
         for match in result.matches:
             print(f"   - {match.group_name} (ID {match.group_id}) under '{match['halogen']}' mode")

**Understanding the halogen field**: The ``match['halogen']`` field shows which halogens are
*actually present* in that match's components (e.g., ``'F'``, ``'Cl'``, ``['F', 'H']``, ``'*'``
for wildcards). For HalogenGroup matches, this reflects the real halogens bonded to the
matched carbon components. For WildcardGroup matches, it is always ``'*'``. For PFASdefinition
matches, it is always ``'F'``.

Note: To run both H-component and wildcard matching in one call, use
``halogens=['H', '*']``.

Note: Since this molecule has no halogens, matches will be identical across all halogen modes
(wildcard patterns don't depend on the halogen parameter).

Advanced: Combining H-Components and Wildcards
===============================================

You can analyze molecules using both features simultaneously:

.. code-block:: python

   from PFASGroups import parse_smiles
   from PFASGroups.generate_homologues import generate_homologues
   from rdkit import Chem

   # Complex molecule: PFOA-like with a non-fluorinated tail
   smiles = "FC(F)(F)C(F)(F)C(F)(F)C(F)(F)CCCCCC(=O)O"
   mol = Chem.MolFromSmiles(smiles)

   # 1. Detect all functional groups and PFAS patterns
    results = parse_smiles(smiles, halogens=['F', '*'])
   print("Detected groups:")
   for match in results[0].matches:
       prefix = "PFAS" if not match.match_id.startswith('W') else "Generic"
       print(f"  [{prefix}] {match.group_name}")

   # 2. Generate PFAS (F-based) homologues
   pfas_homologues = generate_homologues(mol, halogen='F')
   print(f"\nFluorinated homologues: {len(pfas_homologues)}")

   # 3. Analyze the non-halogenated portion as H-component
   # (Extract a fragment to demonstrate)
   non_halo_smiles = "CCCCCC(=O)O"  # Hydrocarbon portion only
   non_halo_mol = Chem.MolFromSmiles(non_halo_smiles)
   h_homologues = generate_homologues(non_halo_mol, halogen='H')
   print(f"Hydrocarbon homologues: {len(h_homologues)}")

Custom Wildcard Definitions
============================

If you need to detect additional functional groups, you can extend the wildcard definitions by modifying ``PFASGroups/data/Halogen_groups_smarts.json``.

To create a custom wildcard group:

.. code-block:: python

   from PFASGroups import HalogenGroup, parse_smiles

   # Define a custom wildcard group (e.g., for a specific ketone pattern)
   custom_group = HalogenGroup(
       id=9999,  # Use a high ID to avoid conflicts
       name="Methyl ketone",
       smarts={"[#6]C(=O)[#6]": 1},  # Ketone with two alkyl groups
       alias="Methyl ketone",
       # Wildcard groups typically have no halogen-specific constraints:
       componentSmarts=None,
       componentSaturation=None,
       linker_smarts=None,
       constraints={},
   )

   # Use in parsing
    results = parse_smiles(["CC(=O)C", "CC(=O)CC"], pfas_groups=[custom_group], halogens='*')
   for mol in results:
       if mol.matches:
           for m in mol.matches:
               if m.group_id == 9999:
                   print(f"Matched custom group: {m.group_name}")

Best Practices
==============

For H-Components
----------------

1. **Use for research/validation only** — not for production PFAS screening
2. **Expect fewer homologues** than fluorinated analogues (fewer repeated units in hydrocarbons)
3. **Check component detection** manually if results are unexpected
4. **Combine with standard PFAS analysis** for comprehensive coverage

For Wildcard Groups
--------------------

1. **Enable selectively** — only when needed to reduce parse overhead
2. **Filter by prefix** — distinguish wildcards (``W-*``) from halogen groups
3. **Validate against chemistry** — wildcard patterns are generic and may match unintended substructures
4. **Telomer behavior** — telomer groups are not assessed in H-only or wildcard-only flows
5. **PFAS definitions gate** — definitions are assessed only when ``'F'`` is included in ``halogens``
6. **Document custom groups** — if you extend the definitions, add comments to ``PFASGroups/data/Halogen_groups_smarts.json``
7. **Test on reference sets** — verify that your chosen patterns work for your use case

Summary
=======

**H-Components** and **Wildcard Groups** extend PFASGroups beyond PFAS-specific analysis:

- **H-Components**: Model hydrocarbon chains as pseudo-halogenated for homologue generation and component validation
- **Wildcards**: Detect generic functional groups (alcohols, ethers, esters, etc.) complementary to PFAS patterns

Both features maintain the core architecture of PFASGroups while enabling broader applications in organic chemistry and structural validation.

See Also
========

- :doc:`quickstart` — Quick reference for common workflows
- :doc:`algorithm` — Detailed explanation of SMARTS matching and component detection
- :doc:`customization` — How to define custom PFAS groups and component patterns