Quickstart
==========

Five-minute overview of the most common PFASGroups workflows.

.. code-block:: python

   from PFASGroups import parse_smiles

   results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
   arr, cols = results.to_array(), results.column_names()
   print(arr.shape)   # (2, 114) — 114 groups compiled by default for fluorine-only

.. contents:: Contents
   :local:
   :depth: 1

Parsing SMILES
--------------

.. code-block:: python

   from PFASGroups import parse_smiles

   smiles = [
       "CCCC(F)(F)F",       # perfluoroalkyl chain
       "FC(F)(F)C(=O)O",    # trifluoroacetic acid (TFA)
       "OCCOCCO",           # no halogen — returns no matches
   ]

   results = parse_smiles(smiles)

``results`` is a :class:`~PFASGroups.PFASEmbeddingSet` — a list-like container
of :class:`~PFASGroups.PFASEmbedding` objects (dict subclass), one per input SMILES.

Accessing matches
-----------------

.. code-block:: python

   mol = results[0]                             # first molecule (PFASEmbedding)
   print(mol.smiles)                            # canonical SMILES
   print(bool(mol.matches))                     # True if any group matched

   for match in mol.matches:                    # iterate over MatchView objects
       if match.is_group:
           print(match.group_name)              # e.g. "Perfluoroalkyl"
           print(match.group_id)               # integer group ID
           for comp in match.components:
               print(comp.atoms)               # list of atom indices

Loop over only molecules that have at least one match:

.. code-block:: python

   for mol in results:
       if mol.matches:
           print(mol.smiles, "—", len(mol.matches), "match(es)")

Converting to a DataFrame
-------------------------

.. code-block:: python

   df = results.to_dataframe()
   print(df.columns.tolist())
   # ['smiles', 'inchi', 'group_name', 'group_id', ...]

Generating embeddings
---------------------

Embeddings encode group matches as a fixed-length numeric vector suitable for
machine learning.  By default PFASGroups produces a **binary vector** with one
column per group (fluorine only):

.. code-block:: python

   from PFASGroups import parse_smiles

   smiles = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O", "OCCOCCO"]

   # Convenience function — parses and returns (array, column_names)
   results = parse_smiles(smiles)
   arr, cols = results.to_array(), results.column_names()
   print(arr.shape)   # (3, n_groups) — one row per molecule
   print(type(arr))   # numpy.ndarray
   print(cols[:2])    # ['Perfluoromethyl [binary]', 'Perfluoroalkyl [binary]', ...]

   # From a pre-parsed set
   results = parse_smiles(smiles)
   arr  = results.to_array()    # (3, n_groups) matrix
   cols = results.column_names()  # matching column labels

**Group selection** — restrict to a subset of groups:

.. code-block:: python

   # OECD groups only
   arr_oecd, cols_oecd = results.to_array(group_selection='oecd'), results.column_names(group_selection='oecd')

   # From a pre-parsed set
   arr_oecd = results.to_array(group_selection='oecd')

**component_metrics** — control how matches are encoded:

.. code-block:: python

   # binary (default): 1 = present, 0 = absent
   arr_bin, _ = results.to_array(component_metrics=['binary']), results.column_names(component_metrics=['binary'])

   # count: number of independent matches
   arr_cnt, _ = results.to_array(component_metrics=['count']), results.column_names(component_metrics=['count'])

   # max_component: size (atom count) of the largest matching component
   arr_max, _ = results.to_array(component_metrics=['max_component']), results.column_names(component_metrics=['max_component'])

   # Preset combining binary + effective graph resistance ('best')
   arr_best, _ = results.to_array(preset='best'), results.column_names(preset='best')

**n_spacer** — telomer CH\ :sub:`2` spacer length (the ``m`` in ``m:n`` notation):

.. code-block:: python

   # n_spacer is 0 for non-telomers; encodes the linker length for
   # fluorotelomers (2 for 4:2 FTOH, 4 for 6:2 FTOH, etc.)
   arr_ns = results.to_array(component_metrics=['n_spacer'])
   # Non-zero entries only appear for telomers group columns

**ring_size** — smallest ring containing the matched component:

.. code-block:: python

   # ring_size is 0 for acyclic groups; 5 for azoles/furans; 6 for benzene/cyclohexane
   arr_rs = results.to_array(component_metrics=['ring_size'])

**Combined embedding** with multiple metrics and molecule-wide descriptors:

.. code-block:: python

   arr_combined = results.to_array(
       component_metrics=['binary', 'effective_graph_resistance',
                          'n_spacer', 'ring_size'],
       molecule_metrics=['n_components', 'max_size',
                         'mean_branching', 'max_component_fraction'],
   )

.. note::

   For multi-halogen embeddings covering F, Cl, Br and I, see
   :ref:`multi-halogen fingerprinting <multi_halogen_fingerprint>` in
   :doc:`halogengroups`.

PFAS definition screening
--------------------------

.. code-block:: python

   from PFASGroups import parse_smiles

   results = parse_smiles(
       ["CCCC(F)(F)F", "OCCOCCO"],
       include_PFAS_definitions=True,
   )

   for mol in results:
       for match in mol.matches:
           if match.is_definition:
               print(mol.smiles, "matches", match.get("definition_name"))

Saturation filter
-----------------

.. code-block:: python

   # Only perfluorinated (fully saturated C-F) groups
   results = parse_smiles(smiles, saturation='per')

   # Polyfluorinated groups (partially substituted)
   results = parse_smiles(smiles, saturation='poly')

   # No filter — all groups (default: saturation=None for parse_smiles)
   results = parse_smiles(smiles, saturation=None)

Multi-halogen parsing (advanced)
---------------------------------

To detect Cl, Br and I groups in addition to fluorine, use the
``halogens`` argument or import from ``HalogenGroups``:

.. code-block:: python

   from PFASGroups import parse_smiles

   results = parse_smiles(["ClCCCl", "BrCCBr"], halogens=['F', 'Cl', 'Br', 'I'])

See :doc:`halogengroups` for full multi-halogen documentation.

Command-line usage
------------------

.. code-block:: bash

   # Parse a CSV of SMILES
   pfasgroups parse input.csv --output results.json

   # Generate fingerprints
   pfasgroups fingerprint input.csv --output fps.csv

   # List all 119 group names (114 compiled by default for fluorine-only)
   pfasgroups list-groups

See :doc:`cli` for the full CLI reference.