Embedding Analysis ================== .. currentmodule:: PFASGroups .. code-block:: python from PFASGroups import parse_smiles results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"]) arr = results.to_array() print(arr.shape) # (2, 112) — 112 groups, F only, binary cols = results.column_names() # list of 112 column label strings # PCA in two lines pca_result = results.perform_pca(n_components=2) print(pca_result['transformed'].shape) # (2, 2) :func:`parse_smiles` returns a :class:`PFASEmbeddingSet` — a list of :class:`PFASEmbedding` objects, one per molecule. Call :meth:`~PFASEmbeddingSet.to_array` on the set to get a ``(n_mols, n_cols)`` numpy matrix, or call it on a single :class:`PFASEmbedding` for a 1-D vector. .. seealso:: :doc:`/ResultsFingerprint_Guide` — complete metric reference with formulas, preset benchmarks, and group selection tables. .. contents:: Contents :local: :depth: 2 PFASEmbeddingSet ---------------- .. autoclass:: PFASEmbeddingSet :members: :undoc-members: :show-inheritance: Generating an embedding array ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Via :meth:`~PFASGroups.PFASEmbeddingSet.to_array`: .. code-block:: python from PFASGroups import parse_smiles smiles = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O", "OCCOCCO"] results = parse_smiles(smiles) # Default: all 112 groups, fluorine only, binary encoding arr = results.to_array() print(arr.shape) # (3, 112) cols = results.column_names() # list of 112 column label strings # OECD groups only, count mode arr_oecd = results.to_array(group_selection='oecd', component_metrics=['count']) print(arr_oecd.shape) # (3, 28) # Best-performing preset (binary + effective_graph_resistance) arr_best = results.to_array(preset='best') cols_best = results.column_names(preset='best') print(arr_best.shape) # (3, 224) — 112 groups × 2 metrics # Single-molecule embedding vec = results[0].to_array(preset='best') # 1-D array, length 224 Key attributes ~~~~~~~~~~~~~~ :class:`PFASEmbeddingSet` is a plain list of :class:`PFASEmbedding` dicts. The embedding matrix is computed on demand; no matrix is stored on the object. .. list-table:: :header-rows: 1 :widths: 30 70 * - Attribute / method - Description * - ``results.to_array(...)`` - ``numpy.ndarray`` of shape ``(n_mols, n_cols)`` * - ``results.column_names(...)`` - List of column label strings; same arguments as ``to_array()`` * - ``results[i]`` - :class:`PFASEmbedding` for molecule *i* (dict subclass) * - ``results[i].smiles`` - SMILES string of molecule *i* * - ``results[i].to_array(...)`` - 1-D embedding vector for one molecule * - ``results[i].summarise()`` - Returns a formatted string summary of matched PFAS groups * - ``results[i].summary()`` - Prints the formatted summary to stdout Dimensionality reduction ------------------------- perform_pca ~~~~~~~~~~~ .. automethod:: PFASEmbeddingSet.perform_pca :no-index: .. code-block:: python result = results.perform_pca(n_components=2, plot=True) coords = result['transformed'] # numpy array (n_mols, 2) evr = result['explained_variance'] # variance ratios per component **Parameters:** - ``n_components`` (int, default 2): Number of PCA components - ``plot`` (bool, default True): Generate and display a scatter plot + scree plot - ``output_file`` (str, optional): File path to save the plot **Returns:** dict with keys ``'transformed'``, ``'explained_variance'``, ``'components'``, ``'pca_model'``, ``'scaler'``. *Requires*: scikit-learn, matplotlib perform_tsne ~~~~~~~~~~~~ .. automethod:: PFASEmbeddingSet.perform_tsne :no-index: .. code-block:: python result = results.perform_tsne(n_components=2, perplexity=5, plot=True) coords = result['transformed'] **Parameters:** - ``n_components`` (int, default 2) - ``perplexity`` (float, default 30.0): t-SNE perplexity (typically 5–50; must be less than the number of molecules) - ``learning_rate`` (float, default 200.0) - ``max_iter`` (int, default 1000) - ``plot`` (bool, default True) **Returns:** dict with keys ``'transformed'``, ``'tsne_model'``, ``'scaler'``, ``'perplexity'``. *Requires*: scikit-learn, matplotlib perform_umap ~~~~~~~~~~~~ .. automethod:: PFASEmbeddingSet.perform_umap :no-index: .. code-block:: python result = results.perform_umap(n_components=2, n_neighbors=15, plot=True) coords = result['transformed'] **Parameters:** - ``n_components`` (int, default 2) - ``n_neighbors`` (int, default 15): UMAP local neighborhood size - ``min_dist`` (float, default 0.1): Minimum distance between embedded points - ``metric`` (str, default ``'euclidean'``) - ``plot`` (bool, default True) **Returns:** dict with keys ``'transformed'``, ``'umap_model'``, ``'scaler'``, ``'n_neighbors'``, ``'min_dist'``. *Requires*: umap-learn (``pip install umap-learn``), matplotlib Statistical comparison ----------------------- compare_kld ~~~~~~~~~~~ .. automethod:: PFASEmbeddingSet.compare_kld :no-index: Compute KL divergence between the group-occurrence frequencies of two sets: .. code-block:: python results_a = parse_smiles(set_a) results_b = parse_smiles(set_b) kld = results_a.compare_kld(results_b, method='minmax') # kld: float — normalised symmetric KLD (lower = more similar distributions) **Parameters:** - ``other`` (:class:`PFASEmbeddingSet`): The comparison set - ``method`` (str, default ``'minmax'``): - ``'minmax'``: normalised symmetric KLD ∈ [0, 1] - ``'forward'``: KL(self ‖ other) - ``'reverse'``: KL(other ‖ self) - ``'symmetric'``: average of forward + reverse **Returns:** ``float`` Database I/O ------------ to_sql / from_sql ~~~~~~~~~~~~~~~~~ .. automethod:: PFASEmbeddingSet.to_sql :no-index: .. automethod:: PFASEmbeddingSet.from_sql :no-index: .. code-block:: python # Save to SQLite results.to_sql(filename="pfas_results.db") # Load back from PFASGroups import PFASEmbeddingSet results2 = PFASEmbeddingSet.from_sql(filename="pfas_results.db") # PostgreSQL results.to_sql(dbname="mydb", user="alice", password="secret", host="localhost") Component and group-level data are stored in two tables (``components`` and ``pfas_groups_in_compound`` by default). Pass ``if_exists='replace'`` to overwrite existing tables.