Embedding Analysis

from PFASGroups import parse_smiles

results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
arr = results.to_array()

print(arr.shape)                        # (2, 112)  — 112 groups, F only, binary
cols = results.column_names()           # list of 112 column label strings

# PCA in two lines
pca_result = results.perform_pca(n_components=2)
print(pca_result['transformed'].shape)  # (2, 2)

parse_smiles() returns a PFASEmbeddingSet — a list of PFASEmbedding objects, one per molecule. Call to_array() on the set to get a (n_mols, n_cols) numpy matrix, or call it on a single PFASEmbedding for a 1-D vector.

PFASEmbeddingSet 

class PFASGroups.PFASEmbeddingSet(iterable: Iterable[Dict[str, Any]] = ())[source]

Bases: list

List-like container for multiple PFASEmbedding results.

Subclasses list so existing code that iterates over results continues to work. Call to_array() to produce a (n_molecules, n_columns) matrix from all stored results.

__init__(iterable: Iterable[Dict[str, Any]] = ())[source]

property matches: List[MatchView]

Flattened list of all MatchView objects across all molecules.

Some older code expects a matches attribute on a ResultsModel instance. Provide a read-only aggregated view by concatenating the per-molecule match lists.

classmethod from_raw(results: Iterable[Dict[str, Any]]) → PFASEmbeddingSet[source]: Wrap an existing list of result dicts without changing them.

classmethod from_smiles(smiles: str | List[str], **kwargs) → PFASEmbeddingSet[source]

Parse SMILES string(s) and return a PFASEmbeddingSet.

Parameters:

smiles (str or list of str) – One or more SMILES strings.
**kwargs – Forwarded to parse_smiles() (e.g. halogens, saturation, progress).

classmethod from_mols(mols, **kwargs) → PFASEmbeddingSet[source]

Parse RDKit molecules and return a PFASEmbeddingSet.

Parameters:

mols (list of rdkit.Chem.Mol) – List of RDKit molecule objects.
**kwargs – Forwarded to parse_mols().

classmethod from_inchis(inchis: List[str], **kwargs) → PFASEmbeddingSet[source]

Parse InChI strings and return a PFASEmbeddingSet.

Parameters:

inchis (list of str) – List of InChI strings.
**kwargs – Forwarded to parse_mols().

reorder(indices: list | None = None, key: Callable[[PFASEmbedding], Any] = None, reverse: bool = False) → PFASEmbeddingSet[source]

Return a new PFASEmbeddingSet with results reordered by a key function.

Parameters:

indices (list of int, optional) – Explicit list of indices defining the new order. If provided, this takes precedence over the key function.
key (callable) – Function that takes a PFASEmbedding and returns a value to sort by.
reverse (bool, default False) – Whether to sort in descending order.

iter_group_matches(group_id: int | None = None, group_name: str | None = None) → Iterator[Tuple[PFASEmbedding, MatchView]][source]: Iterate over all PFAS group matches across all molecules.

plot_components_for_group(group_id: int | None = None, group_name: str | None = None, max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) → Tuple[Image, int, int][source]

Plot all components for a specific PFAS group across molecules.

Either group_id or group_name (or both) can be provided to select the target group. Each panel corresponds to one molecule, with all its components for that group highlighted together.

show(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) → Image[source]

Show all component combinations in a grid plot.

Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.

Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.

plot(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) → Image

Show all component combinations in a grid plot.

Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.

Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.

to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') → None[source]

Export this molecule result to a SQL database.

Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).

Parameters:

filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.
dbname (str, optional) – Database name (for PostgreSQL/MySQL).
user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.
password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.
host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).
port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

svg(filename: str, subwidth: int = 350, subheight: int = 350, ncols: int = 4) → str[source]

Export all component combinations to an SVG file (vector graphics).

Components that share the same highlighted atoms within a molecule are merged into a single panel with a bullet-point legend.

Parameters:

filename (str) – Path to the output SVG file.
subwidth (int, default 350) – Width of each sub-image in pixels.
subheight (int, default 350) – Minimum height of each sub-image in pixels.
ncols (int, default 4) – Number of columns in the grid.

Returns:

Path to the created SVG file.

Return type:

str

summarise() → str[source]

Return a coloured text summary of the results.

The summary includes: - number of molecules - counts of PFAS group and definition matches - total number of components across all group matches - the most frequent PFAS groups (colour-coded by halogen)

table() → str[source]

Return a more detailed text table with one row per molecule.

The TSV table has the following columns: index (1-based), smiles, group_matches (count), definition_matches (count), and groups (per-molecule PFAS groups with counts, e.g. "Perfluoroalkyl (2); Polyfluoroalkyl (1)").

classify() → DataFrame[source]

Return a classification DataFrame with one row per molecule.

Each molecule is classified by MoleculeResult.classify().

Returns:

Columns:

smiles — molecule SMILES.
category — classification label: OECD group name(s) if matched, otherwise "per-"/"poly-" + generic/telomeric group names (comma-separated).
total_component_size — sum of C-atom counts across all matched group components.

Return type:

pandas.DataFrame

summary() → None[source]

Print a detailed coloured summary of matched groups across all molecules.

For each group, shows the component SMARTS type and, per component, the graph metrics: size (C-atom count), branching and mean eccentricity. Component size statistics (min, max, mean) are also shown.

plot_all_components_with_group_colours(max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) → Tuple[Image, int, int][source]

Plot all matched components, coloured by PFAS group.

Each panel corresponds to one molecule; atoms are highlighted with colours assigned per PFAS group. The legend lists the groups found in that molecule.

to_sql_all(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') → None[source]

Export all molecule results to a SQL database.

This method efficiently batches all molecules into the database in a single operation.

Parameters:

conn (str or sqlalchemy.engine.Engine, optional) – Database connection. Can be: - SQLAlchemy Engine object - Connection string (e.g., ‘postgresql://user:pass@host:port/db’) - SQLite path with ‘sqlite:///’ prefix
filename (str, optional) – Path to SQLite database file (legacy parameter, use conn instead).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

Examples

>>> # Using connection string
>>> results.to_sql(conn='postgresql://user:pass@localhost/pfas_db')
>>>
>>> # Using SQLAlchemy engine
>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///pfas.db')
>>> results.to_sql_all(conn=engine)
>>>
>>> # Using filename (legacy)
>>> results.to_sql(filename='pfas.db')

to_fingerprint(group_selection: str = 'all', component_metrics: List[str] | None = None, selected_group_ids: List[int] | None = None, halogens: str | List[str] = 'F', saturation: str | None = 'per', molecule_metrics: List[str] | None = None, pfas_groups: List[Dict] | None = None, preset: str | None = None, count_mode: str | None = None, graph_metrics: List[str] | None = None, progress: bool = False, **kwargs) → ndarray[source]: Deprecated. Use to_array() instead.

property n_molecules: int: Number of molecules in this set.

property has_cache: bool: Always True — PFASEmbeddingSet stores pre-parsed results.

property match_cache: PFASEmbeddingSet: Alias for the set itself (backward compat with PFASFingerprint API).

get_embedding(**kwargs) → EmbeddingArray[source]: Alias for to_array() (backward compat with PFASFingerprint API).

to_array(component_metrics=<object object>, molecule_metrics=<object object>, group_selection=<object object>, selected_group_ids=<object object>, aggregation=<object object>, preset=<object object>, pfas_groups=<object object>, halogens=<object object>, progress: bool = True) → EmbeddingArray[source]

Stack per-molecule embedding rows into a (n_mols, n_cols) matrix.

When called with no arguments, returns the last cached embedding (or binary by default on the first call). Pass explicit arguments to override and update the cache.

Parameters match those of PFASEmbedding.to_array(), plus:

progressbool, default True: Show a tqdm progress bar while computing embeddings.

compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') → float[source]

Compare two sets using KL divergence on group-occurrence frequencies.

Parameters:

other (PFASEmbeddingSet) – Second set to compare against.
method (str, default 'minmax') – 'forward', 'reverse', 'symmetric', or 'minmax' (normalised symmetric KLD).

Returns:

KL divergence value (lower = more similar).

Return type:

float

perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform PCA on the embedding matrix.

Parameters:

n_components (int, default 2)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'explained_variance', 'components', 'pca_model', 'scaler', 'labels' (if color_by is set).

Return type:

dict

perform_kernel_pca(n_components: int = 2, kernel: str = 'rbf', gamma: float | None = None, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform kernel PCA on the embedding matrix.

Parameters:

n_components (int, default 2)
kernel (str, default 'rbf')
gamma (float, optional)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'kpca_model', 'scaler', 'kernel', 'gamma', 'labels' (if color_by is set).

Return type:

dict

perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform t-SNE dimensionality reduction on the embedding matrix.

Parameters:

n_components (int, default 2)
perplexity (float, default 30.0)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'tsne_model', 'scaler', 'perplexity', 'labels' (if color_by is set).

Return type:

dict

perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform UMAP dimensionality reduction on the embedding matrix.

Parameters:

n_components (int, default 2)
n_neighbors (int, default 15)
min_dist (float, default 0.1)
metric (str, default 'euclidean')
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'umap_model', 'scaler', 'n_neighbors', 'min_dist', 'labels' (if color_by is set).

Return type:

dict

column_names(component_metrics: List[str] | None = None, molecule_metrics: List[str] | None = None, group_selection: str = 'all', selected_group_ids: List[int] | None = None, preset: str | None = None, pfas_groups=None, halogens=None) → List[str][source]: Return column labels (delegates to first element).

classmethod from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) → PFASEmbeddingSet[source]

Load results from SQL database.

Parameters:

conn (str or SQLAlchemy Engine, optional) – Database connection string or engine
filename (str, optional) – SQLite database filename (alternative to conn)
components_table (str, default "components") – Name of the components table
groups_table (str, default "pfas_groups_in_compound") – Name of the groups table
limit (int, optional) – Limit number of molecules to load

Returns:

Loaded results

Return type:

ResultsModel

Generating an embedding array 

Via to_array():

from PFASGroups import parse_smiles

smiles = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O", "OCCOCCO"]
results = parse_smiles(smiles)

# Default: all 112 groups, fluorine only, binary encoding
arr = results.to_array()
print(arr.shape)                      # (3, 112)
cols = results.column_names()         # list of 112 column label strings

# OECD groups only, count mode
arr_oecd = results.to_array(group_selection='oecd', component_metrics=['count'])
print(arr_oecd.shape)                 # (3, 28)

# Best-performing preset (binary + effective_graph_resistance)
arr_best = results.to_array(preset='best')
cols_best = results.column_names(preset='best')
print(arr_best.shape)                 # (3, 224)  — 112 groups × 2 metrics

# Single-molecule embedding
vec = results[0].to_array(preset='best')   # 1-D array, length 224

Key attributes 

PFASEmbeddingSet is a plain list of PFASEmbedding dicts. The embedding matrix is computed on demand; no matrix is stored on the object.

Attribute / method	Description
`results.to_array(...)`	`numpy.ndarray` of shape `(n_mols, n_cols)`
`results.column_names(...)`	List of column label strings; same arguments as `to_array()`
`results[i]`	`PFASEmbedding` for molecule i (dict subclass)
`results[i].smiles`	SMILES string of molecule i
`results[i].to_array(...)`	1-D embedding vector for one molecule
`results[i].summarise()`	Returns a formatted string summary of matched PFAS groups
`results[i].summary()`	Prints the formatted summary to stdout

Dimensionality reduction 

perform_pca 

PFASEmbeddingSet.perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform PCA on the embedding matrix.

Parameters:

n_components (int, default 2)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'explained_variance', 'components', 'pca_model', 'scaler', 'labels' (if color_by is set).

Return type:

dict

result = results.perform_pca(n_components=2, plot=True)
coords = result['transformed']    # numpy array (n_mols, 2)
evr   = result['explained_variance']   # variance ratios per component

Parameters:

n_components (int, default 2): Number of PCA components
plot (bool, default True): Generate and display a scatter plot + scree plot
output_file (str, optional): File path to save the plot

Returns: dict with keys 'transformed', 'explained_variance', 'components', 'pca_model', 'scaler'.

Requires: scikit-learn, matplotlib

perform_tsne 

PFASEmbeddingSet.perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform t-SNE dimensionality reduction on the embedding matrix.

Parameters:

n_components (int, default 2)
perplexity (float, default 30.0)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'tsne_model', 'scaler', 'perplexity', 'labels' (if color_by is set).

Return type:

dict

result = results.perform_tsne(n_components=2, perplexity=5, plot=True)
coords = result['transformed']

Parameters:

n_components (int, default 2)
perplexity (float, default 30.0): t-SNE perplexity (typically 5-50; must be less than the number of molecules)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)

Returns: dict with keys 'transformed', 'tsne_model', 'scaler', 'perplexity'.

Requires: scikit-learn, matplotlib

perform_umap 

PFASEmbeddingSet.perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) → Dict[source]

Perform UMAP dimensionality reduction on the embedding matrix.

Parameters:

n_components (int, default 2)
n_neighbors (int, default 15)
min_dist (float, default 0.1)
metric (str, default 'euclidean')
plot (bool, default True)
output_file (str, optional)
color_by (None | 'top_group' | list of str, default None) – Colour scatter-plot points. Pass 'top_group' to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings. None uses a single colour.

Returns:

Keys: 'transformed', 'umap_model', 'scaler', 'n_neighbors', 'min_dist', 'labels' (if color_by is set).

Return type:

dict

result = results.perform_umap(n_components=2, n_neighbors=15, plot=True)
coords = result['transformed']

Parameters:

n_components (int, default 2)
n_neighbors (int, default 15): UMAP local neighborhood size
min_dist (float, default 0.1): Minimum distance between embedded points
metric (str, default 'euclidean')
plot (bool, default True)

Returns: dict with keys 'transformed', 'umap_model', 'scaler', 'n_neighbors', 'min_dist'.

Requires: umap-learn (pip install umap-learn), matplotlib

Statistical comparison 

compare_kld 

PFASEmbeddingSet.compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') → float[source]

Compare two sets using KL divergence on group-occurrence frequencies.

Parameters:

other (PFASEmbeddingSet) – Second set to compare against.
method (str, default 'minmax') – 'forward', 'reverse', 'symmetric', or 'minmax' (normalised symmetric KLD).

Returns:

KL divergence value (lower = more similar).

Return type:

float

Compute KL divergence between the group-occurrence frequencies of two sets:

results_a = parse_smiles(set_a)
results_b = parse_smiles(set_b)

kld = results_a.compare_kld(results_b, method='minmax')
# kld: float — normalised symmetric KLD (lower = more similar distributions)

Parameters:

other (PFASEmbeddingSet): The comparison set
method (str, default 'minmax'):
- 'minmax': normalised symmetric KLD ∈ [0, 1]
- 'forward': KL(self ‖ other)
- 'reverse': KL(other ‖ self)
- 'symmetric': average of forward + reverse

Returns: float

Database I/O 

to_sql / from_sql 

PFASEmbeddingSet.to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') → None[source]

Export this molecule result to a SQL database.

Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).

Parameters:

filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.
dbname (str, optional) – Database name (for PostgreSQL/MySQL).
user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.
password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.
host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).
port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.

classmethod PFASEmbeddingSet.from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) → PFASEmbeddingSet[source]

Load results from SQL database.

Parameters:

conn (str or SQLAlchemy Engine, optional) – Database connection string or engine
filename (str, optional) – SQLite database filename (alternative to conn)
components_table (str, default "components") – Name of the components table
groups_table (str, default "pfas_groups_in_compound") – Name of the groups table
limit (int, optional) – Limit number of molecules to load

Returns:

Loaded results

Return type:

ResultsModel

# Save to SQLite
results.to_sql(filename="pfas_results.db")

# Load back
from PFASGroups import PFASEmbeddingSet
results2 = PFASEmbeddingSet.from_sql(filename="pfas_results.db")

# PostgreSQL
results.to_sql(dbname="mydb", user="alice", password="secret", host="localhost")

Component and group-level data are stored in two tables (components and pfas_groups_in_compound by default). Pass if_exists='replace' to overwrite existing tables.