Embedding Analysis
from PFASGroups import parse_smiles
results = parse_smiles(["CCCC(F)(F)F", "FC(F)(F)C(=O)O"])
arr = results.to_array()
print(arr.shape) # (2, 112) — 112 groups, F only, binary
cols = results.column_names() # list of 112 column label strings
# PCA in two lines
pca_result = results.perform_pca(n_components=2)
print(pca_result['transformed'].shape) # (2, 2)
parse_smiles() returns a PFASEmbeddingSet — a list of
PFASEmbedding objects, one per molecule. Call to_array()
on the set to get a (n_mols, n_cols) numpy matrix, or call it on a single
PFASEmbedding for a 1-D vector.
See also
/ResultsFingerprint_Guide — complete metric reference with formulas, preset benchmarks, and group selection tables.
PFASEmbeddingSet
- class PFASGroups.PFASEmbeddingSet(iterable: Iterable[Dict[str, Any]] = ())[source]
Bases:
listList-like container for multiple
PFASEmbeddingresults.Subclasses
listso existing code that iterates over results continues to work. Callto_array()to produce a(n_molecules, n_columns)matrix from all stored results.- property matches: List[MatchView]
Flattened list of all MatchView objects across all molecules.
Some older code expects a
matchesattribute on a ResultsModel instance. Provide a read-only aggregated view by concatenating the per-molecule match lists.
- classmethod from_raw(results: Iterable[Dict[str, Any]]) PFASEmbeddingSet[source]
Wrap an existing list of result dicts without changing them.
- classmethod from_smiles(smiles: str | List[str], **kwargs) PFASEmbeddingSet[source]
Parse SMILES string(s) and return a
PFASEmbeddingSet.
- classmethod from_mols(mols, **kwargs) PFASEmbeddingSet[source]
Parse RDKit molecules and return a
PFASEmbeddingSet.- Parameters:
mols (list of rdkit.Chem.Mol) – List of RDKit molecule objects.
**kwargs – Forwarded to
parse_mols().
- classmethod from_inchis(inchis: List[str], **kwargs) PFASEmbeddingSet[source]
Parse InChI strings and return a
PFASEmbeddingSet.
- reorder(indices: list | None = None, key: Callable[[PFASEmbedding], Any] = None, reverse: bool = False) PFASEmbeddingSet[source]
Return a new PFASEmbeddingSet with results reordered by a key function.
- Parameters:
- iter_group_matches(group_id: int | None = None, group_name: str | None = None) Iterator[Tuple[PFASEmbedding, MatchView]][source]
Iterate over all PFAS group matches across all molecules.
- plot_components_for_group(group_id: int | None = None, group_name: str | None = None, max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) Tuple[Image, int, int][source]
Plot all components for a specific PFAS group across molecules.
Either
group_idorgroup_name(or both) can be provided to select the target group. Each panel corresponds to one molecule, with all its components for that group highlighted together.
- show(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) Image[source]
Show all component combinations in a grid plot.
Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.
Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.
- plot(display: bool = True, subwidth: int = 350, subheight: int = 350, ncols: int = 4) Image
Show all component combinations in a grid plot.
Components that share the same highlighted atoms within a molecule are merged into a single panel. The table below each structure lists the matched PFAS group, the component SMARTS type, and three graph metrics: size (C-atom count), branching (1.0 = linear) and mean eccentricity.
Atoms are highlighted with the colour derived from the component SMARTS metadata (halogen / form / saturation) of the first entry in each panel.
- to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]
Export this molecule result to a SQL database.
Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).
- Parameters:
filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.
dbname (str, optional) – Database name (for PostgreSQL/MySQL).
user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.
password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.
host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).
port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.
- svg(filename: str, subwidth: int = 350, subheight: int = 350, ncols: int = 4) str[source]
Export all component combinations to an SVG file (vector graphics).
Components that share the same highlighted atoms within a molecule are merged into a single panel with a bullet-point legend.
- Parameters:
- Returns:
Path to the created SVG file.
- Return type:
- summarise() str[source]
Return a coloured text summary of the results.
The summary includes: - number of molecules - counts of PFAS group and definition matches - total number of components across all group matches - the most frequent PFAS groups (colour-coded by halogen)
- table() str[source]
Return a more detailed text table with one row per molecule.
The TSV table has the following columns:
index(1-based),smiles,group_matches(count),definition_matches(count), andgroups(per-molecule PFAS groups with counts, e.g."Perfluoroalkyl (2); Polyfluoroalkyl (1)").
- classify() DataFrame[source]
Return a classification DataFrame with one row per molecule.
Each molecule is classified by
MoleculeResult.classify().- Returns:
Columns:
smiles— molecule SMILES.category— classification label: OECD group name(s) if matched, otherwise"per-"/"poly-"+ generic/telomeric group names (comma-separated).total_component_size— sum of C-atom counts across all matched group components.
- Return type:
- summary() None[source]
Print a detailed coloured summary of matched groups across all molecules.
For each group, shows the component SMARTS type and, per component, the graph metrics: size (C-atom count), branching and mean eccentricity. Component size statistics (min, max, mean) are also shown.
- plot_all_components_with_group_colours(max_molecules: int | None = None, subwidth: int = 300, subheight: int = 300, ncols: int = 3) Tuple[Image, int, int][source]
Plot all matched components, coloured by PFAS group.
Each panel corresponds to one molecule; atoms are highlighted with colours assigned per PFAS group. The legend lists the groups found in that molecule.
- to_sql_all(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]
Export all molecule results to a SQL database.
This method efficiently batches all molecules into the database in a single operation.
- Parameters:
conn (str or sqlalchemy.engine.Engine, optional) – Database connection. Can be: - SQLAlchemy Engine object - Connection string (e.g., ‘postgresql://user:pass@host:port/db’) - SQLite path with ‘sqlite:///’ prefix
filename (str, optional) – Path to SQLite database file (legacy parameter, use conn instead).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.
Examples
>>> # Using connection string >>> results.to_sql(conn='postgresql://user:pass@localhost/pfas_db') >>> >>> # Using SQLAlchemy engine >>> from sqlalchemy import create_engine >>> engine = create_engine('sqlite:///pfas.db') >>> results.to_sql_all(conn=engine) >>> >>> # Using filename (legacy) >>> results.to_sql(filename='pfas.db')
- to_fingerprint(group_selection: str = 'all', component_metrics: List[str] | None = None, selected_group_ids: List[int] | None = None, halogens: str | List[str] = 'F', saturation: str | None = 'per', molecule_metrics: List[str] | None = None, pfas_groups: List[Dict] | None = None, preset: str | None = None, count_mode: str | None = None, graph_metrics: List[str] | None = None, progress: bool = False, **kwargs) ndarray[source]
Deprecated. Use
to_array()instead.
- property match_cache: PFASEmbeddingSet
Alias for the set itself (backward compat with PFASFingerprint API).
- get_embedding(**kwargs) EmbeddingArray[source]
Alias for
to_array()(backward compat with PFASFingerprint API).
- to_array(component_metrics=<object object>, molecule_metrics=<object object>, group_selection=<object object>, selected_group_ids=<object object>, aggregation=<object object>, preset=<object object>, pfas_groups=<object object>, halogens=<object object>, progress: bool = True) EmbeddingArray[source]
Stack per-molecule embedding rows into a
(n_mols, n_cols)matrix.When called with no arguments, returns the last cached embedding (or binary by default on the first call). Pass explicit arguments to override and update the cache.
Parameters match those of
PFASEmbedding.to_array(), plus:- progressbool, default True
Show a tqdm progress bar while computing embeddings.
- compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') float[source]
Compare two sets using KL divergence on group-occurrence frequencies.
- Parameters:
other (PFASEmbeddingSet) – Second set to compare against.
method (str, default
'minmax') –'forward','reverse','symmetric', or'minmax'(normalised symmetric KLD).
- Returns:
KL divergence value (lower = more similar).
- Return type:
- perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform PCA on the embedding matrix.
- Parameters:
n_components (int, default 2)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','explained_variance','components','pca_model','scaler','labels'(if color_by is set).- Return type:
- perform_kernel_pca(n_components: int = 2, kernel: str = 'rbf', gamma: float | None = None, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform kernel PCA on the embedding matrix.
- Parameters:
n_components (int, default 2)
kernel (str, default
'rbf')gamma (float, optional)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','kpca_model','scaler','kernel','gamma','labels'(if color_by is set).- Return type:
- perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform t-SNE dimensionality reduction on the embedding matrix.
- Parameters:
n_components (int, default 2)
perplexity (float, default 30.0)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','tsne_model','scaler','perplexity','labels'(if color_by is set).- Return type:
- perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform UMAP dimensionality reduction on the embedding matrix.
- Parameters:
n_components (int, default 2)
n_neighbors (int, default 15)
min_dist (float, default 0.1)
metric (str, default
'euclidean')plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','umap_model','scaler','n_neighbors','min_dist','labels'(if color_by is set).- Return type:
- column_names(component_metrics: List[str] | None = None, molecule_metrics: List[str] | None = None, group_selection: str = 'all', selected_group_ids: List[int] | None = None, preset: str | None = None, pfas_groups=None, halogens=None) List[str][source]
Return column labels (delegates to first element).
- classmethod from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) PFASEmbeddingSet[source]
Load results from SQL database.
- Parameters:
conn (str or SQLAlchemy Engine, optional) – Database connection string or engine
filename (str, optional) – SQLite database filename (alternative to conn)
components_table (str, default "components") – Name of the components table
groups_table (str, default "pfas_groups_in_compound") – Name of the groups table
limit (int, optional) – Limit number of molecules to load
- Returns:
Loaded results
- Return type:
ResultsModel
Generating an embedding array
Via to_array():
from PFASGroups import parse_smiles
smiles = ["CCCC(F)(F)F", "FC(F)(F)C(=O)O", "OCCOCCO"]
results = parse_smiles(smiles)
# Default: all 112 groups, fluorine only, binary encoding
arr = results.to_array()
print(arr.shape) # (3, 112)
cols = results.column_names() # list of 112 column label strings
# OECD groups only, count mode
arr_oecd = results.to_array(group_selection='oecd', component_metrics=['count'])
print(arr_oecd.shape) # (3, 28)
# Best-performing preset (binary + effective_graph_resistance)
arr_best = results.to_array(preset='best')
cols_best = results.column_names(preset='best')
print(arr_best.shape) # (3, 224) — 112 groups × 2 metrics
# Single-molecule embedding
vec = results[0].to_array(preset='best') # 1-D array, length 224
Key attributes
PFASEmbeddingSet is a plain list of PFASEmbedding dicts.
The embedding matrix is computed on demand; no matrix is stored on the object.
Attribute / method |
Description |
|---|---|
|
|
|
List of column label strings; same arguments as |
|
|
|
SMILES string of molecule i |
|
1-D embedding vector for one molecule |
|
Returns a formatted string summary of matched PFAS groups |
|
Prints the formatted summary to stdout |
Dimensionality reduction
perform_pca
- PFASEmbeddingSet.perform_pca(n_components: int = 2, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform PCA on the embedding matrix.
- Parameters:
n_components (int, default 2)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','explained_variance','components','pca_model','scaler','labels'(if color_by is set).- Return type:
result = results.perform_pca(n_components=2, plot=True)
coords = result['transformed'] # numpy array (n_mols, 2)
evr = result['explained_variance'] # variance ratios per component
Parameters:
n_components(int, default 2): Number of PCA componentsplot(bool, default True): Generate and display a scatter plot + scree plotoutput_file(str, optional): File path to save the plot
Returns: dict with keys 'transformed', 'explained_variance',
'components', 'pca_model', 'scaler'.
Requires: scikit-learn, matplotlib
perform_tsne
- PFASEmbeddingSet.perform_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, max_iter: int = 1000, plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform t-SNE dimensionality reduction on the embedding matrix.
- Parameters:
n_components (int, default 2)
perplexity (float, default 30.0)
learning_rate (float, default 200.0)
max_iter (int, default 1000)
plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','tsne_model','scaler','perplexity','labels'(if color_by is set).- Return type:
result = results.perform_tsne(n_components=2, perplexity=5, plot=True)
coords = result['transformed']
Parameters:
n_components(int, default 2)perplexity(float, default 30.0): t-SNE perplexity (typically 5–50; must be less than the number of molecules)learning_rate(float, default 200.0)max_iter(int, default 1000)plot(bool, default True)
Returns: dict with keys 'transformed', 'tsne_model', 'scaler',
'perplexity'.
Requires: scikit-learn, matplotlib
perform_umap
- PFASEmbeddingSet.perform_umap(n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', plot: bool = True, output_file: str | None = None, color_by=None) Dict[source]
Perform UMAP dimensionality reduction on the embedding matrix.
- Parameters:
n_components (int, default 2)
n_neighbors (int, default 15)
min_dist (float, default 0.1)
metric (str, default
'euclidean')plot (bool, default True)
output_file (str, optional)
color_by (None |
'top_group'| list of str, default None) – Colour scatter-plot points. Pass'top_group'to colour by the PFAS group with the highest match count per molecule, or pass a list of per-molecule label strings.Noneuses a single colour.
- Returns:
Keys:
'transformed','umap_model','scaler','n_neighbors','min_dist','labels'(if color_by is set).- Return type:
result = results.perform_umap(n_components=2, n_neighbors=15, plot=True)
coords = result['transformed']
Parameters:
n_components(int, default 2)n_neighbors(int, default 15): UMAP local neighborhood sizemin_dist(float, default 0.1): Minimum distance between embedded pointsmetric(str, default'euclidean')plot(bool, default True)
Returns: dict with keys 'transformed', 'umap_model', 'scaler',
'n_neighbors', 'min_dist'.
Requires: umap-learn (pip install umap-learn), matplotlib
Statistical comparison
compare_kld
- PFASEmbeddingSet.compare_kld(other: PFASEmbeddingSet, method: str = 'minmax') float[source]
Compare two sets using KL divergence on group-occurrence frequencies.
- Parameters:
other (PFASEmbeddingSet) – Second set to compare against.
method (str, default
'minmax') –'forward','reverse','symmetric', or'minmax'(normalised symmetric KLD).
- Returns:
KL divergence value (lower = more similar).
- Return type:
Compute KL divergence between the group-occurrence frequencies of two sets:
results_a = parse_smiles(set_a)
results_b = parse_smiles(set_b)
kld = results_a.compare_kld(results_b, method='minmax')
# kld: float — normalised symmetric KLD (lower = more similar distributions)
Parameters:
other(PFASEmbeddingSet): The comparison setmethod(str, default'minmax'):'minmax': normalised symmetric KLD ∈ [0, 1]'forward': KL(self ‖ other)'reverse': KL(other ‖ self)'symmetric': average of forward + reverse
Returns: float
Database I/O
to_sql / from_sql
- PFASEmbeddingSet.to_sql(filename: str | None = None, dbname: str | None = None, user: str | None = None, password: str | None = None, host: str | None = None, port: int | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', if_exists: str = 'append') None[source]
Export this molecule result to a SQL database.
Can write to either SQLite (via filename) or PostgreSQL/MySQL (via connection parameters).
- Parameters:
filename (str, optional) – Path to SQLite database file. If provided, uses SQLite.
dbname (str, optional) – Database name (for PostgreSQL/MySQL).
user (str, optional) – Database username. Defaults to os.environ[‘DB_USER’] if not provided.
password (str, optional) – Database password. Defaults to os.environ[‘DB_PASSWORD’] if not provided.
host (str, optional) – Database host. Defaults to os.environ.get(‘DB_HOST’, ‘localhost’).
port (int, optional) – Database port. Defaults to os.environ.get(‘DB_PORT’, 5432 for PostgreSQL).
components_table (str, default "components") – Name of the table to store component-level data.
groups_table (str, default "pfas_groups_in_compound") – Name of the table to store PFAS group matches.
if_exists (str, default "append") – How to behave if tables exist: ‘fail’, ‘replace’, or ‘append’.
- classmethod PFASEmbeddingSet.from_sql(conn: str | 'sqlalchemy.engine.Engine' | None = None, filename: str | None = None, components_table: str = 'components', groups_table: str = 'pfas_groups_in_compound', limit: int | None = None) PFASEmbeddingSet[source]
Load results from SQL database.
- Parameters:
conn (str or SQLAlchemy Engine, optional) – Database connection string or engine
filename (str, optional) – SQLite database filename (alternative to conn)
components_table (str, default "components") – Name of the components table
groups_table (str, default "pfas_groups_in_compound") – Name of the groups table
limit (int, optional) – Limit number of molecules to load
- Returns:
Loaded results
- Return type:
ResultsModel
# Save to SQLite
results.to_sql(filename="pfas_results.db")
# Load back
from PFASGroups import PFASEmbeddingSet
results2 = PFASEmbeddingSet.from_sql(filename="pfas_results.db")
# PostgreSQL
results.to_sql(dbname="mydb", user="alice", password="secret", host="localhost")
Component and group-level data are stored in two tables
(components and pfas_groups_in_compound by default).
Pass if_exists='replace' to overwrite existing tables.