Group Feature Extraction
extract_group_features() provides a single entry point for
extracting structured numerical features from halogen-group detection. It
is designed for machine-learning pipelines that require a fixed-length,
named feature vector from the six component groups and the generic
functional groups of PFASGroups.
Overview
Two parse_mol() calls are made per molecule:
halogens=['H', 'F', 'Cl', 'Br', 'I']restricted to the six component groups (ids 34, 35, 37, 38, 44, 45) — captures structured halogen-chain information including alkyl chains detected via the'H'pseudo-halogen.halogens=['*'](wildcard) — captures generic functional group matches (ids 29-76). The six component group ids are automatically excluded from wildcard matching.
The result is a GroupFeatureResult dataclass with four
named dictionaries and a to_array()
method that returns a fixed-length float32 array of shape (66,).
The six component groups
These groups describe the halogenation pattern of carbon-chain or ring
components. They are parameterised per halogen type and are the basis for
the poly_counts and
per_halogen_sizes feature sets.
ID |
Category |
Name |
Description |
|---|---|---|---|
34 |
Perhalogenated |
perhalogenated alkyl |
All halogen-bearing carbons in the alkyl chain carry one single halogen type (F, Cl, Br, I, or the H pseudo-halogen for un-substituted chains). Match is attributed to that halogen. |
35 |
Polyhalogenated |
polyhalogenated alkyl |
The alkyl chain has halogen-bearing carbons but carries a mix of halogens or only partial halogenation. Match is counted regardless of which halogen(s) are present. |
37 |
Perhalogenated |
perhalogenated aryl compounds |
Like group 34, but the halogenated component is aromatic. |
38 |
Polyhalogenated |
polyhalogenated aryl compounds |
Like group 35, but the halogenated component is aromatic. |
44 |
Perhalogenated |
perhalogenated cyclic compounds |
Like group 34, but the halogenated component is cyclic (non-aromatic). |
45 |
Polyhalogenated |
polyhalogenated cyclic compounds |
Like group 35, but the halogenated component is cyclic (non-aromatic). |
Note
The six component group ids (34, 35, 37, 38, 44, 45) are excluded from wildcard matching by PFASGroups design. Their contribution to the generic-group feature vector is therefore always zero.
The H pseudo-halogen
Group 34 (perhalogenated alkyl) is compiled with componentHalogens
inferred from the available component SMARTS, which includes the 'Alkyl'
component type. This allows PFASGroups to detect un-substituted (all-H)
alkyl chains and attribute them to the 'H' pseudo-halogen.
The h_chain_sizes dictionary in GroupFeatureResult
is a convenience view of the 'H' column of per_halogen_sizes:
r.h_chain_sizes['alkyl_H'] # == r.per_halogen_sizes['g34_H']
r.h_chain_sizes['aryl_H'] # == r.per_halogen_sizes['g37_H']
r.h_chain_sizes['cyclic_H'] # == r.per_halogen_sizes['g44_H']
Note that h_chain_sizes is not included in
to_array(); it is purely for convenient
named access.
Generic functional groups (ids 29-76)
The wildcard call captures all 48 functional group ids between 29 and 76. Group names and their IDs:
ID |
Name |
ID |
Name |
|---|---|---|---|
29 |
acrylate |
30 |
acyl halide |
31 |
alcohol |
32 |
aldehyde |
33 |
alkene |
34 |
perhalogenated alkyl (always 0) |
35 |
polyhalogenated alkyl (always 0) |
36 |
alkyne |
37 |
perhalogenated aryl compounds (always 0) |
38 |
polyhalogenated aryl compounds (always 0) |
39 |
benzodioxole |
40 |
benzoyl peroxydes |
41 |
bromide |
42 |
carboxylic acid |
43 |
chloride |
44 |
perhalogenated cyclic compounds (always 0) |
45 |
polyhalogenated cyclic compounds (always 0) |
46 |
ester |
47 |
ether |
48 |
fluoride |
49 |
glucuronate |
50 |
iodide |
51 |
ketone |
52 |
methacrylate |
53 |
peroxydes |
54 |
side-chain aromatics |
55 |
sulfenic acid |
56 |
sulfenyl halide |
57 |
sulfinic acid |
58 |
sulfinyl amido sulfonic acid |
59 |
sulfonamide |
60 |
sulfonamidoethanol |
61 |
sulfonic acid |
62 |
sulfonyl halide |
63 |
sulfonyl propanoic acid |
64 |
sulfuric acid |
65 |
thioester keto dicarboxylic acid |
66 |
thiocyanic acid |
67 |
phosphinic acid |
68 |
phosphonic acid |
69 |
amide |
70 |
amine |
71 |
heterocyclic azine |
72 |
heterocyclic azole |
73 |
betaine |
74 |
glycine |
75 |
trichlorosilane |
76 |
silane |
Array layout
to_array() returns 66 float32 values:
[0] poly_alkyl group 35 match count
[1] poly_aryl group 38 match count
[2] poly_cyclic group 45 match count
[3] g34_F max perfluoroalkyl component size
[4] g34_Cl max perchloroalkyl component size
[5] g34_Br max perbromoalkyl component size
[6] g34_I max periodoalkyl component size
[7] g34_H max alkyl (H chain) component size
[8] g37_F max perfluorinated aryl component size
[9] g37_Cl max perchlorinated aryl component size
[10] g37_Br max perbrominated aryl component size
[11] g37_I max periodinated aryl component size
[12] g37_H max aryl (H) component size
[13] g44_F max perfluorinated cyclic component size
[14] g44_Cl max perchlorinated cyclic component size
[15] g44_Br max perbrominated cyclic component size
[16] g44_I max periodinated cyclic component size
[17] g44_H max cyclic (H) component size
[18] g29 acrylate count
[19] g30 acyl halide count
...
[65] g76 silane count
Feature name labels are available from
feature_names().
Quick start
Basic usage with an RDKit molecule or a SMILES string:
from rdkit import Chem
from PFASGroups import extract_group_features
# PFOA-like molecule
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(=O)O")
r = extract_group_features(mol)
print(r)
# GroupFeatureResult(poly_counts=1 nonzero, per_halogen_sizes=1 nonzero, generic_groups=2 nonzero)
print(r.poly_counts)
# {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
print(r.per_halogen_sizes['g34_F']) # largest perfluoroalkyl chain
# 4.0
arr = r.to_array()
print(arr.shape, arr.dtype)
# (66,) float32
SMILES strings are also accepted directly:
r = extract_group_features("CCCCCCCC") # octane
print(r.h_chain_sizes)
# {'alkyl_H': 6.0, 'aryl_H': 0.0, 'cyclic_H': 0.0}
Worked examples
from rdkit import Chem
from PFASGroups import extract_group_features, GENERIC_GROUP_NAMES
molecules = {
"PFOA": "FC(F)(F)C(F)(F)C(F)(F)C(=O)O",
"PFOS": "FC(F)(F)C(F)(F)C(F)(F)S(=O)(=O)O",
"mixed": "CCCCC(F)(F)C(F)(F)C(=O)O", # CF2 segment + alkyl tail
"octane": "CCCCCCCC",
"hexafluorobenzene": "Fc1c(F)c(F)c(F)c(F)c1F",
"perchloroalkyl": "ClC(Cl)(Cl)C(Cl)(Cl)C(Cl)(Cl)Cl",
}
for name, smi in molecules.items():
r = extract_group_features(smi)
print(f"{name}")
print(f" poly: {r.poly_counts}")
print(f" g34(alkyl): F={r.per_halogen_sizes['g34_F']}, H={r.per_halogen_sizes['g34_H']}")
print(f" g37(aryl): F={r.per_halogen_sizes['g37_F']}")
nz_gen = {GENERIC_GROUP_NAMES[int(k[1:])]: v for k, v in r.generic_groups.items() if v}
if nz_gen:
print(f" generic: {nz_gen}")
print()
Expected output:
PFOA
poly: {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
g34(alkyl): F=4.0, H=0.0
g37(aryl): F=0.0
generic: {'carboxylic acid': 1.0, 'fluoride': 1.0}
PFOS
poly: {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
g34(alkyl): F=3.0, H=0.0
g37(aryl): F=0.0
generic: {'fluoride': 1.0, 'sulfonic acid': 1.0}
mixed
poly: {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
g34(alkyl): F=2.0, H=3.0
g37(aryl): F=0.0
generic: {'carboxylic acid': 1.0, 'fluoride': 1.0}
octane
poly: {'poly_alkyl': 0.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
g34(alkyl): F=0.0, H=6.0
g37(aryl): F=0.0
hexafluorobenzene
poly: {'poly_alkyl': 0.0, 'poly_aryl': 1.0, 'poly_cyclic': 0.0}
g34(alkyl): F=0.0, H=0.0
g37(aryl): F=6.0
generic: {'fluoride': 1.0}
perchloroalkyl
poly: {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
g34(alkyl): F=0.0, H=0.0
g37(aryl): F=0.0
generic: {'chloride': 1.0}
Using feature names
from PFASGroups import extract_group_features
from rdkit import Chem
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(=O)O")
r = extract_group_features(mol)
names = r.feature_names() # list of 66 labels
arr = r.to_array()
# Non-zero features with their labels
for name, val in zip(names, arr):
if val:
print(f"{name}: {val}")
# poly_alkyl: 1.0
# g34_F: 3.0
# g42: 1.0 (carboxylic acid)
# g48: 1.0 (fluoride)
API Reference
- PFASGroups.extract_group_features(mol: Mol | str) GroupFeatureResult[source]
Extract structured halogen-group features for a single molecule.
Two
parse_mol()calls are made internally:halogens=['H','F','Cl','Br','I']with the six component groups (ids 34, 35, 37, 38, 44, 45) — populatespoly_countsandper_halogen_sizes.halogens=['*'](wildcard) — populatesgeneric_groups.
- Parameters:
mol (rdkit.Chem.Mol or str) – RDKit molecule object or a SMILES string.
- Returns:
Populated result object. Call
to_array()to obtain a fixed-length float32 array of shape(66,).- Return type:
- Raises:
ValueError – If a SMILES string cannot be parsed by RDKit.
RuntimeError – If an unexpected error occurs during
parse_mol().
Examples
>>> from rdkit import Chem >>> from PFASGroups import extract_group_features >>> pfoa = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(=O)O") >>> r = extract_group_features(pfoa) >>> r.poly_counts {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0} >>> r.per_halogen_sizes['g34_F'] 4.0 >>> r.h_chain_sizes {'alkyl_H': 0.0, 'aryl_H': 0.0, 'cyclic_H': 0.0} >>> arr = r.to_array() >>> arr.shape (66,) >>> # Octane — only H alkyl chain detected >>> octane = Chem.MolFromSmiles("CCCCCCCC") >>> r2 = extract_group_features(octane) >>> r2.h_chain_sizes['alkyl_H'] 6.0
- class PFASGroups.GroupFeatureResult(poly_counts: dict[str, float]=<factory>, per_halogen_sizes: dict[str, float]=<factory>, h_chain_sizes: dict[str, float]=<factory>, generic_groups: dict[str, float]=<factory>)[source]
Structured feature extraction result for a single molecule.
This dataclass groups features from two
parse_mol()calls into four semantically distinct dictionaries. It is returned byextract_group_features().- poly_counts
Match counts for polyhalogenated groups (ids 35, 38, 45), aggregated across all halogens (F, Cl, Br, I, H). Keys:
'poly_alkyl','poly_aryl','poly_cyclic'.
- per_halogen_sizes
Maximum carbon-component size for perhalogenated groups (ids 34, 37, 44) resolved per halogen. Keys follow the pattern
'g{id}_{hal}'where id is in{34, 37, 44}and hal is in['F', 'Cl', 'Br', 'I', 'H']. Zero when no match is found.
- h_chain_sizes
Convenience view of the
'H'column ofper_halogen_sizes: the largest un-substituted alkyl / aryl / cyclic component detected via the H pseudo-halogen mechanism. Keys:'alkyl_H','aryl_H','cyclic_H'. Not included into_array().
- generic_groups
Wildcard functional-group match counts for group ids 29-76. Keys:
'g29'…'g76'. Ids{34, 35, 37, 38, 44, 45}are excluded from wildcard matching and will always be zero.
- to_array() ndarray[source]
Return a fixed-length float32 array of shape
(66,).Layout:
[0:3] poly_counts (poly_alkyl, poly_aryl, poly_cyclic) [3:18] per_halogen_sizes (g34_F…g44_H, 3 groups × 5 halogens) [18:66] generic_groups (g29…g76, 48 entries)
Note
h_chain_sizesis excluded from this array (it is a strict subset ofper_halogen_sizes).
- feature_names() list[str][source]
Return the 66 feature names in the same order as
to_array().
- PFASGroups.PER_GROUP_IDS = [34, 37, 44]
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
- PFASGroups.POLY_GROUP_IDS = [35, 38, 45]
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
- PFASGroups.HALOGENS_ORDER = ['F', 'Cl', 'Br', 'I', 'H']
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
- PFASGroups.GENERIC_GROUP_VOCAB = [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76]
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
- PFASGroups.GENERIC_GROUP_NAMES = {29: 'acrylate', 30: 'acyl halide', 31: 'alcohol', 32: 'aldehyde', 33: 'alkene', 34: 'perhalogenated alkyl', 35: 'polyhalogenated alkyl', 36: 'alkyne', 37: 'perhalogenated aryl compounds', 38: 'polyhalogenated aryl compounds', 39: 'benzodioxole', 40: 'benzoyl peroxydes', 41: 'bromide', 42: 'carboxylic acid', 43: 'chloride', 44: 'perhalogenated cyclic compounds', 45: 'polyhalogenated cyclic compounds', 46: 'ester', 47: 'ether', 48: 'fluoride', 49: 'glucuronate', 50: 'iodide', 51: 'ketone', 52: 'methacrylate', 53: 'peroxydes', 54: 'side-chain aromatics', 55: 'sulfenic acid', 56: 'sulfenyl halide', 57: 'sulfinic acid', 58: 'sulfinyl amido sulfonic acid', 59: 'sulfonamide', 60: 'sulfonamidoethanol', 61: 'sulfonic acid', 62: 'sulfonyl halide', 63: 'sulfonyl propanoic acid', 64: 'sulfuric acid', 65: 'thioester keto dicarboxylic acid', 66: 'thiocyanic acid', 67: 'phosphinic acid', 68: 'phosphonic acid', 69: 'amide', 70: 'amine', 71: 'heterocyclic azine', 72: 'heterocyclic azole', 73: 'betaine', 74: 'glycine', 75: 'trichlorosilane', 76: 'silane'}
dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s
(key, value) pairs
- dict(iterable) -> new dictionary initialized as if via:
d = {} for k, v in iterable:
d[k] = v
- dict(**kwargs) -> new dictionary initialized with the name=value pairs
in the keyword argument list. For example: dict(one=1, two=2)