Group Feature Extraction

extract_group_features() provides a single entry point for extracting structured numerical features from halogen-group detection. It is designed for machine-learning pipelines that require a fixed-length, named feature vector from the six component groups and the generic functional groups of PFASGroups.

Overview

Two parse_mol() calls are made per molecule:

  1. halogens=['H', 'F', 'Cl', 'Br', 'I'] restricted to the six component groups (ids 34, 35, 37, 38, 44, 45) — captures structured halogen-chain information including alkyl chains detected via the 'H' pseudo-halogen.

  2. halogens=['*'] (wildcard) — captures generic functional group matches (ids 29-76). The six component group ids are automatically excluded from wildcard matching.

The result is a GroupFeatureResult dataclass with four named dictionaries and a to_array() method that returns a fixed-length float32 array of shape (66,).

The six component groups

These groups describe the halogenation pattern of carbon-chain or ring components. They are parameterised per halogen type and are the basis for the poly_counts and per_halogen_sizes feature sets.

ID

Category

Name

Description

34

Perhalogenated

perhalogenated alkyl

All halogen-bearing carbons in the alkyl chain carry one single halogen type (F, Cl, Br, I, or the H pseudo-halogen for un-substituted chains). Match is attributed to that halogen.

35

Polyhalogenated

polyhalogenated alkyl

The alkyl chain has halogen-bearing carbons but carries a mix of halogens or only partial halogenation. Match is counted regardless of which halogen(s) are present.

37

Perhalogenated

perhalogenated aryl compounds

Like group 34, but the halogenated component is aromatic.

38

Polyhalogenated

polyhalogenated aryl compounds

Like group 35, but the halogenated component is aromatic.

44

Perhalogenated

perhalogenated cyclic compounds

Like group 34, but the halogenated component is cyclic (non-aromatic).

45

Polyhalogenated

polyhalogenated cyclic compounds

Like group 35, but the halogenated component is cyclic (non-aromatic).

Note

The six component group ids (34, 35, 37, 38, 44, 45) are excluded from wildcard matching by PFASGroups design. Their contribution to the generic-group feature vector is therefore always zero.

The H pseudo-halogen

Group 34 (perhalogenated alkyl) is compiled with componentHalogens inferred from the available component SMARTS, which includes the 'Alkyl' component type. This allows PFASGroups to detect un-substituted (all-H) alkyl chains and attribute them to the 'H' pseudo-halogen.

The h_chain_sizes dictionary in GroupFeatureResult is a convenience view of the 'H' column of per_halogen_sizes:

r.h_chain_sizes['alkyl_H']   # == r.per_halogen_sizes['g34_H']
r.h_chain_sizes['aryl_H']    # == r.per_halogen_sizes['g37_H']
r.h_chain_sizes['cyclic_H']  # == r.per_halogen_sizes['g44_H']

Note that h_chain_sizes is not included in to_array(); it is purely for convenient named access.

Generic functional groups (ids 29-76)

The wildcard call captures all 48 functional group ids between 29 and 76. Group names and their IDs:

ID

Name

ID

Name

29

acrylate

30

acyl halide

31

alcohol

32

aldehyde

33

alkene

34

perhalogenated alkyl (always 0)

35

polyhalogenated alkyl (always 0)

36

alkyne

37

perhalogenated aryl compounds (always 0)

38

polyhalogenated aryl compounds (always 0)

39

benzodioxole

40

benzoyl peroxydes

41

bromide

42

carboxylic acid

43

chloride

44

perhalogenated cyclic compounds (always 0)

45

polyhalogenated cyclic compounds (always 0)

46

ester

47

ether

48

fluoride

49

glucuronate

50

iodide

51

ketone

52

methacrylate

53

peroxydes

54

side-chain aromatics

55

sulfenic acid

56

sulfenyl halide

57

sulfinic acid

58

sulfinyl amido sulfonic acid

59

sulfonamide

60

sulfonamidoethanol

61

sulfonic acid

62

sulfonyl halide

63

sulfonyl propanoic acid

64

sulfuric acid

65

thioester keto dicarboxylic acid

66

thiocyanic acid

67

phosphinic acid

68

phosphonic acid

69

amide

70

amine

71

heterocyclic azine

72

heterocyclic azole

73

betaine

74

glycine

75

trichlorosilane

76

silane

Array layout

to_array() returns 66 float32 values:

[0]    poly_alkyl           group 35 match count
[1]    poly_aryl            group 38 match count
[2]    poly_cyclic          group 45 match count
[3]    g34_F                max perfluoroalkyl component size
[4]    g34_Cl               max perchloroalkyl component size
[5]    g34_Br               max perbromoalkyl component size
[6]    g34_I                max periodoalkyl component size
[7]    g34_H                max alkyl (H chain) component size
[8]    g37_F                max perfluorinated aryl component size
[9]    g37_Cl               max perchlorinated aryl component size
[10]   g37_Br               max perbrominated aryl component size
[11]   g37_I                max periodinated aryl component size
[12]   g37_H                max aryl (H) component size
[13]   g44_F                max perfluorinated cyclic component size
[14]   g44_Cl               max perchlorinated cyclic component size
[15]   g44_Br               max perbrominated cyclic component size
[16]   g44_I                max periodinated cyclic component size
[17]   g44_H                max cyclic (H) component size
[18]   g29                  acrylate count
[19]   g30                  acyl halide count
...
[65]   g76                  silane count

Feature name labels are available from feature_names().

Quick start

Basic usage with an RDKit molecule or a SMILES string:

from rdkit import Chem
from PFASGroups import extract_group_features

# PFOA-like molecule
mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(=O)O")
r = extract_group_features(mol)

print(r)
# GroupFeatureResult(poly_counts=1 nonzero, per_halogen_sizes=1 nonzero, generic_groups=2 nonzero)

print(r.poly_counts)
# {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}

print(r.per_halogen_sizes['g34_F'])  # largest perfluoroalkyl chain
# 4.0

arr = r.to_array()
print(arr.shape, arr.dtype)
# (66,) float32

SMILES strings are also accepted directly:

r = extract_group_features("CCCCCCCC")  # octane
print(r.h_chain_sizes)
# {'alkyl_H': 6.0, 'aryl_H': 0.0, 'cyclic_H': 0.0}

Worked examples

from rdkit import Chem
from PFASGroups import extract_group_features, GENERIC_GROUP_NAMES

molecules = {
    "PFOA":            "FC(F)(F)C(F)(F)C(F)(F)C(=O)O",
    "PFOS":            "FC(F)(F)C(F)(F)C(F)(F)S(=O)(=O)O",
    "mixed":           "CCCCC(F)(F)C(F)(F)C(=O)O",  # CF2 segment + alkyl tail
    "octane":          "CCCCCCCC",
    "hexafluorobenzene": "Fc1c(F)c(F)c(F)c(F)c1F",
    "perchloroalkyl":  "ClC(Cl)(Cl)C(Cl)(Cl)C(Cl)(Cl)Cl",
}

for name, smi in molecules.items():
    r = extract_group_features(smi)
    print(f"{name}")
    print(f"  poly:       {r.poly_counts}")
    print(f"  g34(alkyl): F={r.per_halogen_sizes['g34_F']}, H={r.per_halogen_sizes['g34_H']}")
    print(f"  g37(aryl):  F={r.per_halogen_sizes['g37_F']}")
    nz_gen = {GENERIC_GROUP_NAMES[int(k[1:])]: v for k, v in r.generic_groups.items() if v}
    if nz_gen:
        print(f"  generic:    {nz_gen}")
    print()

Expected output:

PFOA
  poly:       {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
  g34(alkyl): F=4.0, H=0.0
  g37(aryl):  F=0.0
  generic:    {'carboxylic acid': 1.0, 'fluoride': 1.0}

PFOS
  poly:       {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
  g34(alkyl): F=3.0, H=0.0
  g37(aryl):  F=0.0
  generic:    {'fluoride': 1.0, 'sulfonic acid': 1.0}

mixed
  poly:       {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
  g34(alkyl): F=2.0, H=3.0
  g37(aryl):  F=0.0
  generic:    {'carboxylic acid': 1.0, 'fluoride': 1.0}

octane
  poly:       {'poly_alkyl': 0.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
  g34(alkyl): F=0.0, H=6.0
  g37(aryl):  F=0.0

hexafluorobenzene
  poly:       {'poly_alkyl': 0.0, 'poly_aryl': 1.0, 'poly_cyclic': 0.0}
  g34(alkyl): F=0.0, H=0.0
  g37(aryl):  F=6.0
  generic:    {'fluoride': 1.0}

perchloroalkyl
  poly:       {'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
  g34(alkyl): F=0.0, H=0.0
  g37(aryl):  F=0.0
  generic:    {'chloride': 1.0}

Using feature names

from PFASGroups import extract_group_features
from rdkit import Chem

mol = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(=O)O")
r = extract_group_features(mol)
names = r.feature_names()   # list of 66 labels
arr = r.to_array()

# Non-zero features with their labels
for name, val in zip(names, arr):
    if val:
        print(f"{name}: {val}")
# poly_alkyl: 1.0
# g34_F: 3.0
# g42: 1.0   (carboxylic acid)
# g48: 1.0   (fluoride)

API Reference

PFASGroups.extract_group_features(mol: Mol | str) GroupFeatureResult[source]

Extract structured halogen-group features for a single molecule.

Two parse_mol() calls are made internally:

  1. halogens=['H','F','Cl','Br','I'] with the six component groups (ids 34, 35, 37, 38, 44, 45) — populates poly_counts and per_halogen_sizes.

  2. halogens=['*'] (wildcard) — populates generic_groups.

Parameters:

mol (rdkit.Chem.Mol or str) – RDKit molecule object or a SMILES string.

Returns:

Populated result object. Call to_array() to obtain a fixed-length float32 array of shape (66,).

Return type:

GroupFeatureResult

Raises:

Examples

>>> from rdkit import Chem
>>> from PFASGroups import extract_group_features
>>> pfoa = Chem.MolFromSmiles("FC(F)(F)C(F)(F)C(F)(F)C(=O)O")
>>> r = extract_group_features(pfoa)
>>> r.poly_counts
{'poly_alkyl': 1.0, 'poly_aryl': 0.0, 'poly_cyclic': 0.0}
>>> r.per_halogen_sizes['g34_F']
4.0
>>> r.h_chain_sizes
{'alkyl_H': 0.0, 'aryl_H': 0.0, 'cyclic_H': 0.0}
>>> arr = r.to_array()
>>> arr.shape
(66,)
>>> # Octane — only H alkyl chain detected
>>> octane = Chem.MolFromSmiles("CCCCCCCC")
>>> r2 = extract_group_features(octane)
>>> r2.h_chain_sizes['alkyl_H']
6.0
class PFASGroups.GroupFeatureResult(poly_counts: dict[str, float]=<factory>, per_halogen_sizes: dict[str, float]=<factory>, h_chain_sizes: dict[str, float]=<factory>, generic_groups: dict[str, float]=<factory>)[source]

Structured feature extraction result for a single molecule.

This dataclass groups features from two parse_mol() calls into four semantically distinct dictionaries. It is returned by extract_group_features().

poly_counts

Match counts for polyhalogenated groups (ids 35, 38, 45), aggregated across all halogens (F, Cl, Br, I, H). Keys: 'poly_alkyl', 'poly_aryl', 'poly_cyclic'.

Type:

dict[str, float]

per_halogen_sizes

Maximum carbon-component size for perhalogenated groups (ids 34, 37, 44) resolved per halogen. Keys follow the pattern 'g{id}_{hal}' where id is in {34, 37, 44} and hal is in ['F', 'Cl', 'Br', 'I', 'H']. Zero when no match is found.

Type:

dict[str, float]

h_chain_sizes

Convenience view of the 'H' column of per_halogen_sizes: the largest un-substituted alkyl / aryl / cyclic component detected via the H pseudo-halogen mechanism. Keys: 'alkyl_H', 'aryl_H', 'cyclic_H'. Not included in to_array().

Type:

dict[str, float]

generic_groups

Wildcard functional-group match counts for group ids 29-76. Keys: 'g29''g76'. Ids {34, 35, 37, 38, 44, 45} are excluded from wildcard matching and will always be zero.

Type:

dict[str, float]

poly_counts: dict[str, float]
per_halogen_sizes: dict[str, float]
h_chain_sizes: dict[str, float]
generic_groups: dict[str, float]
to_array() ndarray[source]

Return a fixed-length float32 array of shape (66,).

Layout:

[0:3]   poly_counts    (poly_alkyl, poly_aryl, poly_cyclic)
[3:18]  per_halogen_sizes  (g34_F…g44_H, 3 groups × 5 halogens)
[18:66] generic_groups  (g29…g76, 48 entries)

Note

h_chain_sizes is excluded from this array (it is a strict subset of per_halogen_sizes).

feature_names() list[str][source]

Return the 66 feature names in the same order as to_array().

Returns:

Labels: ['poly_alkyl', 'poly_aryl', 'poly_cyclic', 'g34_F', 'g34_Cl', …, 'g44_H', 'g29', 'g30', …, 'g76'].

Return type:

list[str]

__init__(poly_counts: dict[str, float]=<factory>, per_halogen_sizes: dict[str, float]=<factory>, h_chain_sizes: dict[str, float]=<factory>, generic_groups: dict[str, float]=<factory>) None
PFASGroups.PER_GROUP_IDS = [34, 37, 44]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

PFASGroups.POLY_GROUP_IDS = [35, 38, 45]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

PFASGroups.HALOGENS_ORDER = ['F', 'Cl', 'Br', 'I', 'H']

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

PFASGroups.GENERIC_GROUP_VOCAB = [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76]

Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.

PFASGroups.GENERIC_GROUP_NAMES = {29: 'acrylate', 30: 'acyl halide', 31: 'alcohol', 32: 'aldehyde', 33: 'alkene', 34: 'perhalogenated alkyl', 35: 'polyhalogenated alkyl', 36: 'alkyne', 37: 'perhalogenated aryl compounds', 38: 'polyhalogenated aryl compounds', 39: 'benzodioxole', 40: 'benzoyl peroxydes', 41: 'bromide', 42: 'carboxylic acid', 43: 'chloride', 44: 'perhalogenated cyclic compounds', 45: 'polyhalogenated cyclic compounds', 46: 'ester', 47: 'ether', 48: 'fluoride', 49: 'glucuronate', 50: 'iodide', 51: 'ketone', 52: 'methacrylate', 53: 'peroxydes', 54: 'side-chain aromatics', 55: 'sulfenic acid', 56: 'sulfenyl halide', 57: 'sulfinic acid', 58: 'sulfinyl amido sulfonic acid', 59: 'sulfonamide', 60: 'sulfonamidoethanol', 61: 'sulfonic acid', 62: 'sulfonyl halide', 63: 'sulfonyl propanoic acid', 64: 'sulfuric acid', 65: 'thioester keto dicarboxylic acid', 66: 'thiocyanic acid', 67: 'phosphinic acid', 68: 'phosphonic acid', 69: 'amide', 70: 'amine', 71: 'heterocyclic azine', 72: 'heterocyclic azole', 73: 'betaine', 74: 'glycine', 75: 'trichlorosilane', 76: 'silane'}

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:

d = {} for k, v in iterable:

d[k] = v

dict(**kwargs) -> new dictionary initialized with the name=value pairs

in the keyword argument list. For example: dict(one=1, two=2)