Deep Dive into Cheminformatics Data Structures
Exploring molecular representations, fingerprints, and advanced structural analysis with RDKit.
In this post, we will explore the core data structures that power modern cheminformatics. Understanding these is crucial for building robust chemical databases and analysis pipelines.
Molecular Representations
Representing a 3D molecule in a computer-readable format is a non-trivial task. Several standards have emerged over the decades.
SMILES and SMARTS
SMILES (Simplified Molecular Input Line Entry System) is the most common way to represent molecular structures as strings.
Canonical SMILES
A molecule can have multiple valid SMILES strings. Canonicalization ensures that a single molecule always results in the same unique string.
from rdkit import Chem
# Two different SMILES for the same molecule (Toluene)
s1 = "Cc1ccccc1"
s2 = "c1ccccc1C"
m1 = Chem.MolFromSmiles(s1)
m2 = Chem.MolFromSmiles(s2)
# Canonical SMILES will be identical
print(Chem.MolToSmiles(m1)) # "Cc1ccccc1"
print(Chem.MolToSmiles(m2)) # "Cc1ccccc1"SMARTS for Substructure Search
SMARTS is an extension of SMILES used for specifying substructure patterns. It allows for “wildcards” and complex logical operations.
# Match any aromatic ring with a hydroxyl group
pattern = Chem.MolFromSmarts("a-OH")
molecule = Chem.MolFromSmiles("c1ccccc1O")
if molecule.HasSubstructMatch(pattern):
print("Phenol-like substructure found!")InChI and InChIKey
The International Chemical Identifier (InChI) provides a hierarchical representation that is even more robust than SMILES for identity checking. The InChIKey is a fixed-length hash of the InChI, ideal for database indexing.
Fingerprints and Similarity
Molecular fingerprints are bit-strings that represent the presence or absence of specific structural features.
Topological Fingerprints (RDKit)
The default RDKit fingerprint follows paths in the molecular graph up to a certain number of bonds.
from rdkit.Chem import RDKFingerprint
mol = Chem.MolFromSmiles("C1CCCCC1")
fp = RDKFingerprint(mol)
print(f"Fingerprint bits: {len(fp)}")Morgan Fingerprints (Circular)
Morgan fingerprints (analogous to ECFP) are based on the neighborhood of each atom. They are widely used for similarity searching and QSAR modeling.
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles("c1ccccc1")
# Radius 2 is roughly equivalent to ECFP4
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)Database Integration
Modern cheminformatics often involves storing millions of structures in relational databases with specialized extensions like pgchem or rdkit-postgresql.
SQL Query for Similarity Search
Here is how you might perform a Tanimoto similarity search in PostgreSQL:
-- Find molecules similar to a query molecule
SELECT
id,
smiles,
tanimoto_sml(morganbv_fp(mol), morganbv_fp('c1ccccc1'::mol)) as similarity
FROM
molecules
WHERE
morganbv_fp(mol) % morganbv_fp('c1ccccc1'::mol)
ORDER BY
similarity DESC
LIMIT 10;JSON Data Structures
When passing chemical data between microservices, JSON is often used to wrap molecular properties.
{
"id": "MOL-001",
"smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
"properties": {
"mw": 180.158,
"logp": 1.19,
"hbd": 1,
"hba": 4
},
"metadata": {
"source": "Internal Database",
"quality": "Validated"
}
}Conclusion
We’ve covered the basics of molecular representations and fingerprints. These form the foundation of more advanced tasks like machine learning on chemical data and virtual screening.
Lead developer at cheminfo.dev, passionate about the intersection of chemistry and software engineering.