· Anton · Education  · 3 min read

Deep Dive into Cheminformatics Data Structures

Exploring molecular representations, fingerprints, and advanced structural analysis with RDKit.

In this post, we will explore the core data structures that power modern cheminformatics. Understanding these is crucial for building robust chemical databases and analysis pipelines.

Molecular Representations

Representing a 3D molecule in a computer-readable format is a non-trivial task. Several standards have emerged over the decades.

SMILES and SMARTS

SMILES (Simplified Molecular Input Line Entry System) is the most common way to represent molecular structures as strings.

Canonical SMILES

A molecule can have multiple valid SMILES strings. Canonicalization ensures that a single molecule always results in the same unique string.

examples/canonical.py
from rdkit import Chem

# Two different SMILES for the same molecule (Toluene)
s1 = "Cc1ccccc1"
s2 = "c1ccccc1C"

m1 = Chem.MolFromSmiles(s1)
m2 = Chem.MolFromSmiles(s2)

# Canonical SMILES will be identical
print(Chem.MolToSmiles(m1)) # "Cc1ccccc1"
print(Chem.MolToSmiles(m2)) # "Cc1ccccc1"

SMARTS is an extension of SMILES used for specifying substructure patterns. It allows for “wildcards” and complex logical operations.

# Match any aromatic ring with a hydroxyl group
pattern = Chem.MolFromSmarts("a-OH")
molecule = Chem.MolFromSmiles("c1ccccc1O")

if molecule.HasSubstructMatch(pattern):
    print("Phenol-like substructure found!")

InChI and InChIKey

The International Chemical Identifier (InChI) provides a hierarchical representation that is even more robust than SMILES for identity checking. The InChIKey is a fixed-length hash of the InChI, ideal for database indexing.

Fingerprints and Similarity

Molecular fingerprints are bit-strings that represent the presence or absence of specific structural features.

Topological Fingerprints (RDKit)

The default RDKit fingerprint follows paths in the molecular graph up to a certain number of bonds.

examples/fingerprints.py
from rdkit.Chem import RDKFingerprint

mol = Chem.MolFromSmiles("C1CCCCC1")
fp = RDKFingerprint(mol)

print(f"Fingerprint bits: {len(fp)}")

Morgan Fingerprints (Circular)

Morgan fingerprints (analogous to ECFP) are based on the neighborhood of each atom. They are widely used for similarity searching and QSAR modeling.

from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles("c1ccccc1")
# Radius 2 is roughly equivalent to ECFP4
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)

Database Integration

Modern cheminformatics often involves storing millions of structures in relational databases with specialized extensions like pgchem or rdkit-postgresql.

Here is how you might perform a Tanimoto similarity search in PostgreSQL:

queries/search.sql
-- Find molecules similar to a query molecule
SELECT
    id,
    smiles,
    tanimoto_sml(morganbv_fp(mol), morganbv_fp('c1ccccc1'::mol)) as similarity
FROM
    molecules
WHERE
    morganbv_fp(mol) % morganbv_fp('c1ccccc1'::mol)
ORDER BY
    similarity DESC
LIMIT 10;

JSON Data Structures

When passing chemical data between microservices, JSON is often used to wrap molecular properties.

data/molecule_data.json
{
  "id": "MOL-001",
  "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
  "properties": {
    "mw": 180.158,
    "logp": 1.19,
    "hbd": 1,
    "hba": 4
  },
  "metadata": {
    "source": "Internal Database",
    "quality": "Validated"
  }
}

Conclusion

We’ve covered the basics of molecular representations and fingerprints. These form the foundation of more advanced tasks like machine learning on chemical data and virtual screening.

Anton

Lead developer at cheminfo.dev, passionate about the intersection of chemistry and software engineering.

Comments

Back to Blog

Related Posts

View All Posts »