Deep Dive into Cheminformatics Data Structures

In this post, we will explore the core data structures that power modern cheminformatics. Understanding these is crucial for building robust chemical databases and analysis pipelines.

Molecular Representations

Representing a 3D molecule in a computer-readable format is a non-trivial task. Several standards have emerged over the decades.

SMILES and SMARTS

SMILES (Simplified Molecular Input Line Entry System) is the most common way to represent molecular structures as strings.

Canonical SMILES

A molecule can have multiple valid SMILES strings. Canonicalization ensures that a single molecule always results in the same unique string.

examples/canonical.py

from rdkit import Chem

# Two different SMILES for the same molecule (Toluene)
s1 = "Cc1ccccc1"
s2 = "c1ccccc1C"

m1 = Chem.MolFromSmiles(s1)
m2 = Chem.MolFromSmiles(s2)

# Canonical SMILES will be identical
print(Chem.MolToSmiles(m1)) # "Cc1ccccc1"
print(Chem.MolToSmiles(m2)) # "Cc1ccccc1"

SMARTS for Substructure Search

SMARTS is an extension of SMILES used for specifying substructure patterns. It allows for “wildcards” and complex logical operations.

# Match any aromatic ring with a hydroxyl group
pattern = Chem.MolFromSmarts("a-OH")
molecule = Chem.MolFromSmiles("c1ccccc1O")

if molecule.HasSubstructMatch(pattern):
    print("Phenol-like substructure found!")

InChI and InChIKey

The International Chemical Identifier (InChI) provides a hierarchical representation that is even more robust than SMILES for identity checking. The InChIKey is a fixed-length hash of the InChI, ideal for database indexing.

Fingerprints and Similarity

Molecular fingerprints are bit-strings that represent the presence or absence of specific structural features.

Topological Fingerprints (RDKit)

The default RDKit fingerprint follows paths in the molecular graph up to a certain number of bonds.

examples/fingerprints.py

from rdkit.Chem import RDKFingerprint

mol = Chem.MolFromSmiles("C1CCCCC1")
fp = RDKFingerprint(mol)

print(f"Fingerprint bits: {len(fp)}")

Morgan Fingerprints (Circular)

Morgan fingerprints (analogous to ECFP) are based on the neighborhood of each atom. They are widely used for similarity searching and QSAR modeling.

from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles("c1ccccc1")
# Radius 2 is roughly equivalent to ECFP4
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)

Database Integration

Modern cheminformatics often involves storing millions of structures in relational databases with specialized extensions like pgchem or rdkit-postgresql.

SQL Query for Similarity Search

Here is how you might perform a Tanimoto similarity search in PostgreSQL:

queries/search.sql

-- Find molecules similar to a query molecule
SELECT
    id,
    smiles,
    tanimoto_sml(morganbv_fp(mol), morganbv_fp('c1ccccc1'::mol)) as similarity
FROM
    molecules
WHERE
    morganbv_fp(mol) % morganbv_fp('c1ccccc1'::mol)
ORDER BY
    similarity DESC
LIMIT 10;

JSON Data Structures

When passing chemical data between microservices, JSON is often used to wrap molecular properties.

data/molecule_data.json

{
  "id": "MOL-001",
  "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
  "properties": {
    "mw": 180.158,
    "logp": 1.19,
    "hbd": 1,
    "hba": 4
  },
  "metadata": {
    "source": "Internal Database",
    "quality": "Validated"
  }
}

Conclusion

We’ve covered the basics of molecular representations and fingerprints. These form the foundation of more advanced tasks like machine learning on chemical data and virtual screening.

Deep Dive into Cheminformatics Data Structures

Molecular Representations

SMILES and SMARTS

Canonical SMILES

SMARTS for Substructure Search

InChI and InChIKey

Fingerprints and Similarity

Topological Fingerprints (RDKit)

Morgan Fingerprints (Circular)

Database Integration

SQL Query for Similarity Search

JSON Data Structures

Conclusion

Anton

Comments

Related Posts

Introduction to Cheminformatics with Python

Organizing Content with Series: Part 1 - Getting Started

Get started with AstroZephyrus to create a website using Astro and Tailwind CSS

Useful tools and resources to create a professional website