Computational Chemistry, Demystified

A plain-language tour of the key tools in modern cheminformatics — from building 3D structures to predicting toxicity with machine learning.

Key Software & Libraries

Think of these as the instruments in a computational chemist's orchestra — each plays a different part.

RDKit
Cheminformatics

RDKit is like a GPS for molecules. Give it a SMILES string — a line of text that describes a molecule — and it plots the full 3D structure. It can optimise geometry, compute molecular fingerprints, calculate properties such as molecular weight and logP, and generate publication-ready images. It's the starting point for almost every drug-discovery pipeline.

DeepChem
Deep Learning for Chemistry

DeepChem is a digital chemist's brain. It ships with pre-trained models and curated datasets (MoleculeNet) covering toxicity, solubility, binding affinity, and more. Feed it a molecule; it predicts whether the compound might be toxic, penetrate cell membranes, or bind to a target — in milliseconds. No wet lab required for the initial screen.

Open Babel
Chemical Format Converter

Open Babel is the Swiss Army knife for chemists. It translates molecules between over 100 file formats (SDF, PDB, MOL2, SMILES, …), generates 3D conformers from flat structures, applies force-field minimisation, and computes basic descriptors. If two pieces of software speak different "chemical languages", Open Babel is the interpreter.

ASE
Atomistic Simulation Environment

ASE (Atomic Simulation Environment) is a Python toolkit for setting up, running, and analysing atomistic simulations. Think of it as the conductor: it orchestrates geometry optimisations, molecular dynamics runs, and transition-state searches — delegating the actual quantum calculations to backends like GAUSSIAN or PySCF.

PySCF
Quantum Chemistry in Python

PySCF is a modern quantum chemistry library written almost entirely in Python. It implements DFT, Hartree-Fock, MP2, CCSD, and more. Researchers use it to compute the ground-state energy, electron density, and forces of a molecule — the "ground truth" that machine-learned force fields learn from.

PyTorch
Neural Network Framework

PyTorch is the engine behind most modern ML force fields and graph neural networks for property prediction. Its automatic differentiation engine (autograd) lets you derive forces from energies with a single line of code — exactly what you need when training a neural network potential to replace expensive DFT calculations.

Virtual Screening Pipeline

How a library of thousands of molecules is narrowed down to a handful of promising drug candidates — all before stepping into a lab.

1
Build 3D Structures
RDKit

Read thousands of SMILES strings from a compound library. RDKit converts each flat SMILES into a 3D molecular geometry using its ETKDG algorithm, then minimises the energy with a molecular mechanics force field.

2
Filter Drug-Likeness
RDKit & Lipinski Rules

Apply Lipinski's Rule of Five (MW < 500, logP < 5, H-bond donors < 5, acceptors < 10) to retain only orally bioavailable candidates. RDKit computes all descriptors in seconds.

3
Toxicity Screening
DeepChem (Tox21 model)

Run each molecule through a pre-trained graph neural network trained on the Tox21 dataset (12 toxicity endpoints). Molecules flagged as "likely toxic" are removed — before any synthesis is attempted.

4
Molecular Docking
AutoDock / OpenMM

Dock surviving molecules into the active site of a target protein and compute a binding-affinity score. Compounds with the best scores advance to experimental validation in the wet lab.

5
ADME Prediction & Hit Selection
SwissADME / pkCSM

Predict absorption, distribution, metabolism, and excretion (ADME) to rank the final hits. The top compounds are then synthesised and tested in cell-based and enzyme assays.

Python — pipeline_screen.py
# Simplified virtual screening pipeline
from rdkit import Chem
from rdkit.Chem import AllChem, Descriptors
import deepchem as dc

results = []

for smiles in candidate_library:

    # Step 1 — build 3D structure
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, AllChem.ETKDG())
    AllChem.MMFFOptimizeMolecule(mol)

    # Step 2 — Lipinski filter
    mw  = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    if mw > 500 or logp > 5:
        continue

    # Step 3 — toxicity screen (DeepChem Tox21)
    X = featurizer.featurize([smiles])
    tox = tox_model.predict(X)[0]
    if tox.max() > 0.5:
        continue

    # Step 4 — docking score
    score = dock(mol, target_protein)
    results.append({"smiles": smiles,
                    "score": score})

# Rank by binding affinity
results.sort(key=lambda x: x["score"])
Hypothetical Docking Scores — Top 8 Candidates

Negative values indicate stronger predicted binding affinity (kcal/mol). Compounds with scores below −8.0 typically advance to synthesis.

Machine-Learned Force Fields

Teaching a neural network to predict quantum energies — making DFT-quality calculations millions of times faster.

The Idea

DFT calculations are accurate but slow — a single-point energy for a drug-sized molecule can take hours. Machine-learned force fields (ML-FFs) solve this: we train a neural network on thousands of DFT calculations of slightly perturbed geometries. The network learns to predict energies and forces from atomic positions — with near-DFT accuracy but at classical-simulation speed.

It's like teaching a student by showing them worked examples: once trained, they can solve new problems in seconds rather than starting from scratch every time.

Workflow

  • ASE — perturb the molecular geometry ~1000 times (small random displacements)
  • PySCF — compute DFT energy + forces for each perturbed geometry (ground truth)
  • PyTorch — train a neural network on {geometry → energy, forces} pairs
  • Forces are the gradient of the predicted energy w.r.t. atom positions (autograd)
  • The trained model runs molecular dynamics millions of times faster than DFT
Python — ml_forcefield.py
import torch
import torch.nn as nn
from ase import Atoms
from pyscf import dft

# Step 1 — generate training data via ASE + PySCF
molecule = Atoms('H2O', positions=[...])
geometries, energies, forces = [], [], []

for _ in range(1000):
    geo = perturb(molecule)          # small random shift
    e, f = pyscf_dft(geo)           # DFT energy + forces
    geometries.append(geo)
    energies.append(e)
    forces.append(f)

# Step 2 — define neural network
class NeuralFF(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_feat, 128), nn.Tanh(),
            nn.Linear(128, 64),  nn.Tanh(),
            nn.Linear(64, 1)
        )
    def forward(self, x):
        return self.net(x)

model = NeuralFF()
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)

# Step 3 — training loop
for epoch in range(500):
    E_pred = model(X_train)
    F_pred = -torch.autograd.grad(   # forces = -dE/dr
        E_pred.sum(), X_train, create_graph=True)[0]
    loss = mse(E_pred, E_train) + mse(F_pred, F_train)
    opt.zero_grad(); loss.backward(); opt.step()
ML-FF vs DFT Energy Parity Plot (Training Data)

Perfect prediction lies on the dashed diagonal. MAE < 1 kcal/mol is considered "chemical accuracy".