The world of bioinformatics is fun to explore and if you know a little programming, the journey gets even better.


How are diseases caused ?
How do drugs work in our body ?
Let’s try to answer that and take baby steps in how we can design medicinal drugs.

For now, we care about only two things: proteins and ligands. They play crucial roles in various biological processes and we are interested in their interaction. In simpler terms, think of protein like a big thing that is responsible for biological processes in our body. A ligand can bind with the protein and change the working of that protein. Due to that, we can get a new disease or we can be cured of new diseases.

So, we are interested in how a protein and a ligand interact with each other. But what is a protein and what is a ligand, structurally ?

Protein: Proteins are long chains of amino acids linked together. Our body needs 20 different amino acids. Think of them like blocks. In a protein, those amino acids are connected together. That gives different properties to those proteins they form and help in well functioning of our body.

Ligands: Explaining ligands can be a little difficult. They can be anything as long as they enter inside a protein, bind with it and change it’s properties. So, they can be small molecules, ions, peptides or even other proteins. Below, I’ll be focusing on small molecules.

Visualizing proteins and ligands is not that complex. Initially, their structure is found out with X-ray Crystallagrophy which gives cartesian coordinates of all the elements. The next step is to just plot those cartesian coordinates like in the above image and a little bit of coloring.

Sample Line In a PDB file.
ATOM 1 N GLU A 3 -27.949 -78.370 -65.845 1.00 72.12 N

The green highlighted part is the cartesian coordinate and at the end is the element that is in that coordinate. In this line, it gives the position of a Nitrogen atom(N) in Glutamic Acid residue of the protein (GLU) which is in the ‘A’ chain of that protein. RCSB has a detailed guide on understanding PDB files.

Now that we have accurate representation of Proteins and Ligands, we need to identify two important things:

  1. Will the protein and the ligand interact ? (We’ll be doing this below)
  2. If they do interact, will the interacted complex be stable ?

Let’s say there is an epidemic going on and you are tasked with finding a drug that will save humanity. You have identified the target protein that is causing problems. If you can somehow inhibit that protein the humanity will be saved. You ask your super intelligent AI to give you a molecule that will save all of humanity. The AI gives a molecular formula.

Now, we need to virtually see whether the protein and the ligand will interact. For that we will perform molecular docking.

Molecular Docking:

  1. Initially, we need to find the places in the protein where a ligand can go and bind. They are called as binding pockets. Ligands can only bind with the protein in the proteins binding pockets. Conversely, if a ligand binds then we call that place a binding pocket inside a protein.
  2. After that, we take that ligand and place it inside the target protein.
  3. Now, we develop an equation that can simulate the forces of nature. The equation should take into account the chemical properties of both protein and ligands. We integrate over that equation and get a number that says how likely is the interaction between the ligand and the protein (binding affinity).

To put very simply, ligand is like a key and protein is like a lock. And we have to see if the ligand (key) fits inside the lock (protein). And the keyhole is the binding pocket.

The problem with this is that, we have a 3D grid. A single orientation or pose of a ligand is not enough to determine whether the ligand binds with the protein or not. We have to check for the multiple orientations of the ligand and also in multiple places inside the protein.

Let’s implement that. I’ve created a simple API for molecular docking and I’ll be using code samples from there. Dockster Link.

Get a protein PDB File and a Ligand PDB File.

We need accurate representations of both the protein and the ligand. A PDB file has exact coordinates. Therefore we use them. Inside a protein PDB, there can already be a co-crystallized ligand. We can select that as a ligand or use our own ligand. For now, we’ll ignore that. Co-crystallized ligands start with HETATM and we’ll only use the protein part.

import prody 
pdb = prody.parsePDB("filepath.pdb")
protein ="protein")
if protein is None:
 protein = protein_pdb

Add missing residues, missing atoms and minimize hydrogen coordinates.

The PDB file that we have, can have some missing atoms or even residues. As we know that there are only 20 amino acids, and we know the composition of an amino acid, we can fix them.

import pdbfixer

fixer = pdbfixer.PDBFixer("protein_only.pdb")
fixer.addMissingHydrogens(7.0) # pH of the hydrogen that we are adding

Minimize Energy

The PDB file that we got can have some coordinates that are slightly misplaced or have small deviations in them. So, if we integrate them over a Force-Field, and minimize the energy, the positions will be corrected.

from openmm import app 
import openmm 
from openmm import unit

def minimize_energy(fixer):
    forcefield = app.ForceField("amber14-all.xml", "amber14/tip3pfb.xml")
    system = forcefield.createSystem(
        nonbondedCutoff=0.9 * unit.nanometer,
    atom_elements = [ for atom in fixer.topology.atoms()]
    for i in range(system.getNumParticles()):
        if atom_elements[i] != "hydrogen":
            system.setParticleMass(i, 0.0)

    integrator = openmm.LangevinIntegrator(
        298 * unit.kelvin, 1 / unit.picosecond, 1 * unit.femtosecond
    platform = openmm.Platform.getPlatformByName("CPU")

    simulation = app.Simulation(fixer.topology, system, integrator, platform)

    positions = simulation.context.getState(getPositions=True).getPositions()

        fixer.topology, positions, open(f"{temporary_file_dir}/proteinH.pdb", "w")
    ) # Saving the file as our work here is done.
    return True

File Conversion

After minimizing and we save our file, we need to do a small step. We need to convert the PDB file to a PDBQT file because apparently docking softwares require additional things like Atomic Charges and other few things. To do that, I use a tool called OpenBabel.

FROM ubuntu:18.04

RUN apt-get update && apt-get -y install --no-install-recommends  build-essential  ca-certificates  cmake  git  zlib1g-dev  libcairo2-dev  libboost-dev  libboost-program-options-dev  libboost-iostreams-dev  libboost-regex-dev  rapidjson-dev  python3-dev  libbz2-dev  libeigen3-dev  libxml2-dev  swig3.0  lzma  wget &&  apt-get clean -y


RUN git clone &&  cd openbabel 

WORKDIR /root/openbabel/build 

RUN cmake ..  -DPYTHON_EXECUTABLE=/usr/bin/python3  -DPYTHON_BINDINGS=ON  -DRUN_SWIG=ON  -DWITH_MAEPARSER=off &&  nproc=$(getconf _NPROCESSORS_ONLN) &&  make -j $(( nproc > 2 ? nproc - 2 : 1 )) &&  make install &&  cd /root/openbabel &&  rm -rf build/*

WORKDIR /home/obabel
$ docker build -t myobabel .

$ docker run -v `pwd`:/home/obabel -it --rm myobabel

# For Ligand: 
$ obabel <ligand_name.pdb> -O <ligand_name.pdbqt> 

# For Protein: 
# -xr is an option that removes all records except ATOM and HETATM from PDB files
$ obabel <protein_name.pdb> -xr -O <protein_name.pdbqt>

Perform Docking

Now, we need to send this to a docking software called AutoDock Vina. I’ll explain how it works below, but for now let’s only see how to run the software.

a. Create a Vina Config File:

receptor = proteinH.pdbqt 
ligand = ligandH.pdbqt

center_x = 7.4418864
center_y = 22.105509
center_z = -12.090041

size_x = 100
size_y = 100
size_z = 100

exhaustiveness = 8 
num_modes = 20 
energy_range = 3

Here, receptor and ligand are the path to pdbqt files that we created in our previous steps. The center(x,y,z) is the center of the grid box that we will create in the protein. The size(x,y,z) is the size of the grid box that we want to make inside which we will check different poses of the ligand. Exhaustiveness determines how exhaustive/thorough we want the docking search to be i.e how many different conformations and orientations to explore. Number of modes set how many binding poses that vina should generate. Finally, energy range defines the energy range within which vina will consider the binding modes. It specifies maximum energy difference between the best binding mode and worst binding mode to be included in the output.

b. Perform Docking:

# Running vina with the config we created.
$ docker run -v $PWD:/data --rm vina --config config.txt

Now, you’ll get the docking score of the protein and ligand. Let’s see how we can interpret the docking score. Also, that was only step 1 which showed whether the protein and the ligand would interact or not. We also need to see whether they will be stable after interacting or not too.

mode |   affinity | dist from best mode
     | (kcal/mol) | rmsd l.b.| rmsd u.b.
   1       -7.442          0          0
   2       -7.377      49.59      53.93
   3       -7.214      49.68      55.42
   4       -7.075      10.28      13.45
   5       -6.917      10.42      16.78
   6        -6.67      8.263      11.52
   7       -6.594      49.11      55.18
   8       -6.537      29.72      33.17
   9       -6.439       17.5      21.68
  10       -6.431      12.44      16.56
  11       -6.315       16.4       21.1
  12       -6.272      44.38      48.89
  13       -6.271      10.91      15.64
  14       -6.254      13.72      15.62
  15        -6.25      22.98      25.37
  16       -6.123      51.29      56.33
  17       -6.084      31.61      34.94
  18       -5.994      45.28      49.17
  19       -5.971      32.36      36.77
  20       -5.942       4.88       9.99

Let’s try to interpret the results.

Mode: This represents the total binding poses of the ligand within the receptor protein. Each mode represents a different binding configuration.

Affinity(kcal/mol): The binding affinity is a measure of the predicted binding affinity between the ligand and the protein. (Lower the better)

RMSD lb: It is a measure of the structural deviation between the best predicted and the current structure. It is calculated by aligning the two structures as closely as possible and measuring the average distance between corresponding atoms. It represents the minimum structural deviation achievable through optimal alignment.

RMSD ub: It is also a measure of structural deviation between best predicted and the current structure. It is calculated by aligning the two structures in a way that minimizes the structural deviation but allows for a more flexible alignment. A higher RMSD ub value indicates that the structures are less structurally similar.

Now that we have good enough results from molecular docking. Let’s do molecular dynamics simulation and see whether the protein and the ligand complex is stable or not.

That, I’ll try to write in the next post.