Title: An Introduction to Open Smallmolecule Resources of High Utility for Systems Biologists
1An Introduction to Open Small-molecule Resources
of High Utility for Systems Biologists
- Tutorial for the International Conference on
Systems Biology - Göteborg, August 2008
- Christopher Southan, European Bioinformatics
Institute, - Wellcome Trust Genome Campus, Cambridge, UK
2Context
- Medicinal chemistry has a long history of
providing a bridge between biology and chemistry
by identifying compounds that produce biological
effects - It is increasingly recognised that bioactive
compounds are an essential part of the
perturbation toolbox for systems biology - Advancing biological knowledge vial a broad
spectrum of small molecule investigations can
lead to improved understanding not only of
systems biology but also disease mechanisms and
new opportunities for therapeutic intervention
3Systems Chemical Biology
- Oprea et al. Nat Chem Biol. 2007 (8)447-50
PMID 17637771 - The increasing availability of data related
to genes, proteins and their modulation by small
molecules has provided a vast amount of
biological information leading to the emergence
of systems biology and the broad use of
simulation tools for data analysis. However,
there is a critical need to develop
cheminformatics tools that can integrate chemical
knowledge with these biological databases and
simulation approaches, with the goal of creating
systems chemical biology.
4Chemical Biology goes back a long way .
5So does Bioactive Compound Structure
Representation..
6But .... Times Have Changed for Chemical
Information
7Strophanthidin from 1952 to 2008 Now just a
click to Hinxton
8Or Bethesda.
9The times have also changed for Chemical Biology
10And the Union of Chemistry and Biology
11November 2004 The Seeds of Revolution
12PubChem and ChEBI Revolutionary Consequences
- Arrival of the missing entity of formal and
linked chemical structure representation within
the global web of bioinformatic relationships
13PubChem and ChEBI Revolutionary Consequences
- Arrival of the missing entity of formal and
linked chemical structure representation within
the global web of bioinformatic relationships - Ability to search across links between
biochemical data, biological effects and chemical
structure information
14PubChem and ChEBI Revolutionary Consequences
- Arrival of the missing entity of formal and
linked chemical structure representation within
the global web of bioinformatic relationships - Ability to search across links between
biochemical data, biological effects and chemical
structure information - Deposition not just of HTS results but a wide
range of other types of screening data directly
linked to chemical structure information in
public repositories
15PubChem and ChEBI Revolutionary Consequences
- Arrival of the missing entity of formal and
linked chemical structure representation within
the global web of bioinformatic relationships - Ability to search across links between
biochemical data, biological effects and chemical
structure information - Deposition not just of HTS results but a wide
range of other types of screening data directly
linked to chemical structure information in
public repositories - Proliferation of cheminformatics tools,
databases, nomenclatures, and ontologies in the
public domain
16PubChem and ChEBI Revolutionary Consequences
- Arrival of the missing entity of formal and
linked chemical structure representation within
the global web of bioinformatic relationships - Ability to search across links between
biochemical data, biological effects and chemical
structure information - Deposition not just of HTS results but a wide
range of other types of screening data directly
linked to chemical structure information in
public repositories - Proliferation of cheminformatics tools,
databases, nomenclatures, and ontologies in the
public domain - A quantum jump in the global enablement of
chemical biology and medicinal chemistry
17Post-Revolution How Many Compounds are Out There
?
- Chemical Structure Lookup Service 36 million,
100 sources - ChemSpider 21.5 million 150 sources
- PubChem - 19,296,269 70 sources
- SureChem 9 million from US, European and WO
patents,
But how many are verified as bioactive ?
18Relationships in Bioactive Chemical Space
metabolomes natural products
drugs
chem genomics sys biol probes
assay data
drug-like cpds from literature patents
Protein Sequences
19Searchable Chemical Structure Designations and
Representations in Databases
- SD/MOL files
- IUPAC standard name
- Sketched Image
- SMILES
- InChI codes
- InChI strings
- Experimental 3D structure
- Code names (CID 121880)
- Generic, trade and MeSH names
- CAS numbers
- Database acession numbers e.g. PubChem CID, SID,
ChEBI ID, ChemSpider ID
All can be exact-match searched, some allow
simillarity searching, some also inter-convert
20SD/MOLfile
The basic MDL chemical table files of atoms,
bonds, connectivity and 3D coordinates
- benzene
- ACD/Labs0812062058
-
- 6 6 0 0 0 0 0 0 0 0 1 V2000
- 1.9050 -0.7932 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 - 1.9050 -2.1232 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 - 0.7531 -0.1282 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 - 0.7531 -2.7882 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 - -0.3987 -0.7932 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 - -0.3987 -2.1232 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0 - 2 1 1 0 0 0 0
- 3 1 2 0 0 0 0
- 4 2 2 0 0 0 0
- 5 3 1 0 0 0 0
- 6 4 1 0 0 0 0
- 6 5 2 0 0 0 0
21Experimental 3D Structures
Cn3D view of PDB 1I7G on the left PubChem
tesaglitazarCID 208901 on the right
22SMILES -simplified molecular input line entry
notation for encoding molecular structures
- Interconverts with 2D sketchers
- Can then be searched
- Human readable
23Structure Sketchers/Converters
24IUPAC Systematic Naming of Organic Chemical
Compounds
- International Union of Pure and Applied Chemistry
(IUPAC) - Should human readable and allow an unambiguous
structural formula to be drawn - Usable for automated text-to-structure conversion
- Taxol
- (2aR,4S,4aS,6R,9S,11S,12S,12aR,12bS)-1,2a,3,4,4a,6
,9,10,11,12,12a,12b-Dodecahydro- - 4,6,9,11,12,12b-hexahydroxy-4a,8,13,13-tetramethyl
-7,11-methano-5H-cyclodeca(3,4)benz(1,2b)oxet-5-on
e 6,12b-diacetate, 12-benzoate, 9-ester with
(2R,3S)-N-benzoyl-3-phenylisoserine
25IUPAC International Chemical Identifier (InChI)
Textual Identifier for Chemical Substances
- A formalized string conversion of IUPAC names but
not human readable - Express more information than the simpler SMILES
notation and differ in that every structure has a
unique InChI string - InChI algorithm converts structural information
in a three-step process normalization (to remove
redundant information), canonicalization (to
generate a unique number label for each atom),
and serialization (to give a string of
characters) but without explicit 3D information - The 25 character InChIKey is a hashed version of
the full InChI designed to allow for easy web
searches of chemical compounds (e,g, Google)
26CAS Registry Number
- Unique numeric identifier Contains up to 10
digits, divided by hyphens into three parts, e.g.
58-08-2 for caffeine (Google it) - Has no chemical significance
- Widely used but not open-access because the
source chemical information links to the CAS
commercial databases e.g. SciFinder - Consequently the consistency of mappings to open
identifiers cannot be verified
27PubChem Identifiers CIDs and SIDs
- PubChem is the NCBI informatics backbone for
the NIH Molecular Libraries Initiative - A suite of three databases, PubChem Compound
unique structures with computed properties )
PubChem BioAssay ( results supplied by
depositors) and PubChem Substance ( deposited
compound structures) - The ten MLI-funded screening centers are run
cellular and target-based HTSs using a compound
collection of 250 K and submitting the results
to PubChem
28PubChem is now a Global Hub Including
bioinformatic dbs with in-links
ChEBi, enzyme ligands 8K
MMDB, PDB ligands 55K
P u b C h e m
ZINC, ready-to-dock 3.8 mill
KEGG, drugs and metabolites 14K
ChemBank, chemical genomics 0.4 mill
Human Metabolite db 2K
ChemIDplus, NIH tox data 383K
MEROPS protease inhibitors
ChemSpider 20 million
DrugBank, drugs and targets 4K
Drugs of the Future 3.4K
GPCR-Ligand Database
Nature Chemical Biology 0.8 K
LIPID MAPS, metabolism 8.8K
29Searchable Measures of Chemical Similarity
- 1D measured or computed molecular properties,
e.g., molecular weight, number of rings,
molecular surface area or volume, pKa, logP etc - 3D map a molecular surface, chemical graphs,
spectral descriptors, distribution of
electrostatic charge around a molecule - 2D fingerprints are by far the most common, based
on a bit-string encoding of substructural
occurrences
30Molecular Fingerprints for Similarity Searching
- Each bit in the fingerprint (or fragment
bit-string) represents one molecular fragment.
Typical length is 1000 bits - The bit string for a molecule records the
presence (1) or absence (0) of each fragment
in the molecule - Compare fingerprints of two molecules to identify
common bits and hence common substructures (and
hence overall structural resemblance)
31Tanimoto Chemical Similarity
- Tally features
- Unique (a,b)
- Both on (c)
- Both off (d)
- Similarity Formula
- Tanimotoc/(abc)
Beware Chemical Similarity searches are not
standardised between databases
32- PubChem Chemical Searching
33Bio-Chem Data Joins
34A Pharmaceutical Portfolio from PubChem
35Disambiguation
From Wells et al. Reaching for high-hanging
fruit in drug discovery at proteinprotein
interfaces
1R6N
1Y2F
36OSRA Optical Structure Recognition
37Checking Chemical Patents
- Taking Nutlin-3 as an example the SMILES entry
from PubChem - CC(C)OC1C(CCC(C1)OC)C2NC(C(N2C(O)N3CCNC(O)C3
)C4CCC(CC4)Cl)C5CCC(CC5)Cl - was pasted into the SureChem search box
- There are nine exact matches including the
granted patent application from Roche shown below -
38Exploring Relationships in Entrez
BLAST Sequence Similarity
Protein Sequence
Biological Terms MeSH indexed
Literature PubMed
VAST Structure Similarity
Protein 3D Structures
Bioactivity Assay Results
2D Chemical Structure Similarity (3D soon)
Small Molecule Structures
Protein Sequences
Activity Profile Similarity
39 Linkage between Swiss-Prot-DrugBank-PubChem-MMDB
(411) (15728) 181 (2501)
see these marketed target links