Title: Open chemical dictionaries and ontologies for biosciences
1Open chemical dictionaries and ontologies for
biosciences
Kirill Degtyarenko, EMBL-EBI
2The team
- Rafael Alcántara
- Michael Ashburner
- Volker Ast
- Sergio Contrino
- Michael Darsow
- Paula de Matos
- Marcus Ennis
- Janna Hastings
- Alan McNaught
- Martin Zbinden
3Thanks
- EU funding
- Tamara Kulikova photo
4What is EBI ?
5EMBL-EBIThe European Bioinformatics Institute
- We develop and provide
- EMBL Nucleotide Sequence Database
- UniProt (Swiss-Prot/TrEMBL/PIR)
- InterPro
- Macromolecular Structure Database
- ENSEMBL
- ArrayExpress
6EMBL-EBIThe European Bioinformatics Institute
- Is also home to
- Gene Ontology editorial office http//www.geneonto
logy.org/
7What is an ontology?
8What is an ontology?
9Ontology definitions
- Ontology the theory or study of being as such
i.e., of the basic characteristics of all reality
(Encyclopædia Britannica) - Ontology or the science of something and of
nothing, of being and not-being, of the thing and
the mode of the thing, of substance and accident
(Gottfried Wilhelm Leibniz) - Ontology A formal definition of concepts
(entities, relationships) of a given area of
knowledge, described in a standardized form
(Carugo Pongor, 2002) - An ontology is a specification of a
conceptualization (Tom Gruber) - More cracking definitions from http//www.formalon
tology.it/ !
10Working definition
Ontology an explicit specification of some
topic which includes a vocabulary of terms
(names) with defined logical relationships to
each other. Jane Lomax, EBI
11NCBI Taxonomy
Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Archosauria
Aves Neognathae
Passeriformes Hirundinidae
Hirundo Hirundo rustica
Phylum ? Subphylum ?
Class ? Order ?
Family ? Genus ? Species ?
12Enzyme Taxonomy
EC 2 Transferases EC 2.8 Transferring
sulfur-containing groups EC 2.8.2
Sulfotransferases EC 2.8.2.25 Flavonol
3-sulfotransferase
13OBO
Open Biomedical Ontologies is an umbrella web
address for well-structured controlled
vocabularies for shared use across different
biological and medical domains http//obo.source
forge.net/
14ChEBI What is it?
Chemical Entities of Biological Interest an
EBI database/dictionary of biochemical compounds
15What are the biochemical compounds?
Can be defined as consisting of molecules not
directly encoded by the genome ... that are
either the products of nature or are synthetic
products used ... to intervene in the processes
of living organisms Michael Ashburner
16Molecular entity
Any constitutionally or isotopically distinct
atom, molecule, ion, ion pair, radical, radical
ion, complex, conformer etc., identifiable as a
separately distinguishable entity IUPAC Gold
Book
17In fact, ChEBI contains
- Molecular entities
- trans-vaccenic acid
- Groups
- trans-vaccenoyl group
- Classes
- fatty acids
18Small molecules?
- Yes, but big molecules as well!
- alumina
- amylose
- metaborate
- poly(vinyl alcohol)
191-D ChEBI
- Numeric ID
- Carefully checked terminology
- Unambiguous ChEBI name
- IUPAC names
- Cross-references to free resources
20Unambiguous ChEBI name
- CHEBI28918
- L-adrenaline
- not just adrenaline
21IUPAC name
- 4-(1R)-1-hydroxy-2-(methylamino)ethylbenzene-1,
2-diol
22The Unpronounceables
CHEBI32902 gibberellin A4
IUPAC name (1R,2R,5R,8R,9S,10R,11S,12S)-12-hydr
oxy-11-methyl-6-methylidene-16-oxo-15-oxapentacycl
o9.3.2.15,8.01,10.02,8heptadecane-9-carboxylic
acid
23Need for 2-D
- Better to see the face than to hear the name
(Zen proverb) - Structures and identifiers based on structures
offer new ways of crosslinking to other databases - Our users desperately want it!
24Connection table
ChEBI 9 10 0 0 0 0 999 V2000
11.8219 -7.2713 0.0000 C 0 0 0 0 0 0
0 0 0 0 0 0 11.8219 -8.0922 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 12.6074
-7.0165 0.0000 N 0 0 0 0 0 0 0 0 0
0 0 0 11.1072 -6.8574 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0 12.6039 -8.3505
0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
11.1072 -8.5027 0.0000 N 0 0 0 0 0
0 0 0 0 0 0 0 13.0886 -7.6818
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
10.3923 -7.2713 0.0000 N 0 0 0 0 0 0
0 0 0 0 0 0 10.3888 -8.0922 0.0000 C
0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0
0 0 0 1 3 1 0 0 0 0 1 4 1 0 0 0
0 2 5 1 0 0 0 0 2 6 1 0 0 0 0
3 7 1 0 0 0 0 4 8 2 0 0 0 0 6 9
2 0 0 0 0 5 7 2 0 0 0 0 8 9 1 0
0 0 0 M END
252-D ChEBI
- One or more 2-D (or 3-D) connection tables
- One is default
- Autogenerated images (PNG)
- Default diagrams should be unambiguous
26Art of chemical drawing
(R)-camphor
ambiguous
unambiguous
27From 2-D back to 1-D
28SMILES (1)
- Simplified Molecular Input Line Entry
Specification - Developed by David Weininger in 1988
- Extended by others (e.g. Daylight)
- String of standard ASCII characters
- A number of valid SMILES can be produced for the
same molecule
29SMILES (2)
- N1CNC2C1CNCN2
- c1ncc2ncnc2n1
- C1N\CN/C\2N/CN\C1/2
- c1ncnc2/NC\Nc12
- n1cc2c(nc1)ncn2
- Hc1nc(H)c2n(H)c(H)nc2n1
30InChI (1)
- IUPAC International Chemical Identifier or InChI
- Open source
- Developed by Stein, Heller, Tchekhovskoi and
McNaught - Used by NIST, PubChem, CML and ChEBI
31InChI (2)
InChI1/C5H4N4/c1-4-5(8-2-6-1)9-3-7-4/h1-3H,(H,6,7
,8,9)
32Limitations (1)
- Stereochemistry other than sp3 tetrahedral and
sp2 trigonal planar - Polymers
- Conformers
- Radicals/different spin state
- Topological isomers
- Mixtures
- Markush structures
33Limitations (2)
cisplatin
transplatin
InChI1/2ClH.2H3N.Pt/h21H21H3/q2/p-2
343-D ChEBI
cisplatin
35ChEBI ontology
- Molecular structure ontology
- Subatomic particle ontology
- Biological role ontology
- Application ontology
36L-adrenaline
- Molecular structure ontology
- catecholamines
- Biological role ontology
- hormone
- Application ontology
- antiglaucoma
- bronchodilator
- cardiostimulant
37The family relations
L-cystein-S-yl
L-cysteine()
L-cysteine zwitterion
cysteine
D-cysteine
L-cysteino
L-cysteine
L-cysteinium
L-cysteinyl
L-cysteinate(1)
L-cysteine residue
L-cysteinate(2)
L-cysteinate residue
38Relationships in ChEBI
39Is A relationship
?
L-cysteine
cysteine
is a
40Is Enantiomer Of
?
L-cysteine
D-cysteine
is enantiomer of
41Is Tautomer Of
L-cysteine
L-cysteine zwitterion
42Is Conjugate Acid Of
L-cysteinium
L-cysteinate(2)
L-cysteine
L-cysteinate(1)
is conjugate acid of
43Is Conjugate Base Of
L-cysteinium
L-cysteinate(2)
L-cysteine
L-cysteinate(1)
44Acid/base relationships
L-cysteinium
L-cysteinate(2)
?
?
L-cysteine
L-cysteinate(1)
45Is Part Of
?
L-cysteinium
L-cysteine hydrochloride
is part of
46Is Substituent Group From
L-cysteine
L-cysteinyl
L-cysteino
L-cysteine residue
47Has Parent Hydride
is parent hydride of
H
benzene
1,2,3-trichlorobenzene
has parent hydride
48Has Functional Parent
is functional parent of
F
L-cysteine
S-(4-bromophenyl)-L-cysteine
has functional parent
49The family relations
L-cysteine()
L-cysteinium
L-cystein-S-yl
cysteine
L-cysteine zwitterion
L-cysteine
D-cysteine
L-cysteino
L-cysteinyl
L-cysteinate(1)
L-cysteine residue
L-cysteinate(2)
L-cysteinate residue
50Ontology of L-cysteine
51Ontology of L-cysteine (1)
52Ontology of L-cysteine (2)
53Current status (25.04.07)
54Users of ChEBI
- ArrayExpress
- BIND
- BioModels
- ChemIDplus
- Human Metabolite Database
- KEGG COMPOUND
- Reactome
- Industry (Chenomx, Lion, etc.)
55http//www.ebi.ac.uk/come/
- Italian word come (how)
- English word come (not GO)
- Classification Of Metalloproteins
- COfactors and Metals
- COMplex proteins, etc.
- Co-Ordination of Metals in proteins
- Contrino and me
56COMe version 5.1
- Controlled vocabulary
- 1376 protein classes (PRX)
- 524 bioinorganic motifs (BIM)
- 179 small molecules (MOL)
- organised as
- XML version (master)
- Oracle version
57COMe top of hierarchy
Complex proteins belong to at least one of three
groups
- Metalloprotein
- Organic prosthetic group protein
- Modified amino acid protein
58COMe entry PRX000552
59Path to PRX000552
  complex protein  PRX000001 includes Â
?  metalloprotein  PRX000002 includes    ?
 iron protein  PRX000004 includes      ?
 iron-sulphur protein  PRX000007
includes        ?  Fe(34)S4 protein  PRX000054
includes          ?  Fe4S4Cys4
protein  PRX000088 includes            ?
 Fe4S4/DMSO reductase-like  PRX000546
includes              ?  formate dehydrogenase
catalytic subunit  PRX000557 includes           Â
    ?  molybdenum formate dehydrogenase
catalytic subunit  PRX000733 includes           Â
      ?  formate dehydrogenase N, catalytic
subunit  PRX000552 includes
Instance ? Â formate dehydrogenase,
nitrate-inducible, major subunit Escherichia
coli UniProtP24183
60Molecule (MOL)
- Controlled vocabulary of small molecular
entities bound to complex proteins - Cross-references to (bio)chemical resources
chemPDB , NIST Chemistry Webbook, LIGAND, RESID - In future ChEBI
61COMe entry MOL000015
62Bioinorganic motif (BIM)
- A common structural feature of a class of
functionally related, but not necessarily
homologous, proteins, that includes the metal
atom(s) and first coordination shell ligands - Degtyarenko (2000) Bioinformatics 16, 851864
63Example BIM000027
T-4
Fe(SG.Cys)4
64Example BIM000056
(Fe2S2)(ND.His)2(SG.Cys)2
65Example BIM000061
Fe4(µ3-S)4(OD.Asp)(SG.Cys)3
66Fe4S4sirohaem centre (1)
MOL000131 Fe4S4
67Fe4S4sirohaem centre (2)
BIM000008 Fe4(µ3-S)4(SG.Cys)4
68Fe4S4sirohaem centre (3)
BIM000026 (Fe4S4)(SG.Cys)3Fe(por)µ-(SG.Cys)
69Relationships in COMe
- IsA inherits all attributes
- PRX to PRX
- cytochrome c IsA cytochrome
- Is_Part_Of no inheritance
- BIM to BIM MOL to MOL MOL to BIM BIM to
PRX - Fe(por)(NE.His)2 Is_Part_Of cytochrome b5
- Is_Bound_To no inheritance
- MOL to PRX
- haem b Is_Bound_To cytochrome c
70Paths to PRX000552
71Physico-chemical ontology ???
- Physico-chemical property
- Physico-chemical method
- Available at OBO web site (http//obo.sourceforge.
net/)
72Molecular entity has
- Mass (molecular weight)
- Size
- Shape
- Charge
- Structure
- One can derive many properties from known
complete structure - Spectra
73Relationships in FIX
- IsA
- Raman spectroscopy IsA vibrational spectroscopy
- Is_Part_Of
- phasing method Is_Part_Of crystallography
- Can_Be_Determined_By
- molecular structure Can_Be_Determined_By
crystallography
74Molecular Property vs Method
Heat capacity Mass Net charge Shape Size Structur
e Geometry Connectivity Topography
Calorimetry Centrifugation Crystallography Electro
phoresis Isotope method Mass spectrometry Microsco
py Spectroscopy
75A snapshot of FIX (1)
76A snapshot of FIX (2)
77A snapshot of FIX (3)
78Physico-chemical process (REX)
- IUPAC definitions (if available)
- Macroscopic and microscopic processes
- Available at OBO web site
79Biochemical reactions (1)
- Enzymatic reactions
- Non-enzymatic reactions
80Biochemical reactions (2)
- Catalytic Catalyst
- Enzymatic protein
- Abzymatic antibody
- Deoxyribozymatic DNA
- Ribozymatic RNA
- Heterogeneous surface (e.g. metal)
- Homogeneous solute (e.g. metal)
- Non-catalytic
- Photoinduced
- Spontaneous
81Biochemical reactions (3)
- Biotransformation
- A B ? C D (A, B, C, D small molecules)
- Binding
- A M ? AM (M macromolecule)
- Molecular transport
- A(compartment X) ? A(compartment Y)
- Electron and exciton transfer reactions
- Conformation change (e.g. folding)
82Relationships in REX
- IsA
- redox reaction IsA chemical reaction
- Is_Part_Of
- photoexcitation Is_Part_Of photoabsorption
- Is_Reverse_Of
- associative desorption Is_Reverse_Of
dissociative adsorption - Not DAG!
83A snapshot of REX (1)
84A snapshot of REX (2)
85A snapshot of REX (3)
86Users
- ChEBI, FIX, REX
- Oscar3 (University of Cambridge)
- ProjectProspect (Royal Society of Chemistry)
- Gene Ontology
- ChEBI REX
- Coming soon IntEnz
- COMe
- InterPro
87Summary
- Ontologies provide controlled vocabulary
organised as a directed graph - ChEBI standard terminology and structure of
(bio)chemical compounds - COMe ontology for bioinorganic proteins
- FIX controlled vocabulary for physico-chemical
properties and methods - REX controlled vocabulary for physico-chemical
processes
88Links to remember
http//www.ebi.ac.uk/chebi/ http//www.ebi.ac.uk/c
ome/ http//www.ebi.ac.uk/kirill/FIX/ http//www
.ebi.ac.uk/kirill/REX/
89GrazieThank you???????
90Future plans
- Diversity of biochemical reactions and
mechanistic aspects of enzymatic catalysis - Development of database for quantitative
properties of functional centres in
metalloproteins and other complex proteins - Further development of ontology of complex
proteins based on the concept of bioinorganic
motif
91Diversity of biochemical reactions
- Unambiguous chemical representation of reactions
- Further development of REX
- Collaboration with IntEnz, MACiE
92Quantitative properties of functional centres
- From qualitative to quantitative annotation
- Proteins utilise and modulate properties of
non-peptide groups - Redox potentials
- Absorption maxima (better, spectra)
- Dissociation constants
- Collaboration with experimentalists (here?)
93Metalloproteins
- Further development of COMe
- Annotation and prediction of metal-binding sites
- Selectivity and specificity of metalloprotein
cofactors - Collaborations (University of Edinburgh, Academia
Sinica)
94Metal-binding sites in proteins
- Position-specific annotation of experimentally
determined and inferrable metal-binding sites in
the UniProt and closely connected resources (e.g.
InterPro) - Automation of major annotation steps to ensure
the sustainability of this feature with minimum
maintenance/curation in the future - Development of prediction methodology for protein
metal binding sites from multiple sequence
alignments
95Selectivity and specificity of metalloprotein
cofactors
- Two separate problems binding selectivity and
catalytic efficiency - The database of experimentally defined
qualitative and quantitative data for
protein-metal interactions (coordination
geometry, binding constants, redox potentials) - Density functional theory (DFT)
- Continuum dielectric methods (CDM)
96Computational metallomics
- Metallomics comprehensive analysis of the
entirety of metal and metalloid species within a
cell or tissue type - A branch of metabolomics
- Enzymatic reactions, transport phenomena, metal
targeting (metallochaperones) and non-catalytic
reactions - The database of metal metabolism and transport
pathways