Title: Guest Lecture
1Bio-Chemical databases
Guest Lecture Graduate level course MCB221b -
Mechanistic Enzymology Tobias Kind November
2007
- Database concepts - what is a good database
(DB) - How is data stored and queried and curated
- Enzyme DBs, Protein and peptide DBs, small
molecule DBs
This document is hyperlinked (pictures and green
text). To use WWW links in this PPT switch to
slide show mode.
2Databases very short primer ()
DB2
Database interface is what you see Database
queries what you ask the database Database
objects where the data is stored (index and
tables) Database types relational databases,
object oriented databases, flat file
DBs Database brands Oracle, MySQL, Apache, IBM
DB2, PostgreSQL, MS SQL Database query language
how a database can be programmed
(SQL) Database dump file the whole database in
a single (.dmp) file Database Ontology
database vocabulary and used relationships Databa
se Semantics capture meaning by grammar or
logical analysis
Oracle
MySQL
() you can study this for several yearsand get
a PhD in computer and database sciences.
3What is a good database? As in normal life its
important to distinguish between good and evil
- Good DB
- allows multiple input queries
- exports in multiple output formats
- connects to other DBs
- is curated (means checked for errors by humans
or machines) - is regularly updated (daily, yearly)
- cost money (your money or tax payers money) or
time - allows bulk download (millions of data sets can
be downloaded) - has open interfaces (APIs) for query requests
Source wikimedia.org
- Bad DB
- allow only single requests (which have to be
typed manually) - are not databases but just lists or tables
- have no link-out and no link-in
- allow no bulk download
- are not curated
-
Source wikimedia.org
4Exchange formats SMBL, XML, BioPax
XML format general purpose data format (CML for
storing chemical data) lt?xml version"1.0"
?gt ltmolecule id"m1"gt ltatomArraygt ltatom
id"a1" elementType"C"
x2"-3.0333333015441895" y2"2.9166667461395264"
/gt lt/atomArraygt ltbondArraygt
lt/bondArraygt lt/moleculegt BioPax format used
for representing pathway data (data exchange
format) SBML format representing models of
biochemical reaction networks SDF format
general purpose chemical structure format (small
molecules) RDF format format for storing
chemical reactions (small molecules) PDB format
general purpose chemical structure format
(proteins)
Methane
5SBML (Systems Biology Markup Language)
Source Akira Funahashi Cell Designer Tutorial
- List of supported SBML programs (more than 200)
from sbml.org - List of curated and published SBML models
(around 200) from biomodels DB
6APIs, Mashups, SQL
- Application programming interfaces (API) are
important to connect and automate data exchange
between local programs and databases Example
NCBI SOAP or PubChem PUG (Power User Interface)
can be used to download certain data via the
web to another service or to a local program - Mashups and integration services use new web
technology (RDF, Yahoo Pipes) to combine data
sources and create new knowledge or enhance usage - SQL used for programming databases
Large Database Table SQL query Result
- yr subject winner
- 1901 Chemistry Jacobus H. van 't Hoff
- 1902 Chemistry Emil Fischer
- 1903 Chemistry Svante Arrhenius
- 1904 Chemistry Sir William Ramsay
- 1905 Chemistry Adolf von Baeyer
- 1906 Chemistry Henri Moissan
- 1907 Chemistry Eduard Buchner
- 1908 Chemistry Ernest Rutherford
- 1909 Chemistry Wilhelm Ostwald
- 1910 Chemistry Otto Wallach
yr subject winner 1909 Chemistry Wilhelm
Ostwald
SELECT yr, subject, winner FROM nobel WHERE yr
1909 and subject 'chemistry'
Visit the SQL Zoo
7Database front-ends (a good one) Enhanced NCI
Database Browser Release 2 (CACTVS DB)
- Small molecule DB with revolutionary
web-front-end (2001) - Multiple input an output (export) methods
- Allows matching of molecule lists against DB (as
SMILES, CAS, NCI number)
- Links to other services
- Visualization modes (2D, 3D)
- 20 different molecular output formats (SDF,
CML, SMILES) - export to different other (calculational)
services - 30 different query modes
8Database visualization
- Visualize complex networks uses
plug-in-technology from different sources - Map your own compound data (proteins, genes,
molecules) onto networks - Perform literature search with enzymes, genes,
small molecules
Source Cytoscape.org
Start Cytoscape via JAVA webstart
9Uber-portals (NCBI ENTREZ)
10Database and tools integration Gaggle
Source WIKIMEDIA
- Frameworks
- Portals
- Mashups
Source http//gaggle.systemsbiology.org/docs/gees
e/
11Gaggle Integration of tools and database services
Source WIKIMEDIA
ListLink
The Gaggle an open-source software system for
integrating bioinformatics software and data
sources. Shannon PT, Reiss DJ, Bonneau R, Baliga
NS. BMC Bioinformatics. 2006 Mar 287176.
Use Gaggle
12Use or built your own local databaseExample
LipidMaps DB with Instant-JChem
- Download the whole LipidMaps DB (10,000 lipids)
as SDF file LINK - Use Instant-JChem as data DB, molecule DB,
reaction DB LINK - Perform data and molecule queries on your laptop
(PC, LINUX, MAC)
(also works with KEGG/Biometa DB)
13Welcome to the (database) jungle!
ChemBioGrid collection of most chemistry
databases current number
156 Pathguide.org collection of pathway,
enzyme, metabolite DBs current
number 231 Chemistry related (big
players) PubChem, CAS (subscription), Beilstein
(subscription), Chemspider (fast
growing) Important for chemistry/metabolomics Sp
ectral databases (NMR, mass spectral databases),
compound property DBs Pathway, Enzyme
related KEGG, Brenda, Reactome, Expasy, MetaCyc
14Pathguide.org
Pathguide is a meta-database Comprehensive
collection of pathway, small molecule, enzyme,
protein interaction databases
15Enzyme and kinetics related databases
KDBI - Kinetic Data of Bio-molecular Interactions
database http//bidd.nus.edu.sg/group/kdbi/ SABIO
-RK - SABIO-Reaction Kinetics Database http//sabi
o.villa-bosch.de/SABIORK/ BRENDA - Comprehensive
Enzyme Information System http//www.brenda.uni-
koeln.de/ EMP - Enzymes and Metabolic Pathways
Database http//www.empproject.com/ ENZYME -
Enzyme nomenclature database (EXPASY) http//www.e
xpasy.ch/enzyme/ IntEnz - Integrated relational
Enzyme database http//www.ebi.ac.uk/intenz/index
.html TECR - Thermodynamics of Enzyme-Catalyzed
Reaction http//xpdb.nist.gov/enzyme_thermodynami
cs/ REBASE - Restriction Enzyme
Database http//rebase.neb.com/ Precise -
Predicted and Consensus Interaction Sites in
Enzymes http//precise.bu.edu/
Source Pathguide Own search
16PubChem
- Most important small molecule DB
- There was no large open chemistry DB until
10 years ago (!) - All records can be downloaded via FTP
- All other small molecule link to PubChem
- PubChem Compounds (true chemicals)
- PubChem Substances
- (formulations, mixtures)
- substructure search and multiple other options
Goto PubChem
17CAS SciFinder
- 33 million molecules and 60 million
peptides/proteins - Largest reaction DB (14 million reactions) and
literature DB - A must for chemist and biochemist/biologist
- no bulk download, no good Import/ Export, no
Linkouts - only proprietary Windows interface (no plugins)
- no text mining (requires ANAVIST)
Download Scifinder
18BRENDA - Comprehensive Enzyme Information System
19Brenda 3D model output with JMOL
Example Brenda connection to RSCB Protein Data
bank
Visit Brenda
20KEGG Pathway DB
KEGG ID C00002 (ATP) KEGG pathway map ID
map00195 (Photosynthesis) KEGG reaction ID
R05668 (ATP NAD reaction)
Visit KEGG
21Reactome curated pathway maps
Example Skypainter, map your given KEGG IDs to
pathways
Visit Reactome
22Outlook for the database lesson
- Curation, Curation, Curation (costs money)
- Inhale the good DB and bad DB scheme and apply
when you enter a DB portal - Learn some basic database programming (Ruby on
Rails, JAVA, SQL) using bioinformatics and
chemoinformatics approaches is crucial for
research - Learn how to import and store and handle
database search results on your local computer
(simple parse important data with regular
expressions) - Dont be overwhelmed by the database jungle,
take some time to play around Finally
automation and clever use of DB tools will
innovate your research - Multiple unique identifier problem (Kegg ID,
PubChem ID, CAS number) and biological naming
problem still exist - The systems biology and chemistry database world
is still different in terms of re-use. Most of
the chemistry data published (including
molecules) is not machine readable, hence
cant be automatically harvested by software
robots.
23Reading List databases
The Gaggle An open-source software system for
integrating bioinformatics software and data
sources Correcting ligands, metabolites, and
pathways Large-Scale Annotation of
Small-Molecule Libraries Using Public Databases
24Homework for homework discussion III (30 min)
- Find three bad or evil databases in the
biochemistry/chemistry worldplease give a reason
in a short sentence. - Find the year in which most papers about enzyme
kinetics were published using SciFinder (use
Explore enter search term, then Analyze year) - Find the molecules which were analyzed most in
papers regarding "enzyme kinetics" and
"cricketsusing SciFinder (use Explore, then
Analyze CAS Number) - Find the price for 1g ATP from Pfaltz Bauer
(in SciFinder use locate substance then use the
Erlenmeyer icon for price info) - Goto Brenda and find out how many coronavirus
types are in the DB(use TaxExplorer and query) - Goto Brenda and find out how many enzymes are
listed as resistant againstperchloric acid,
report publication title (goto Brenda, Advanced
search) - Goto KEGG Ligand DB find the KEGG Numbers for
D-Hexose and ATP - Goto KEGG Reaction Prediction (e-zyme) How many
similar reactions occur between D-Hexose and ATP?
(Enter above KEGG IDs, press view structures
press compute) - Goto PubChem What is the PubChem compound ID
(CID) and the topological surface area for Tobias
acid?
Source WIKImedia
25Pathways and enzymes http//www.biocarta.com/pathf
iles/h_etcPathway.asp SQL learning http//sqlzoo.
net/ Databases http//www.google.com/search?hlen
qenzymekineticsdatabasebtnGGoogleSearch SQL
biologists Im a biologist Jim, not
a programmer SQL biologists SciView part 5
interview with Alexei Drummond
Thank you!Thanks to all Wikimedia.org
contributors for pictures! Thanks to the Dinesh
Kumar (FiehnLab) for discussions.