Guest Lecture - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Guest Lecture

Description:

Database concepts - what is a 'good' database (DB) How is data stored and ... has open interfaces (APIs) for query requests. Bad DB: ... APIs, Mashups, SQL ... – PowerPoint PPT presentation

Number of Views:1811
Avg rating:3.0/5.0
Slides: 26
Provided by: ms145
Category:
Tags: apis | guest | lecture

less

Transcript and Presenter's Notes

Title: Guest Lecture


1
Bio-Chemical databases
Guest Lecture Graduate level course MCB221b -
Mechanistic Enzymology Tobias Kind November
2007
  • Database concepts - what is a good database
    (DB)
  • How is data stored and queried and curated
  • Enzyme DBs, Protein and peptide DBs, small
    molecule DBs

This document is hyperlinked (pictures and green
text). To use WWW links in this PPT switch to
slide show mode.
2
Databases very short primer ()
DB2
Database interface is what you see Database
queries what you ask the database Database
objects where the data is stored (index and
tables) Database types relational databases,
object oriented databases, flat file
DBs Database brands Oracle, MySQL, Apache, IBM
DB2, PostgreSQL, MS SQL Database query language
how a database can be programmed
(SQL) Database dump file the whole database in
a single (.dmp) file Database Ontology
database vocabulary and used relationships Databa
se Semantics capture meaning by grammar or
logical analysis
Oracle
MySQL
() you can study this for several yearsand get
a PhD in computer and database sciences.
3
What is a good database? As in normal life its
important to distinguish between good and evil
  • Good DB
  • allows multiple input queries
  • exports in multiple output formats
  • connects to other DBs
  • is curated (means checked for errors by humans
    or machines)
  • is regularly updated (daily, yearly)
  • cost money (your money or tax payers money) or
    time
  • allows bulk download (millions of data sets can
    be downloaded)
  • has open interfaces (APIs) for query requests

Source wikimedia.org
  • Bad DB
  • allow only single requests (which have to be
    typed manually)
  • are not databases but just lists or tables
  • have no link-out and no link-in
  • allow no bulk download
  • are not curated

Source wikimedia.org
4
Exchange formats SMBL, XML, BioPax
XML format general purpose data format (CML for
storing chemical data) lt?xml version"1.0"
?gt ltmolecule id"m1"gt ltatomArraygt ltatom
id"a1" elementType"C"
x2"-3.0333333015441895" y2"2.9166667461395264"
/gt lt/atomArraygt ltbondArraygt
lt/bondArraygt lt/moleculegt BioPax format used
for representing pathway data (data exchange
format) SBML format representing models of
biochemical reaction networks SDF format
general purpose chemical structure format (small
molecules) RDF format format for storing
chemical reactions (small molecules) PDB format
general purpose chemical structure format
(proteins)
Methane
5
SBML (Systems Biology Markup Language)
Source Akira Funahashi Cell Designer Tutorial
  • List of supported SBML programs (more than 200)
    from sbml.org
  • List of curated and published SBML models
    (around 200) from biomodels DB

6
APIs, Mashups, SQL
  • Application programming interfaces (API) are
    important to connect and automate data exchange
    between local programs and databases Example
    NCBI SOAP or PubChem PUG (Power User Interface)
    can be used to download certain data via the
    web to another service or to a local program
  • Mashups and integration services use new web
    technology (RDF, Yahoo Pipes) to combine data
    sources and create new knowledge or enhance usage
  • SQL used for programming databases

Large Database Table SQL query Result
  • yr subject winner
  • 1901 Chemistry Jacobus H. van 't Hoff
  • 1902 Chemistry Emil Fischer
  • 1903 Chemistry Svante Arrhenius
  • 1904 Chemistry Sir William Ramsay
  • 1905 Chemistry Adolf von Baeyer
  • 1906 Chemistry Henri Moissan
  • 1907 Chemistry Eduard Buchner
  • 1908 Chemistry Ernest Rutherford
  • 1909 Chemistry Wilhelm Ostwald
  • 1910 Chemistry Otto Wallach

yr subject winner 1909 Chemistry Wilhelm
Ostwald
SELECT yr, subject, winner FROM nobel WHERE yr
1909 and subject 'chemistry'
Visit the SQL Zoo
7
Database front-ends (a good one) Enhanced NCI
Database Browser Release 2 (CACTVS DB)
  • Small molecule DB with revolutionary
    web-front-end (2001)
  • Multiple input an output (export) methods
  • Allows matching of molecule lists against DB (as
    SMILES, CAS, NCI number)
  • Links to other services
  • Visualization modes (2D, 3D)
  • 20 different molecular output formats (SDF,
    CML, SMILES)
  • export to different other (calculational)
    services
  • 30 different query modes

8
Database visualization
  • Visualize complex networks uses
    plug-in-technology from different sources
  • Map your own compound data (proteins, genes,
    molecules) onto networks
  • Perform literature search with enzymes, genes,
    small molecules

Source Cytoscape.org
Start Cytoscape via JAVA webstart
9
Uber-portals (NCBI ENTREZ)
10
Database and tools integration Gaggle
Source WIKIMEDIA
  • Frameworks
  • Portals
  • Mashups

Source http//gaggle.systemsbiology.org/docs/gees
e/
11
Gaggle Integration of tools and database services
Source WIKIMEDIA
ListLink
The Gaggle an open-source software system for
integrating bioinformatics software and data
sources. Shannon PT, Reiss DJ, Bonneau R, Baliga
NS. BMC Bioinformatics. 2006 Mar 287176.
Use Gaggle
12
Use or built your own local databaseExample
LipidMaps DB with Instant-JChem
  • Download the whole LipidMaps DB (10,000 lipids)
    as SDF file LINK
  • Use Instant-JChem as data DB, molecule DB,
    reaction DB LINK
  • Perform data and molecule queries on your laptop
    (PC, LINUX, MAC)

(also works with KEGG/Biometa DB)
13
Welcome to the (database) jungle!
ChemBioGrid collection of most chemistry
databases current number
156 Pathguide.org collection of pathway,
enzyme, metabolite DBs current
number 231 Chemistry related (big
players) PubChem, CAS (subscription), Beilstein
(subscription), Chemspider (fast
growing) Important for chemistry/metabolomics Sp
ectral databases (NMR, mass spectral databases),
compound property DBs Pathway, Enzyme
related KEGG, Brenda, Reactome, Expasy, MetaCyc
14
Pathguide.org
Pathguide is a meta-database Comprehensive
collection of pathway, small molecule, enzyme,
protein interaction databases
15
Enzyme and kinetics related databases
KDBI - Kinetic Data of Bio-molecular Interactions
database http//bidd.nus.edu.sg/group/kdbi/ SABIO
-RK - SABIO-Reaction Kinetics Database http//sabi
o.villa-bosch.de/SABIORK/ BRENDA - Comprehensive
Enzyme Information System http//www.brenda.uni-
koeln.de/ EMP - Enzymes and Metabolic Pathways
Database http//www.empproject.com/ ENZYME -
Enzyme nomenclature database (EXPASY) http//www.e
xpasy.ch/enzyme/ IntEnz - Integrated relational
Enzyme database http//www.ebi.ac.uk/intenz/index
.html TECR - Thermodynamics of Enzyme-Catalyzed
Reaction http//xpdb.nist.gov/enzyme_thermodynami
cs/ REBASE - Restriction Enzyme
Database http//rebase.neb.com/ Precise -
Predicted and Consensus Interaction Sites in
Enzymes http//precise.bu.edu/
Source Pathguide Own search
16
PubChem
  • Most important small molecule DB
  • There was no large open chemistry DB until
    10 years ago (!)
  • All records can be downloaded via FTP
  • All other small molecule link to PubChem
  • PubChem Compounds (true chemicals)
  • PubChem Substances
  • (formulations, mixtures)
  • substructure search and multiple other options

Goto PubChem
17
CAS SciFinder
  • 33 million molecules and 60 million
    peptides/proteins
  • Largest reaction DB (14 million reactions) and
    literature DB
  • A must for chemist and biochemist/biologist
  • no bulk download, no good Import/ Export, no
    Linkouts
  • only proprietary Windows interface (no plugins)
  • no text mining (requires ANAVIST)

Download Scifinder
18
BRENDA - Comprehensive Enzyme Information System
19
Brenda 3D model output with JMOL
Example Brenda connection to RSCB Protein Data
bank
Visit Brenda
20
KEGG Pathway DB
KEGG ID C00002 (ATP) KEGG pathway map ID
map00195 (Photosynthesis) KEGG reaction ID
R05668 (ATP NAD reaction)
Visit KEGG
21
Reactome curated pathway maps
Example Skypainter, map your given KEGG IDs to
pathways
Visit Reactome
22
Outlook for the database lesson
  • Curation, Curation, Curation (costs money)
  • Inhale the good DB and bad DB scheme and apply
    when you enter a DB portal
  • Learn some basic database programming (Ruby on
    Rails, JAVA, SQL) using bioinformatics and
    chemoinformatics approaches is crucial for
    research
  • Learn how to import and store and handle
    database search results on your local computer
    (simple parse important data with regular
    expressions)
  • Dont be overwhelmed by the database jungle,
    take some time to play around Finally
    automation and clever use of DB tools will
    innovate your research
  • Multiple unique identifier problem (Kegg ID,
    PubChem ID, CAS number) and biological naming
    problem still exist
  • The systems biology and chemistry database world
    is still different in terms of re-use. Most of
    the chemistry data published (including
    molecules) is not machine readable, hence
    cant be automatically harvested by software
    robots.

23
Reading List databases
The Gaggle An open-source software system for
integrating bioinformatics software and data
sources Correcting ligands, metabolites, and
pathways Large-Scale Annotation of
Small-Molecule Libraries Using Public Databases
24
Homework for homework discussion III (30 min)
  • Find three bad or evil databases in the
    biochemistry/chemistry worldplease give a reason
    in a short sentence.
  • Find the year in which most papers about enzyme
    kinetics were published using SciFinder (use
    Explore enter search term, then Analyze year)
  • Find the molecules which were analyzed most in
    papers regarding "enzyme kinetics" and
    "cricketsusing SciFinder (use Explore, then
    Analyze CAS Number)
  • Find the price for 1g ATP from Pfaltz Bauer
    (in SciFinder use locate substance then use the
    Erlenmeyer icon for price info)
  • Goto Brenda and find out how many coronavirus
    types are in the DB(use TaxExplorer and query)
  • Goto Brenda and find out how many enzymes are
    listed as resistant againstperchloric acid,
    report publication title (goto Brenda, Advanced
    search)
  • Goto KEGG Ligand DB find the KEGG Numbers for
    D-Hexose and ATP
  • Goto KEGG Reaction Prediction (e-zyme) How many
    similar reactions occur between D-Hexose and ATP?
    (Enter above KEGG IDs, press view structures
    press compute)
  • Goto PubChem What is the PubChem compound ID
    (CID) and the topological surface area for Tobias
    acid?

Source WIKImedia
25
Pathways and enzymes http//www.biocarta.com/pathf
iles/h_etcPathway.asp SQL learning http//sqlzoo.
net/ Databases http//www.google.com/search?hlen
qenzymekineticsdatabasebtnGGoogleSearch SQL
biologists Im a biologist Jim, not
a programmer SQL biologists SciView part 5
interview with Alexei Drummond
Thank you!Thanks to all Wikimedia.org
contributors for pictures! Thanks to the Dinesh
Kumar (FiehnLab) for discussions.
Write a Comment
User Comments (0)
About PowerShow.com