Title: Bioinformatic Databases
1Bioinformatic Databases
- Microbiology 343
- David Wishart
- david.wishart_at_ualberta.ca
2Objectives Outline
- The biological data explosion appreciating
the size scope - Different types of database knowing which
database is right for the given job - The web is your friend (on-line DBs)
- EcoCyc, PDB, Bind, BacMap, etc.
- Using PubMed
3The U of A Library
- 10.02 million total holdings
- 4.81 million books journals
- 3.69 million microfilms
- 1.39 million maps
- 39,141 journal subs
- 8,000 PhD dissertations
- 12 library bldngs
- 2 Terabytes of data
4The Library of Congress
- 120 million items in storage
- 54 million manuscripts
- 18 million books
- 12 million photographs
- 4.5 million maps
- 4.4 million technical reports
- 1.1 million PhD dissertations
- 20 Terabytes of data
5Some Numbers
- 3 scientific journals in 1750
- 120,000 scientific journals today
- 500,000 medical articles/year
- 4,000,000 scientific articles/year
- 14,000,000 abstracts in PubMed derived from 4600
journals - 10 billion web pages on Google
- 800,000,000,000,000 bytes on the Web
6Some Numbers
- A researcher would have to scan 130 different
journals and read 27 papers per day to follow a
single disease, such as breast cancer. - Baasiri, R.A., Glasser, S.R., Steffen, D.L.
Wheeler, D.A. Oncogene 18, 7958-7965 (1999)
7Some Graphs
8A Tidal Wave of Data
9Different Databases in Bioinformatics
- Sequence Databases
- SNP/mutation DBs
- Structure Databases
- Expression Databases
- Spectral Databases
- Metabolism Databases
- Drug Databases
- Cell/Strain Databases
- Organism Databases
- Interaction Databases
- Function/Ontology DBs
- Bibliographic DBs
- Disease Databases
- Databases of Databases
10Why So Many Databases?
- To collect and preserve valuable data
- To make data accessible and easily searched
- To standardize data representation or data
formats - To organize data into knowledge
11Different Types of Databases
- Public (web-accessible, downloadable)
- Local or Private (restricted to registered users
or not web-accessible) - Archival (anything goes)
- Curated (managed data submission)
- Specialty databases (special interest)
- General databases (wide interest)
12Where To Go For More Info?
Nucleic Acids Research Web Server Issue (every
Jan.)
http//nar.oupjournals.org/
13Sequence Databases
- GenBank
- www.ncbi.nlm.nih.gov/
- EMBL/trEMBL/UniProt
- www.ebi.ac.uk/trembl/
- DDBJ
- www.nig.ac.jp/
- PIR
- http//pir.georgetown.edu/
- SwissProt
- www.expasy.ch/sprot/
14Sequence Databases
- Some specialize in DNA sequence data (GenBank,
DDBJ) - Some specialize in protein sequence data (PIR,
Swiss-Prot, UniProt) - Some are specific for organisms or classes of
organisms (yeast, fruitflies, human, certain
bacteria)
15Sequence Annotation
- Most sequence databases usually include more
information than just the raw sequence data - This additional information is called
annotation and it may be done either
automatically or manually - Annotations include information about
gene/protein name, length, position, references,
corrections, etc
16Different Levels of Annotation
- Sparse typical of most archival DNA sequence
databases (GenBank, DDBJ) - Moderate typical of more curated databases or
protein-specific databases (PIR, trEMBL) - Detailed typical of organism-specific databases
or databases with a very high level of curation
(Swiss-Prot, EcoCyc, BacMap)
17Different Levels of Database Annotation
- GenBank (large of sequences, minimal
annotation) - PIR (large of sequences, slightly better
annotation) - SwissProt (small of sequences, even better
annotation) - Organsim-specific DB (very small of sequences,
best annotation)
18GenBank Annotation
19Swiss-Prot Annotation
20Organism-specific Databases
- Flybase
- http//flybase.harvard.edu/
- ENSEMBL (human)
- http//www.ensembl.org/
- CYGD (yeast)
- http//www.mips.biochem.mpg.de/proj/yeast/
- BacMap (all bacteria)
- http//wishart.biology.ualberta.ca/BacMap/
- EcoCyc (E. coli)
- http//ecocyc.org
21The EcoCyc Database
22The EcoCyc Database
- EcoCyc is a scientific database for the bacterium
E. coli K12 MG1655 - The EcoCyc project assembles its data via
literature-based curation of the entire genome
along with information on transcriptional
regulation, transporters, and metabolic pathways - Actively developed since 1999
23The EcoCyc Database
- Users may query the database for E. coli
sequences, pathways, genes, reactions, compounds,
metabolic charts and gene expression data - Supports sequence searches and interactive
visualization of key genomic and proteomic
features
24The EcoCyc Database
- EcoCyc supports the visualization of
- gene layouts within the E. coli chromosome
- individual biochemical reactions
- a complete biochemical pathway (with compound
structures - Users may display an enzyme, the reaction that
the enzyme catalyzes, or the gene that encodes
the enzyme
25EcoCyc Statistics
26Structure Databases
- RCSB-PDB
- http//www.rcsb.org/pdb/
- MSD
- http//www.ebi.ac.uk/msd/index.html
- CATH
- www.biochem.ucl.ac.uk/bsm/cath/
- SCOP
- www.scop.mrc-lmb.cam.ac.uk/scop/
27Sequences vs. Structures
200000
160000
120000
Sequences
Structures
80000
40000
0
28Protein DataBank (PDB)
http//www.rcsb.org/pdb/Welcome.do
29Looking at E. coli Trx (2trx)
30Homework
- Explore the PDB, click on some of the links, try
the visualization tools - Use the advanced search mode to find out how many
proteins have been solved by NMR that have
between 200 and 350 residues - Find out how many structures are similar (lt3 Angs
RMSD) to 2trx
31Metabolism Databases
- KEGG
- http//www.genome.ad.jp/kegg/metabolism.html
- Roche/Boeringer
- http//www.expasy.org/cgi-bin/search-biochem-index
- EcoCyc
- http//ecocyc.org/
- WIT
- http//wit.mcs.anl.gov/WIT2/
32Interaction Databases
- BIND
- http//www.bind.ca/
- DIP
- http//dip.doe-mbi.ucla.edu/
- PIM
- http//www.hybrigenics.fr/
- PathCalling
- http//portal.curagen.com/extpc/com.curagen.portal
.servlet.Yeast
33Bibliographic Databases
- PubMed Medline
- http//www.ncbi.nlm.nih.gov/PubMed/
- Science Citation Index
- http//isi4.isiknowledge.com/portal.cgi
- Your Local eLibrary
- www.XXXX.ca
- Current Contents
- http//www.isinet.com/isi/
34Bibliographic Databases
- Research paper from author X
- Sequence from gene X in organism Y
- All information about organelle W in model
organism Y - All information about disease X caused by microbe
Z - Orthologs of disease gene Y in other model
organisms
35Where to go? PubMed
http//www.ncbi.nlm.nih.gov/PubMed/
36PubMed
- Allows users to search by journal, key words,
titles etc. - Uses MeSH (Medical SubHeadings) to allow
automated search of synonyms (renal transplant
kidney transplantation) - API available to query PubMed automatically and
remotely - Few users know how to use PubMed properly or to
its full extent
37ouellette bf au AND yeast
Details
38(No Transcript)
39MeSH Medical Subject Heading
("ouellette bf"au AND (("yeasts"MeSH Terms OR
"saccharomyces cerevisiae"MeSH Terms) OR
yeastText Word))
40Integrated Text/Sequence Searching with Entrez
Wishart DS au AND coli
41Results
Click Here!
42(No Transcript)
43PubMed Icons
44Results
Click Here!
45(No Transcript)
46How Do You Search Sequence Databases?
- Most kinds of non-biological databases are
designed to be searched using exact matches of
dates, keywords or numbers - Sequence databases are different in that we only
want to or only can query them with approximate
or inexact matches this is a challenging
problem - Next Lecture Sequence Searching
47Sample Exam Question
- Which database(s) would you use to find out about
E. colis pyruvate synthase - A) structure
- B) reactions
- C) interacting partners
- D) enzymology
- E) sequence variations