Bioinformatic Databases - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Bioinformatic Databases

Description:

500,000 medical articles/year. 4,000,000 scientific articles/year ... Science Citation Index. http://isi4.isiknowledge.com/portal.cgi. Your Local eLibrary ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 48
Provided by: Comp632
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatic Databases


1
Bioinformatic Databases
  • Microbiology 343
  • David Wishart
  • david.wishart_at_ualberta.ca

2
Objectives Outline
  • The biological data explosion appreciating
    the size scope
  • Different types of database knowing which
    database is right for the given job
  • The web is your friend (on-line DBs)
  • EcoCyc, PDB, Bind, BacMap, etc.
  • Using PubMed

3
The U of A Library
  • 10.02 million total holdings
  • 4.81 million books journals
  • 3.69 million microfilms
  • 1.39 million maps
  • 39,141 journal subs
  • 8,000 PhD dissertations
  • 12 library bldngs
  • 2 Terabytes of data

4
The Library of Congress
  • 120 million items in storage
  • 54 million manuscripts
  • 18 million books
  • 12 million photographs
  • 4.5 million maps
  • 4.4 million technical reports
  • 1.1 million PhD dissertations
  • 20 Terabytes of data

5
Some Numbers
  • 3 scientific journals in 1750
  • 120,000 scientific journals today
  • 500,000 medical articles/year
  • 4,000,000 scientific articles/year
  • 14,000,000 abstracts in PubMed derived from 4600
    journals
  • 10 billion web pages on Google
  • 800,000,000,000,000 bytes on the Web

6
Some Numbers
  • A researcher would have to scan 130 different
    journals and read 27 papers per day to follow a
    single disease, such as breast cancer.
  • Baasiri, R.A., Glasser, S.R., Steffen, D.L.
    Wheeler, D.A. Oncogene 18, 7958-7965 (1999)

7
Some Graphs
8
A Tidal Wave of Data
9
Different Databases in Bioinformatics
  • Sequence Databases
  • SNP/mutation DBs
  • Structure Databases
  • Expression Databases
  • Spectral Databases
  • Metabolism Databases
  • Drug Databases
  • Cell/Strain Databases
  • Organism Databases
  • Interaction Databases
  • Function/Ontology DBs
  • Bibliographic DBs
  • Disease Databases
  • Databases of Databases

10
Why So Many Databases?
  • To collect and preserve valuable data
  • To make data accessible and easily searched
  • To standardize data representation or data
    formats
  • To organize data into knowledge

11
Different Types of Databases
  • Public (web-accessible, downloadable)
  • Local or Private (restricted to registered users
    or not web-accessible)
  • Archival (anything goes)
  • Curated (managed data submission)
  • Specialty databases (special interest)
  • General databases (wide interest)

12
Where To Go For More Info?
Nucleic Acids Research Web Server Issue (every
Jan.)
http//nar.oupjournals.org/
13
Sequence Databases
  • GenBank
  • www.ncbi.nlm.nih.gov/
  • EMBL/trEMBL/UniProt
  • www.ebi.ac.uk/trembl/
  • DDBJ
  • www.nig.ac.jp/
  • PIR
  • http//pir.georgetown.edu/
  • SwissProt
  • www.expasy.ch/sprot/

14
Sequence Databases
  • Some specialize in DNA sequence data (GenBank,
    DDBJ)
  • Some specialize in protein sequence data (PIR,
    Swiss-Prot, UniProt)
  • Some are specific for organisms or classes of
    organisms (yeast, fruitflies, human, certain
    bacteria)

15
Sequence Annotation
  • Most sequence databases usually include more
    information than just the raw sequence data
  • This additional information is called
    annotation and it may be done either
    automatically or manually
  • Annotations include information about
    gene/protein name, length, position, references,
    corrections, etc

16
Different Levels of Annotation
  • Sparse typical of most archival DNA sequence
    databases (GenBank, DDBJ)
  • Moderate typical of more curated databases or
    protein-specific databases (PIR, trEMBL)
  • Detailed typical of organism-specific databases
    or databases with a very high level of curation
    (Swiss-Prot, EcoCyc, BacMap)

17
Different Levels of Database Annotation
  • GenBank (large of sequences, minimal
    annotation)
  • PIR (large of sequences, slightly better
    annotation)
  • SwissProt (small of sequences, even better
    annotation)
  • Organsim-specific DB (very small of sequences,
    best annotation)

18
GenBank Annotation
19
Swiss-Prot Annotation
20
Organism-specific Databases
  • Flybase
  • http//flybase.harvard.edu/
  • ENSEMBL (human)
  • http//www.ensembl.org/
  • CYGD (yeast)
  • http//www.mips.biochem.mpg.de/proj/yeast/
  • BacMap (all bacteria)
  • http//wishart.biology.ualberta.ca/BacMap/
  • EcoCyc (E. coli)
  • http//ecocyc.org

21
The EcoCyc Database
22
The EcoCyc Database
  • EcoCyc is a scientific database for the bacterium
    E. coli K12 MG1655
  • The EcoCyc project assembles its data via
    literature-based curation of the entire genome
    along with information on transcriptional
    regulation, transporters, and metabolic pathways
  • Actively developed since 1999

23
The EcoCyc Database
  • Users may query the database for E. coli
    sequences, pathways, genes, reactions, compounds,
    metabolic charts and gene expression data
  • Supports sequence searches and interactive
    visualization of key genomic and proteomic
    features

24
The EcoCyc Database
  • EcoCyc supports the visualization of
  • gene layouts within the E. coli chromosome
  • individual biochemical reactions
  • a complete biochemical pathway (with compound
    structures
  • Users may display an enzyme, the reaction that
    the enzyme catalyzes, or the gene that encodes
    the enzyme

25
EcoCyc Statistics
26
Structure Databases
  • RCSB-PDB
  • http//www.rcsb.org/pdb/
  • MSD
  • http//www.ebi.ac.uk/msd/index.html
  • CATH
  • www.biochem.ucl.ac.uk/bsm/cath/
  • SCOP
  • www.scop.mrc-lmb.cam.ac.uk/scop/

27
Sequences vs. Structures
200000
160000
120000
Sequences
Structures
80000
40000
0
28
Protein DataBank (PDB)
http//www.rcsb.org/pdb/Welcome.do
29
Looking at E. coli Trx (2trx)
30
Homework
  • Explore the PDB, click on some of the links, try
    the visualization tools
  • Use the advanced search mode to find out how many
    proteins have been solved by NMR that have
    between 200 and 350 residues
  • Find out how many structures are similar (lt3 Angs
    RMSD) to 2trx

31
Metabolism Databases
  • KEGG
  • http//www.genome.ad.jp/kegg/metabolism.html
  • Roche/Boeringer
  • http//www.expasy.org/cgi-bin/search-biochem-index
  • EcoCyc
  • http//ecocyc.org/
  • WIT
  • http//wit.mcs.anl.gov/WIT2/

32
Interaction Databases
  • BIND
  • http//www.bind.ca/
  • DIP
  • http//dip.doe-mbi.ucla.edu/
  • PIM
  • http//www.hybrigenics.fr/
  • PathCalling
  • http//portal.curagen.com/extpc/com.curagen.portal
    .servlet.Yeast

33
Bibliographic Databases
  • PubMed Medline
  • http//www.ncbi.nlm.nih.gov/PubMed/
  • Science Citation Index
  • http//isi4.isiknowledge.com/portal.cgi
  • Your Local eLibrary
  • www.XXXX.ca
  • Current Contents
  • http//www.isinet.com/isi/

34
Bibliographic Databases
  • Research paper from author X
  • Sequence from gene X in organism Y
  • All information about organelle W in model
    organism Y
  • All information about disease X caused by microbe
    Z
  • Orthologs of disease gene Y in other model
    organisms

35
Where to go? PubMed
http//www.ncbi.nlm.nih.gov/PubMed/
36
PubMed
  • Allows users to search by journal, key words,
    titles etc.
  • Uses MeSH (Medical SubHeadings) to allow
    automated search of synonyms (renal transplant
    kidney transplantation)
  • API available to query PubMed automatically and
    remotely
  • Few users know how to use PubMed properly or to
    its full extent

37
ouellette bf au AND yeast
Details
38
(No Transcript)
39
MeSH Medical Subject Heading
("ouellette bf"au AND (("yeasts"MeSH Terms OR
"saccharomyces cerevisiae"MeSH Terms) OR
yeastText Word))
40
Integrated Text/Sequence Searching with Entrez
Wishart DS au AND coli
41
Results
Click Here!
42
(No Transcript)
43
PubMed Icons
44
Results
Click Here!
45
(No Transcript)
46
How Do You Search Sequence Databases?
  • Most kinds of non-biological databases are
    designed to be searched using exact matches of
    dates, keywords or numbers
  • Sequence databases are different in that we only
    want to or only can query them with approximate
    or inexact matches this is a challenging
    problem
  • Next Lecture Sequence Searching

47
Sample Exam Question
  • Which database(s) would you use to find out about
    E. colis pyruvate synthase
  • A) structure
  • B) reactions
  • C) interacting partners
  • D) enzymology
  • E) sequence variations
Write a Comment
User Comments (0)
About PowerShow.com