Title: NCBI Highlights
1Lecture 3 NCBI Highlights and Text Search
Tools (Entrez, PubMed, OMIM)
2(No Transcript)
3Haaretz Publication, June 2000
4The Central Paradigm of Bio-informatics
Molecular structure
Biochemical function
Genetic information
Symptoms
5http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/mi
lestones.html
6-Haernophilus influenzae (2 Mb).
-First Eukaryote genome (Saccharomyces
cereviseae (12 Mb)).
-First multi-cellular Eukaryote (Caenorhabditis
elegans (100Mb)).
-A model organism for animal kingdom (Drosophila
melanogaster).
-A model organism for plant kingdom-
(Arabidopsis thaliana).
7What is a Biological Database ?
A biological database is a large, organized body
of data, usually associated with computer
software designed to update, query, and
retrieve stored components of the data.
Example nucleotide sequence database will
contain information such as contact name
sequence and description of the molecule
scientific name of the source organism
literature citations.
8Challenge of Data Retrieval
- Store large amounts of data.
- Update data on a regular basis.
- Fast retrieval of information Extract as much
- updated information as possibleCross
multiple databanks. - Many data sources in different locations
- Use connected databases.
- Systematic classification of the data.
- Search multiple terms in one query
- Use multiple search fields.
9International collaboration by NCBI, DDBJ, EMBL
10(No Transcript)
11NCBI
ENTREZ - PubMed
http//www.ncbi.nlm.nih.gov/
http//www.ncbi.nlm.nih.gov/Sitemap/index.html
12At NCBI, many of the databases are linked
through a unique search and retrieval system,
called . Entrez allows a user not only to
access and retrieve specific information from
many NCBI databases, but to access integrated
information from a single database. Example the
protein database is cross-linked to the
taxonomy database.
http//www.ncbi.nlm.nih.gov/Entrez/
13Information Flow
- PubMed-The biomedical literature.
- Genbank-Nucleotide sequences,
- protein sequence databases.
- 3D macromolecular structures.
- Complete genome maps.
- Taxonomy-organisms in GenBank.
- OMIM-Genetic diseases.
http//www.ncbi.nlm.nih.gov/Tour/tour.html
14Exponential growth of biological information
Efficient storage and management tools are most
important.
15Primary (raw) databases genomic, DNA, protein.
Types of Data (Databases)
Ribbons
Publications
Cylinders
Secondary (analyzed) Databases
16 Types of Primary Databases
DNA sequences GenBank http//www.ncbi.nlm.n
ih.gov/GenBank/GenBankOverview.html
EMBL http//www.ebi.ac.uk/embl/index.html
DDBJ (DNA Data Bank of Japan)
http//www.ddbj/nig.ac.jp/ Protein
sequences Swiss-prot and TrEMBL http//www.expas
y.ch/sprot/sprot-top.html Protein Identification
Resource (PIR) http//www-nbrf.georgetown.edu/p
irwww/pirhome.shtml Genomic Databases Whole
genomes (NCBI) http//www.ncbi.nlm.nih.gov/entr
ez/query.fcgi?dbgenome Whole microbial genome
(TIGR) http//www.tigr.org/tigr-scripts/CMR2/C
MRGenomes.sp1 Human gene mutations
http//www.uwcm.ac.uk/uwcm/mg/hgmd0.html Others
17Analysis and interpretation of data may reveal
patterns and trends in Biology
- Common sequences can be identified by multiple
alignment. - Sequence families or neighborhoods can be
defined. - Motifs can provide clues for biochemical
function. - Clustering sequences into trees reflect the
degree of similarity between - species and evolutionary relationships.
18Types of Database Search
- Text-based search.
- Sequence based database search
- (based on sequence
similarities). - Structure based database search
- (based on
structure similarities). - Motif/Domain based database search
- (based on Domain
similarities). - Other.
19Biological Databases
- DNA databanks
- GenBank, DDBJ, EMBL,
- Protein databases
- PIR, Swiss-Prot, GenPept, PDB, TrEMBL
- EST databases
- dbEST, DOTS, UniGene, STACK
- Structure databases
- MMDB, PDB
- Pathway databases
- KEGG, BRITE, TRANSPATH,
- Motif databases
- Prosite, Pfam, BLOCKS, TransFac, PRINTS, URLs,
- Gene, protein disease databases
- GeneCards, OMIM, OMIA,
- Taxonomy databases
- Literature databases
- PubMed, Medline,
- Patent database
- Apipa, CA-STN, IPN, USPTO, EPO, Beilstein,
- Others
- RNA databases, SNP,
- microarray
20Gene finding
Design tools data collection
Over 2x109 bp (mainly human)
Whole genome approach
- Huge data explosion.
- Management of biological information is
crucial but becomes harder. - Most biological experiments require
bio-informatics.
http//www.ncbi.nlm.nih.gov/Web/Newsltr/Summer99/d
ecade.html
21Growth of GenBank
GenBank Divisions
PLN - Plant sequences. PRI - Primates ROD
- Rodents MAM- Other mammals VRT - Other
vertebrates INV - Invertebrates BCT -
Bacterial PHG - Phage VRL - Viral
SYN - Synthetic UNA - Un-annotated
PAT - Patent NEW - New
22EMBL European Molecular Biology lab
http//www1.embl-heidelberg.de/
EMBL Database Entries by species