National Center for Biotechnology Information - PowerPoint PPT Presentation

About This Presentation
Title:

National Center for Biotechnology Information

Description:

... 22138.60 1818881.00 24392.90 770587.00 25506.70 914954.00 25246.30 1883975.00 26052.80 1868684.00 22273.30 2209321.00 24297.60 2128920.00 26022.30 1845409.00 ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 128
Provided by: wayn109
Category:

less

Transcript and Presenter's Notes

Title: National Center for Biotechnology Information


1
National Center for Biotechnology Information
  • A Field Guide to GenBank
  • and NCBIs Molecular Biology Resources

University of Colorado Health Sciences Center
August 30, 2005
2
Topics
  • About NCBI
  • GenBank overview
  • Primary vs derivative databases
  • The Reference Sequence (RefSeq) project
  • Entrez databases
  • Genome resources
  • Bookshelf
  • -break-
  • Entrez text searching
  • BLAST sequence searching
  • VAST structure searching
  • An integrated example

3
The National Institutes of Health
4
The National Center for Biotechnology Information
  • Accepts submissions of primary data
  • Develops tools to analyze these data
  • Creates derivative databases based on the primary
    data
  • Provides free search, link, and retrieval of
    these data, primarily through the Entrez system

5
NCBI WWW Users per Day
6
Number of Users Per Day
1997 1998 1999 2000 2001 2002
2003
7
Homepage - accessing the data
allfilter
8
allfilter
1/11/2005
3/15/2005
8/15/2005
9
Entrez Nucleotide
  • Primary Data
  • GenBank / DDBJ / EMBL 57.3 million (97.4 )
  • Derivative Data
  • RefSeq 1.47 million (2.5 )
  • RefSeq reviewed 60,000
  • PDB (structures) 5,973

  • Total
    59 million

records
GenBank
10
GenBank NCBIs Primary Sequence Database
Over 100 billion bases!
  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet
  • release notes gbrel.txt

ftp//ftp.ncbi.nih.gov/genbank/
ftp//genbank.sdsc.edu/pub ftp//bio-mirror.net/bi
omirror/genbank
11
What is GenBank?
  • Nucleotide only sequence database
  • Archival in nature
  • GenBank Data
  • Direct submissions (traditional records)
  • Batch submissions (EST, GSS, STS)
  • ftp accounts (genome data)
  • Three collaborating databases
  • GenBank
  • DNA Database of Japan (DDBJ)
  • European Molecular Biology Laboratory (EMBL)
    Database

12
GenBank Divisions
Organismal
PRI (28) Primate ROD (15) Rodent PLN (13)
Plant and Fungal BCT (11) Bacterial/Archeal INV
(7) Invertebrate VRT (7) Other Vertebrate VRL
(4) Viral MAM (2) Mammalian PHG (1)
Phage SYN (1) Synthetic UNA (1) Unannotated
  • Organized by taxonomy (sort of)
  • Direct submissions (Sequin/Bankit)
  • Accurate (1 error per 10,000 bp)
  • Well characterized

Functional
EST (377) Expressed Sequence Tag GSS (138)
Genome Survey Sequence HTG (63) High Throughput
Genomic PAT (17) Patent STS (9) Sequence
Tagged Site CON (1) Contigs, virtual
  • Organized by sequence type
  • Batch submissions (ftp/email)
  • Inaccurate
  • Poorly characterized

13
GenBank Functional (Bulk) Divisions
  • Expressed Sequence Tag
  • 1st pass single read cDNA
  • Genome Survey Sequence
  • 1st pass single read gDNA
  • High Throughput Genomic
  • incomplete sequences of genomic clones
  • Sequence Tagged Site
  • PCR-based mapping reagents
  • Whole Genome Shotgun

14
EST Division Expressed Sequence Tags
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAG
CAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATC
CATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAG
TGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTG
TACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTT
GAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGA
GACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGA
TTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATA
TGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg
  • - isolate unique clones
  • sequence once from each end

gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTT
CATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTT
TCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGC
TTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCT
GACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANG
TTGCCAGCCCTC
RNA gene products
15
GSS, WGS, HTG
Whole BAC insert (or genome)
shred
isolate clones
sequence
16
HTG Example Honeybee Draft Sequences
  • Unfinished sequences of BACs
  • Gaps and unordered pieces
  • Finished sequences (Phase 3) move to
    traditional GenBank division

17
Whole Genome Shotgun Projects
  • 351 projects
  • Bacteria (251)
  • Environmental sequences (6)
  • Archaea (6)
  • Eukaryotes (88), including
  • Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
  • Pufferfish (2)
  • Honeybee, Anopheles, Fruit Flies (3), Silkworm
  • Nematode (2)
  • Yeasts (8), Aspergillus (2)
  • Rice (2)

18
Whole Genome Shotgun (WGS) Projects
wgs masterproperties
19
Derivative Databases
Sequencing Centers
UniGene
UniSTS
Updated by NCBI
EST
GenBank
STS
Updated ONLY by submitters
RefSeq
RefSeq Entrez Gene and annotation pipelines
HTG
GSS
INV
VRT
PHG
VRL
PRI
ROD
PLN
MAM
BCT
Labs
20
Why Make Reference Sequences?
Entrez Nucleotide query humanorganism AND
lipasetitle
21
Why Make Reference Sequences?
Entrez Nucleotide query humanorganism AND
lipasetitle
22
humanorganism AND lipasetitle AND
endothelialtitle
humanorganism AND lipasetitle AND
endothelialtitle
23
RefSeq Benefits
genomes
transcripts
proteins
  • non-redundant best representative
  • updates to reflect current sequence data and
    biology
  • distinct, stable accession series

24
Reference Sequence RefSeq
Accession Sequence Type NM_123456789 mRNA NP_1234
56789 protein, from NM_ NR_123456 non-coding
RNA XM_123456 predicted mRNA XP_123456
predicted protein XR_123456 predicted
non-coding RNA ZP_12345678 predicted from
NZ_ NC_123456 genomic, e.g.,
chromosomes NG_123455 genomic, incomplete
region NT_123456 genomic, BAC
assembly NW_123456 genomic, WGS
assembly NZ_ABCD12345678 genomic, WGS
collection bluecurated
25
Annotation Process
Genomic DNA (NC, NT, NW)
Scanning....
Model mRNA (XM) (XR)
Model protein (XP)
Curated mRNA (NM) (NR)
Curated Protein (NP)
RefSeq
Genbank Sequences
26
Creating NM_ Records
Genome annotation
NMs must have cDNA support
transcript variant 1
transcript variant 2
transcript variant 3
Longest mRNA
27
Where is RefSeq?
28
The Entrez System
Gene
UniGene
CancerChromosomes
UniSTS
Homologene
SNP
PopSet
Genome
Nucleotide
GEO
Books
Entrez
Taxonomy
PubMed
MeSH
OMIM
Protein
PMC
Journals
Domains
3D Domains
Structure
29
A Few Entrez Databases
  • UniGene Clusters of ESTs, mRNAs
  • dbSNP Single Nucleotide Polymorphisms
  • GEO Gene Expression Omnibus
  • microarray and other expression data
  • CDD Conserved Domain Database
  • protein families (COGs and KOGs)
  • single domains (PFAM, SMART, CD)

30
UniGene
Gene-oriented clusters of expressed sequences
  • Automatic clustering using MegaBlast
  • Each cluster represents a unique gene
  • Informed by genome hits
  • Information on tissue types and map locations
  • Useful for gene discovery and selection of
    mapping reagents

unique gene
31
A Cluster of ESTs
query
5 EST hits
3 EST hits
32
UniGene Collections
33
Example UniGene Cluster
34
Histogram of cluster sizes for UniGene Hs Build
177
(Now at Build 186)
35
UniGene Cluster Hs.95351
SELECTED PROTEIN SIMILARITES
36
UniGene Cluster Hs.95351
GENE EXPRESSION
37
UniGene Cluster Hs.95351 expression
38
UniGene Cluster Hs.95351 seqs
39
Download sequences
web page
ftp//ftp.ncbi.nih.gov/repository/UniGene/Homo_sap
iens/
40
Entrez GEO
41
NCBIs SNP Database
  • Primary and derivative (RefSNP)
  • Single nucleotide polymorphisms
  • Repeat polymorphisms
  • Insertion-deletion polymorphisms
  • Over 19 million refSNPs (rsXXXXXXX)
  • (August, 2005)

42
Searching dbSNP
43
RefSNP
44
RefSNP
45
RefSNP
46
RefSNP
Search Mouse SNP between strains
47
RefSNP
48
RefSNP
49
Entrez GEO
50
Submitted by Experimentalists
Curated by NCBI
Submitted by Manufacturer
GDS Grouping of experiments
GPL Platform descriptions
GSE Grouping of slide/chip data a single
experiment
GSM Raw/processed spot intensities from a
single slide/chip
GEO SEries set of related samples
GEO SaMple experimental conditions
Entrez GEO Datasets
Entrez GEO
51
Whats a DataSet?
52
Gene Expression Omnibus (GEO)
Dataset browser
53
GEO Dataset Browser
54
GEO Dataset Report
55
GEO Profiles
of 12625
56
Entrez CDD
57
Conserved Domain Database
  • Multiple sequence alignments
  • Position-specific scoring matrices (PSSM)
  • Sources SMART, PFAM, COGs, KOGs, and
  • NCBI curated domains (structure-informed
    alignments)

58
CDD
gtgi45549418gbAAS67634.1 ATP7A Solenodon
paradoxus IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAID
IERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALST
LHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEV
ESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKK
AGVKSIQVSLANRNGTVEYDP LLTSPEILRE
59
CDD
Click on a colored bar to align your sequence to
the CD
CD
Pfam
COG
60
Conserved Domain Database cd00371.1, HMA
61
CDD
62
CDART Conserved Domain Architecture Retrieval
Tool
63
cdd
Linking from Entrez Protein
64
Genome Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
65
Genomic Biology
66
Gen Biol Gen Resources
67
Gen Biol Gen Resources
68
Gen Biol Gen Resources
69
Genome Projects microb
70
Gen Biol Gen Resources
71
Gen Biol Gen Resources
72
Gen Biol Gen Resources
73
Gen Biol Gen Resources
74
Gen Biol Gen Resources
75
Genome Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
76
Entrez Gene
  • A single query interface to
  • Sequences
  • - RefSeqs
  • - GenBank
  • - Homologene
  • Maps MapViewer
  • Entrez links
  • Linkouts
  • More organisms, 3000
  • Entrez integration

77
Global Entrez NADH2
78
Entrez Gene NADH2
79
Gene Record for Pongo NADH2
Not found with nadh2
80
A Record With More Data Human HFE
81
Human HFE Transcripts
Transcripts with experimental evidence
82
Gene Table
83
Introns/Exons Gene Table
links to sequence
84
Human HFE Links
85
Genotype
86
Genotype
87
Human HFE Links
88
GeneView in dbSNP
89
SNP in Structure
90
SNP in Structure
91
SNP in Structure
H41
S43
C260
92
Another Variation Source OMIM
93
Variants in OMIM
94
Genome Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
95
The New Homologene
Automated detection of homologs among the
annotated genes of completely sequenced
eukaryotic genomes.
  • No longer UniGene based
  • Protein similarities first
  • Guided by taxonomic tree
  • Includes orthologs and paralogs

96
The New Homologene
97
RAG1 ? Homologene
98
RAG1 ? Homolgene
RAG1
99
RAG1
RING-finger
100
RAG1 ? Homolgene
RAG1
101
RAG1
Sugar_tr
102
Homologene alignment scores
103
BLASTP bl2seq
104
Genome Resources
Gene database
Map Viewer
105
List View
106
Human MapViewer
107
MapViewer Human ADAR
108
MV Hs ADAR
109
Maps Options
Maps Options
--Sequence maps-- Ab initio Assembly Repeats BES_C
lone Clone NCI_Clone Contig Component CpG
island dbSNP haplotype Fosmid GenBank_DNA Gene Phe
notype SAGE_Tag STS TCAG_RNA Transcript
(RNA) Hs_UniGene Hs_EST
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps-- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ss
c_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variat
ion
110
MapViewer
Component
Gene
UniGene
Repeats
111
Phenotype
Variation
Gene
112
Maps Options
Maps Options
113
Genome Resources
Trace Archive
114
Trace Archive Page
115
Macaca Mulatta Traces
116
(No Transcript)
117
Trace Archive BLAST Page
Access to sequences NOT in GenBank
118
Literature Links
119
BOOKS Database
120
BOOKS Database hyperlinked
121
BOOKS Database
122
BOOKS Database
123
BOOKS Database
124
Genes Dis
125
Genes Dis
126
For More Information
127
Intermission
Write a Comment
User Comments (0)
About PowerShow.com