Title: National Center for Biotechnology Information
1National Center for Biotechnology Information
- A Field Guide to GenBank
- and NCBIs Molecular Biology Resources
University of Colorado Health Sciences Center
August 30, 2005
2Topics
- About NCBI
- GenBank overview
- Primary vs derivative databases
- The Reference Sequence (RefSeq) project
- Entrez databases
- Genome resources
- Bookshelf
- -break-
- Entrez text searching
- BLAST sequence searching
- VAST structure searching
- An integrated example
3The National Institutes of Health
4The National Center for Biotechnology Information
- Accepts submissions of primary data
- Develops tools to analyze these data
- Creates derivative databases based on the primary
data - Provides free search, link, and retrieval of
these data, primarily through the Entrez system
5NCBI WWW Users per Day
6Number of Users Per Day
1997 1998 1999 2000 2001 2002
2003
7Homepage - accessing the data
allfilter
8allfilter
1/11/2005
3/15/2005
8/15/2005
9Entrez Nucleotide
- Primary Data
- GenBank / DDBJ / EMBL 57.3 million (97.4 )
-
- Derivative Data
- RefSeq 1.47 million (2.5 )
- RefSeq reviewed 60,000
- PDB (structures) 5,973
-
- Total
59 million
records
GenBank
10GenBank NCBIs Primary Sequence Database
Over 100 billion bases!
- full release every two months
- incremental and cumulative updates daily
- available only through internet
- release notes gbrel.txt
ftp//ftp.ncbi.nih.gov/genbank/
ftp//genbank.sdsc.edu/pub ftp//bio-mirror.net/bi
omirror/genbank
11What is GenBank?
- Nucleotide only sequence database
- Archival in nature
- GenBank Data
- Direct submissions (traditional records)
- Batch submissions (EST, GSS, STS)
- ftp accounts (genome data)
- Three collaborating databases
- GenBank
- DNA Database of Japan (DDBJ)
- European Molecular Biology Laboratory (EMBL)
Database
12GenBank Divisions
Organismal
PRI (28) Primate ROD (15) Rodent PLN (13)
Plant and Fungal BCT (11) Bacterial/Archeal INV
(7) Invertebrate VRT (7) Other Vertebrate VRL
(4) Viral MAM (2) Mammalian PHG (1)
Phage SYN (1) Synthetic UNA (1) Unannotated
- Organized by taxonomy (sort of)
- Direct submissions (Sequin/Bankit)
- Accurate (1 error per 10,000 bp)
- Well characterized
Functional
EST (377) Expressed Sequence Tag GSS (138)
Genome Survey Sequence HTG (63) High Throughput
Genomic PAT (17) Patent STS (9) Sequence
Tagged Site CON (1) Contigs, virtual
- Organized by sequence type
- Batch submissions (ftp/email)
- Inaccurate
- Poorly characterized
13GenBank Functional (Bulk) Divisions
- Expressed Sequence Tag
- 1st pass single read cDNA
- Genome Survey Sequence
- 1st pass single read gDNA
- High Throughput Genomic
- incomplete sequences of genomic clones
- Sequence Tagged Site
- PCR-based mapping reagents
- Whole Genome Shotgun
14EST Division Expressed Sequence Tags
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAG
CAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATC
CATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAG
TGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTG
TACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTT
GAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGA
GACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGA
TTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATA
TGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg
- - isolate unique clones
- sequence once from each end
gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTT
CATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTT
TCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGC
TTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCT
GACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANG
TTGCCAGCCCTC
RNA gene products
15GSS, WGS, HTG
Whole BAC insert (or genome)
shred
isolate clones
sequence
16HTG Example Honeybee Draft Sequences
- Unfinished sequences of BACs
- Gaps and unordered pieces
- Finished sequences (Phase 3) move to
traditional GenBank division
17Whole Genome Shotgun Projects
- 351 projects
- Bacteria (251)
- Environmental sequences (6)
- Archaea (6)
- Eukaryotes (88), including
- Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
- Pufferfish (2)
- Honeybee, Anopheles, Fruit Flies (3), Silkworm
- Nematode (2)
- Yeasts (8), Aspergillus (2)
- Rice (2)
18Whole Genome Shotgun (WGS) Projects
wgs masterproperties
19Derivative Databases
Sequencing Centers
UniGene
UniSTS
Updated by NCBI
EST
GenBank
STS
Updated ONLY by submitters
RefSeq
RefSeq Entrez Gene and annotation pipelines
HTG
GSS
INV
VRT
PHG
VRL
PRI
ROD
PLN
MAM
BCT
Labs
20Why Make Reference Sequences?
Entrez Nucleotide query humanorganism AND
lipasetitle
21Why Make Reference Sequences?
Entrez Nucleotide query humanorganism AND
lipasetitle
22humanorganism AND lipasetitle AND
endothelialtitle
humanorganism AND lipasetitle AND
endothelialtitle
23RefSeq Benefits
genomes
transcripts
proteins
- non-redundant best representative
- updates to reflect current sequence data and
biology - distinct, stable accession series
24Reference Sequence RefSeq
Accession Sequence Type NM_123456789 mRNA NP_1234
56789 protein, from NM_ NR_123456 non-coding
RNA XM_123456 predicted mRNA XP_123456
predicted protein XR_123456 predicted
non-coding RNA ZP_12345678 predicted from
NZ_ NC_123456 genomic, e.g.,
chromosomes NG_123455 genomic, incomplete
region NT_123456 genomic, BAC
assembly NW_123456 genomic, WGS
assembly NZ_ABCD12345678 genomic, WGS
collection bluecurated
25Annotation Process
Genomic DNA (NC, NT, NW)
Scanning....
Model mRNA (XM) (XR)
Model protein (XP)
Curated mRNA (NM) (NR)
Curated Protein (NP)
RefSeq
Genbank Sequences
26Creating NM_ Records
Genome annotation
NMs must have cDNA support
transcript variant 1
transcript variant 2
transcript variant 3
Longest mRNA
27Where is RefSeq?
28The Entrez System
Gene
UniGene
CancerChromosomes
UniSTS
Homologene
SNP
PopSet
Genome
Nucleotide
GEO
Books
Entrez
Taxonomy
PubMed
MeSH
OMIM
Protein
PMC
Journals
Domains
3D Domains
Structure
29A Few Entrez Databases
- UniGene Clusters of ESTs, mRNAs
- dbSNP Single Nucleotide Polymorphisms
- GEO Gene Expression Omnibus
- microarray and other expression data
- CDD Conserved Domain Database
- protein families (COGs and KOGs)
- single domains (PFAM, SMART, CD)
30UniGene
Gene-oriented clusters of expressed sequences
- Automatic clustering using MegaBlast
- Each cluster represents a unique gene
- Informed by genome hits
- Information on tissue types and map locations
- Useful for gene discovery and selection of
mapping reagents
unique gene
31A Cluster of ESTs
query
5 EST hits
3 EST hits
32UniGene Collections
33Example UniGene Cluster
34Histogram of cluster sizes for UniGene Hs Build
177
(Now at Build 186)
35UniGene Cluster Hs.95351
SELECTED PROTEIN SIMILARITES
36UniGene Cluster Hs.95351
GENE EXPRESSION
37UniGene Cluster Hs.95351 expression
38UniGene Cluster Hs.95351 seqs
39Download sequences
web page
ftp//ftp.ncbi.nih.gov/repository/UniGene/Homo_sap
iens/
40Entrez GEO
41NCBIs SNP Database
- Primary and derivative (RefSNP)
- Single nucleotide polymorphisms
- Repeat polymorphisms
- Insertion-deletion polymorphisms
- Over 19 million refSNPs (rsXXXXXXX)
- (August, 2005)
42Searching dbSNP
43RefSNP
44RefSNP
45RefSNP
46RefSNP
Search Mouse SNP between strains
47RefSNP
48RefSNP
49Entrez GEO
50Submitted by Experimentalists
Curated by NCBI
Submitted by Manufacturer
GDS Grouping of experiments
GPL Platform descriptions
GSE Grouping of slide/chip data a single
experiment
GSM Raw/processed spot intensities from a
single slide/chip
GEO SEries set of related samples
GEO SaMple experimental conditions
Entrez GEO Datasets
Entrez GEO
51Whats a DataSet?
52Gene Expression Omnibus (GEO)
Dataset browser
53GEO Dataset Browser
54GEO Dataset Report
55GEO Profiles
of 12625
56Entrez CDD
57Conserved Domain Database
- Multiple sequence alignments
- Position-specific scoring matrices (PSSM)
- Sources SMART, PFAM, COGs, KOGs, and
- NCBI curated domains (structure-informed
alignments)
58CDD
gtgi45549418gbAAS67634.1 ATP7A Solenodon
paradoxus IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAID
IERLKNIPVKSSEGSQQMSPSSTNDSKVTLTIDGMHCNSCVSNIESALST
LHYVSSIVVSLQNKSAIIKYNANSVTPEILKKAIEAISPGQYRVSITSEV
ESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNSCVQSIEGVMSKK
AGVKSIQVSLANRNGTVEYDP LLTSPEILRE
59CDD
Click on a colored bar to align your sequence to
the CD
CD
Pfam
COG
60Conserved Domain Database cd00371.1, HMA
61CDD
62CDART Conserved Domain Architecture Retrieval
Tool
63cdd
Linking from Entrez Protein
64Genome Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
65Genomic Biology
66Gen Biol Gen Resources
67Gen Biol Gen Resources
68Gen Biol Gen Resources
69Genome Projects microb
70Gen Biol Gen Resources
71Gen Biol Gen Resources
72Gen Biol Gen Resources
73Gen Biol Gen Resources
74Gen Biol Gen Resources
75Genome Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
76Entrez Gene
- A single query interface to
- Sequences
- - RefSeqs
- - GenBank
- - Homologene
- Maps MapViewer
- Entrez links
- Linkouts
- More organisms, 3000
- Entrez integration
77Global Entrez NADH2
78Entrez Gene NADH2
79Gene Record for Pongo NADH2
Not found with nadh2
80A Record With More Data Human HFE
81Human HFE Transcripts
Transcripts with experimental evidence
82Gene Table
83Introns/Exons Gene Table
links to sequence
84Human HFE Links
85Genotype
86Genotype
87Human HFE Links
88GeneView in dbSNP
89SNP in Structure
90SNP in Structure
91SNP in Structure
H41
S43
C260
92Another Variation Source OMIM
93Variants in OMIM
94Genome Resources
Genomic Biology
Gene database
Homologene
Map Viewer
Trace Archive
95The New Homologene
Automated detection of homologs among the
annotated genes of completely sequenced
eukaryotic genomes.
- No longer UniGene based
- Protein similarities first
- Guided by taxonomic tree
- Includes orthologs and paralogs
96The New Homologene
97RAG1 ? Homologene
98RAG1 ? Homolgene
RAG1
99RAG1
RING-finger
100RAG1 ? Homolgene
RAG1
101RAG1
Sugar_tr
102Homologene alignment scores
103BLASTP bl2seq
104Genome Resources
Gene database
Map Viewer
105List View
106Human MapViewer
107MapViewer Human ADAR
108MV Hs ADAR
109Maps Options
Maps Options
--Sequence maps-- Ab initio Assembly Repeats BES_C
lone Clone NCI_Clone Contig Component CpG
island dbSNP haplotype Fosmid GenBank_DNA Gene Phe
notype SAGE_Tag STS TCAG_RNA Transcript
(RNA) Hs_UniGene Hs_EST
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps-- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ss
c_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variat
ion
110MapViewer
Component
Gene
UniGene
Repeats
111Phenotype
Variation
Gene
112Maps Options
Maps Options
113Genome Resources
Trace Archive
114Trace Archive Page
115Macaca Mulatta Traces
116(No Transcript)
117Trace Archive BLAST Page
Access to sequences NOT in GenBank
118Literature Links
119BOOKS Database
120BOOKS Database hyperlinked
121BOOKS Database
122BOOKS Database
123BOOKS Database
124Genes Dis
125Genes Dis
126For More Information
127Intermission