A Field Guide to GenBank and NCBI Molecular Biology Resources PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A Field Guide to GenBank and NCBI Molecular Biology Resources


1
A Field Guide to GenBank and NCBI Molecular
Biology Resources
  • slightly modified from
  • Peter Cooper
  • ftp//ftp.ncbi.nih.gov/pub/cooper/FieldGuide/
  • Eric Sayers
  • ftp//ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Pe
    nn/

2
NCBI Resources
  • About NCBI
  • NCBI Sequence Databases
  • Primary Database GenBank
  • Derivative Databases - RefSeq
  • Entrez Databases and Text Searching
  • BLAST Services
  • Genomic Resources

3
The National Center for Biotechnology
Information (NCBI)
  • Created as a part of NLM in 1988
  • Establish public databases
  • Perform research in computational biology
  • Develop software tools for sequence analysis
  • Disseminate biomedical information
  • Tools BLAST(1990), Entrez (1992)
  • GenBank (1992)
  • Free MEDLINE (PubMed, 1997)
  • Human genome (2001)

4
NCBI Home Pagehttp//www.ncbi.nlm.nih.gov
To learn more, visit the Site Map and About
NCBI web pages
5
About NCBI
6
Some NCBI Statistics.
7
Users per day
1997 1998 1999 2000
2001
8
Molecular Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Database staff organize but dont add additional
    information
  • Example GenBank
  • Derivative Databases
  • Human curated
  • compilation and correction of data
  • Example SWISS-PROT, NCBI RefSeq mRNA
  • Computationally Derived
  • Example UniGene
  • Combinations
  • Example NCBI Genome Assembly

9
What is GenBank? NCBIs Primary Sequence Database
  • Nucleotide only sequence database
  • GenBank Data
  • Direct submissions individual records (BankIt,
    Sequin)
  • Batch submissions via email (EST, GSS, STS)
  • ftp accounts established for sequencing centers
  • Data shared amongst three collaborating
    databases
  • GenBank
  • DNA Database of Japan (DDBJ).
  • European Molecular Biology Laboratory Database
  • (EMBL)

10
The International Nucleotide Sequence Database
Collaboration
NIH
Entrez
Sequin BankIt ftp
NCBI
GenBank
  • Submissions
  • Updates
  • Submissions
  • Updates

EMBL
DDBJ
EBI
CIB
NIG
  • Submissions
  • Updates

SRS
EMBL
getentry
11
GenBank NCBIs Primary Sequence Database
gt90 Gigabytes of data
12
Entrez Nucleotide
RefSeq 1
EMBL 9
DDBJ 19
GenBank 71
23,464,770 records
13
Primary vs. Derivative Databases
ACGTGC
Curators
C
C
GA
GA
ATT
C
GA
GA
ATT
C
RefSeq
TATAGCCG
Sequencing Centers
ACGTGC
TATAGCCG AGCTCCGATA CCGATGACAA
ATTGACTA
CGTGA
TTGACA
Labs
TTGACA
TTGACA
ACGTGC
Genome Assembly
TATAGCCG
ACGTGC
TATAGCCG
ATTGACTA
CGTGA
CGTGA
ATTGACTA
CGTGA
TATAGCCG
ATTGACTA
TTGACA
ATTGACTA
TATAGCCG
ATTGACTA
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
ATT
C
GenBank
GA
UniGene
AT
C
C
Algorithms
ATT
C
C
GA
GA
ATT
GA
GA
ATT
ATT
C
C
GA
GA
ATT
GA
GA
ATT
C
C
14
Traditional GenBank Divisions
  • Direct Submissions (Sequin and BankIt)
  • Accurate
  • Well characterized

BCT Bacterial and Archeal INV Invertebrate MAM Ma
mmalian (ex. ROD and PRI) PHG Phage PLN Plant and
Fungal PRI Primate ROD Rodent SYN Synthetic
(cloning vectors) VRL Viral VRT Other Vertebrate
15
A Traditional GenBank Record
Locus Field
Molecule Type
Modification Date
GenBank Division
Definition Line
Accession Number
Version
GI (GenInfo)
Keywords
Taxonomy
16
A Traditional GenBank Record
17
Bulk Sequence Divisions of GenBank
  • Batch Submissions (email and ftp)
  • Inaccurate
  • Poorly Characterized

EST Expressed Sequence Tag STS Sequence Tagged
Site GSS Genome Survey Sequence HTG High
Throughput Genomic HTC High Throughput cDNA
18
Organization of GenBank
11 Traditional Divisions
PAT 4
Traditional 8
1 Patent Division
STS, HTG, HTC 2
GSS 19
EST 67
5 Bulk Divisions
23,087,196 records
19
What is UniGene?
A gene-oriented view of sequence entries
  • MegaBlast-based automated sequence clustering
  • Nonredundant set of gene-oriented clusters
  • Each cluster represents a unique gene
  • Provides information on tissue-specific
    expression and map locations
  • Includes well-characterized genes and novel ESTs
  • Useful for gene discovery and selection of
    mapping reagents

20
Organisms Representedin UniGene
21
Genome Sequencing
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
assembly
Draft Sequence (HTG division)
22
Working Draft Sequence
23
HTG Division High Throughput Genome
24
HTG Division High Throughput Genome
25
NCBIs Third Party Annotation (TPA) Database
NEW
  • NCBI now accepts the submission of new
    annotations of existing GenBank sequences
  • Facilitates the annotation of genomes by experts

26
A Sample TPA record
27
RefSeq NCBIs Derivative Sequence Database
  • Curated transcripts and proteins
  • reviewed
  • human, mouse, rat, fruit fly, zebrafish,
    arabidopsis
  • Human model transcripts and proteins
  • Assembled Genomic Regions (contigs)
  • draft human genome
  • mouse genome
  • Chromosome records
  • Microbial
  • viral
  • organelle

28
The RefSeq Accession Numbers
mRNAs and Proteins NM_123456 Curated
mRNA NP_123456 Curated Protein NR_123456 Curate
d non-coding RNA XM_123456 Predicted Transcript
(human, mouse) XP_123456 Predicted Protein
(human, mouse) XR_123456 Predicted non-coding
RNA Gene Records NG_ 123456 Reference Genomic
Sequence (human) Assemblies NT_ 123456 Contig
(Mouse and Human) NW_123456 Supercontig
(Mouse) NC_ 123456 Chromosome (Microbial,Viral,Ar
abidopsis ) NR_ 123456 Interim Identifier
for Microbial Chromosomes
29
Curated RefSeq Records NM_, NP_
30
EntrezLinking and Neighboring
31
The Entrez Databases
32
The (ever) Expanding Entrez System
Journals
UniGene
Books
SNP
PubMed
UniSTS
PubMed Central
Nucleotide
PopSet
Protein
ProbeSet
Entrez
Genome
Structure
Taxonomy
CDD
OMIM
3D Domains
33
Entrez Nucleotides
glucose 6 phosphate dehydrogenase
34
Document Summaries
glucose 6 phosphate dehydrogenaseAll Fields
748 hits
35
Entrez Nucleotides Limits
Accession All Fields Author Name EC/RN
Number Feature key Filter Gene Name Issue Journal
Name Keyword Modification Date Organism Page
Number Primary Accession Properties Protein
Name Publication Date SeqID String Sequence
Length Substance Name Text Word Title
Word Uid Volume
36
Entrez Nucleotides Preview/Index
37
Adding Terms Preview/Index
38
Plant G6PD mRNAs
39
Display Formats, Links, and Neighbors
Summary Brief ASN.1 FASTA XML GenBank GI
list LinkOut Nucleotide Neighbors Genome
Links ProbeSet Links OMIM Links PopSet
Links Protein Links PubMed Links SNP
Links Structure Links Taxonomy Links UniSTS Links
40
gtgi603218gbU18238.1MSU18238 Medicago sativa
glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATC
AGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GAT
AGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCAC
ACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAG
ACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTG
ATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAA
TTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTC
CTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGT
AAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGA
T TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAG
GCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAG
ATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACAC
GCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAA
CTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTA
TTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCG
TTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACA
ATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGG
ATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTG
TTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG C
CTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCT
ATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATG
ACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTAT
TCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAG
CAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAA
GGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAG
TTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGT
CAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTG
TCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGC
GTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAG
AGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAA
TTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGG
TCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACA
CCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATA
ATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATA
AGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCA
AGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA
ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA
gt
41
Entrez Genome
42
Organism Pages
43
The Map Viewer a common platform for integrated
display
44
The Map Viewer
45
Entrez PubMed
46
Online Books
47
Entrez Specialized Databases
Taxonomy
Searchable taxonomic tree having nodes for all
species with records in an Entrez database
Online Mendelian Inheritance in Man A database
of genetically linked human diseases
OMIM
ProbeSet
Expression data (GEO) and microarray datasets
48
Entrez Taxonomy
49
Entrez OMIM
50
Entrez ProbeSet
51
Trace Archive
52
Entrez Structure
53
Structure Summary
Cn3D viewer
Related Structures
Conserved Domains
54
Cn3D Displaying Structures
55
Structural Alignment
56
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com