Molecular Biology Resources at NCBI - PowerPoint PPT Presentation

1 / 138
About This Presentation
Title:

Molecular Biology Resources at NCBI

Description:

Molecular Biology Resources at NCBI – PowerPoint PPT presentation

Number of Views:389
Avg rating:3.0/5.0
Slides: 139
Provided by: stevep49
Category:

less

Transcript and Presenter's Notes

Title: Molecular Biology Resources at NCBI


1
Molecular Biology Resourcesat NCBI
  • Or
  • NCBIIts a BLAST!

June 22, 2004
Washington State University - Tri-Cities
2
NCBI Resources
  • About NCBI
  • NCBI Sequence Databases
  • Other NCBI Databases
  • Entrez Databases and Text Searching
  • Genomic Resources
  • BLAST Services
  • Educational Tools

3
The National Center for Biotechnology Information
  • Created in 1988 as a part of the
  • National Library of Medicine at NIH
  • Establish public databases
  • Research in computational biology
  • Develop software tools for sequence analysis
  • Disseminate biomedical information

4
Web Access http//www.ncbi.nlm.nih.gov
5
http//www.ncbi.nlm.nih.gov/About/index.html
6
NCBI At A Glance
7
A Science Primer
8
(No Transcript)
9
What is a Cell?
10
(No Transcript)
11
The Global Entrez Search Engine
12
(No Transcript)
13
The (ever expanding) Entrez System
Literature
Organism
Expression
14
Examples of Database Integrationin Entrez
Word weight
Phylogeny
VAST
Protein sequences
BLASTn
BLASTp
15
(No Transcript)
16
Entrez Nucleotides
  • Primary
  • GenBank / EMBL / DDBJ 29,800,460
  • Derivative
  • RefSeq 304,828
  • Third Party Annotation 4,266
  • PDB 5,062

  • Total
    30,114,616

17
Entrez Protein
  • GenPept (GB,EMBL, DDBJ) 3,333,837
  • RefSeq 1,011,056
  • Third Party Annotation 4,685
  • Swiss Prot 154,397
  • PIR 282,821
  • PRF
    12,079

  • Total
    4,798,875
  • BLAST nr
    1,859,644

18
(No Transcript)
19
Other integrated NCBI websites
  • Taxonomy Browser
  • Database links sorted by lineage
  • Gene
  • Database links based on genetic loci

20
potato
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Gene
? Links to Everywhere (almost)
25
NCBI Databases and Tools
26
(No Transcript)
27
(No Transcript)
28
Part 2. Data Flow and Processing
Part 1. The Databases
Part 3. Querying and Linking the Data
Part 4. User Support
A part of the NCBI Bookshelf
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Types of Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Content controlled by the submitter
  • Examples GenBank, SNP, GEO
  • Derivative Databases
  • Built from primary data
  • Content controlled by third party (NCBI)
  • Examples Refseq, TPA, RefSNP, UniGene, NCBI
    Protein, Structure, Conserved Domain

35
Primary vs. DerivativeSequence Databases
Labs
Sequencing Centers
Updated continually by NCBI
Updated ONLY by submitters
36
1º Sequence Database
GenBank
  • Nucleotide only sequence database
  • Archival in nature
  • Submission of GenBank Data to NCBI
  • Direct submissions of individual records via Web
    (BankIt, Sequin)
  • Batch submissions of bulk sequences via Email
  • (EST, GSS, STS)
  • FTP accounts for Sequencing Centers

Sequence
37
The International Sequence Database Collaboration
Sequence
38
The Growth of GenBank
Currently 40.1 million records 37.9
billion nucleotides Average doubling time 12
months
Sequence Records (millions)
Total Base Pairs (billions)
83 84 85 86 87 88 89 90 91 92 93 94
95 96 97 98 99 00 01 02 03 04
39
Number of Users and Hits Per Day
1997 1998 1999 2000 2001 2002
2003
Currently averaging 10,000,000 to 35,000,000 hits
per day!
40
GenBank
  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet

ftp//ftp.ncbi.nih.gov/genbank/
Sequence
41
Organization of GenBankGenBank Divisions (gbdiv)
  • Records are divided into 17 Divisions.
  • 1 Patent (11 files)
  • 5 High Throughput
  • 11 Traditional

EST (288) Expressed Sequence Tag GSS (98)
Genome Survey Sequence HTG (61) High Throughput
Genomic STS (3) Sequence Tagged Site HTC (3)
High Throughput cDNA
PRI (27) Primate PLN (10) Plant and
Fungal BCT (8) Bacterial and Archeal INV
(6) Invertebrate ROD (11) Rodent VRL (3)
Viral VRT (4) Other Vertebrate MAM (1)
Mammalian (ex. ROD and PRI) PHG (1) Phage SYN
(1) Synthetic (cloning vectors) UNA (1)
Unannotated
  • Traditional Divisions
  • Direct Submissions
  • (Sequin and BankIt)
  • Accurate
  • Well characterized
  • BULK Divisions
  • Batch Submission
  • (Email and FTP)
  • Inaccurate
  • Poorly characterized

Sequence
42
File Formats of theSequence Databases
Each sequence is represented by a text record
called a flat file.
  • GenBank/GenPept (useful for scientists)
  • FASTA (the simplest format)
  • ASN.1 XML (useful for programmers)

Sequence
43
A Traditional GenBank Record
LOCUS AF062069 3808 bp mRNA
INV 02-MAR-2000 DEFINITION Limulus
polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Atlantic
horseshoe crab. ORGANISM Limulus polyphemus
Eukaryota Metazoa Arthropoda
Chelicerata Merostomata Xiphosura
Limulidae Limulus. REFERENCE 1 (bases 1 to
3808) AUTHORS Battelle,B.-A., Andrews,A.W.,
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE A
myosin III from Limulus eyes is a clock-regulated
phosphoprotein JOURNAL J. Neurosci. (1998) In
press REFERENCE 2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REFERENCE 3 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REMARK Sequence update by
submitter COMMENT On Mar 2, 2000 this
sequence version replaced gi3132700.
Definition Title
References
NCBIs Taxonomy
44
Lower down in the GenBank Record
FEATURES Location/Qualifiers source
1..3808 /organism"Limulus
polyphemus" /db_xref"taxon6850"
/tissue_type"lateral eye" CDS
258..3302 /note"N-terminal
protein kinase domain C-terminal myosin
heavy chain head substrate for PKA"
/codon_start1
/product"myosin III"
/protein_id"AAC16332.2"
/db_xref"GI7144485"
/translation"MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATV
YSAIDKQA NKKVALKIIGHIAENLLDIETEYRI
YKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRD
IRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQ
SSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISEC
LVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSEL
VDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt
attatgacgt cctatctgtt 3781 aagatacagt
aactagggaa aaaaaaaa //
Feature Table
GenPept Protein ID
45
FASTA Format
gtgi30256embCAA42556.1 c-src-kinase Homo
sapiens MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAV
TKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITRE
QAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI
DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWA
LNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEAS
VMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS
RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKV
SDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEI
YSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK NCWHLDAA
MRPSFLQLREQLEHIKTHELHL
46
GenBank GenPept Files
47
Bulk Divisions
  • Batch Submission and htg (email and ftp)
  • Inaccurate
  • Poorly Characterized
  • Expressed Sequence Tag
  • 1st pass single read cDNA
  • Genome Survey Sequence
  • 1st pass single read gDNA
  • High Throughput Genomic
  • incomplete sequences of genomic clones
  • Sequence Tagged Site
  • PCR-based mapping reagents

48
Types of Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Content controlled by the submitter
  • Examples GenBank, SNP, GEO
  • Derivative Databases
  • Built from primary data
  • Content controlled by third party (NCBI)
  • Examples Refseq, TPA, RefSNP, UniGene, NCBI
    Protein, Structure, Conserved Domain

49
EST Division Expressed Sequence Tags
gbdiv_estProperties
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAG
CAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATC
CATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAG
TGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTG
TACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTT
GAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGA
GACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGA
TTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATA
TGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg
  • - isolate unique clones
  • sequence once
  • from each end

gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTT
CATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTT
TCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGC
TTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCT
GACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANG
TTGCCAGCCCTC
RNA gene products
50
Sea Urchin ESTs in Entrez
51
UniGeneSets of expressed sequencesclustered
by BLAST similarity
  • Summary pages of curated information
  • about expressed gene transcripts.

Sequence Expression
52
A Cluster of ESTsArabidopsis serine protease
query
5 EST hits
3 EST hits
53
UniGene Collectionsas of February 2004
Sequence Expression
54
Genome Sequencing - HTG, GSS,(WGS)
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
whole genome shotgun assemblies (traditional
division)
assembly
Draft Sequence (HTG division)
55
HTG Division Honeybee Draft Sequences
  • Unfinished sequences of BACs
  • Gaps and unordered pieces
  • Finished sequences move to traditional GenBank
    division

56
Maize Genome Survey Sequences
  • Surveys of BAC Libraries
  • BAC end sequences
  • More than 100K per project

57
Other Genome Sequencing Products
  • Trace Archive
  • Whole Genome Shotgun

58
Trace Archive
  • Primary reads from WGS and EST projects
  • Many not available in GenBank
  • Earliest access to genome data

59
Trace Archive Page
60
Short-tailed opossum traces
61
Whole Genome Shotgun Projects
  • Traditional GenBank Divisions
  • 120 projects
  • 1 Virus
  • 78 Bacteria
  • 5 Archaea
  • 37 Eukaryotes featuring
  • Rat, Mouse, Dog, Chimpanzee, Human
  • Honeybee, Anopheles, Fruit Flies (2)
  • Nematode (C. briggsae)
  • Yeasts (8), Aspergillus (2)
  • Rice

wgs_masterProperties
62
Countries of Origin
63
Submitted by Experimentalists
Curated by NCBI
GDS Grouping of experiments
GSE Grouping of slide/chip data a single
experiment
GSM Raw/processed spot intensities from a
single slide/chip
Entrez GEO Datasets
Entrez GEO
64
  • Submit and update data
  • Query the database
  • gene identifiers
  • field information
  • sequence
  • Browse datasets
  • Download data

65
(No Transcript)
66
mRNAs
RELEASE 4 IS NOW AVAILABLE ON THE FTP SITE!
  • Forming the best representative sequence
  • Standardizing nomenclature and record structure
  • Adding annotation (references, sequence features)

Genomes
Proteins
Sequence Genome
67
RefSeq Curation Processes
Curated genomic DNA (NC, NT, NW)
Scanning....
Curated Model mRNA (XM) (XR)
Model protein (XP)
Curated mRNA (NM) (NR)
Protein (NP)
Sequence Genome
68
Curated RefSeq Records
COMMENT REVIEWED REFSEQ This record has been
curated by NCBI staff. The reference
sequence was derived from X66503.1.
Summary Adenylosuccinate synthetase catalyzes
the first committed step in the
conversion of IMP to AMP.
X records Genome Annotation Inferred or
Predicted vs N records Provisional,
Reviewed or Validated
Sequence
69
Intermission
  • To come in Part 2
  • Protein Databases
  • Genomes Genomic Resources
  • Searching Sequences with Entrez and BLAST
  • Educational Resources

70
ProteinSequencesStructures
71
Linking Protein Sequence, Structure and Function
CDD Conserved functional domains in
proteins (Conserved Domain Database)

72
  • GenPept Derivative database ie. there are no
    directly sequenced proteins.
  • translations of nucleic acid sequences provided
    by submitters
  • SWISS-PROT, PDB, PIR, DDBJ curation provided by
    these databases.

73
How Many Protein Records?
74
(No Transcript)
75
(No Transcript)
76
Patatin an abundant tuber protein
77
Protein Links to Related Databases
Precomputed BLAST searches
Conserved functional domains
78
BLink precomputed BLAST Searches
79
BLink Best Hits
80
CDD Conserved Domain Database
81
(No Transcript)
82
(No Transcript)
83
Sequence-based NeighborsDomain Relatives
  • Modular Architecture of Domains
  • Cartoon descriptions of protein domain
    organization on the primary
    sequence
  • Allows for comparison with other proteins with
    the same Domain

Conserved Domain Architecture Retrieval Tool
(CDART)
84
Entrez Structure Molecular Modeling Database
  • Derived from experimentally determined PDB
    records
  • Data is added to PDB records including
  • Addition of explicit chemical bonding information
  • Validation and indexing of sequence
  • Inclusion of Taxonomy, Citation, and other
    information
  • Conversion to ASN.1 data description language
  • Searching the Structure Databases
  • Keyword search by Entrez
  • Sequence search by BLAST or BLink
  • Domain search by RPS-BLAST (CDD Search)
  • Structure search by VAST

Structure
85
Structure Summary Page
to get the Cn3D viewer
Get Cn3D 4.1
Sequence-based Neighbors Conserved Domains
(CDD/RPS-BLAST)
86
Complex Genomes
  • Sequences are provided complete or we help
    assemble
  • Heavy annotation Genes,
    transcript regions ORFs, sequence variations
    markers, clones, ESTs, etc.
  • The annotation can be shown graphically and
    linked to other databases using the MapViewer

87
(No Transcript)
88
Higher Genome MapViews
adss
build 34
build 34
89
Examples of Maps Mapped Data
--Sequence maps--- Ab initio (model) Assembly BES_
Clone Clone Contig Component CpG island dbSNP
haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_T
ag TCAG_RNA Transcript (RNA) UniGene EST Variation
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps--- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
90
Our annotation of the Gene
Gnmon prediction of Genes
Clones used in contig assembly
RefSeq mRNAs based on GB records
91
Mapviewer Arabidopsis patatin gene
92
Other Databases and Genomic Mapping
  • SNP Database (Variation Map)
    Single Nucleotide Polymorphisms, et al.
  • Identified differences in sequences found within
    gene loci (L), transcript (T) or coding (C)
    regions
  • STS Division UniSTS (STS Map)
    Sequence Tagged Sites
  • Physically mapped segments of genes, ESTs, mRNAs
    or genomic DNAs with known position
  • PCR with STS primers gives one product per genome
    Related resource Electronic PCR
  • EST Division UniGene (UniGene Map)
  • Histogram of expressed regions

93
Mapping Data on the Genome
94
Searching the NCBI Databases
95
How to Query a Particular Database
term1 term2
(term1tag delimiter op term2tag delimiter op
)
op AND, OR, NOT
  • Boolean operators MUST be in ALL CAPS!

tag delimiter Entrez indexing field
Organism Journal User compounds Author
96
Sample Query
Brauninger a c-src kinase
Organism Journal User compounds Author
97
Using Fields to Find Records
Accession All Fields Author EC/RN Number Feature
Key Filter Gene Name Issue Journal Keyword Modific
ation Date Organism Page Number Primary
Accession Properties Protein Name Publication
Date SeqID String Sequence Length Substance
Name Text Word Title Volume
  • Most useful search field Organism
  • humanorgn or bacteriaorgn
  • Useful search terms in Properties field
  • srcdb source database ( srcdb
    genbankprop )
  • gbdiv genbank division ( gbdiv
    estprop )
  • biomol biomolecular type ( biomol
    mrnaprop )

98
Complex searches you can do with Preview/Index
Terms used (and indexed) in Entrez fields can be
searched to gain useful information!
How many rat Unigene clusters contain at least
one mRNA?
  • Select the UniGene database.
  • Find all the rat records.
  • Find those that have 1 mRNAs. (not 0)

NOT
rat organism
99
Complex Queries with Preview/Index
NOT 0 mRNA Count
100
Other Advanced Queries
Nucleotide Non-genomic sequences from the PLN
division of Genbank gbdiv_pln properties NOT
biomol_genomic properties
Protein RefSeq sequences with molecular weights
of 80 to 100 kDa srcdb_refseq properties AND
080000100000 Molecular Weight
SNP True SNPs that are uniquely mapped on the
mouse genome Snp SNP Class AND 1 Map
Weight AND mouse organism
UniSTS Markers on the Genethon map of human
chromosome 12 Genethon Map Name AND human
organism AND 12 chromosome
Structure Structures of bacterial kinases with
resolutions below 2 Å Bacteria organism AND
kinase AND 000.00002.00 resolution
101
Searching the NCBI Databases
102
http//www.ncbi.nlm.nih.gov/blast
103
Why do we needsequence similarity searching?
Searching with Sequences
  • To identify and annotate sequences with
  • incomplete (or no) annotations (GenBank)
  • incorrect annotations
  • To assemble genomes
  • To explore evolutionary relationships by
  • finding homologous molecules
  • developing phylogenetic trees
  • NOTE Similar sequences may NOT have similar
    function!

104
Basic Local Alignment Search Tool
BLAST
  • Calculates similarity for biological sequences
  • Finds best local alignments
  • A Heuristic approach based on the
    Smith-Waterman algorithm
  • Searches for matching words rather than
    individual residues
  • Uses statistical theory to determine if a match
    might have occurred by chance

105
Local vs. Global Alignment
Align program (Lipman and Pearson) -a global
alignment protocol-
Human 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKN
KVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84
A DL F K DL I T W GR G
IPNYV PW Worm 63
VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNY
VAREKSIES------QPWYF 125 Human 85
GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-
MYHASKLSIDEEVYFENLMQ 151 GK R AE L
E G FLVRS D L V VHYRI H I
F L Worm 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQH
DLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD
194 Human 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYR
SGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220
L HY ADGLC L P Y W
L IG GFGV G N VA Worm 195
LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQI
GAGQFGEVWEGRWNVNVPVA 264 Human 221
VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYM
AKGSLVDYLRSRGRSVLGGD 289 VK K A
FLAEA M LRH L L V IVTE M L L
RGR Worm 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKL
LSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332
Human 290
CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT---
-KEASSTQDTG-KLPVKWTA 353 L S V M
YLE NFHRDLAARNL KDFGL KE TG
PKWTA Worm 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARN
ILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA
401 Human 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYP
RIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423
PEA FTKSDVWSFGILL EI FGRPYP V V
GYM P GCP YM CW Worm 402
PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYR
MPCPAGCPVTLYDIMQQCWR 471 Human 424
LDAAMRPSFLQLREQLEHI 443 D RPF L
LE Worm 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AA
WPSGT------------ECIAKYNFHG M
S .. AA SG. . .A
... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPS
IGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1
20 40
60
440 450 human
REQLEHI--------KTHELHL . . .
... worm QWKLEDLFNLDSSEYKEASINF
500
BLASTp protein-protein comparison -a local
alignment protocol-
106
Nucleotide Words
GTACTGGACAT TACTGGACATG ACTGGACATGG
CTGGACATGGA TGGACATGGAC GGACATGGACC
GACATGGACCC ACATGGACCCT
...........
Minimum word size 7 blastn default
11 megablast default 28
Make a lookup table of words
107
An alignment that BLAST cant find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAAC
CACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGG
ATCATTAAGAACTCCTGGGGAGCCAGTT

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGC
TGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCT
CGTGGTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCG
TGACAACAAC
No words longer than 6 (exact matches) ...for
nucleotides there must be at least 7.
108
Protein Words
GTQ TQI QIT ITV TVE VED
EDL DLF ...
Word Size can be 2 or 3 (default 3)
Make a lookup table of words
109
Minimum Requirements for a Hit
ATCGCCATGCTTAATTGGGCTT CATGCTTAATT
exact word match
one match
  • Nucleotide BLAST requires one exact match
  • Protein BLAST requires two neighboring matches
    within 40 residues

GTQITVEDLFYNI SEI YYN
neighborhood words
two matches
110
Nucleotide vs. Protein BLAST
Comparing ADSS from H. sapiens and A. thaliana
  • aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaag
    gc
  • Human N R V T V V L G A Q W G D E
    G
  • V V L G Q W G D E
    G
  • A.th. S Q V S G V L G C Q W G D E
    G
  • agtcaagtatctggtgtactcggttgccaatggggagatgaag
    gt

BLASTn finds no match, because there are no 7 bp
words
BLASTp finds three matching words
Protein searches are generally more sensitive
than nucleotide searches.
111
Some WWW-BLAST Databases
Nucleotide
Protein
  • nr (nt)
  • Traditional gb divisions
  • NM_ and XM_ RefSeqs
  • dbest
  • EST Division
  • htgs
  • HTG division
  • gss
  • GSS division
  • chromosome
  • NC_ RefSeqs
  • nr (non-redundant sequences)
  • GenBank CDS translations
  • NP_ RefSeqs
  • PIR, Swiss-Prot, PRF
  • PDB (sequences from structures)
  • swissprot
  • pat - patents
  • pdb - sequences with 3D structures
  • month - sequences updated within the
    past 30 days

112
Local Alignment Statistics
High scores of local alignments between two
random Sequences follow the Extreme Value
Distribution.
Expect Value E number of database hits you
expect to find by chance
size of database
your score
Alignments
expected number of random hits
Score
113
Protein BLAST Page
Accession, GI, or sequence
Choose your database
114
BLAST Formatting Page
Link to CDD
115
BLAST Results Page
116
BLAST Results Page
117
Genomic BLAST
  • The BLAST homepage links to the Genome BLAST
    pages provide customized nucleotide and protein
    databases for each genome.
  • If a Map Viewer is available, the BLAST hits can
    be viewed on the maps.

118
Locate an A. thaliana Gene with BLAST
AB004798
A. thaliana mRNA
119
Hits to A. thaliana Clones
120
Related SequencesPrecomputed BLASTn BLASTp
Lists
Nucleotide
Protein
Related Sequences Entrez-Link retrieves a list
of sequences sorted by BLAST score, but with no
alignment details.
121
Searching the NCBI Databases
122
Structure-based Neighbors Vector Alignment
Search Tool
4
For each protein chain
2
locate secondary structure elements,
5
6
represent them as individual vectors,
1
3
and compare these with precomputed vectors of
database structures.
Human IL-4
123
NCBI Educational Resources
  • Tutorials
  • Practice exercises
  • About NCBI
  • Bookshelf
  • NCBI Handbook

124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
Other educational Sites
  • Geospiza
  • University of Wisconsin BioTrek
  • USDA National Agricultural Library
  • SWBIC Southwest Biotechnology and Information
    Center
  • Biologica
  • Howard Hughes Medical Institute

132
(No Transcript)
133
(No Transcript)
134
University of Wisconsin BIOTREK
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
For More Information
E-mail addresses
  • General Help info_at_ncbi.nlm.nih.gov
  • BLAST Help blast-help_at_ncbi.nlm.nih.gov

The (free!) NCBI Newsletter
http//www.ncbi.nih.gov/About/newsletter.html
The NCBI Handbook
Follow the link from the NCBI Home Page under
Hot Spots
The NCBI Education Page
http//www.ncbi.nih.gov/Education/index.html
Write a Comment
User Comments (0)
About PowerShow.com