Molecular Biology Resources at NCBI

About This Presentation

Title:

Molecular Biology Resources at NCBI

Description:

Molecular Biology Resources at NCBI – PowerPoint PPT presentation

Number of Views:390

Avg rating:3.0/5.0

Slides: 139

Provided by: stevep49

Category:

more less

Transcript and Presenter's Notes

Title: Molecular Biology Resources at NCBI

1
Molecular Biology Resourcesat NCBI

Or
NCBIIts a BLAST!

June 22, 2004
Washington State University - Tri-Cities
2
NCBI Resources

About NCBI
NCBI Sequence Databases
Other NCBI Databases
Entrez Databases and Text Searching
Genomic Resources
BLAST Services
Educational Tools

3
The National Center for Biotechnology Information

Created in 1988 as a part of the
National Library of Medicine at NIH
Establish public databases
Research in computational biology
Develop software tools for sequence analysis
Disseminate biomedical information

4
Web Access http//www.ncbi.nlm.nih.gov
5
http//www.ncbi.nlm.nih.gov/About/index.html
6
NCBI At A Glance
7
A Science Primer
8
(No Transcript)
9
What is a Cell?
10
(No Transcript)
11
The Global Entrez Search Engine
12
(No Transcript)
13
The (ever expanding) Entrez System
Literature
Organism
Expression
14
Examples of Database Integrationin Entrez
Word weight
Phylogeny
VAST
Protein sequences
BLASTn
BLASTp
15
(No Transcript)
16
Entrez Nucleotides

Primary
GenBank / EMBL / DDBJ 29,800,460
Derivative
RefSeq 304,828
Third Party Annotation 4,266
PDB 5,062
Total
30,114,616

17
Entrez Protein

GenPept (GB,EMBL, DDBJ) 3,333,837
RefSeq 1,011,056
Third Party Annotation 4,685
Swiss Prot 154,397
PIR 282,821
PRF
12,079
Total
4,798,875
BLAST nr
1,859,644

18
(No Transcript)
19
Other integrated NCBI websites

Taxonomy Browser
Database links sorted by lineage
Gene
Database links based on genetic loci

20
potato
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
Gene
? Links to Everywhere (almost)
25
NCBI Databases and Tools
26
(No Transcript)
27
(No Transcript)
28
Part 2. Data Flow and Processing
Part 1. The Databases
Part 3. Querying and Linking the Data
Part 4. User Support
A part of the NCBI Bookshelf
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Types of Databases

Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples GenBank, SNP, GEO
Derivative Databases
Built from primary data
Content controlled by third party (NCBI)
Examples Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain

35
Primary vs. DerivativeSequence Databases
Labs
Sequencing Centers
Updated continually by NCBI
Updated ONLY by submitters
36
1º Sequence Database
GenBank

Nucleotide only sequence database
Archival in nature
Submission of GenBank Data to NCBI
Direct submissions of individual records via Web
(BankIt, Sequin)
Batch submissions of bulk sequences via Email
(EST, GSS, STS)
FTP accounts for Sequencing Centers

Sequence
37
The International Sequence Database Collaboration
Sequence
38
The Growth of GenBank
Currently 40.1 million records 37.9
billion nucleotides Average doubling time 12
months
Sequence Records (millions)
Total Base Pairs (billions)
83 84 85 86 87 88 89 90 91 92 93 94
95 96 97 98 99 00 01 02 03 04
39
Number of Users and Hits Per Day
1997 1998 1999 2000 2001 2002
2003
Currently averaging 10,000,000 to 35,000,000 hits
per day!
40
GenBank

full release every two months
incremental and cumulative updates daily
available only through internet

ftp//ftp.ncbi.nih.gov/genbank/
Sequence
41
Organization of GenBankGenBank Divisions (gbdiv)

Records are divided into 17 Divisions.
1 Patent (11 files)
5 High Throughput
11 Traditional

EST (288) Expressed Sequence Tag GSS (98)
Genome Survey Sequence HTG (61) High Throughput
Genomic STS (3) Sequence Tagged Site HTC (3)
High Throughput cDNA
PRI (27) Primate PLN (10) Plant and
Fungal BCT (8) Bacterial and Archeal INV
(6) Invertebrate ROD (11) Rodent VRL (3)
Viral VRT (4) Other Vertebrate MAM (1)
Mammalian (ex. ROD and PRI) PHG (1) Phage SYN
(1) Synthetic (cloning vectors) UNA (1)
Unannotated

Traditional Divisions
Direct Submissions
(Sequin and BankIt)
Accurate
Well characterized

BULK Divisions
Batch Submission
(Email and FTP)
Inaccurate
Poorly characterized

Sequence
42
File Formats of theSequence Databases
Each sequence is represented by a text record
called a flat file.

GenBank/GenPept (useful for scientists)
FASTA (the simplest format)
ASN.1 XML (useful for programmers)

Sequence
43
A Traditional GenBank Record
LOCUS AF062069 3808 bp mRNA
INV 02-MAR-2000 DEFINITION Limulus
polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Atlantic
horseshoe crab. ORGANISM Limulus polyphemus
Eukaryota Metazoa Arthropoda
Chelicerata Merostomata Xiphosura
Limulidae Limulus. REFERENCE 1 (bases 1 to
3808) AUTHORS Battelle,B.-A., Andrews,A.W.,
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE A
myosin III from Limulus eyes is a clock-regulated
phosphoprotein JOURNAL J. Neurosci. (1998) In
press REFERENCE 2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REFERENCE 3 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REMARK Sequence update by
submitter COMMENT On Mar 2, 2000 this
sequence version replaced gi3132700.
Definition Title
References
NCBIs Taxonomy
44
Lower down in the GenBank Record
FEATURES Location/Qualifiers source
1..3808 /organism"Limulus
polyphemus" /db_xref"taxon6850"
/tissue_type"lateral eye" CDS
258..3302 /note"N-terminal
protein kinase domain C-terminal myosin
heavy chain head substrate for PKA"
/codon_start1
/product"myosin III"
/protein_id"AAC16332.2"
/db_xref"GI7144485"
/translation"MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATV
YSAIDKQA NKKVALKIIGHIAENLLDIETEYRI
YKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRD
IRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQ
SSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISEC
LVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSEL
VDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt
attatgacgt cctatctgtt 3781 aagatacagt
aactagggaa aaaaaaaa //
Feature Table
GenPept Protein ID
45
FASTA Format
gtgi30256embCAA42556.1 c-src-kinase Homo
sapiens MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAV
TKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITRE
QAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI
DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWA
LNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEAS
VMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS
RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKV
SDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEI
YSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK NCWHLDAA
MRPSFLQLREQLEHIKTHELHL
46
GenBank GenPept Files
47
Bulk Divisions

Batch Submission and htg (email and ftp)
Inaccurate
Poorly Characterized

Expressed Sequence Tag
1st pass single read cDNA
Genome Survey Sequence
1st pass single read gDNA
High Throughput Genomic
incomplete sequences of genomic clones
Sequence Tagged Site
PCR-based mapping reagents

48
Types of Databases

Primary Databases
Original submissions by experimentalists
Content controlled by the submitter
Examples GenBank, SNP, GEO
Derivative Databases
Built from primary data
Content controlled by third party (NCBI)
Examples Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain

49
EST Division Expressed Sequence Tags
gbdiv_estProperties
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAG
CAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATC
CATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAG
TGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTG
TACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTT
GAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGA
GACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGA
TTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATA
TGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg

- isolate unique clones
sequence once
from each end

gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTT
CATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTT
TCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGC
TTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCT
GACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANG
TTGCCAGCCCTC
RNA gene products
50
Sea Urchin ESTs in Entrez
51
UniGeneSets of expressed sequencesclustered
by BLAST similarity

Summary pages of curated information
about expressed gene transcripts.

Sequence Expression
52
A Cluster of ESTsArabidopsis serine protease
query
5 EST hits
3 EST hits
53
UniGene Collectionsas of February 2004
Sequence Expression
54
Genome Sequencing - HTG, GSS,(WGS)
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
whole genome shotgun assemblies (traditional
division)
assembly
Draft Sequence (HTG division)
55
HTG Division Honeybee Draft Sequences

Unfinished sequences of BACs
Gaps and unordered pieces
Finished sequences move to traditional GenBank
division

56
Maize Genome Survey Sequences

Surveys of BAC Libraries
BAC end sequences
More than 100K per project

57
Other Genome Sequencing Products

Trace Archive
Whole Genome Shotgun

58
Trace Archive

Primary reads from WGS and EST projects
Many not available in GenBank
Earliest access to genome data

59
Trace Archive Page
60
Short-tailed opossum traces
61
Whole Genome Shotgun Projects

Traditional GenBank Divisions
120 projects
1 Virus
78 Bacteria
5 Archaea
37 Eukaryotes featuring
Rat, Mouse, Dog, Chimpanzee, Human
Honeybee, Anopheles, Fruit Flies (2)
Nematode (C. briggsae)
Yeasts (8), Aspergillus (2)
Rice

wgs_masterProperties
62
Countries of Origin
63
Submitted by Experimentalists
Curated by NCBI
GDS Grouping of experiments
GSE Grouping of slide/chip data a single
experiment
GSM Raw/processed spot intensities from a
single slide/chip
Entrez GEO Datasets
Entrez GEO
64

Submit and update data
Query the database
gene identifiers
field information
sequence
Browse datasets
Download data

65
(No Transcript)
66
mRNAs
RELEASE 4 IS NOW AVAILABLE ON THE FTP SITE!

Forming the best representative sequence
Standardizing nomenclature and record structure
Adding annotation (references, sequence features)

Genomes
Proteins
Sequence Genome
67
RefSeq Curation Processes
Curated genomic DNA (NC, NT, NW)
Scanning....
Curated Model mRNA (XM) (XR)
Model protein (XP)
Curated mRNA (NM) (NR)
Protein (NP)
Sequence Genome
68
Curated RefSeq Records
COMMENT REVIEWED REFSEQ This record has been
curated by NCBI staff. The reference
sequence was derived from X66503.1.
Summary Adenylosuccinate synthetase catalyzes
the first committed step in the
conversion of IMP to AMP.
X records Genome Annotation Inferred or
Predicted vs N records Provisional,
Reviewed or Validated
Sequence
69
Intermission

To come in Part 2
Protein Databases
Genomes Genomic Resources
Searching Sequences with Entrez and BLAST
Educational Resources

70
ProteinSequencesStructures
71
Linking Protein Sequence, Structure and Function
CDD Conserved functional domains in
proteins (Conserved Domain Database)

72

GenPept Derivative database ie. there are no
directly sequenced proteins.
translations of nucleic acid sequences provided
by submitters
SWISS-PROT, PDB, PIR, DDBJ curation provided by
these databases.

73
How Many Protein Records?
74
(No Transcript)
75
(No Transcript)
76
Patatin an abundant tuber protein
77
Protein Links to Related Databases
Precomputed BLAST searches
Conserved functional domains
78
BLink precomputed BLAST Searches
79
BLink Best Hits
80
CDD Conserved Domain Database
81
(No Transcript)
82
(No Transcript)
83
Sequence-based NeighborsDomain Relatives

Modular Architecture of Domains
Cartoon descriptions of protein domain
organization on the primary
sequence
Allows for comparison with other proteins with
the same Domain

Conserved Domain Architecture Retrieval Tool
(CDART)
84
Entrez Structure Molecular Modeling Database

Derived from experimentally determined PDB
records
Data is added to PDB records including
Addition of explicit chemical bonding information
Validation and indexing of sequence
Inclusion of Taxonomy, Citation, and other
information
Conversion to ASN.1 data description language

Searching the Structure Databases
Keyword search by Entrez
Sequence search by BLAST or BLink
Domain search by RPS-BLAST (CDD Search)
Structure search by VAST

Structure
85
Structure Summary Page
to get the Cn3D viewer
Get Cn3D 4.1
Sequence-based Neighbors Conserved Domains
(CDD/RPS-BLAST)
86
Complex Genomes

Sequences are provided complete or we help
assemble
Heavy annotation Genes,
transcript regions ORFs, sequence variations
markers, clones, ESTs, etc.
The annotation can be shown graphically and
linked to other databases using the MapViewer

87
(No Transcript)
88
Higher Genome MapViews
adss
build 34
build 34
89
Examples of Maps Mapped Data
--Sequence maps--- Ab initio (model) Assembly BES_
Clone Clone Contig Component CpG island dbSNP
haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_T
ag TCAG_RNA Transcript (RNA) UniGene EST Variation
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps--- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
90
Our annotation of the Gene
Gnmon prediction of Genes
Clones used in contig assembly
RefSeq mRNAs based on GB records
91
Mapviewer Arabidopsis patatin gene
92
Other Databases and Genomic Mapping

SNP Database (Variation Map)
Single Nucleotide Polymorphisms, et al.
Identified differences in sequences found within
gene loci (L), transcript (T) or coding (C)
regions
STS Division UniSTS (STS Map)
Sequence Tagged Sites
Physically mapped segments of genes, ESTs, mRNAs
or genomic DNAs with known position
PCR with STS primers gives one product per genome
Related resource Electronic PCR
EST Division UniGene (UniGene Map)
Histogram of expressed regions

93
Mapping Data on the Genome
94
Searching the NCBI Databases
95
How to Query a Particular Database
term1 term2
(term1tag delimiter op term2tag delimiter op
)
op AND, OR, NOT

Boolean operators MUST be in ALL CAPS!

tag delimiter Entrez indexing field
Organism Journal User compounds Author
96
Sample Query
Brauninger a c-src kinase
Organism Journal User compounds Author
97
Using Fields to Find Records
Accession All Fields Author EC/RN Number Feature
Key Filter Gene Name Issue Journal Keyword Modific
ation Date Organism Page Number Primary
Accession Properties Protein Name Publication
Date SeqID String Sequence Length Substance
Name Text Word Title Volume

Most useful search field Organism
humanorgn or bacteriaorgn
Useful search terms in Properties field
srcdb source database ( srcdb
genbankprop )
gbdiv genbank division ( gbdiv
estprop )
biomol biomolecular type ( biomol
mrnaprop )

98
Complex searches you can do with Preview/Index
Terms used (and indexed) in Entrez fields can be
searched to gain useful information!
How many rat Unigene clusters contain at least
one mRNA?

Select the UniGene database.
Find all the rat records.
Find those that have 1 mRNAs. (not 0)

NOT
rat organism
99
Complex Queries with Preview/Index
NOT 0 mRNA Count
100
Other Advanced Queries
Nucleotide Non-genomic sequences from the PLN
division of Genbank gbdiv_pln properties NOT
biomol_genomic properties
Protein RefSeq sequences with molecular weights
of 80 to 100 kDa srcdb_refseq properties AND
080000100000 Molecular Weight
SNP True SNPs that are uniquely mapped on the
mouse genome Snp SNP Class AND 1 Map
Weight AND mouse organism
UniSTS Markers on the Genethon map of human
chromosome 12 Genethon Map Name AND human
organism AND 12 chromosome
Structure Structures of bacterial kinases with
resolutions below 2 Å Bacteria organism AND
kinase AND 000.00002.00 resolution
101
Searching the NCBI Databases
102
http//www.ncbi.nlm.nih.gov/blast
103
Why do we needsequence similarity searching?
Searching with Sequences

To identify and annotate sequences with
incomplete (or no) annotations (GenBank)
incorrect annotations
To assemble genomes
To explore evolutionary relationships by
finding homologous molecules
developing phylogenetic trees
NOTE Similar sequences may NOT have similar
function!

104
Basic Local Alignment Search Tool
BLAST

Calculates similarity for biological sequences
Finds best local alignments
A Heuristic approach based on the
Smith-Waterman algorithm
Searches for matching words rather than
individual residues
Uses statistical theory to determine if a match
might have occurred by chance

105
Local vs. Global Alignment
Align program (Lipman and Pearson) -a global
alignment protocol-
Human 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKN
KVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84
A DL F K DL I T W GR G
IPNYV PW Worm 63
VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNY
VAREKSIES------QPWYF 125 Human 85
GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-
MYHASKLSIDEEVYFENLMQ 151 GK R AE L
E G FLVRS D L V VHYRI H I
F L Worm 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQH
DLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD
194 Human 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYR
SGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220
L HY ADGLC L P Y W
L IG GFGV G N VA Worm 195
LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQI
GAGQFGEVWEGRWNVNVPVA 264 Human 221
VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYM
AKGSLVDYLRSRGRSVLGGD 289 VK K A
FLAEA M LRH L L V IVTE M L L
RGR Worm 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKL
LSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332
Human 290
CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT---
-KEASSTQDTG-KLPVKWTA 353 L S V M
YLE NFHRDLAARNL KDFGL KE TG
PKWTA Worm 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARN
ILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA
401 Human 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYP
RIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423
PEA FTKSDVWSFGILL EI FGRPYP V V
GYM P GCP YM CW Worm 402
PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYR
MPCPAGCPVTLYDIMQQCWR 471 Human 424
LDAAMRPSFLQLREQLEHI 443 D RPF L
LE Worm 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AA
WPSGT------------ECIAKYNFHG M
S .. AA SG. . .A
... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPS
IGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1
20 40
60
440 450 human
REQLEHI--------KTHELHL . . .
... worm QWKLEDLFNLDSSEYKEASINF
500
BLASTp protein-protein comparison -a local
alignment protocol-
106
Nucleotide Words
GTACTGGACAT TACTGGACATG ACTGGACATGG
CTGGACATGGA TGGACATGGAC GGACATGGACC
GACATGGACCC ACATGGACCCT
...........
Minimum word size 7 blastn default
11 megablast default 28
Make a lookup table of words
107
An alignment that BLAST cant find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAAC
CACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGG
ATCATTAAGAACTCCTGGGGAGCCAGTT

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGC
TGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCT
CGTGGTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCG
TGACAACAAC
No words longer than 6 (exact matches) ...for
nucleotides there must be at least 7.
108
Protein Words
GTQ TQI QIT ITV TVE VED
EDL DLF ...
Word Size can be 2 or 3 (default 3)
Make a lookup table of words
109
Minimum Requirements for a Hit
ATCGCCATGCTTAATTGGGCTT CATGCTTAATT
exact word match
one match

Nucleotide BLAST requires one exact match
Protein BLAST requires two neighboring matches
within 40 residues

GTQITVEDLFYNI SEI YYN
neighborhood words
two matches
110
Nucleotide vs. Protein BLAST
Comparing ADSS from H. sapiens and A. thaliana

aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaag
gc
Human N R V T V V L G A Q W G D E
G
V V L G Q W G D E
G
A.th. S Q V S G V L G C Q W G D E
G
agtcaagtatctggtgtactcggttgccaatggggagatgaag
gt

BLASTn finds no match, because there are no 7 bp
words
BLASTp finds three matching words
Protein searches are generally more sensitive
than nucleotide searches.
111
Some WWW-BLAST Databases
Nucleotide
Protein

nr (nt)
Traditional gb divisions
NM_ and XM_ RefSeqs
dbest
EST Division
htgs
HTG division
gss
GSS division
chromosome
NC_ RefSeqs

nr (non-redundant sequences)
GenBank CDS translations
NP_ RefSeqs
PIR, Swiss-Prot, PRF
PDB (sequences from structures)
swissprot
pat - patents
pdb - sequences with 3D structures
month - sequences updated within the
past 30 days

112
Local Alignment Statistics
High scores of local alignments between two
random Sequences follow the Extreme Value
Distribution.
Expect Value E number of database hits you
expect to find by chance
size of database
your score
Alignments
expected number of random hits
Score
113
Protein BLAST Page
Accession, GI, or sequence
Choose your database
114
BLAST Formatting Page
Link to CDD
115
BLAST Results Page
116
BLAST Results Page
117
Genomic BLAST

The BLAST homepage links to the Genome BLAST
pages provide customized nucleotide and protein
databases for each genome.
If a Map Viewer is available, the BLAST hits can
be viewed on the maps.

118
Locate an A. thaliana Gene with BLAST
AB004798
A. thaliana mRNA
119
Hits to A. thaliana Clones
120
Related SequencesPrecomputed BLASTn BLASTp
Lists
Nucleotide
Protein
Related Sequences Entrez-Link retrieves a list
of sequences sorted by BLAST score, but with no
alignment details.
121
Searching the NCBI Databases
122
Structure-based Neighbors Vector Alignment
Search Tool
4
For each protein chain
2
locate secondary structure elements,
5
6
represent them as individual vectors,
1
3
and compare these with precomputed vectors of
database structures.
Human IL-4
123
NCBI Educational Resources

Tutorials
Practice exercises
About NCBI
Bookshelf
NCBI Handbook

124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
Other educational Sites

Geospiza
University of Wisconsin BioTrek
USDA National Agricultural Library
SWBIC Southwest Biotechnology and Information
Center
Biologica
Howard Hughes Medical Institute

132
(No Transcript)
133
(No Transcript)
134
University of Wisconsin BIOTREK
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
For More Information
E-mail addresses

General Help info_at_ncbi.nlm.nih.gov
BLAST Help blast-help_at_ncbi.nlm.nih.gov

The (free!) NCBI Newsletter
http//www.ncbi.nih.gov/About/newsletter.html
The NCBI Handbook
Follow the link from the NCBI Home Page under
Hot Spots
The NCBI Education Page
http//www.ncbi.nih.gov/Education/index.html

Write a Comment

User Comments (0)