Title: Molecular Biology Resources at NCBI
1Molecular Biology Resourcesat NCBI
June 22, 2004
Washington State University - Tri-Cities
2NCBI Resources
- About NCBI
- NCBI Sequence Databases
- Other NCBI Databases
- Entrez Databases and Text Searching
- Genomic Resources
- BLAST Services
- Educational Tools
3The National Center for Biotechnology Information
- Created in 1988 as a part of the
- National Library of Medicine at NIH
- Establish public databases
- Research in computational biology
- Develop software tools for sequence analysis
- Disseminate biomedical information
4Web Access http//www.ncbi.nlm.nih.gov
5http//www.ncbi.nlm.nih.gov/About/index.html
6NCBI At A Glance
7A Science Primer
8(No Transcript)
9What is a Cell?
10(No Transcript)
11The Global Entrez Search Engine
12(No Transcript)
13The (ever expanding) Entrez System
Literature
Organism
Expression
14Examples of Database Integrationin Entrez
Word weight
Phylogeny
VAST
Protein sequences
BLASTn
BLASTp
15(No Transcript)
16Entrez Nucleotides
- Primary
- GenBank / EMBL / DDBJ 29,800,460
- Derivative
- RefSeq 304,828
- Third Party Annotation 4,266
- PDB 5,062
-
- Total
30,114,616
17Entrez Protein
-
- GenPept (GB,EMBL, DDBJ) 3,333,837
- RefSeq 1,011,056
- Third Party Annotation 4,685
- Swiss Prot 154,397
- PIR 282,821
- PRF
12,079 -
- Total
4,798,875 - BLAST nr
1,859,644
18(No Transcript)
19Other integrated NCBI websites
- Taxonomy Browser
- Database links sorted by lineage
- Gene
- Database links based on genetic loci
20potato
21(No Transcript)
22(No Transcript)
23(No Transcript)
24Gene
? Links to Everywhere (almost)
25NCBI Databases and Tools
26(No Transcript)
27(No Transcript)
28Part 2. Data Flow and Processing
Part 1. The Databases
Part 3. Querying and Linking the Data
Part 4. User Support
A part of the NCBI Bookshelf
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Types of Databases
- Primary Databases
- Original submissions by experimentalists
- Content controlled by the submitter
- Examples GenBank, SNP, GEO
- Derivative Databases
- Built from primary data
- Content controlled by third party (NCBI)
- Examples Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
35Primary vs. DerivativeSequence Databases
Labs
Sequencing Centers
Updated continually by NCBI
Updated ONLY by submitters
361º Sequence Database
GenBank
- Nucleotide only sequence database
- Archival in nature
- Submission of GenBank Data to NCBI
- Direct submissions of individual records via Web
(BankIt, Sequin) - Batch submissions of bulk sequences via Email
- (EST, GSS, STS)
- FTP accounts for Sequencing Centers
Sequence
37The International Sequence Database Collaboration
Sequence
38The Growth of GenBank
Currently 40.1 million records 37.9
billion nucleotides Average doubling time 12
months
Sequence Records (millions)
Total Base Pairs (billions)
83 84 85 86 87 88 89 90 91 92 93 94
95 96 97 98 99 00 01 02 03 04
39Number of Users and Hits Per Day
1997 1998 1999 2000 2001 2002
2003
Currently averaging 10,000,000 to 35,000,000 hits
per day!
40GenBank
- full release every two months
- incremental and cumulative updates daily
- available only through internet
ftp//ftp.ncbi.nih.gov/genbank/
Sequence
41Organization of GenBankGenBank Divisions (gbdiv)
- Records are divided into 17 Divisions.
- 1 Patent (11 files)
- 5 High Throughput
- 11 Traditional
EST (288) Expressed Sequence Tag GSS (98)
Genome Survey Sequence HTG (61) High Throughput
Genomic STS (3) Sequence Tagged Site HTC (3)
High Throughput cDNA
PRI (27) Primate PLN (10) Plant and
Fungal BCT (8) Bacterial and Archeal INV
(6) Invertebrate ROD (11) Rodent VRL (3)
Viral VRT (4) Other Vertebrate MAM (1)
Mammalian (ex. ROD and PRI) PHG (1) Phage SYN
(1) Synthetic (cloning vectors) UNA (1)
Unannotated
- Traditional Divisions
- Direct Submissions
- (Sequin and BankIt)
- Accurate
- Well characterized
- BULK Divisions
- Batch Submission
- (Email and FTP)
- Inaccurate
- Poorly characterized
Sequence
42File Formats of theSequence Databases
Each sequence is represented by a text record
called a flat file.
- GenBank/GenPept (useful for scientists)
- FASTA (the simplest format)
- ASN.1 XML (useful for programmers)
Sequence
43A Traditional GenBank Record
LOCUS AF062069 3808 bp mRNA
INV 02-MAR-2000 DEFINITION Limulus
polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Atlantic
horseshoe crab. ORGANISM Limulus polyphemus
Eukaryota Metazoa Arthropoda
Chelicerata Merostomata Xiphosura
Limulidae Limulus. REFERENCE 1 (bases 1 to
3808) AUTHORS Battelle,B.-A., Andrews,A.W.,
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE A
myosin III from Limulus eyes is a clock-regulated
phosphoprotein JOURNAL J. Neurosci. (1998) In
press REFERENCE 2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REFERENCE 3 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REMARK Sequence update by
submitter COMMENT On Mar 2, 2000 this
sequence version replaced gi3132700.
Definition Title
References
NCBIs Taxonomy
44Lower down in the GenBank Record
FEATURES Location/Qualifiers source
1..3808 /organism"Limulus
polyphemus" /db_xref"taxon6850"
/tissue_type"lateral eye" CDS
258..3302 /note"N-terminal
protein kinase domain C-terminal myosin
heavy chain head substrate for PKA"
/codon_start1
/product"myosin III"
/protein_id"AAC16332.2"
/db_xref"GI7144485"
/translation"MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATV
YSAIDKQA NKKVALKIIGHIAENLLDIETEYRI
YKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRD
IRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQ
SSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISEC
LVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSEL
VDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ ORIGIN
1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt
attatgacgt cctatctgtt 3781 aagatacagt
aactagggaa aaaaaaaa //
Feature Table
GenPept Protein ID
45 FASTA Format
gtgi30256embCAA42556.1 c-src-kinase Homo
sapiens MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAV
TKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITRE
QAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI
DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWA
LNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEAS
VMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS
RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKV
SDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEI
YSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK NCWHLDAA
MRPSFLQLREQLEHIKTHELHL
46GenBank GenPept Files
47Bulk Divisions
- Batch Submission and htg (email and ftp)
- Inaccurate
- Poorly Characterized
- Expressed Sequence Tag
- 1st pass single read cDNA
- Genome Survey Sequence
- 1st pass single read gDNA
- High Throughput Genomic
- incomplete sequences of genomic clones
- Sequence Tagged Site
- PCR-based mapping reagents
48Types of Databases
- Primary Databases
- Original submissions by experimentalists
- Content controlled by the submitter
- Examples GenBank, SNP, GEO
- Derivative Databases
- Built from primary data
- Content controlled by third party (NCBI)
- Examples Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
49EST Division Expressed Sequence Tags
gbdiv_estProperties
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAG
CAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATC
CATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAG
TGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTG
TACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTT
GAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGA
GACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGA
TTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATA
TGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg
- - isolate unique clones
- sequence once
- from each end
gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTT
CATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTT
TCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGC
TTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCT
GACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANG
TTGCCAGCCCTC
RNA gene products
50Sea Urchin ESTs in Entrez
51UniGeneSets of expressed sequencesclustered
by BLAST similarity
- Summary pages of curated information
- about expressed gene transcripts.
Sequence Expression
52A Cluster of ESTsArabidopsis serine protease
query
5 EST hits
3 EST hits
53UniGene Collectionsas of February 2004
Sequence Expression
54Genome Sequencing - HTG, GSS,(WGS)
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
whole genome shotgun assemblies (traditional
division)
assembly
Draft Sequence (HTG division)
55HTG Division Honeybee Draft Sequences
- Unfinished sequences of BACs
- Gaps and unordered pieces
- Finished sequences move to traditional GenBank
division
56Maize Genome Survey Sequences
- Surveys of BAC Libraries
- BAC end sequences
- More than 100K per project
57Other Genome Sequencing Products
- Trace Archive
- Whole Genome Shotgun
58Trace Archive
- Primary reads from WGS and EST projects
- Many not available in GenBank
- Earliest access to genome data
59Trace Archive Page
60 Short-tailed opossum traces
61Whole Genome Shotgun Projects
- Traditional GenBank Divisions
- 120 projects
- 1 Virus
- 78 Bacteria
- 5 Archaea
- 37 Eukaryotes featuring
- Rat, Mouse, Dog, Chimpanzee, Human
- Honeybee, Anopheles, Fruit Flies (2)
- Nematode (C. briggsae)
- Yeasts (8), Aspergillus (2)
- Rice
wgs_masterProperties
62Countries of Origin
63Submitted by Experimentalists
Curated by NCBI
GDS Grouping of experiments
GSE Grouping of slide/chip data a single
experiment
GSM Raw/processed spot intensities from a
single slide/chip
Entrez GEO Datasets
Entrez GEO
64- Submit and update data
- Query the database
- gene identifiers
- field information
- sequence
- Browse datasets
- Download data
65(No Transcript)
66mRNAs
RELEASE 4 IS NOW AVAILABLE ON THE FTP SITE!
- Forming the best representative sequence
- Standardizing nomenclature and record structure
- Adding annotation (references, sequence features)
Genomes
Proteins
Sequence Genome
67RefSeq Curation Processes
Curated genomic DNA (NC, NT, NW)
Scanning....
Curated Model mRNA (XM) (XR)
Model protein (XP)
Curated mRNA (NM) (NR)
Protein (NP)
Sequence Genome
68Curated RefSeq Records
COMMENT REVIEWED REFSEQ This record has been
curated by NCBI staff. The reference
sequence was derived from X66503.1.
Summary Adenylosuccinate synthetase catalyzes
the first committed step in the
conversion of IMP to AMP.
X records Genome Annotation Inferred or
Predicted vs N records Provisional,
Reviewed or Validated
Sequence
69Intermission
- To come in Part 2
- Protein Databases
- Genomes Genomic Resources
- Searching Sequences with Entrez and BLAST
- Educational Resources
70ProteinSequencesStructures
71Linking Protein Sequence, Structure and Function
CDD Conserved functional domains in
proteins (Conserved Domain Database)
72- GenPept Derivative database ie. there are no
directly sequenced proteins. - translations of nucleic acid sequences provided
by submitters - SWISS-PROT, PDB, PIR, DDBJ curation provided by
these databases.
73How Many Protein Records?
74(No Transcript)
75(No Transcript)
76Patatin an abundant tuber protein
77Protein Links to Related Databases
Precomputed BLAST searches
Conserved functional domains
78BLink precomputed BLAST Searches
79BLink Best Hits
80CDD Conserved Domain Database
81(No Transcript)
82(No Transcript)
83Sequence-based NeighborsDomain Relatives
- Modular Architecture of Domains
- Cartoon descriptions of protein domain
organization on the primary
sequence - Allows for comparison with other proteins with
the same Domain
Conserved Domain Architecture Retrieval Tool
(CDART)
84Entrez Structure Molecular Modeling Database
- Derived from experimentally determined PDB
records - Data is added to PDB records including
- Addition of explicit chemical bonding information
- Validation and indexing of sequence
- Inclusion of Taxonomy, Citation, and other
information - Conversion to ASN.1 data description language
- Searching the Structure Databases
- Keyword search by Entrez
- Sequence search by BLAST or BLink
- Domain search by RPS-BLAST (CDD Search)
- Structure search by VAST
Structure
85Structure Summary Page
to get the Cn3D viewer
Get Cn3D 4.1
Sequence-based Neighbors Conserved Domains
(CDD/RPS-BLAST)
86Complex Genomes
- Sequences are provided complete or we help
assemble - Heavy annotation Genes,
transcript regions ORFs, sequence variations
markers, clones, ESTs, etc. - The annotation can be shown graphically and
linked to other databases using the MapViewer
87(No Transcript)
88Higher Genome MapViews
adss
build 34
build 34
89Examples of Maps Mapped Data
--Sequence maps--- Ab initio (model) Assembly BES_
Clone Clone Contig Component CpG island dbSNP
haplotype Fosmid GenBank_DNA Gene Phenotype SAGE_T
ag TCAG_RNA Transcript (RNA) UniGene EST Variation
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps--- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
90Our annotation of the Gene
Gnmon prediction of Genes
Clones used in contig assembly
RefSeq mRNAs based on GB records
91Mapviewer Arabidopsis patatin gene
92Other Databases and Genomic Mapping
- SNP Database (Variation Map)
Single Nucleotide Polymorphisms, et al. - Identified differences in sequences found within
gene loci (L), transcript (T) or coding (C)
regions - STS Division UniSTS (STS Map)
Sequence Tagged Sites - Physically mapped segments of genes, ESTs, mRNAs
or genomic DNAs with known position - PCR with STS primers gives one product per genome
Related resource Electronic PCR - EST Division UniGene (UniGene Map)
- Histogram of expressed regions
93Mapping Data on the Genome
94Searching the NCBI Databases
95How to Query a Particular Database
term1 term2
(term1tag delimiter op term2tag delimiter op
)
op AND, OR, NOT
- Boolean operators MUST be in ALL CAPS!
tag delimiter Entrez indexing field
Organism Journal User compounds Author
96Sample Query
Brauninger a c-src kinase
Organism Journal User compounds Author
97Using Fields to Find Records
Accession All Fields Author EC/RN Number Feature
Key Filter Gene Name Issue Journal Keyword Modific
ation Date Organism Page Number Primary
Accession Properties Protein Name Publication
Date SeqID String Sequence Length Substance
Name Text Word Title Volume
- Most useful search field Organism
- humanorgn or bacteriaorgn
- Useful search terms in Properties field
- srcdb source database ( srcdb
genbankprop ) - gbdiv genbank division ( gbdiv
estprop ) - biomol biomolecular type ( biomol
mrnaprop )
98Complex searches you can do with Preview/Index
Terms used (and indexed) in Entrez fields can be
searched to gain useful information!
How many rat Unigene clusters contain at least
one mRNA?
- Select the UniGene database.
- Find all the rat records.
- Find those that have 1 mRNAs. (not 0)
NOT
rat organism
99Complex Queries with Preview/Index
NOT 0 mRNA Count
100Other Advanced Queries
Nucleotide Non-genomic sequences from the PLN
division of Genbank gbdiv_pln properties NOT
biomol_genomic properties
Protein RefSeq sequences with molecular weights
of 80 to 100 kDa srcdb_refseq properties AND
080000100000 Molecular Weight
SNP True SNPs that are uniquely mapped on the
mouse genome Snp SNP Class AND 1 Map
Weight AND mouse organism
UniSTS Markers on the Genethon map of human
chromosome 12 Genethon Map Name AND human
organism AND 12 chromosome
Structure Structures of bacterial kinases with
resolutions below 2 Ã… Bacteria organism AND
kinase AND 000.00002.00 resolution
101Searching the NCBI Databases
102http//www.ncbi.nlm.nih.gov/blast
103Why do we needsequence similarity searching?
Searching with Sequences
- To identify and annotate sequences with
- incomplete (or no) annotations (GenBank)
- incorrect annotations
- To assemble genomes
- To explore evolutionary relationships by
- finding homologous molecules
- developing phylogenetic trees
- NOTE Similar sequences may NOT have similar
function!
104Basic Local Alignment Search Tool
BLAST
- Calculates similarity for biological sequences
- Finds best local alignments
- A Heuristic approach based on the
Smith-Waterman algorithm - Searches for matching words rather than
individual residues - Uses statistical theory to determine if a match
might have occurred by chance
105Local vs. Global Alignment
Align program (Lipman and Pearson) -a global
alignment protocol-
Human 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKN
KVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84
A DL F K DL I T W GR G
IPNYV PW Worm 63
VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNY
VAREKSIES------QPWYF 125 Human 85
GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-
MYHASKLSIDEEVYFENLMQ 151 GK R AE L
E G FLVRS D L V VHYRI H I
F L Worm 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQH
DLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD
194 Human 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYR
SGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220
L HY ADGLC L P Y W
L IG GFGV G N VA Worm 195
LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQI
GAGQFGEVWEGRWNVNVPVA 264 Human 221
VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYM
AKGSLVDYLRSRGRSVLGGD 289 VK K A
FLAEA M LRH L L V IVTE M L L
RGR Worm 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKL
LSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332
Human 290
CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT---
-KEASSTQDTG-KLPVKWTA 353 L S V M
YLE NFHRDLAARNL KDFGL KE TG
PKWTA Worm 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARN
ILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA
401 Human 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYP
RIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423
PEA FTKSDVWSFGILL EI FGRPYP V V
GYM P GCP YM CW Worm 402
PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYR
MPCPAGCPVTLYDIMQQCWR 471 Human 424
LDAAMRPSFLQLREQLEHI 443 D RPF L
LE Worm 472 SDPDKRPTFETLQWKLEDL 492
human M--------------SAIQ----------------------AA
WPSGT------------ECIAKYNFHG M
S .. AA SG. . .A
... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPS
IGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1
20 40
60
440 450 human
REQLEHI--------KTHELHL . . .
... worm QWKLEDLFNLDSSEYKEASINF
500
BLASTp protein-protein comparison -a local
alignment protocol-
106Nucleotide Words
GTACTGGACAT TACTGGACATG ACTGGACATGG
CTGGACATGGA TGGACATGGAC GGACATGGACC
GACATGGACCC ACATGGACCCT
...........
Minimum word size 7 blastn default
11 megablast default 28
Make a lookup table of words
107An alignment that BLAST cant find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAAC
CACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGG
ATCATTAAGAACTCCTGGGGAGCCAGTT
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGC
TGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCT
CGTGGTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCG
TGACAACAAC
No words longer than 6 (exact matches) ...for
nucleotides there must be at least 7.
108Protein Words
GTQ TQI QIT ITV TVE VED
EDL DLF ...
Word Size can be 2 or 3 (default 3)
Make a lookup table of words
109Minimum Requirements for a Hit
ATCGCCATGCTTAATTGGGCTT CATGCTTAATT
exact word match
one match
- Nucleotide BLAST requires one exact match
- Protein BLAST requires two neighboring matches
within 40 residues
GTQITVEDLFYNI SEI YYN
neighborhood words
two matches
110Nucleotide vs. Protein BLAST
Comparing ADSS from H. sapiens and A. thaliana
- aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaag
gc - Human N R V T V V L G A Q W G D E
G - V V L G Q W G D E
G - A.th. S Q V S G V L G C Q W G D E
G - agtcaagtatctggtgtactcggttgccaatggggagatgaag
gt
BLASTn finds no match, because there are no 7 bp
words
BLASTp finds three matching words
Protein searches are generally more sensitive
than nucleotide searches.
111Some WWW-BLAST Databases
Nucleotide
Protein
- nr (nt)
- Traditional gb divisions
- NM_ and XM_ RefSeqs
- dbest
- EST Division
- htgs
- HTG division
- gss
- GSS division
- chromosome
- NC_ RefSeqs
- nr (non-redundant sequences)
- GenBank CDS translations
- NP_ RefSeqs
- PIR, Swiss-Prot, PRF
- PDB (sequences from structures)
- swissprot
- pat - patents
- pdb - sequences with 3D structures
- month - sequences updated within the
past 30 days
112Local Alignment Statistics
High scores of local alignments between two
random Sequences follow the Extreme Value
Distribution.
Expect Value E number of database hits you
expect to find by chance
size of database
your score
Alignments
expected number of random hits
Score
113Protein BLAST Page
Accession, GI, or sequence
Choose your database
114BLAST Formatting Page
Link to CDD
115BLAST Results Page
116BLAST Results Page
117Genomic BLAST
- The BLAST homepage links to the Genome BLAST
pages provide customized nucleotide and protein
databases for each genome. - If a Map Viewer is available, the BLAST hits can
be viewed on the maps.
118Locate an A. thaliana Gene with BLAST
AB004798
A. thaliana mRNA
119Hits to A. thaliana Clones
120Related SequencesPrecomputed BLASTn BLASTp
Lists
Nucleotide
Protein
Related Sequences Entrez-Link retrieves a list
of sequences sorted by BLAST score, but with no
alignment details.
121Searching the NCBI Databases
122Structure-based Neighbors Vector Alignment
Search Tool
4
For each protein chain
2
locate secondary structure elements,
5
6
represent them as individual vectors,
1
3
and compare these with precomputed vectors of
database structures.
Human IL-4
123NCBI Educational Resources
- Tutorials
- Practice exercises
- About NCBI
- Bookshelf
- NCBI Handbook
124(No Transcript)
125(No Transcript)
126(No Transcript)
127(No Transcript)
128(No Transcript)
129(No Transcript)
130(No Transcript)
131Other educational Sites
- Geospiza
- University of Wisconsin BioTrek
- USDA National Agricultural Library
- SWBIC Southwest Biotechnology and Information
Center - Biologica
- Howard Hughes Medical Institute
132(No Transcript)
133(No Transcript)
134University of Wisconsin BIOTREK
135(No Transcript)
136(No Transcript)
137(No Transcript)
138For More Information
E-mail addresses
- General Help info_at_ncbi.nlm.nih.gov
- BLAST Help blast-help_at_ncbi.nlm.nih.gov
The (free!) NCBI Newsletter
http//www.ncbi.nih.gov/About/newsletter.html
The NCBI Handbook
Follow the link from the NCBI Home Page under
Hot Spots
The NCBI Education Page
http//www.ncbi.nih.gov/Education/index.html