Title: A Field Guide part 2
1A Field Guide part 2
National Center for Biotechnology Information
UT-Health Science Center
February 14, 2006
2GenBank Records
The Flatfile Format
3A Typical GenBank Record
LOCUS NM_019570 4279 bp mRNA linear INV
28-OCT-2004 DEFINITION Mus musculus REV1-like(S.
cerevisiae)(Rev1l),mRNA ACCESSION
NM_019570 VERSION NM_019570.3 GI50811869
KEYWORDS .
4GenBank Record Feature Table
5GenBank Record Feature Table, cont.
GenPept identifier
6GenBank Record sequence
7Indexing for Nucleotide UID 59958365
Field Indexed Terms primary
accession NM_001012399 title Bos taurus
hemochromatosis (hfe), mRNA. organism Bos
taurus sequence length 1168 modification
date 2005/02/19 properties biomol
mrna gbdiv mam srcdb refseq
8Global Entrez Search HFE
HFE
9Entrez Nucleotide HFE
137 records
Not HFE
10Smarter Query
hfetitle
AND humanorgn
11hfetitle AND humanorgn (cont)
Primary data
12Preview/IndexGateway to Advanced Searches
13Preview/Index
14Preview/Index Properties, srcdb
Properties
15Preview/Index Properties, srcdb
AND srcdb refseqProperties
16Preview/Index Properties, srcdb
AND srcdb ddbj/embl/genbankProperties
17Database Queries
- 1 hfe 137
- 2 hfetitle AND humanorgn 42
- 3 2 AND srcdb refseqprop 11
- 4 2 AND srcdb ddbj/embl/genbankprop
31
5 4 AND gbdiv priprop 29 4
4 AND gbdiv estprop 2
18Molecule Queries
- 1 hfe 116
- 2 hfetitle AND humanorgn
42 - 3 2 AND biomol mrnaprop 29
- 4 2 AND biomol genomicprop
13
19More Queries
Fields are database-specific
20More Queries
Fields are database-specific
21Other Entrez Databases
UniGene rat clusters that have at least one
mRNA ratorganism NOT 0mrna count
SNP uniquely mapped microsatellites on human
chr2 microsatSNP Class AND 1Map Weight AND
2Chromosome) AND humanorgn
UniSTS markers on the Genethon map of human
chromosome 12 GenethonMap Name AND
humanorganism AND 12chromosome
Structure structures of bacterial kinases with
resolutions below 2 Å bacteriaorganism AND
kinase AND 000.00002.00resolution
22Genome Resources
Genomic Biology
23Genomic Biology
24Gen Biol Gen Resources
25Map Viewer Genome Annotation Updates
26Gen Biol Gen Resources
27Genome Projects microb
28Genome Projects microb
13 Eukaryotic Genome Sequencing Projects
Selected Complete 0, Assembly 2, In Progress
- 11
29Genome Projects microb
13 Eukaryotic Genome Sequencing Projects
Selected Complete 0, Assembly 2, In Progress
- 11
30Gen Biol Gen Resources
31Gen Biol Gen Resources
32Gen Biol Gen Resources
33Gen Biol Gen Resources
34Gen Biol Gen Resources
35Genome Resources
Genomic Biology
UniGene
36UniGene
Gene-oriented clusters of expressed sequences
- Automatic clustering using MegaBlast
- Each cluster represents a unique gene
- Informed by genome hits
- Information on tissue types and map locations
- Useful for gene discovery and selection of
mapping reagents
37A Cluster of ESTs
query
5 EST hits
3 EST hits
38UniGene Collections
39UniGene Collections
Species UniGene
40UniGene Hs build 188
41UniGene Cluster Hs.95351Lipase,
hormone-sensitive (LIPE)
42UniGene Cluster Hs.95351
43UniGene Cluster Hs.95351 expression
44UniGene Cluster Hs.95351 seqs
45Get Sequences
web page
46Genome Resources
Genomic Biology
47E-PCR
Genomic sequence here
48Options
49Results
50reverse e-pcr
51reverse e-pcr
52reverse e-pcr
53reverse e-pcr
Gene
STS
LY6G6D lymphocyte antigen 6 complex, locus G6D
54Genome Resources
Genomic Biology
55List View
56Human MapViewer
57MapViewer Human ADAR
58MV Hs ADAR
59Maps Options
Maps Options
--Sequence maps-- Ab initio Assembly Repeats BES_C
lone Clone NCI_Clone Contig Component CpG
island dbSNP haplotype Fosmid GenBank_DNA Gene Phe
notype SAGE_Tag STS TCAG_RNA Transcript
(RNA) Hs_UniGene Hs_EST
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps-- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ss
c_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variat
ion
60MapViewer
Component
Gene
UniGene
Repeats
61Phenotype
Variation
Gene
62Maps Options
Maps Options
63Chimp ADAR
Human ADAR
Mouse ADAR
64Genome Resources
Genomic Biology
Trace Archive
65Trace Archive Page
66Ciona savignyi Traces
67(No Transcript)
68Trace Archive BLAST Page
Potential access to sequences NOT yet in GenBank
69Basic Local Alignment Search Tool
70BLAST Web Searches, 2005
200,000
71- Precomputed BLAST Services
- Nucleotide or protein Related Sequences
- BLAST link BLink
- Transcript clusters UniGene
- Protein homologs HomoloGene
72Link to Related Sequences
73Related Sequences
Most similar
Least similar
74BLink (BLAST Link)
75BLink Output
76Why Is BLAST So Popular?
- Fast
- - heuristic approach based on Smith Waterman
- Local alignments
- Statistical significance
- - Expect value
- Versatile
- - blastn, blastp, blastx, tblastn, tblastx,
rps-blast, psi-blast - - www, standalone, and network clients
77Global vs Local Alignment
78Global vs Local Alignment
Seq1 WHEREISWALTERNOW (16aa) Seq2
HEWASHEREBUTNOWISHERE (21aa)
79How BLAST Works
- Make lookup table of words for query
- Scan database for hits
- Extend alignment both directions
- Ungapped extensions of hits (initial HSPs)
- Gapped extensions (no traceback)
- Gapped extensions (traceback - alignment details)
80Protein Words
GTQ TQI QIT ITV TVE VED
EDL DLF ...
Make a lookup table of words
81BLASTP Summary
Drop-off score Highest score current
score -X X dropoff value for gapped alignment
(in bits) blastn 30, megablast 20, tblastx 0, all
others 15
82BLASTP Summary
High-scoring pair (HSP)
83Scoring Systems - Nucleotides
Identity matrix
A G C T A 1 3 3 -3 G 3 1 3 -3 C 3 3
1 -3 T 3 3 3 1
-r 1 -q -3
CAGGTAGCAAGCTTGCATGTCA
raw score 19-9 10 CACGTAGCAAGCTTG-GTGTCA
84Scoring Systems - Proteins
- Position Independent Matrices
- PAM Matrices (Percent Accepted Mutation)
- Derived from observation small dataset of
alignments - Implicit model of evolution
- All calculated from PAM1
- PAM250 widely used
- BLOSUM Matrices (BLOck SUbstitution Matrices)
- Derived from observation large dataset of
highly conserved blocks - Each matrix derived separately from blocks with
a defined percent identity cutoff - BLOSUM62 - default matrix for BLAST
- Position Specific Score Matrices (PSSMs)
- PSI- and RPS-BLAST
85BLOSUM62
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3
-3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S
T W Y V X
86Position-Specific Score Matrix
Serine/Threonine protein kinases catalytic loop
DAF-1
87Position-Specific Score Matrix
A R N D C Q E G H I L K M
F P S T W Y V 435 K -1 0 0 -1 -2 3
0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E
0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0
0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1
0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0
-1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1
1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2
5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2
-3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1
441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2
3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4
-5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A
4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1
-2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10
-6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8
-3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0
-4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6
-3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6
-3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1
448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5
-3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2
-2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K
0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1
-1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1
-6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5
-5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5
-3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6
-5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6
-3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5
455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3
-2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1
1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D
-3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0
-2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4
-2 3 0 1 1 -2 -2 -3 5 -1 -3
catalytic loop
88Local Alignment Statistics
Expect Value E number of database hits you
expect to find by chance, S
More info The Statistics of Sequence
Similarity Scores
89An Alignment BLAST Cannot Make
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACC
ACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGA
TCATTAAGAACTCCTGGGGAGCCAGTT
61
GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGG
GCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTG
GTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAAC
AAC
Reason no contiguous exact match of 7 bp.
90An Alignment BLAST Can Make
Solution compare protein sequences BLASTX
91Other BLAST Algorithms
- Megablast
- Discontiguous Megablast
- PSI-BLAST
- PHI-BLAST
92Megablast NCBIs Genome Annotator
- Long alignments of similar DNA sequences
- Greedy algorithm
- Concatenation of query sequences
- Faster than blastn less sensitive
93MegaBLAST Word Size
Trade-off sensitivity vs speed
94Discontiguous Megablast
- Uses discontiguous word matches
- Better for cross-species comparisons
95Templates for Discontiguous Words
W 11, t 16, coding 1101101101101101 W 11,
t 16, non-coding 1110010110110111 W 12, t
16, coding 1111101101101101 W 12, t 16,
non-coding 1110110110110111 W 11, t 18,
coding 101101100101101101 W 11, t 18,
non-coding 111010010110010111 W 12, t 18,
coding 101101101101101101 W 12, t 18,
non-coding 111010110010110111 W 11, t 21,
coding 100101100101100101101 W 11, t 21,
non-coding 111010010100010010111 W 12, t
21, coding 100101101101100101101 W 12, t
21, non-coding 111010010110010010111
W word size matches in template t template
length
Reference Ma, B, Tromp, J, Li, M. PatternHunter
faster and more sensitive homology search.
Bioinformatics March, 2002 18(3)440-5
96(No Transcript)
97Discontiguous (Cross-species) MegaBLAST
98Discontiguous Word Options
99Disco. Megablast Example . . .
Query NM_078651 Drosophila melanogaster
CG18582-PA (mbt) mRNA, (3244 bp) /note mushroom
bodies tiny synonyms Pak2, STE20, dPAK2
Database nr (nt), Mammaliaorgn
- MegaBLAST No significant similarity found.
- Discontiguous megaBLAST numerous hits . . .
100Ex Discontiguous MegaBLAST
101Ex BLASTN
102PSI-BLAST
Position-specific Iterated BLAST
- Example Confirming relationships of purine
- nucleotide metabolism proteins
103PSI-BLAST
E value cutoff for PSSM
104RESULTS Initial BLASTP
Same results as protein-protein BLAST different
format
105Results of First PSSM Search
Other purine nucleotide metabolizing enzymes not
found by ordinary BLAST
106Tenth PSSM Search Convergence
107PHI-BLAST
108Whats New?
109BLAST Databases
- Nucleotide
- refseq_rna NM_, XM_
- refseq_genomic NC_, NG_
- env_nt
- environmental samplefilter, e.g., 16S rRNA
- Protein
- refseq NP_, XP_
- env_nr
110New Formatter
Select lower case
Select red
111BLAST Output Alignments Filter
low complexity sequence filtered
112BLAST Output CDS Feature
113Advanced Options
Limit to Organism
allfilter NOT ma
Example Entrez Queries allFilter NOT
mammaliaOrganism ray finned fishesOrganism s
rcdb refseqProperties Nucleotide
only biomol mrnaProperties biomol
genomicProperties OtherAdvanced e
10000 expect value -v 2000 descriptions -b
2000 alignments
-e 10000 -v 2000
114Genome BLAST
115Genome BLAST via Map Viewer
116Example Human Genome BLAST
117Human Genome BLAST Results
118Human Genome BLAST MapViewer
119Example Mapping Oligos Onto a Genome
?
gtforward CCATGGCGACCCTGGAAAAGC gtreverse CAGCAGCGG
CTGTGCCTGCGG
?
?
120Map Oligos Onto Genome
gtCCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTG
CGG
-W 7 e 1000
121Genome BLAST Results
122Primer Alignments
reverse primer
forward primer
123MapViewer
124MapViewer
125Sequence View (sv)
forward
reverse
126Service Addresses
- BLAST blast-help_at_ncbi.nlm.nih.gov
- General Help info_at_ncbi.nlm.nih.gov
- Wayne Matten matten_at_ncbi.nlm.nih.gov