Title: Outline
1(No Transcript)
2Outline
- The fundamental impartance of Alignment and
Statistics - The Basic Sequence Similarity Algorithm
- Heuristics BLAST, FASTA, SIM4
- ESTomics
3A story
The GOLD-BUG
By Edgar A. Poe
4The Gold-Bug, By Edgar A. Poe
Mr. William Legrand, left New Orleans and took
residence on Sullivans Island, near South
Carolina.
His servant was Jupiter, an old negro. He calls
Mr Legrand Massa Will.
One day, Massa Will found a bug, a scarabeus
which he believed is totally new.
5 Jupiter describes the bug in his language de
bug is a gole-bug, solid, ebery bit of him,
inside and all, sep him wing -- neber feel half
so hebby a bug in my life.
The design on the bugs back resembled a
deaths-head . And the story continues and they
were searching for a big treasure hidden by a
famous pirate Captain Kidd.
6Captain Kidds Code
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
7E. A. Poe
Circumstances, and a certain biased of mind,
have led me to take interest in such riddles,
and it may well be doubted whether human
ingenuity can construct an enigma of the
kind which human ingenuity may not, by proper
application, resolve.
8What Language ?
In the present case -- indeed in all cases of
secret writing -- the first question regards the
language of the cipher for the principles of
solution, so far, especially, as the more simple
ciphers are concerned, depend upon, and are
varied by, the genius of a particular idiom.
9E. A. Poe
In general, there is no alternative but
experiment (directed by probabilities) of every
tongue known to him who attempts the solution,
until the true one be attained. But for this
consideration, I should have begun my attempts
with the Spanish and French, as the tongues in
which a secret of this kind would most naturally
have been written by a pirate of the Spanish
main. As it was, I assumed the cryptograph to be
English.
10Statistics
No division between words
Statistics of the character 8 there are 33.
there are 26. 4 there are 19. ) there are 16.
there are 13. 5 there are 12. 6 there are 11. 1
there are 8. 0 there are 6. 92 there are 5 3
there are 4. ? there are 3. there are 2. _
there are 1.
11We found our first letter!
In English the letter which most frequently
occurs is e.
Afterwards, the succession is a o i d h n r s t
u y c f g l m w b k p q x z
12Captain Kidds Code 8 is e
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
88 occurs in English in words like speed,
seen, been, agree
8 is e
13Captain Kidds Code is t and 4 is h
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
must be the most frequent word
48
14 The Solution
A good glass in the bishops hostel in the
devils seat forty-one degrees and thirteen
minutes northeast and by north main branch
seventh limb east side shoot from the left eye
of the deaths head a bee-line from the tree
through the shot fifty feet out.
15THE ALGORITHMS of GENOMICS
The Programming Language of Genomics is
BLASTALL
16Sequence Comparison
- Biomolecular sequences
- DNA sequences (string over 4 letter alphabet A,
C, G, T) - RNA sequences (string over 4 letter alphabet
ACGU) - Protein sequences (string over 20 letter alphabet
Amino Acids) - Sequence similarity helps in the discovery of
genes, and the prediction of structure and
function of proteins.
17The Basic Similarity Analysis Algorithm
- Global Similarity
-
- Scoring Schemes
- Edit Graphs
- Alignment Path in the Edit Graph
- The Principle of Optimality
- The Dynamic Programming Algorithm
- The Traceback
18Sequence Alignment
- Input two sequences over the same alphabet
- Output an alignment of the two sequences
- Example
- GCGCATTTGAGCGA
- TGCGTTAGGGTGACCA
- A possible alignment
-
- - GCGCATTTGAGCGA - -
- TGCG - - TTAGGGTGACC
match mismatch indel
19Consider two sequences
belong to
Over the alphabet
20Scoring Schemes
Unit-score
A
C
T
G
-
A
1
0
0
0
0
C
1
0
0
0
0
0
G
1
0
0
0
0
0
0
T
1
0
-
0
0
0
0
0
21ALIGNMENT
A is aligned with A
ACG AGG
C is aligned with G
A A
C G
G G
G is aligned with G
Unit-cost
Score
(A,A)
(C,G)
(G,G)
1 0
1 2
22GAPS
- is the gap symbol
ACATGGAAT ACAGGAAAT
ACAT GG - AAT ACA - GG AAAT
OPTIMAL ALIGNMENTS
AAAGGG GGGAAA
- - - AAAGGG GGGAAA - - -
23(x,y) the score for aligning x with y
(x,-) the score for aligning x with -
(-,y) the score for aligning - with y
24Alignment
A-CG - G ATCGTG
Score
(A,A)
(G,G)
(C,C)
(-,T)
(-,T )
(G,G)
THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED
SYMBOLS
25Scoring Scheme
Dayhoff score
- A R N D C Q E G H I L K M F P
S T W Y V -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3
-1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0
A R N D
6
4
...
PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL
Partial alignment for Monkey and Trout
somatotropin proteins
26Scoring Functions
Mutations Substitutions, Insertions, Deletions
Scoring function a sum of a terms each for a
pair of aligned residues, and for each gap
The meaning log of the relative likelihood that
the sequences are related, compared to being
unrelated
Identities and conservative substitutions are
Positive terms
Non-conservative substitutions are
Negative terms
27The Edit Graph
Suppose that we want to align
AGT with AT
We are going to construct a graph
where alignments between the two
sequences correspond to paths between the begin
and and end nodes of the graph.
This is the Edit Graph
28The sequence AGT
AGT has length 3 AT has length 2
The sequence AT
0
2
1
3
0
1
2
The Edit graph has (31)(21) nodes
29T
G
A
Begin
0
1
2
3
0
A
1
T
2
End
AGT indexes the columns, and AT indexes the rows
of this table
30Begin
End
The Graph is directed. The nodes (i,j) will hold
values.
31T
G
A
Begin
0
1
2
3
0
A
1
T
2
End
32Directed edges get as labels pairs of aligned
letters.
Begin
End
33Alignment Path in the Edit Graph
Begin
AGT A-T
End
Every path from Begin to End corresponds to an
alignment
Every alignment corresponds to a path between
Begin and End
34The Principle of Optimality
The optimal answer to a problem is expressed in
terms of optimal answer for its subproblems
35Dynamic Programming
Given Two sequences X and Y Find An optimal
alignment of X with Y
Part 1 Compute first the optimal alignment score
Part 2 Construct optimal alignment
We are looking for the optimal alignment
maximal score path in the Edit Graph from the
Begin vertex to the End vertex
36The DP Matrix S(i,j)
S(1,0)
S(2,1)
37The DP Matrix
Matrix S S(i,j)
S(i,j) The score of the maximal cost path from
the Begin Vertex and the vertex (i,j)
Optimal Path to (i,j)
(i-1,j-1)
The optimal path to (i,j) must pass through
one of the vertices
(i,j-1)
(i-1,j)
(i,j)
(i-1,j)
(i,j-1)
(i-1,j-1)
38Opt path
- xi
(- , yj)
S(i-1,j)
yj -
Optimal path to (i-1,j) (- , yj)
39Optimal path
(i-1,j-1)
(i-1,j)
S(i-1,j-1)
(xi , yj)
(i,j)
(i,j-1)
Optimal path to (i-1,j-1) (xi,yj)
40Optimal path
(i-1,j-1)
(i,j-1)
S(i,j-1) (xi, -)
(i,j)
(I-1,j)
Optimal path to (i,j-1) (xi,-)
41The Basic ALGORITHM
S(i-1, j-1) (xi, yj)
S(i-1, j) (xi, -)
MAX
S(i,j)
S(i, j-1) (-, yj)
42OPTIMAL ALIGNMENT and TRACBACK
0
0
0
0
1
1
1
0
0
2
1
1
AGT A - T
Optimal Alignment
43The Basic ALGORITHM Local Similarity
We add this
0,
S(i-1, j-1) (xi, yj),
S(i-1, j) (xi, -),
MAX
S(i,j)
S(i, j-1) (-, yj)
44General Scoring Schemes
Assumptions
1. Independence of mutations at different sites
Additive scoring scheme
2. Gaps of any length are considered one mutation
All of the efficient alignment algorithms --
employing on the dynamic programming method
--are based fundamentally on the of the fact
that the scoring function is additive.
45Substitutions Matrices
belong to
Consider ungapped alignment of equal length
sequences
Compute the probability that the two sequences
are related
Compute the probability that the two sequences
are not related
Compute the ratio of the two probabilities
46Random Model R
Every letter z occurs independently with
probability
q
z
47Match Model M
a b
Aligned pairs of residues occur with joint
probability
p
ab
48s(a,b) the substitution matrix
Log-odds ratio
log
log
i
where
49HEURISTICSBLAST, FASTA, and SIM4
50BLAST (Basic Local Alignment Search Tool)
- A suite of sequence comparison algorithms
optimized for speed used to search sequence
databases for optimal local alignments to a
protein or nucleotide query
Altschul, Gish, Miller, Myers, Lipman Basic
Local Alignment Search Tool, J.Mol.Biol.
215(3)403-10 (1990) Altschul, Madden, Schaffer,
Zhang, Zhang, Miller, Lipman Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs, NAR 25(17)3389-402 (1997) (and
references therein)
51The BLAST algorithm
- Detect all word hits (exact, or nearly identical
matches) of a given length between the two
sequences - k10 for nucleotide sequences (exact word
matches) - k3 for protein sequences (nearly identical word
matches) - Extend the word hits in both directions to
high-scoring gap-free segment pairs (HSPs) - retain only HSPs that score above a threshold
- start from the center of the HSP (original BLAST,
1990), or from the center of a pair of HSPs
located close to each other on the same diagonal
(gapped BLAST, 1997) - Extend the HSPs in both directions allowing for
gaps - use dynamic programming, and stop when the
alignment score falls more than a threshold X
below the best score yet seen - Report all statistically significant local
alignments - E-value (starting with BLAST 2.0) is used to
measure the statistical significance - E-value the number of alignments with score
equal to or higher than s one would expect to
find by chance when searching the database
52FASTA
- A program for rapid alignment of pairs of protein
and DNA sequences, building a local alignment
from matching sequence patterns, or words - Algorithm for comparing a query to a database of
sequencesFor each database sequence - Identify the 10 diagonal regions having the
largest number of perfect word matches of a given
length - word size k1,2 for protein, and k6-10 for
nucleotide searches - Re-score these regions using a given scoring
matrix (e.g., PAM250), and trim them to form
(gap-free) maximal scoring initial regions - Join (non-overlapping) initial regions from
adjacent diagonals to generate longer regions,
allowing for gaps - Re-score these based on the initial regions
scores, assessing a penalty for each joining - Align the query sequence to each of the sequences
in the search set having the highest overall
scores
Pearson and Lipman, Improved tools for
biological sequence comparison, Proc. Natl.
Acad. Sci. USA 85 2444-2448 (1988).
53Sim4
- Aligns an expressed DNA (EST, cDNA, mRNA)
sequence with a genomic sequence for that gene,
allowing for introns and sequencing errors
Florea, Hartzell, Zhang, Rubin, Miller, A
computer program for aligning expressed DNA and
genomic sequences, Genome Res 8(9)96774 (1998)
54Stages and algorithmic techniques
- Detect basic homology blocks
- Determine gap-free matches (HSPs) using a
blast-like homology search - Detect all exact word matches of length k (e.g.,
k12) - Extend the word hits in both directions, by
substitutions, to gap-free high-scoring segment
pairs (HSPs) - Retain only HSPs scoring above a threshold
- Connect the HSPs to form larger blocks (exon
cores) using sparse dynamic programming - Extend or trim the exon cores to eliminate gaps
or overlaps in the cDNA sequence - Extend the similarity blocks using fast greedy
sequence comparison algorithms - Detect new exon cores with the blast-like
homology search tuned for higher sensitivity - Refine the introns
- Predict the locations of splice junctions using a
combined measure of the accuracy of alignment and
the intensity of splice signals at the ends of
each intron - Generate the spliced alignment
- Align the sequences within individual exons using
greedy alignment algorithms - Connect the chain of exon alignments by gaps
(introns)