Outline - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Outline

Description:

Outline. The fundamental impartance of Alignment and Statistics ... eye of the death's head a bee-line from the tree. through the shot fifty feet out.' The Solution ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 55

Provided by: CeleraE7

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
(No Transcript)
2
Outline

The fundamental impartance of Alignment and
Statistics
The Basic Sequence Similarity Algorithm
Heuristics BLAST, FASTA, SIM4
ESTomics

3
A story
The GOLD-BUG
By Edgar A. Poe
4
The Gold-Bug, By Edgar A. Poe
Mr. William Legrand, left New Orleans and took
residence on Sullivans Island, near South
Carolina.
His servant was Jupiter, an old negro. He calls
Mr Legrand Massa Will.
One day, Massa Will found a bug, a scarabeus
which he believed is totally new.
5

Jupiter describes the bug in his language de
bug is a gole-bug, solid, ebery bit of him,
inside and all, sep him wing -- neber feel half
so hebby a bug in my life.
The design on the bugs back resembled a
deaths-head . And the story continues and they
were searching for a big treasure hidden by a
famous pirate Captain Kidd.
6
Captain Kidds Code
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
7
E. A. Poe
Circumstances, and a certain biased of mind,
have led me to take interest in such riddles,
and it may well be doubted whether human
ingenuity can construct an enigma of the
kind which human ingenuity may not, by proper
application, resolve.
8
What Language ?
In the present case -- indeed in all cases of
secret writing -- the first question regards the
language of the cipher for the principles of
solution, so far, especially, as the more simple
ciphers are concerned, depend upon, and are
varied by, the genius of a particular idiom.
9
E. A. Poe
In general, there is no alternative but
experiment (directed by probabilities) of every
tongue known to him who attempts the solution,
until the true one be attained. But for this
consideration, I should have begun my attempts
with the Spanish and French, as the tongues in
which a secret of this kind would most naturally
have been written by a pirate of the Spanish
main. As it was, I assumed the cryptograph to be
English.
10
Statistics
No division between words
Statistics of the character 8 there are 33.
there are 26. 4 there are 19. ) there are 16.
there are 13. 5 there are 12. 6 there are 11. 1
there are 8. 0 there are 6. 92 there are 5 3
there are 4. ? there are 3. there are 2. _
there are 1.
11
We found our first letter!
In English the letter which most frequently
occurs is e.
Afterwards, the succession is a o i d h n r s t
u y c f g l m w b k p q x z
12
Captain Kidds Code 8 is e
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
88 occurs in English in words like speed,
seen, been, agree
8 is e
13
Captain Kidds Code is t and 4 is h
53305))64826)4.)4)806 48860))851(
883(88)5 46(8896?8)(485)52 (495
62(5_4)884069285)) 68)41(948081881
48854) 48552880681(948(884(?34 48)4161
188?
must be the most frequent word
48
14

The Solution
A good glass in the bishops hostel in the
devils seat forty-one degrees and thirteen
minutes northeast and by north main branch
seventh limb east side shoot from the left eye
of the deaths head a bee-line from the tree
through the shot fifty feet out.
15
THE ALGORITHMS of GENOMICS
The Programming Language of Genomics is
BLASTALL
16
Sequence Comparison

Biomolecular sequences
DNA sequences (string over 4 letter alphabet A,
C, G, T)
RNA sequences (string over 4 letter alphabet
ACGU)
Protein sequences (string over 20 letter alphabet
Amino Acids)
Sequence similarity helps in the discovery of
genes, and the prediction of structure and
function of proteins.

17
The Basic Similarity Analysis Algorithm

Global Similarity
Scoring Schemes
Edit Graphs
Alignment Path in the Edit Graph
The Principle of Optimality
The Dynamic Programming Algorithm
The Traceback

18
Sequence Alignment

Input two sequences over the same alphabet
Output an alignment of the two sequences
Example
GCGCATTTGAGCGA
TGCGTTAGGGTGACCA
A possible alignment
- GCGCATTTGAGCGA - -
TGCG - - TTAGGGTGACC

match mismatch indel
19
Consider two sequences
belong to
Over the alphabet
20
Scoring Schemes
Unit-score
A
C
T
G
-
A
1
0
0
0
0
C
1
0
0
0
0
0
G
1
0
0
0
0
0
0
T
1
0
-
0
0
0
0
0
21
ALIGNMENT
A is aligned with A
ACG AGG
C is aligned with G
A A
C G
G G
G is aligned with G
Unit-cost
Score
(A,A)
(C,G)
(G,G)

1 0
1 2
22
GAPS
- is the gap symbol
ACATGGAAT ACAGGAAAT
ACAT GG - AAT ACA - GG AAAT
OPTIMAL ALIGNMENTS
AAAGGG GGGAAA
- - - AAAGGG GGGAAA - - -
23
(x,y) the score for aligning x with y
(x,-) the score for aligning x with -
(-,y) the score for aligning - with y
24
Alignment
A-CG - G ATCGTG
Score
(A,A)
(G,G)
(C,C)
(-,T)
(-,T )
(G,G)
THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED
SYMBOLS
25
Scoring Scheme
Dayhoff score
- A R N D C Q E G H I L K M F P
S T W Y V -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
-8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3
-1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0
A R N D
6
4
...
PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL
Partial alignment for Monkey and Trout
somatotropin proteins
26
Scoring Functions
Mutations Substitutions, Insertions, Deletions
Scoring function a sum of a terms each for a
pair of aligned residues, and for each gap
The meaning log of the relative likelihood that
the sequences are related, compared to being
unrelated
Identities and conservative substitutions are
Positive terms
Non-conservative substitutions are
Negative terms
27
The Edit Graph
Suppose that we want to align
AGT with AT
We are going to construct a graph
where alignments between the two
sequences correspond to paths between the begin
and and end nodes of the graph.
This is the Edit Graph
28
The sequence AGT
AGT has length 3 AT has length 2
The sequence AT
0
2
1
3
0
1
2
The Edit graph has (31)(21) nodes
29
T
G
A
Begin
0
1
2
3
0
A
1
T
2
End
AGT indexes the columns, and AT indexes the rows
of this table
30
Begin
End
The Graph is directed. The nodes (i,j) will hold
values.
31
T
G
A
Begin
0
1
2
3
0
A
1
T
2
End
32
Directed edges get as labels pairs of aligned
letters.
Begin
End
33
Alignment Path in the Edit Graph
Begin
AGT A-T
End
Every path from Begin to End corresponds to an
alignment
Every alignment corresponds to a path between
Begin and End
34
The Principle of Optimality
The optimal answer to a problem is expressed in
terms of optimal answer for its subproblems
35
Dynamic Programming
Given Two sequences X and Y Find An optimal
alignment of X with Y
Part 1 Compute first the optimal alignment score
Part 2 Construct optimal alignment
We are looking for the optimal alignment
maximal score path in the Edit Graph from the
Begin vertex to the End vertex
36
The DP Matrix S(i,j)
S(1,0)
S(2,1)
37
The DP Matrix
Matrix S S(i,j)
S(i,j) The score of the maximal cost path from
the Begin Vertex and the vertex (i,j)
Optimal Path to (i,j)
(i-1,j-1)
The optimal path to (i,j) must pass through
one of the vertices
(i,j-1)
(i-1,j)
(i,j)
(i-1,j)
(i,j-1)
(i-1,j-1)
38
Opt path
- xi
(- , yj)
S(i-1,j)
yj -
Optimal path to (i-1,j) (- , yj)
39
Optimal path
(i-1,j-1)
(i-1,j)
S(i-1,j-1)
(xi , yj)
(i,j)
(i,j-1)
Optimal path to (i-1,j-1) (xi,yj)
40
Optimal path
(i-1,j-1)
(i,j-1)
S(i,j-1) (xi, -)
(i,j)
(I-1,j)
Optimal path to (i,j-1) (xi,-)
41
The Basic ALGORITHM
S(i-1, j-1) (xi, yj)
S(i-1, j) (xi, -)
MAX
S(i,j)
S(i, j-1) (-, yj)
42
OPTIMAL ALIGNMENT and TRACBACK
0
0
0
0
1
1
1
0
0
2
1
1
AGT A - T
Optimal Alignment
43
The Basic ALGORITHM Local Similarity
We add this
0,
S(i-1, j-1) (xi, yj),
S(i-1, j) (xi, -),
MAX
S(i,j)
S(i, j-1) (-, yj)
44
General Scoring Schemes
Assumptions
1. Independence of mutations at different sites
Additive scoring scheme
2. Gaps of any length are considered one mutation
All of the efficient alignment algorithms --
employing on the dynamic programming method
--are based fundamentally on the of the fact
that the scoring function is additive.
45
Substitutions Matrices
belong to
Consider ungapped alignment of equal length
sequences
Compute the probability that the two sequences
are related
Compute the probability that the two sequences
are not related
Compute the ratio of the two probabilities
46
Random Model R
Every letter z occurs independently with
probability
q
z
47
Match Model M
a b
Aligned pairs of residues occur with joint
probability
p
ab
48
s(a,b) the substitution matrix
Log-odds ratio
log
log

i
where
49
HEURISTICSBLAST, FASTA, and SIM4
50
BLAST (Basic Local Alignment Search Tool)

A suite of sequence comparison algorithms
optimized for speed used to search sequence
databases for optimal local alignments to a
protein or nucleotide query

Altschul, Gish, Miller, Myers, Lipman Basic
Local Alignment Search Tool, J.Mol.Biol.
215(3)403-10 (1990) Altschul, Madden, Schaffer,
Zhang, Zhang, Miller, Lipman Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs, NAR 25(17)3389-402 (1997) (and
references therein)
51
The BLAST algorithm

Detect all word hits (exact, or nearly identical
matches) of a given length between the two
sequences
k10 for nucleotide sequences (exact word
matches)
k3 for protein sequences (nearly identical word
matches)
Extend the word hits in both directions to
high-scoring gap-free segment pairs (HSPs)
retain only HSPs that score above a threshold
start from the center of the HSP (original BLAST,
1990), or from the center of a pair of HSPs
located close to each other on the same diagonal
(gapped BLAST, 1997)
Extend the HSPs in both directions allowing for
gaps
use dynamic programming, and stop when the
alignment score falls more than a threshold X
below the best score yet seen
Report all statistically significant local
alignments
E-value (starting with BLAST 2.0) is used to
measure the statistical significance
E-value the number of alignments with score
equal to or higher than s one would expect to
find by chance when searching the database

52
FASTA

A program for rapid alignment of pairs of protein
and DNA sequences, building a local alignment
from matching sequence patterns, or words
Algorithm for comparing a query to a database of
sequencesFor each database sequence
Identify the 10 diagonal regions having the
largest number of perfect word matches of a given
length
word size k1,2 for protein, and k6-10 for
nucleotide searches
Re-score these regions using a given scoring
matrix (e.g., PAM250), and trim them to form
(gap-free) maximal scoring initial regions
Join (non-overlapping) initial regions from
adjacent diagonals to generate longer regions,
allowing for gaps
Re-score these based on the initial regions
scores, assessing a penalty for each joining
Align the query sequence to each of the sequences
in the search set having the highest overall
scores

Pearson and Lipman, Improved tools for
biological sequence comparison, Proc. Natl.
Acad. Sci. USA 85 2444-2448 (1988).
53
Sim4

Aligns an expressed DNA (EST, cDNA, mRNA)
sequence with a genomic sequence for that gene,
allowing for introns and sequencing errors

Florea, Hartzell, Zhang, Rubin, Miller, A
computer program for aligning expressed DNA and
genomic sequences, Genome Res 8(9)96774 (1998)
54
Stages and algorithmic techniques

Detect basic homology blocks
Determine gap-free matches (HSPs) using a
blast-like homology search
Detect all exact word matches of length k (e.g.,
k12)
Extend the word hits in both directions, by
substitutions, to gap-free high-scoring segment
pairs (HSPs)
Retain only HSPs scoring above a threshold
Connect the HSPs to form larger blocks (exon
cores) using sparse dynamic programming
Extend or trim the exon cores to eliminate gaps
or overlaps in the cDNA sequence
Extend the similarity blocks using fast greedy
sequence comparison algorithms
Detect new exon cores with the blast-like
homology search tuned for higher sensitivity
Refine the introns
Predict the locations of splice junctions using a
combined measure of the accuracy of alignment and
the intensity of splice signals at the ends of
each intron
Generate the spliced alignment
Align the sequences within individual exons using
greedy alignment algorithms
Connect the chain of exon alignments by gaps
(introns)