Title: Overview
1Overview (http//www.stats.ox.ac.uk/people/hein/le
ctures.htm) Pairwise Alignment Again Triple
Quadruple - Many Similarity-Distance
Conversion Local Alignment Statistical
alignment Pairwise Multiple Conclusion
2Approaches to Data Analysis
Data GTCAT,GTTGGT,GTCA,CTCA
Parsimony, similarity, optimisation.
GT-CAT GTTGGT GT-CA- CT-CA-
statistics
statistics
Ideal Practice 1 phase analysis.
Actual Practice 2 phase analysis.
3Parsimony Alignment of two strings. Sequences
s1CTAGG s2TTGT. Basic operations
transitions 2 (C-T A-G), transversions 5,
indels (g) 10.
CTA,TTAL
GG CTAG,TTGAL
CTA,TTGAL G-
CTAG,TTAL -G Initial condition D0,00.
(Di,j D(s11i, s21j)) Di,jmin
Di-1,j-1 d(s1i,s2j), Di,j-1 g, Di-1,j
g DCTA,TT w(GG) 12 0
12 D4,3DCTAG,TTGminDCTA,TTG w(G-) 4 10
14 DCTAG,TT w(-G) 22 10 32
4 40 32 22 14 9 17 T
/ 30 22 12 4 12 22 G
/ 20 12 2 - 12 22 32 T
/ 10 2 10 20 30 40 T
/ 0 10 20 30 40 50 C T
A G G CTAGG
Alignment i v Cost 17
TT-GT
5Alignment of three sequences. s1ATCG s2ATGCC
s3CTCC Alignment AT-CG ATGCC
CT-CC Consensus sequence
ATCC Configurations in an alignment column -
- n n n - n - - n - n -
n n - n - - - n n n
- Recursion Di,j,k minDi-i',j-j',k-k'
d(i,i',j,j',k,k') Initial condition D0,0,0
0. Running time l1l2l3(23-1) Memory
requirement l1l2l3 New phenomena ancestral
sequence.
6Parsimony Alignment of four sequences s1ATCG
s2ATGCC s3CTCC s4ACGCG Alignment AT-CG
ATGCC CT-CC
ACGCG Configurations in alignment columns -
- - n - - - n n n - n n n n - -
- n - n n - n - - n - n n n - -
n - - n - n - n - n n - n n - n
- - - - n n - - n n n n - n
- Recursion Di minDi-? d(i,?) ?
0,14\04 Initial condition D0
0. Computation time l1l2l3l424 Memory
l1l2l3l4
7Alignment of many sequences. s1ATCG, s2ATGCC,
......., snACGCG Alignment AT-CG
s1 s3 s4 ATGCC
\ ! / .....
---------- ..... /
\ ACGCG s2
s5 Configurations in an alignment column
2n-1 Recursion DiminDi-? d(i,?) ?
0,1n\0n Initial condition D0,0,..0
0. Computation time ln(2n-1)n Memory
requirement ln (lsequence length, nnumber of
sequences)
8Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7)
/ \ / \ Costs
Transition 2, / \ (A ,C,G,
T) \ Transversion 5, indel 10.
(10,2,10,2) \ / \ \
/ \ \ / \ \
/ \ \ / \
\ (A,C,G,T) (A,C,G,T) (A,C,G,T) 0
0 0 Indel Constraint
Nucleotides is connected set.
9Longer Indels TCATGGTACCGTTAGCGT GCA-----------GC
AT gk cost of indel of length k. Initial
condition D0,00 Di,j min Di-1,j-1
d(s1i,s2j), Di,j-1 g1,Di,j-2
g2,Di,j-3 g3,, Di-1,j g1,Di-2,j
g2,Di-3,j g3,, Cubic running
time. Quadratic memory.
10If gk a bk, then quadratic running
time. Gotoh (1982) Di,j is split into 3 types
1. D0i,j as Di,j, except s1i must mactch
s2j. 2. D1i,j as Di,j, except s1i is
matched with "-". 3. D2i,j as Di,j, except
s2i is matched with "-". ThenD0i,j
min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1)
d(s1i,s2j) D1i,j min(D1i,j-1 b,
D0i,j-1 a b) D2i,j min(D2i-1,j b,
D0i-1,j a b) Comment 1. Evolutionary
Consistency Condition gi gj gt gij
11Distance-Similarity. (Smith-Waterman-Fitch,1982)
Di,jminDi-1,j-1 d(s1i,s2j), Di,j-1 g,
Di-1,j g Si,jmaxDi-1,j-1 s(s1i,s2j),
Si,j-1 -w, Si-1,j-w Distance Transitions2
Transversions 5 Indels10 M largest distance
between two nucleotides (5). Similarity
s(n1,n2) M - d(n1,n2)
wk k/(2M) gk w
1/(2M) g Similarity
Parameters Transversions0 Transitions3
Identity5 Indels 10 1/10
12 40/-40.4 32/-27.3 22/-12.2 14/0.9
9/11.0 17/2.9 T 30/-30.3 22/-17.2
12/-2.1 4/11.0 12/2.9 22/-7.2 G
20/-20.2 12/-7.1 2/8.0 12/-2.1
22/-12.2 32/-22.3 T 10/-10.1 2/3.0
10/-7.1 20/-17.2 30/-27.3 40/-37.4 T
0/0 10/-10.1 20/-20.2 30/-30.3
40/-40.4 50/-50.5 C T
A G G
Comments 1. The Switch from Dist to Sim is
highly analogous to Maximizing -f(x) instead of
Minimizing f(x). 2. Dist will based on a
metric i. d(x,x) 0, ii. d(x,y) gt0, iii.
d(x,y) d(y,x) iv. d(x,z) d(z,y) gt
d(x,y). There are no analogous restrictions
on Sim, giving it a larger parameter space.
13Local alignment Smith,Waterman (1981 Global
Alignment Si,jmaxDi-1,j-1
s(s1i,s2j), Si,j-1 -w, Si-1,j-w Local
Si,jmaxDi-1,j-1 s(s1i,s2j),
Si,j-1 -w, Si-1,j-w,0 0 1 0 .6 1
2 .6 1.6 1.6 3 2.6 Score
Parameters C 0 0 1 0 1 .3
.6 0.6 2 3 1.6 Match 1 A 0
0 0 1.3 0 1 1 2 3.3 2
1.6 Mismatch -1/3 G
/ 0 0 .3 .3 1.3
1 2.3 2.3 2 .6 1.6 Gap 1
k/3 C / 0
0 .6 1.6 .3 1.3 2.6 2.3 1 .6
1.6 GCC-UCG U
/ GCCAUUG 0
0 2 .6 .3 1.6 2.6 1.3 1 .6
1 A ! 0 1 .6
0 1 3 1.6 1.3 1 1.3 1.6 C
/ 0 1 0 0 2
1.3 .3 1 .3 2 .6 C
/ 0 0 0 1 .3 0 0
.6 1 0 0 G / 0 0
0 .6 1 0 0 0 1 1 2
U 0 0 1 .6 0 0 0 0
0 0 0 A 0 0 1 0 0 0
0 0 0 0 0 A 0 0 0 0
0 0 0 0 0 0 0 C
A G C C U C G C U
U
14Progressive Alignment (Feng-Doolittle 1987
J.Mol.Evol.) Can align alignments and given a
tree make a multiple alignment.
alkmny-trwq acdeqrt akkmdyftrwq
acdehrt kkkmemftrwq P(n,q) P(n,h) P(d,q)
P(d,h) P(e,q) P(e,h)/6
Sodh
atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sodb
atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sodl
atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sddm
atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte-glhgfhvh
qfg----ndtagct sagphfnp lsrk Sdmz
atkavcvlkgdgpqvq infeqkesdgpvkvwgsikglteglhgfhvh
qfg----ndtagct sagphfnp Lsrk Sods
vatkavcvlkgdgpqvq infeak-gdtvkvwgsikgltepnglhgfhv
hqfg----ndtagct sagphfnp lsrk Sdpb
datkavcvlkgdgpqvq-infeqkesdgpv----wgsikgltglhgfhv
hqfgscasndtagctvlggssagphfnpehtnk
sddm
Sodb
Sodl
Sodh
Sdmz
sods
Sdpb
15Thorne-Kishino-Felsenstein Process
l lt m P(s) (1-l/m)(l/m)l pA A .. pT
T Time reversible
16Time reversibility
Pi,j(t) probability that i has evolved into j
after time t. p(i) probability of i after
infinitely long time - equilibrium
distribution p(i) Pi,j(t) p(j) Pj,i(t)
t1-----------t2---------t3
17Diff. Equations for p-functions
- - ... - ...
Dpk Dtl(k-1) pk-1 mkpk1 -
(lm)kpk - - - ... -
- ... DpkDtl(k-1)
pk-1m(k1)pk1-(lm)kpkmpk1
- - - ... - ...
DpkDtlkpk-1m(k-1)pk1-((k1)lmk)
pk Initial Conditions pk(0) pk(0) pk
(0) 0 kgt1 p0(0)
p0(0) 1. p0 (0) 0
18l m into Alignment Blocks A. Amino Acids
Ignored - - - - - - -
- - - - -
k
k
k e-mt1-lb(t)(lb(t))k-1 1-e-mt-mb(t)1-lb(
t)(lb(t))k-1 1-lb(t)(lb(t))k
pk(t) pk(t)
pk(t)
p0(t) mb(t)
where b(t)1-e(l-m)t/m-l B. AA Considered
T - - - R Q S W
Pt(T--gtR)pQ..pWp4(t)
4 T - - - -
- R Q S W pR pQ..pWp4(t)
4
19Basic Pairwise Recursion (O(length3))
i
j
Survives
Dies
i-1
i-1
i
i
j-1
j
j
i-1
i
i-1
i
j-2
j
j
j-1
20Fundamental Pairwise Recursion. P(s1i-gts2j)
p0P(s1i-1-gts2j) Initial Condition
P(s10 -gts2j) pjps21j Simplification
Ri,j (p1f(s1i,s2jp1ps2jj)P(s1i-1-gts2j-1)
P(s1i-gts2j) Ri,j p0 P(s1i-gts2j-1) P(s1i-gts2
j) p0P(s1i-1-gts2j)
???????????????????
lbP(s1i-gts2j-1)
(p1f(s1i,s2jp1p?s2jj- lb p?s2jj
))P(s1i-1-gts2j-1) Probability of observation
P(s1 , s2) P(s1) P(s1 -gts2)
21a-globin (141) and b-globin (146) 430.108
-log(a-globin) 327.320 -log(a-globin ?
b-globin) 730.428 -log(l(sumalign)) lt
0.0371805 /- 0.0135899 mt 0.0374396
/- 0.0136846 st 0.91701 /-
0.119556 E(Length) E(Insertions,Deletions)
E(Substitutions) 143.499 5.37255
131.59 Maximum contributing
alignment V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL
VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGD
LSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHK
LRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
R SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH Ratio
l(maxalign)/l(sumalign) 0.00565064
22Probability for substitution 0.46 children
p p' p'' 0. 0.0
3.60 10-2 9.64 10-1 1. 9.28 10-1
6.30 10-4 3.45 10-2
2. 3.32 10-2
2.26 10-5 1.23 10-3
3. 1.19 10-3
8.10 10-7 4.43 10-5 4. 4.27 10-5 2.90
10-8 1.59 10-6
5. 1.53 10-6 1.04
10-9 5.70 10-8 b(t) 9.64 10-1
lb(t) 3.46 10-2
23Length Evolution Immortal Link
24Accelerations of pairwise algorithm
1
2 - Better numerical search
3 - Simpler recursion
4 - Better computers
1991-gt2000 an 106 acceleration for 2 proteins
1500 long.
25Likelihood Surface
26Homology Test Wi,j -ln(piP2.5i,j/(pipj)) D(s
1,s2) is evaluated in D(s1,s2) Real s1
ATWYFCAK-AC Random s1 ATWYFC-AKAC
s2 ETWYKCALLAD s2
LTAYKADCWLE
This test 1. Test the competing
hypothesis that 2 sequences are 2.5 events apart
versus infinitely far apart. 2. It only handles
substitutions correctly. The rationale for
indel costs are more arbitrary. 3. It samples in
(pipj) by permuting the order of amino acids in
the second. I.e. uses drawing without
replacement a hypergeometric distribution.
27(No Transcript)
28Steel-Hein Algorithm
TTGT
ACGC
s2
s1
a
s3
ACGGT
29Binary Tree Problem
TGA
ACCT
s1
s3
a1
a2
s2
s4
GTT
ACG
30Markov Chains Generating the p-functions.
31Generating Ancestral Sequences 1 Sequence
E
l/m 1- l/m
l/m 1- l/m 2 Sequence -
E
-
E
lb l/m (1- lb)e-m
l/m (1- lb)(1- e-m)
(1- l/m) (1- lb) - lb
l/m (1- lb)e-m
l/m (1- lb)(1- e-m) (1- l/m) (1-
lb) _ lb
l/m (1- lb)e-m l/m (1-
lb)(1- e-m) (1- l/m) (1- lb) -
lb a1
- E a2
E
lb l/m (1- lb)e-m
(1- l/m) (1- lb)
32Fundamental Multiple Recursion I s1 - C
G C T A s2 A G A A
T T a1 - a
---gt ? e . . . . . .
. a2 s3 A G C
G G s4 G - C C T G
Sum over all String partitions - Anc. state
survivals - Anc. state MC jumps
33Fundamental Multiple Recursion II
Pa(Sk) Epifixes (S1k1l1) starting in given
MC starts i state a.
Pa(Sk)
Where P(kS i,H ! ?)
F(kSi,H)
34Fundamental 4 sequence Recursion Not a proper
recursion! Initialisation PEE(Ø) 1 and
Pa(Ø) are directly calculatable.
O(l2k)?shown, O(lk) Algorithm possible Toy
4-Sequence Program almost ready. This
approach could analyse up to 6-7 sequences.
Jens Ledet and others are working on Gibbs
sampler approach.
35Statistical Alignment Summary Motivation for
statistical alignment Data is sequences
not alignment! Problems ahead Longer
Insertion Deletion Process Position
Heterogeneous Process Simultaneous Comparative
Gene Finding and Alignment. Making an
TKF-process with a given HMM as
stationary distribution. Explore non-TKF
processes.
36References