Title: Multiple Sequence Alignment Based on Compact Set
1Multiple Sequence Alignment Based on Compact Set
- Department of Computer Science
- National Tsing Hua University
- Chuan Yi Tang
2Multiple Sequence Alignment
- Given s set of sequences,the MSA problem is to
find an alignment of the sequences such that some
object function is minimized - ie.(Sum of Pair Score)
3MSA with SP-ScoreExact Algorithm and Heuristics
- k of Sequences n Sequences of length
- Exactly (using Dynamic Programming)
- O((2n)k)D.Snakoff, Simultaneous solution of RNA
folding, alignment and Protosequence prolblems,
SIAM J. Appl. Math.,(1985) - Heuristics
- D.F.Feng,R.F.Doolittle, Progressive sequence
alignment as a prerequisite to correct
phylogenetic trees. J. Mol. Evol. 25, 351-360.,
(1987) - S.F.Altschul,D.J.Lipman, Trees,star and mutiple
biological sequence aligment,SIAM J. Appl.
Math.,(1989) - D.J.lipman,S.F.Altschul, A tool for multiple
sequences alignment,Proc.Nat.Acad. Sci.
U.S.A.,(1989) - S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of
multiples sequences comparison methods,Bull.Math
Bio.,(1992) -
4MSA with SP-ScoreComplexity
- J Comput Biol 1994 Winter1(4)337-48
- On the complexity of multiple sequence
alignment. - Wang L. Jiang T.
- McMaster University, Hamilton, Ontario, Canada.
- We study the computational complexity of two
popular problems in multiple sequence alignment - 1. multiple alignment with SP-Score gt
NP-complete(non-metric) - 2. multiple tree alignment gt MAX SNP-hard
- Theoretical Computer Science259 (2001) 63-79
- The complexity with Multiple sequence alignment
with SP-score that is a metric - Paola Bonizzoni, Gianluca Della Vedoa
- 1. multiple alignment with SP-Score gt
NP-complete(metric)
5MSA with SP-ScoreApproximation
- Approximation Algorithm
- Performance ratio of 2-2/kD.Gusfilde,Efficient
methods for multiple sequence alignment with
guaranteed error bounds,Bull. Math Bio.,(1993) - Performance ratio of 2-3/kP.Pevzner,Multiple
alignment,communication cost,and graph
matching,SIAM J. Appl. Math.,(1992) - Performance ratio of 2-l/k(assembling l-way
alignments,l k)V.Bafna,E.L.Lawler and
Pevzner,Approximation algorithms for multiple
sequences alignment,Theor. Comput. Sci.,(1997) - Polynomial Time Approximation Scheme(PTAS)
- MSA within a constant band and allows only
constant number of insertion and deletion gaps of
arbitrary length per sequence on average M.
Li,B. Ma. And L. Wang, Near optimal alignment
within a band in polynomial time,STOC 2000.
6Compact Set Definition
- Let S be the set of n objects S1,S2,S3Sn and
D(Si,Sj) denote the distance between Si and Sj in
the distance matrix D. - Consider any C which is a subset of S,if the
distance between elements in C and not in C is
larger than the longest distance in C , then C is
called a compact set. - Property
- The entire set S is a compact set.
- Each set consisting of a single object is also a
compact set.
7Compact Set Example
11 Minimal border edge for compact set 3
S6
S5
10 Maximal inside edge for compact set 3
S1
S4
Compact Set 1
Distance Matrix
S2
S3
Compact Set 3
Compact Set 2
8Compact Set Example(cont)
- Compact Set is hierarchical
9MSA Compact Set
- Consider 12 Protein sequences example
- S1 MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTA
RQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDAN
KYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGV
I - S2 SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDV
TVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF - S3 SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYD
VTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF - S4 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTF
RRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTL
VFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDA
LDVANKIGII - S5 MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKY
ATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIK
KAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI - S6 MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQ
LFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY - S7 MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHR
PKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVF
IVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALD
VANKIGII - S8 MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKY
ARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIK
QAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI - S9 MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSV
TFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNN
TLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEV
DALDIANKIGFI - S10 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPT
FRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNT
LVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYD
ALDVANKIGII - S11 APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKY
ASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKYQIK
KAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI - S12 MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTK
RKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAM
KKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAY
IRLSASYDALDTANKMGLV
Original sequence
10MSA Compact Set(cont)
Original distance matrix
Original Compact Set Tree
Good MSA should Preserve Compact Set as well
11MSA Compact Set(cont)
- S1 -----------------MAPSAPAKTAKALDAKKKVVKGKRTTHR
RQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA
- S2 ---------------------------------------------
------------------------------------SSIIDYPLVTEKAM
DEMDFQNKLQFIVDIDAAKPEIRDV - S3 ---------------------------------------------
-----------------------------------SWDVIKHPHVTEKAM
NDMDFQNKLQFAVDDRASKGEV - S4 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKK
KKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES - S5 ----------------------MAPSTKATAAKKAVVKGTNGKKA
LKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAM
KK - S6 ---------------------------------------------
---------------------------------MDAFDVIKTPIVSEKTM
KLIEEENRLVFYVERKATKEDIKEA - S7 ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKK
SQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE - S8 ----------------------MAPSTKAASAKKAVVKGSNGSKA
LKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAM
KK - S9 ------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQ
RRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN - S10 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHK
KKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE - S11 -----------------------APSAKATAAKKAVVKGTNGK
KALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSET
AMKK - S12 MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVT
KRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP
MSA by MSA1
12MSA Compact Set(cont)
- S1 ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--
R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQ
HPL - S2 --------------------------------------------
-------------------------------------------S--SIID
YPLVTEKAMDEMDFQNKLQFIVDID- AAK - S3 --------------------------------------------
-------------------------------------------SW-DVIK
HPHVTEKAMNDMDFQNKLQFAVD-DRA - S4 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--
K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIK
FP - S5 -----------------MAPST-KATAAKKAVVKGT-NG--K--
KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIE
QPITSET - S6 --------------------------------------------
-----------------------------------------MDAF-DVIK
TPIVSEKTMKLIEEENRLVFYVER-KATK - S7 MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--
KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILK
YP - S8 -----------------MAPST-KAASAKKAVVKGS-NG--S--
KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIV
APIASET - S9 MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--
TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR
- S10 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK-
-K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AII
KF - S11 ------------------APSA-KATAAKKAVVKGT-NG--K-
-KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VI
EQPITSET - S12 ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKP
TKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR
MSA by MSA2
13MSA Compact Set(cont)
Compact Set Tree by MSA1
Distance Matrix by MSA1
14MSA Compact Set(cont)
Compact Set Tree by MSA2
Distance Matrix by MSA2
15Measure of Compact Set Preservation
- How can we measure the Compact Set Preservation
in quantity? - N1 of the original Compact Set relations
- N2 of the relations preserved after MSA
-
- Estimate by Compact Set Preservation
16Measure of Compact Set Preservation(cont)
Original Compact Set relations
1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3
4 5 1 4 5 2 4 5 3
Distance Matrix
N1 10
17Measure of Compact Set Preservation(cont)
The relations preserved after MSA
1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4
5 2 4 5 3 5 4
1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3
4 5 1 4 5 2 4 5 3
After MSA gt
Distance Matrix After MSA
N210-73 gt
Compact Set Tree after MSA
Estimate by Compact Set Preservation 3/10
18Why Pair Wise Compact Set?
- Evolutionary tree is the real judge
- Evolutionary tree has property to minimize the
total evolutionary edges (say tree size) from
pair wise distance which seems to be compact - It is true in experiments
19Compact Set Relation Preserved Rate for
Evolutionary Tree
of relations preserved in Evolutionary Tree /
of Compact Set relations of Pair Wise Distance
More larger more better
20Compact Set Evaluation Algorithm
- Step1 Construct the original Compact Set Tree T
and the Compact Set Tree after MSA T 1. - Step2 Preorder Traversal T to generate the
Compact Set relations after MSA R ,and mark the
entry in the hash table H according to R. - Step3 Preorder Traversal T to generate the
Original Compact Set Relations R ,and check
whether the marked entry in the hash table by R
is a subset of the hash table H. - Total Time Complexity O( ),where n is the
number of sequences - Reference
- 1. E. Dekel,J. Hu and W. Ouyang, An optimal
algorithm for finding compact sets, Inform.
Process. Lett. 44(1992) 285289
21Our Strategy for MSA
- Progressive alignment (Fei Feng and Doolittle
1987 ) - with neighbor first( by using Minimal Spanning
Tree(MST) Kruskal Merging Order) - Set-to-Set align. Once a gap, always a gap.
Kruskal merging order tree
3
S3----ACAGACTCCA S4TTTAAAAGTC----
1
2
set1
S1
S2
S3
S4
S1---AACAGACTT-A- S2----ACAGACTT-AA S3----ACAGA
CTCCA- S4TTTAAAAGTC-----
S1AACAGACTTA- S2-ACAGACTTAA
set2
22Q Why do we use MST Kruskal Order?
A1It has similar structure with compact set
MST Order Merge Tree
Compact Tree
A2MST Kruskal order is obtained easily
23Score function
Match
Begin- gap
Gap-extended
---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TT
TAAAAGTC-C---
End-gap
Mismatch
Gap-open
24Strategy of set-to-set alignment
Score(7, 7) (a8ß8) Score(7, 8)
(a8G3) Score(8, 7) (G2ß8)
(a8ß8) (G,C)(G,-)(G,G)(-,C)(-,-)(-,G)
(-10)(-15)(10)(-15)(0)(-15) -45
Time Complexity of seta to setß alignment
(sasßlalß )(2388), Where sa,sß are the
number of sequences in seta and
setßrespectively, and la,lß are the length of
resulted sequences in seta and setß respectively.
25Time Complexity of our strategy
- The worst case happens in that the binary tree is
balanced. - Total set-to-set time complexity is bounded by
- where l is the length of the resulted sequences
and n is the number of sequences. - The worst case time complexity O(n2l2 )
-
26MSA Useful tools
- GCG (Genetics Computer Group) PileUp
- http//gcg.nhri.org.tw8003/gcg-bin/seqweb.cgi
- Clustalw
- http//clustalw.genome.ad.jp/
27Clustal W
- Pairwise alignment
- Calculate distance matrix
- Construct the unrooted Neighbor-Joining (NJ) tree
- Construct the rooted NJ tree
- rooted at mid-point
- Progressive alignment
- Align following the rooted NJ tree
- set-to-set alignment
28Experiment
29SP Score Result
Clustalw and our result are better than GCGs
More larger more better
30Compact Set Relation Failure rate Result
of relation not preserved / of source compact
set relation
More smaller more better
31Three-point Relative Scale Preserved Rate
For all three species A, B,C, we evaluate their
relative distance relation between original
distance matrix and the MSA distance are
identical or not.
32I Believe Tree Only
- One might still not believe original pair wise
distance is not a good judge - One believes the true evolutionary tree only
33Compact Set Relation Failure Rate
Take Protein 12 for example
of relations not preserved / of source
Compact Set relations
Distance
MSA_Method
More smaller more better
34Future Work
- Is our measurement and algorithms really good?
- Simulations and Web service
- Does Our MSA by set-to-set alignment satisfy some
approximation property? - Theoretical Proving
- How can we reduce the time?
- Hardwired Dynamic Programming
- exPARACEL http//www.paracel.com/