Multiple Sequence Alignment Based on Compact Set - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Multiple Sequence Alignment Based on Compact Set

Description:

... Alignment ... complexity of two popular problems in multiple sequence alignment: ... And L. Wang, Near optimal alignment within a band in polynomial ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 35
Provided by: jmch3
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment Based on Compact Set


1
Multiple Sequence Alignment Based on Compact Set
  • Department of Computer Science
  • National Tsing Hua University
  • Chuan Yi Tang

2
Multiple Sequence Alignment
  • Given s set of sequences,the MSA problem is to
    find an alignment of the sequences such that some
    object function is minimized
  • ie.(Sum of Pair Score)

3
MSA with SP-ScoreExact Algorithm and Heuristics
  • k of Sequences n Sequences of length
  • Exactly (using Dynamic Programming)
  • O((2n)k)D.Snakoff, Simultaneous solution of RNA
    folding, alignment and Protosequence prolblems,
    SIAM J. Appl. Math.,(1985)
  • Heuristics
  • D.F.Feng,R.F.Doolittle, Progressive sequence
    alignment as a prerequisite to correct
    phylogenetic trees. J. Mol. Evol. 25, 351-360.,
    (1987)
  • S.F.Altschul,D.J.Lipman, Trees,star and mutiple
    biological sequence aligment,SIAM J. Appl.
    Math.,(1989)
  • D.J.lipman,S.F.Altschul, A tool for multiple
    sequences alignment,Proc.Nat.Acad. Sci.
    U.S.A.,(1989)
  • S.C. Chan,A.K.C. Wang,D.K.Y. Chiu, A survey of
    multiples sequences comparison methods,Bull.Math
    Bio.,(1992)

4
MSA with SP-ScoreComplexity
  • J Comput Biol 1994 Winter1(4)337-48
  • On the complexity of multiple sequence
    alignment.
  • Wang L. Jiang T.
  • McMaster University, Hamilton, Ontario, Canada.
  • We study the computational complexity of two
    popular problems in multiple sequence alignment
  • 1. multiple alignment with SP-Score gt
    NP-complete(non-metric)
  • 2. multiple tree alignment gt MAX SNP-hard
  • Theoretical Computer Science259 (2001) 63-79
  • The complexity with Multiple sequence alignment
    with SP-score that is a metric
  • Paola Bonizzoni, Gianluca Della Vedoa
  • 1. multiple alignment with SP-Score gt
    NP-complete(metric)

5
MSA with SP-ScoreApproximation
  • Approximation Algorithm
  • Performance ratio of 2-2/kD.Gusfilde,Efficient
    methods for multiple sequence alignment with
    guaranteed error bounds,Bull. Math Bio.,(1993)
  • Performance ratio of 2-3/kP.Pevzner,Multiple
    alignment,communication cost,and graph
    matching,SIAM J. Appl. Math.,(1992)
  • Performance ratio of 2-l/k(assembling l-way
    alignments,l k)V.Bafna,E.L.Lawler and
    Pevzner,Approximation algorithms for multiple
    sequences alignment,Theor. Comput. Sci.,(1997)
  • Polynomial Time Approximation Scheme(PTAS)
  • MSA within a constant band and allows only
    constant number of insertion and deletion gaps of
    arbitrary length per sequence on average M.
    Li,B. Ma. And L. Wang, Near optimal alignment
    within a band in polynomial time,STOC 2000.

6
Compact Set Definition
  • Let S be the set of n objects S1,S2,S3Sn and
    D(Si,Sj) denote the distance between Si and Sj in
    the distance matrix D.
  • Consider any C which is a subset of S,if the
    distance between elements in C and not in C is
    larger than the longest distance in C , then C is
    called a compact set.
  • Property
  • The entire set S is a compact set.
  • Each set consisting of a single object is also a
    compact set.

7
Compact Set Example
11 Minimal border edge for compact set 3
S6
S5
10 Maximal inside edge for compact set 3
S1
S4
Compact Set 1
Distance Matrix
S2
S3
Compact Set 3
Compact Set 2
8
Compact Set Example(cont)
  • Compact Set is hierarchical

9
MSA Compact Set
  • Consider 12 Protein sequences example
  • S1 MAPSAPAKTAKALDAKKKVVKGKRTTHRRQVRTSVHFRRPVTLKTA
    RQARFPRKSAPKTSKMDHFRIIQHPLTTESAMKKIEEHNTLVFIVSNDAN
    KYQIKDAVHKLYNVQALKVNTLITPLQQKKAYVRLTADYDALDVANKIGV
    I
  • S2 SSIIDYPLVTEKAMDEMDFQNKLQFIVDIDAAKPEIRDVVESEYDV
    TVVDVNTQITPEAEKKATVKLSAEDDAQDVASRIGVF
  • S3 SWDVIKHPHVTEKAMNDMDFQNKLQFAVDDRASKGEVADAVEEQYD
    VTVEQVNTQNTMDGEKKAVVRLSEDDDAQEVASRIGVF
  • S4 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPTF
    RRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNTL
    VFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYDA
    LDVANKIGII
  • S5 MAPSTKATAAKKAVVKGTNGKKALKVRTSASFRLPKTLKLARSPKY
    ATKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNTLVFKVSLKANKYQIK
    KAVKELYEVDVLSVNTLVRPNGTKKAYVRLTADFDALDIANRIGYI
  • S6 MDAFDVIKTPIVSEKTMKLIEEENRLVFYVERKATKEDIKEAIKQ
    LFNAEVAEVNTNITPKGQKKAYIKLKDEYNAGEVAASLGIY
  • S7 MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKKSQKIRTKVTFHR
    PKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTESAMKKIEDNNTLVF
    IVDIKADKKKIKDAVKKMYDIQTKKVNTLIRPDGTKKAYVRLTPDYDALD
    VANKIGII
  • S8 MAPSTKAASAKKAVVKGSNGSKALKVRTSTTFRLPKTLKLTRAPKY
    ARKAVPHYQRLDNYKVIVAPIASETAMKKVEDGNTLVFQVDIKANKHQIK
    QAVKDLYEVDVLAVNTLIRPNGTKKAYVRLTADHDALDIANKIGYI
  • S9 MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQRRIRTSV
    TFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLNTESAMKKIEDNN
    TLLFIVDLKANKRQIADAVKKLYDVTPLRVNTLIRPDGKKKAFVRLTPEV
    DALDIANKIGFI
  • S10 MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKKKKIRTSPT
    FRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTESAMKKIEDNNT
    LVFIVDVKANKHQIKQAVKKLYDIDVAKVNTLIRPDGEKKAYVRLAPDYD
    ALDVANKIGII
  • S11 APSAKATAAKKAVVKGTNGKKALKVRTSATFRLPKTLKLARAPKY
    ASKAVPHYNRLDSYKVIEQPITSETAMKKVEDGNILVFQVSMKANKYQIK
    KAVKELYEVDVLKVNTLVRPNGTKKAYVRLTADYDALDIANRIGYI
  • S12 MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVTK
    RKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYPLTTDKAM
    KKIEENNTLTFIVDSRANKTEIKKAIRKLYQVKTVKVNTLIRPDGLKKAY
    IRLSASYDALDTANKMGLV

Original sequence
10
MSA Compact Set(cont)
Original distance matrix
Original Compact Set Tree
Good MSA should Preserve Compact Set as well
11
MSA Compact Set(cont)
  • S1 -----------------MAPSAPAKTAKALDAKKKVVKGKRTTHR
    RQVRTSVHFRRPVTLKTARQARFPRKSAPKTSKMDHFRIIQHPLTTESA
  • S2 ---------------------------------------------
    ------------------------------------SSIIDYPLVTEKAM
    DEMDFQNKLQFIVDIDAAKPEIRDV
  • S3 ---------------------------------------------
    -----------------------------------SWDVIKHPHVTEKAM
    NDMDFQNKLQFAVDDRASKGEV
  • S4 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHKK
    KKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTES
  • S5 ----------------------MAPSTKATAAKKAVVKGTNGKKA
    LKVRTSASFRLPKTLKLARSPKYATKAVPHYNRLDSYKVIEQPITSETAM
    KK
  • S6 ---------------------------------------------
    ---------------------------------MDAFDVIKTPIVSEKTM
    KLIEEENRLVFYVERKATKEDIKEA
  • S7 ----------MAPAKADPSKKSDPKAQAAKVAKAVKSGSTLKKK
    SQKIRTKVTFHRPKTLKKDRNPKYPRISAPGRNKLDQYGILKYPLTTE
  • S8 ----------------------MAPSTKAASAKKAVVKGSNGSKA
    LKVRTSTTFRLPKTLKLTRAPKYARKAVPHYQRLDNYKVIVAPIASETAM
    KK
  • S9 ------MPPKSSTKAEPKASSAKTQVAKAKSAKKAVVKGTSSKTQ
    RRIRTSVTFRRPKTLRLSRKPKYPRTSVPHAPRMDAYRTLVRPLN
  • S10 --------MAPKAKKEAPAPPKAEAKAKALKAKKAVLKGVHSHK
    KKKIRTSPTFRRPKTLRLRRQPKYPRKSAPRRNKLDHYAIIKFPLTTE
  • S11 -----------------------APSAKATAAKKAVVKGTNGK
    KALKVRTSATFRLPKTLKLARAPKYASKAVPHYNRLDSYKVIEQPITSET
    AMKK
  • S12 MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKPTKVVKVT
    KRKAYTRPQFRRPHTYRRPATVKPSSNVSAIKNKWDAFRIIRYP

MSA by MSA1
12
MSA Compact Set(cont)
  • S1 ------------MAPSAPAKTA-KALDAKKKVVKGK-RTTHR--
    R--QV--R---TSVHFRRPVTLKTARQARFPRKSAPK-TSKMDHFR-IIQ
    HPL
  • S2 --------------------------------------------
    -------------------------------------------S--SIID
    YPLVTEKAMDEMDFQNKLQFIVDID- AAK
  • S3 --------------------------------------------
    -------------------------------------------SW-DVIK
    HPHVTEKAMNDMDFQNKLQFAVD-DRA
  • S4 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK--
    K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AIIK
    FP
  • S5 -----------------MAPST-KATAAKKAVVKGT-NG--K--
    KALKV--R---TSASFRLPKTLKLARSPKYATKAVPH-YNRLDSYK-VIE
    QPITSET
  • S6 --------------------------------------------
    -----------------------------------------MDAF-DVIK
    TPIVSEKTMKLIEEENRLVFYVER-KATK
  • S7 MAP-A--KAD-PS-KKSDPK-A-QAAKVAKAVKSG--STLKK--
    KSQKI--R---TKVTFHRPKTLKKDRNPKYPRISAPG-RNKLDQY-GILK
    YP
  • S8 -----------------MAPST-KAASAKKAVVKGS-NG--S--
    KALKV--R---TSTTFRLPKTLKLTRAPKYARKAVPH-YQRLDNYK-VIV
    APIASET
  • S9 MPPKSSTKAE-PKASSAKTQVA-KAKSAKKAVVKGT-SS--K--
    TQRRI--R---TSVTFRRPKTLRLSRKPKYPRTSVPH-APRMDAYRTLVR
  • S10 MAPKA--KKEAPAPPKAEAK-A-KALKAKKAVLKGV-HSHKK-
    -K--KI--R---TSPTFRRPKTLRLRRQPKYPRKSAPR-RNKLDHY-AII
    KF
  • S11 ------------------APSA-KATAAKKAVVKGT-NG--K-
    -KALKV--R---TSATFRLPKTLKLARAPKYASKAVPH-YNRLDSYK-VI
    EQPITSET
  • S12 ------MPAKAASAAASKKNSAPKSAVSKKVAKKGAPAAAAKP
    TKVVKVTKRKAYTRPQFRRPHTYRRPATVK-PSSNVSAIKNKWDAFR

MSA by MSA2
13
MSA Compact Set(cont)
Compact Set Tree by MSA1
Distance Matrix by MSA1
14
MSA Compact Set(cont)
Compact Set Tree by MSA2
Distance Matrix by MSA2
15
Measure of Compact Set Preservation
  • How can we measure the Compact Set Preservation
    in quantity?
  • N1 of the original Compact Set relations
  • N2 of the relations preserved after MSA
  • Estimate by Compact Set Preservation

16
Measure of Compact Set Preservation(cont)
Original Compact Set relations
1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3
4 5 1 4 5 2 4 5 3
Distance Matrix
N1 10
17
Measure of Compact Set Preservation(cont)
The relations preserved after MSA
1 2 4 1 2 5 1 4 3 3 5 1 2 4 3 3 5 2 1 2 3 1 4
5 2 4 5 3 5 4

1 2 4 1 2 5 1 3 4 1 3 5 2 3 4 2 3 5 1 2 3
4 5 1 4 5 2 4 5 3
After MSA gt
Distance Matrix After MSA
N210-73 gt
Compact Set Tree after MSA
Estimate by Compact Set Preservation 3/10
18
Why Pair Wise Compact Set?
  • Evolutionary tree is the real judge
  • Evolutionary tree has property to minimize the
    total evolutionary edges (say tree size) from
    pair wise distance which seems to be compact
  • It is true in experiments

19
Compact Set Relation Preserved Rate for
Evolutionary Tree
of relations preserved in Evolutionary Tree /
of Compact Set relations of Pair Wise Distance
More larger more better
20
Compact Set Evaluation Algorithm
  • Step1 Construct the original Compact Set Tree T
    and the Compact Set Tree after MSA T 1.
  • Step2 Preorder Traversal T to generate the
    Compact Set relations after MSA R ,and mark the
    entry in the hash table H according to R.
  • Step3 Preorder Traversal T to generate the
    Original Compact Set Relations R ,and check
    whether the marked entry in the hash table by R
    is a subset of the hash table H.
  • Total Time Complexity O( ),where n is the
    number of sequences
  • Reference
  • 1. E. Dekel,J. Hu and W. Ouyang, An optimal
    algorithm for finding compact sets, Inform.
    Process. Lett. 44(1992) 285289

21
Our Strategy for MSA
  • Progressive alignment (Fei Feng and Doolittle
    1987 )
  • with neighbor first( by using Minimal Spanning
    Tree(MST) Kruskal Merging Order)
  • Set-to-Set align. Once a gap, always a gap.

Kruskal merging order tree
3
S3----ACAGACTCCA S4TTTAAAAGTC----
1
2
set1
S1
S2
S3
S4
S1---AACAGACTT-A- S2----ACAGACTT-AA S3----ACAGA
CTCCA- S4TTTAAAAGTC-----
S1AACAGACTTA- S2-ACAGACTTAA
set2
22
Q Why do we use MST Kruskal Order?
A1It has similar structure with compact set
MST Order Merge Tree
Compact Tree
A2MST Kruskal order is obtained easily
23
Score function
Match
Begin- gap
Gap-extended
---AACAGACTT-A- ----ACAGAC---AA ----ACAGACTCCA- TT
TAAAAGTC-C---
End-gap
Mismatch
Gap-open
24
Strategy of set-to-set alignment
  • Score(8, 8) Max

Score(7, 7) (a8ß8) Score(7, 8)
(a8G3) Score(8, 7) (G2ß8)
(a8ß8) (G,C)(G,-)(G,G)(-,C)(-,-)(-,G)
(-10)(-15)(10)(-15)(0)(-15) -45
Time Complexity of seta to setß alignment
(sasßlalß )(2388), Where sa,sß are the
number of sequences in seta and
setßrespectively, and la,lß are the length of
resulted sequences in seta and setß respectively.
25
Time Complexity of our strategy
  • The worst case happens in that the binary tree is
    balanced.
  • Total set-to-set time complexity is bounded by
  • where l is the length of the resulted sequences
    and n is the number of sequences.
  • The worst case time complexity O(n2l2 )

26
MSA Useful tools
  • GCG (Genetics Computer Group) PileUp
  • http//gcg.nhri.org.tw8003/gcg-bin/seqweb.cgi
  • Clustalw
  • http//clustalw.genome.ad.jp/

27
Clustal W
  • Pairwise alignment
  • Calculate distance matrix
  • Construct the unrooted Neighbor-Joining (NJ) tree
  • Construct the rooted NJ tree
  • rooted at mid-point
  • Progressive alignment
  • Align following the rooted NJ tree
  • set-to-set alignment

28
Experiment
29
SP Score Result
Clustalw and our result are better than GCGs
More larger more better
30
Compact Set Relation Failure rate Result
of relation not preserved / of source compact
set relation
More smaller more better
31
Three-point Relative Scale Preserved Rate
For all three species A, B,C, we evaluate their
relative distance relation between original
distance matrix and the MSA distance are
identical or not.
32
I Believe Tree Only
  • One might still not believe original pair wise
    distance is not a good judge
  • One believes the true evolutionary tree only

33
Compact Set Relation Failure Rate
Take Protein 12 for example
of relations not preserved / of source
Compact Set relations
Distance
MSA_Method
More smaller more better
34
Future Work
  • Is our measurement and algorithms really good?
  • Simulations and Web service
  • Does Our MSA by set-to-set alignment satisfy some
    approximation property?
  • Theoretical Proving
  • How can we reduce the time?
  • Hardwired Dynamic Programming
  • exPARACEL http//www.paracel.com/
Write a Comment
User Comments (0)
About PowerShow.com