An Approx. Algo. For Alignment MSA using Motif Discovery PowerPoint PPT Presentation

presentation player overlay
1 / 88
About This Presentation
Transcript and Presenter's Notes

Title: An Approx. Algo. For Alignment MSA using Motif Discovery


1
An Approx. Algo. For Alignment MSA using Motif
Discovery
  • B.J. Chen
  • M.W. Chang
  • C.H. Tsai
  • H.Y. Chuang
  • NTU

2
MSA
  • Multiple Sequence Alignment
  • Given N sequences, align these sequences,
    possibly with gaps, that brings out the best
    commonality of these N sequences.
  • Usually measure the alignment by penalizing the
    mis-matches and gaps, and rewarding the matches.

3
Previous works
  • Alt89, CHM94, CL88, GKS95etc. works best
    for small values of N (about 2-6).
  • ZZ96, Vih98 handle the case of
  • N gt2 by first applying pairwise alignment
    techniques.

4
Motivation of this work
  • For larger values of N, we need additional
    constraints to give biological meaningful
    alignment.
  • MOTIF A common patterns across sequences.

5
Motivation of this work
  • Alignment number K (2 K N)
  • A user controlled parameter constrains the
    alignment to have at least K sequences agree on a
    character, whenever possible, in the alignment.
  • The commonality across most sequences is required
    to be detected.

6
Problem Description
  • Align multiple strings with gaps
  • Hope Brings out the best commonality
  • Parameter K , the alignment number
  • The alignment should has at least K sequences
    agree on a character, whenever possible, in the
    alignment.
  • K 1 has no meaning

7
Problem Definition
  • Given N sequences , each have length ni
  • Let matrix A be an alignment , the size of A is N
    x T ( T is the length after alignment, T gt ni )
  • Let Ai be i th row in A,throw all gaps in Ai
    will become original sequence_i

8
Problem Definition(2)
  • Define EA(a,i,j) 1 if Ai,j a
  • EA(a,i,j) 0 o.w
  • (a,j) Si EA(a,i,j)
  • a in column j is bad if (a,j) lt K
  • Define
  • S (a,j)
  • a (a,j) gt 0
  • (a,j) lt K

9
Problem Definition(3)
A A B B B B C C
X X O O O O X X
  • Example Given K 4
  • A B
  • 4

10
Problem Definition(4)
  • Minimize whole badness
  • K-MSA
  • The paper proposes the proof that
  • this problem is MAX SNP hard (Ill report this
    part later)

11
Stage 1 Motif Discovery
  • We begin by defining a motif in a sequence
  • Given a string s on alphabet ? and an integer K,
    2 K s
  • Definition 1 K-motif
  • A string m on ??. is K-motif with location list
    Lm ( l1,l2,,lp), if all of the following
    hold
  • m0, mm-1 belong to ?
  • p K
  • Every dont care character at position j in m,
    there exist at least two distinct occurrences li1
    and li2, 1 i1, i2 p , such that s li1j ?
    s li2j

12
Definition 2
  • Maximal Motif
  • Let p1, p2, , pk be all the motifs in the
    sequence s. A motif pi is maximal if and only if
  • there is no pj (j ? i) such that pi is a
    substring of pj, or
  • if pi is a substring of pj, then there exists at
    least one occurrence of pi in s that is not
    covered by pj in s

13
Definition 3
  • Redundant motif
  • A maximal motif m, with location list Lm, is
    redundant if there exist maximal motifs mi, 1 i
    p , such that Lm Lm1 ? Lm2 ? ? Lmp
  • A motif that is not redundant is called an
    irredundant motif

14
Example 1
  • Let the input string s have the following form
  • ac1c2c3baXc2c3bYac1Xc3bYYac1c2Xb
  • _________________________________________
  • ac1c2c3b aXc2c3bY ac1Xc3bYY
    ac1c2Xb
  • ab
  • a..c3b
  • a.c2.b
  • ac1..b
  • a.c2c3b
  • ac1.c3b
  • ac1c2.b

15
Motif Discovery
  • The motifs of interest to the sequence alignment
    problem are the irredundant motifs lemma 4
  • There exists a polynomial time algorithm to
    extract them from the input
  • In the rest of the paper we use motifs that occur
    in multiple sequences, i.e., occur in each
    sequence exactly once.
  • If there exists a motif that occurs more than
    once in a sequence, then for the arguments that
    follow, each occurrence is treated as a distinct
    motif

16
Lemma 4
  • Lemma 4 if p is a redundant motif, then using
    the motif p does not improve the cost of the
    K-MSA optimization problem
  • Proof. Let p be rendered redundant by motifs p1,
    p2, , pn, n 1. By definition, motif p has less
    number of solid-characters than each of pi, 1 i
    n. Thus if an alignment can use motif p, it can
    certainly use all the motifs p1, p2, ,pn, giving
    a larger number of solid-characters thus a
    higher cost for the K-MSA optimization problem

17
Stage 2 Sequence Alignment
  • Definition 4
  • two irredundant motifs pi and pj, if there
    exists a sequence s containing both these motifs
  • Let ni and nj be the sizes of the motifs pi and
    pj and let li and lj be the locations (offsets)
    in a sequence s respectively
  • Motifs Overlap if the intervals li, li ni
    and lj, lj nj have a non-empty intersection

p1
p2
18
Definition 5
  • Pairwise Compatible Motifs Two motifs, p1 and
    p2, are pairwise feasible if there exists an
    alignment of the sequences that does not
    introduce gaps in the motifs p1 and p2

p1
p2
19
Lemma 1
  • Two irredundant motifs pi and pj are pairwise
    feasible if and only if none of the following
    hold
  • (domain crossing mismatch) if pi and pj do not
    overlap in all the sequence, then pi is to the
    left pj, without loss of generality
  • (overlap mismatch) if pi and pj overlap in any
    sequence, than pi is at some fixed distance d to
    left of pj, without loss of generality

20
Mismatch
Motif A
Motif B
Overlap mismatches
(i)
(ii)
Domain-crossing mismatches
21
Definition 6
  • Motif alignment Given a set M of motifs, a motif
    alignment of the sequences, s1, s2, ,sm, is the
    alignment such that in all the sequences, without
    breaking the motifs in M, the motifs are aligned
    (in all the sequences they appear)
  • Feasible set If such an alignment exists, the
    set is called a feasible set

22
Definition 7
  • Linear ordering of motifs given a set of
    feasible motifs, a consistent ordering of the
    motifs such that in every sequence, the set of
    motifs that are present in the sequence appear in
    the left to right order is called the linear
    ordering

23
Example 2
  • H I A J G L B
  • A M C N B Q D
  • C O P E D R F
  • H I S J G E T U F
  • H I V J G
  • AB in sequence 1 and 2
  • CD in sequence 2 and 3
  • E..F in sequence 3 and 4
  • HI.JG in sequence 1, 4 and 5

24
Example 2
  • (iv) (i) (ii) (iii)
  • H I A J G L B -- -- -- --
  • -- -- A M C N B Q D -- --
  • -- -- -- -- C O P E D R F
  • H I S J G -- -- E T U F
  • H I V J G -- -- -- -- -- --

i
ii
iii
iv
25
Example 3
  • H I A J G L B -- -- -- --
  • -- -- A M C N B Q D -- --
  • -- -- -- -- C O P E D R F
  • -- -- -- -- -- H I E J G F
  • -- -- -- -- -- H I K J G --
  • No alignment due to overlap mismatch
  • There is no linear ordering of the motifs

26
Example 4
  • A G H D X Y
  • A I C D J F
  • X Y C D K F
  • A..D in sequences 1 and 2
  • CD.F in sequences 2 and 3
  • XY in sequences 1 and 3
  • linear order of the motifs is (i), (iii), (ii)
  • no alignment due to crossing mismatch

i
iii
ii
27
Definition 8
  • Domain crossing error Given a set of motifs, m1,
    m2,, mn, a domain crossing error is said to
    occur if there exists a linear ordering of the
    motifs mi1, mi2, , min, yet there exists no
    alignment that respects all the n motifs

28
Lemma 2
  • A set of irredundant motifs p1, p2, , pn is
    feasible if and only if none of the following
    holds
  • There exist distinct motifs pi and pj such that
    pi and pj are pairwise infeasible
  • There exists a non-empty subset of the motifs
    without a linear ordering
  • There exists a non-empty subset of the motifs
    that demonstrate domain crossing error

29
The Graph-theoretic Formulation
  • Construct a directed graph G ( V, E) where every
    motif pi corresponds to a vertex vi, thus n
    V. The directed edge are introduced as follows
  • There is no edge between two vertices where the
    two corresponding motifs do not occur
    simultaneously in any sequence
  • If pi is to the left of pj in every sequence that
    the two motifs are present, then a directed edge
    is placed from vi to vj
  • The edges are labeled as follows
  • Forbidden if the motifs are not pairwise
    feasible
  • Overlap if the motifs corresponding to v1 and v2
    overlap
  • Non-overlap if the motifs do not overlap

30
Handling Domain Crossing Mismatches - intro
  • Introduce consistent graph w.r.t. a vertex.
  • First, define distance on edges.
  • Dv1,v2 minimum distance between v1 and v2
    in every sequence that both them appear in.

31
Consistent Graph w.r.t. a Vertex
  • G (V,E) for each edge (u,v) label?
    forbidden, overlap, nonoverlap weight Du,v
  • Definition
  • Valid path contains no forbidden edge
  • Overlap-path all edges in the path are
    overlap
  • Weight Dp ? De, e in p (path)

32
Consistent Graph w.r.t. a Vertex
  • p ?V The graph is consistent w.r.t. p if
  • ?q ?V, for all pair of vertex-disjoint valid
    paths from p to q, P1 and P2,
  • 1. Dp1 Dp2, if P1 and P2 are both
    overlap-paths, or,
  • 2. Dp1 ? Dp2, if P1 is an overlap-path and P2 is
    not.

33
An Example
  • Consider a previous example.
  • AGHDXY
  • AICDJF
  • XYCDKF
  • A..D
  • CD.F
  • XY

P1 v1 -gt v2, D 2 P2 v1 -gt v3 -gtv2, D
6 Not consistent!
34
Graph Consistent v.s. Domain Crossing
Mismatches
  • Use Graph Consistent property to deal with the
    domain crossing mismatches problem.
  • If the induced graph is consistent w.r.t every
    vertex, then there is no domain crossing
    mismatches.
  • Moreover, it may rule out overlap mismatches.

35
Lemma Rewritten
  • Given a subset of vertices(motifs) v1vn,
    construct graph as previously defined. This
    induced subgraph on v1vn is feasible, if the
    following holds
  • 1. there is no edge labled forbidden in the
  • subgraph,
  • 2. the induced subgraph is acyclic, and,
  • 3. the subgraph is consistent w.r.t. every vi.

36
Algorithm
  • Idea an infeasible set to a feasible set

37
Algorithm
  • Detect the basic infeasible subsets.
  • Eliminate motifs to obtain a feasible set that
    maximizes the cost.
  • Render the alignment.

38
Algorithm
  • Before we construct the basic infeasible sets, we
    have one more definition
  • Given a graph G and two cycles C1 and C2 on G,
    if all vertices defining C1 are also in C2, then
    C2 is redundant with respect to C1.

39
Algorithm Step1
  • Compute the following sets
  • 1. Fi contains the 2 endpoints of i-th
    forbidden edge.
  • 2. Cj contains vertices that form a directed
    non-redundant cycle in the graph.
  • 3. Pk contains vertices that a non-redundant
    path in the graph.

40
Algorithm Step1
  • The number of the total basic infeasible sets we
    obtained in Step1 is bounded.
  • More precisely, there would be no more than .
  • It is easy to prove, since the number would be
    bounded by the number of cycles or closed path.

41
Algorithm Step2
  • Mapping to SETCOVER problem.
  • v1, v2vn F1 ? Fnf ? C1Cnc ? P1Pnp
  • U F1Fnf, C1Cnc, P1Pnp
  • Si Fl vi ?Fl, 1? l ? nf ?
  • Cl vi ?Cl, 1? l ? nc ?
  • Pl vi ?Pl, 1? l ? np
  • costi solid_char in motifi sequences
  • containing motifi
  • AB..CD ? 4

42
Algorithm Step3
  • Find feasible subgraph containing only overlap
    edges.
  • For each sequence, find an ordering of motifs in
    that subgraph. Say, p1, p2pj.
  • Align sequences.

p1 p2 p4
p1
p2 p4
43
Example
  • A set of 3 sequences
  • GFPCQFSAG
  • GFPCQFSGG
  • GPCQSAGK
  • K2

44
Example
  • Irredundant motifs (By Teiresias)
  • PCQ in Seq. 1,2,3
  • GFPCQFS.G in Seq. 1,2
  • PCQ..G in Seq. 2,3
  • SAG in Seq. 1,3
  • GFPCQFSAG
  • GFPCQFSGG
  • GPCQSAGK

45
Example
  • Sequences to Graph

2 (0.V.)
1
2
0 (O.V.)
2 (O.V.)
3 (non.)
4
3
3 (O.V.)
6 (O.V.)
46
Example
  • Basic infeasible sets
  • No F
  • No C
  • Closed path set P,
  • P1 214
  • P2 123
  • P3 134
  • P4 234

47
Example
  • Make feasible set by S.C. problem
  • G P1,P2,P3,P4
  • S1 P1,P2,P3 w1 3
  • S2 P1,P2,P4 w2 9
  • S3 P2,P3,P4 w3 6
  • S4 P1,P3,P4 w4 3
  • Solution S1,S4

48
Example
  • Get aligned blocks from overlap graphs
  • GFPCQFS.G
  • PCQ..G

2
2 (O.V.)
3
49
Example
  • Result ---
  • G F P C Q F S a G
  • G F P C Q F S G G
  • - g P C Q s a G K

50
Demonstration
  • IBM Bioinformatics Group
  • http//cbcsrv.watson.ibm.com/Tmsa.html

51
Approximation Analysis
  • Q1 Is it polynomial time?
  • Recall this algorithm---
  • 1. Convert to graph.
  • 2. Detect the basic infeasible subsets.
  • 3. Eliminate motifs to obtain a feasible set that
    maximizes the cost.
  • 4. Render the alignment.

52
AnalysisQ1
  • Step1---Covert to graph
  • one irredundant motif ? one vertex
  • one relation between two irredundant motifs ?
    one edge
  • of irredundant motifs is linear in the input
    size. (by Par98)

53
AnalysisQ1
  • Step2--- Compute the following sets
  • 2.1 Fi contains the 2 endpoints of i-th
    forbidden edge.
  • Simply scan all the edges and collecting the
    end points of the edges labeled Forbidden.
  • In G(V,E),Vn
  • E n(n-1)/2O(n2)

54
AnalysisQ1
  • 2.2 Cj contains vertices that form a
    directed non-redundant cycle in the graph.
  • Do DFS of the graph to capture all the
    cycles. (If you do a DFS, you have a cycle iff
    you have a back edge. )
  • Dont need to traverse along a whole path.
  • DFS is polynomial.

55
AnalysisQ1
  • 2.3 Pk contains vertices that a non-redundant
    path in the graph.
  • Do BFS rooted at each vertex. This search is
    also partial as in the last step.
  • n(BFS) is still polynomial.

56
AnalysisQ1
  • Step3--- Mapping to SETCOVER problem.
  • Use V. Chvatal.
  • A greedy heuristic for the set covering
    problem.
  • Math. Oper. Res.,4233-235,1979

57
AnalysisQ1
  • Step4---
  • Find a polynomial algorithm for each block.
  • After summing up these 3 steps, its still
    polynomial.

58
Approximation Analysis
  • Q2Is it feasible?
  • Yes. Because of the blocking alignment, the
    solution is an alignment which doesnt break any
    needed motifs.

59
Approximation Analysis
  • Q3 What is approximation factor?
  • Using the result of Chv79,the set cover
    problem is approximable within 1 lnX, where X
    is the number of the elements of the ground set.
    We have shown that the maximum number of
    non-redundant basic infeasible sets is no more
    than (nN)6. Thus X (nN)6 . The approximation
    factor is 1 6ln(nN).

60
Conclusion
  • Two stages one is identifying the local
    similarities and the other is aligning the
    similarities appropriately.
  • We dont think its a good approximation.

61
MAX SNP
  • Definition
  • The class of problems having constant-factor
    approximation algorithms, but no approximation
    schemes unless PNP.
  • In other words, no PTAS exits.
  • Fact Maximum Cut is MAX SNP hard

62
Proof Procedure
  • First , the paper prove that the max-version
    K-MSA is MAX SNP hard.
  • Second , the paper prove that K-MSA is MAX SNP
    hard

63
MAX Version K-MSA
  • Define
  • K-MSA (max)

64
Proof Procedure(2)
  • The procedure to prove K-MSA(max) is MAX SNP hard
  • Give a reduction from Maximum Cut to Bipartite
    Maximum Cut
  • Then give another reduction from BMC to K-MSA(max)

65
Maximum Cut
  • Given graph (V,E),
  • Output Two vertex set V1,V2
  • Max The cardinality of the cut, i.e., the
    number of edges with one end point in V1 and one
    endpoint in V2.

66
Bipartite Maximum Cut
  • Given graph (Vtop,Vdown,E),
  • any e in E is between Vtop,Vdown
  • value(e) 1 or -1
  • Output Two vertex set V1,V2
  • Max The cardinality of the cut, i.e., the
    number of edges with one end point in V1 and one
    endpoint in V2.

67
Proof Procedure(3)
Some instance of Bipartite Maximum Cut problem
Any instance of Maximum Cut problem
Some instance of K-MSA(max) problem
  • Whole procedure from MC to K-MSA(max)
  • some trouble in middle stage

68
Procedure Overview(1)
  • MC to BMC
  • Given a cut of BMC , a cut of MC is constructed

69
Procedure Overview(2)
  • BMC to K-MSA(max)

70
Procedure Overview(3)
  • If we have those two reduction, then
  • Assume that we have a PTAS for K-MSA(max)
  • gt
  • We will have a PTAS for Maximum Cut !

71
(No Transcript)
72
MC to BMC
  • A basic idea for mc to bmc
  • Split a vertex into top and bottom
  • Much more complex because K-MSA
  • Given a Graph. Construct an instance of BMC. For
    each vi , construct 3di 6 vertex.

73
MC to BMC (2)
74
MC to BMC(3)
75
MC to BMC(4)
76
MC to BMC(5)
  • So Complicated? Dont Worry
  • Main vertex
  • only these point connect to outside
  • only negative edge
  • only postive edge
  • Banalce to zero

77
Construct the Cut of MC
  • Define

78
Construct the Cut of MC(2)
  • In any cut of BMC , we can modify the cut such
    that
  • without decreasing the cost.
  • Then we assign vi ( in MC) according to
    ffff .In other words, if
    then
  • vi belongs to S1(in MC) ,too

79
Construct the Cut of MC(3)
  • Cost(BMC) cost_in_square
    cost_out_square
  • Total cost_in_squre S (3di 2)
  • each vertex in MC
  • 6e 2n

80
Construct the Cut of MC(4)
  • Cost_out_square cost from
  • cost from
  • Assume we will have cost(MC) K
  • total cost from 2K
  • total cost from S(1 (di ci))
  • ci is the cut from vi

81
Construct the Cut of MC(5)
  • Cost(BMC) 6e 2n 2K S(1 (di ci)) 3n
    4e 4K 3n 4e 4cost(MC)
  • 4Cost(MC) cost(BMC) 3n - 4e
  • Claim MC has solution K iff BMC has solution 4K
    4e 3n

82
BMC to K-MSA(max)
  • This part I cant understand very well.
  • This paper lacks definition of some terms.
  • May have error in this part?
  • In Thesis , BSC problem

83
BMC to K-MSA(max)
  • Make reduction from BMC to restricted MSA
    problem.
  • Every sequence has the same length.
  • Gap can only be inserted in left end of right
    end.
  • Only one gap can be inserted.

84
Construct KSA problem
85
Construct KSA problem
86
Construct the cut of BMC
  • Key observation
  • Every edge link to ui is ????
  • If ui change from S1 to S2 than
  • the cost will shift!

87
Construct the cut of BMC
  • If one row i is right alignment , then
    corresponding ui is belong to S1, else S2.
  • If one column j has more one than column j 1,
    then corresponding vi belong to S1, else S2 (not
    sure)
  • (??)Cluster negative edge in one Set.

88
Complain
  • The structure of the paper is hard to read and
    understand.
  • Some proof is bypass .
  • Some information lacked.
  • But to appear in JCO.?
  • But the reduction is interesting.
Write a Comment
User Comments (0)
About PowerShow.com