Title: An Approx. Algo. For Alignment MSA using Motif Discovery
1An Approx. Algo. For Alignment MSA using Motif
Discovery
- B.J. Chen
- M.W. Chang
- C.H. Tsai
- H.Y. Chuang
- NTU
2MSA
- Multiple Sequence Alignment
- Given N sequences, align these sequences,
possibly with gaps, that brings out the best
commonality of these N sequences. - Usually measure the alignment by penalizing the
mis-matches and gaps, and rewarding the matches.
3Previous works
- Alt89, CHM94, CL88, GKS95etc. works best
for small values of N (about 2-6). - ZZ96, Vih98 handle the case of
- N gt2 by first applying pairwise alignment
techniques.
4Motivation of this work
- For larger values of N, we need additional
constraints to give biological meaningful
alignment. - MOTIF A common patterns across sequences.
5Motivation of this work
- Alignment number K (2 K N)
- A user controlled parameter constrains the
alignment to have at least K sequences agree on a
character, whenever possible, in the alignment. - The commonality across most sequences is required
to be detected.
6Problem Description
- Align multiple strings with gaps
- Hope Brings out the best commonality
- Parameter K , the alignment number
- The alignment should has at least K sequences
agree on a character, whenever possible, in the
alignment. - K 1 has no meaning
-
7Problem Definition
- Given N sequences , each have length ni
- Let matrix A be an alignment , the size of A is N
x T ( T is the length after alignment, T gt ni ) - Let Ai be i th row in A,throw all gaps in Ai
will become original sequence_i
8Problem Definition(2)
- Define EA(a,i,j) 1 if Ai,j a
- EA(a,i,j) 0 o.w
- (a,j) Si EA(a,i,j)
- a in column j is bad if (a,j) lt K
- Define
- S (a,j)
- a (a,j) gt 0
- (a,j) lt K
9Problem Definition(3)
A A B B B B C C
X X O O O O X X
10Problem Definition(4)
- Minimize whole badness
- K-MSA
- The paper proposes the proof that
- this problem is MAX SNP hard (Ill report this
part later)
11Stage 1 Motif Discovery
- We begin by defining a motif in a sequence
- Given a string s on alphabet ? and an integer K,
2 K s - Definition 1 K-motif
- A string m on ??. is K-motif with location list
Lm ( l1,l2,,lp), if all of the following
hold - m0, mm-1 belong to ?
- p K
- Every dont care character at position j in m,
there exist at least two distinct occurrences li1
and li2, 1 i1, i2 p , such that s li1j ?
s li2j
12Definition 2
- Maximal Motif
- Let p1, p2, , pk be all the motifs in the
sequence s. A motif pi is maximal if and only if - there is no pj (j ? i) such that pi is a
substring of pj, or - if pi is a substring of pj, then there exists at
least one occurrence of pi in s that is not
covered by pj in s
13Definition 3
- Redundant motif
- A maximal motif m, with location list Lm, is
redundant if there exist maximal motifs mi, 1 i
p , such that Lm Lm1 ? Lm2 ? ? Lmp - A motif that is not redundant is called an
irredundant motif
14Example 1
- Let the input string s have the following form
- ac1c2c3baXc2c3bYac1Xc3bYYac1c2Xb
- _________________________________________
- ac1c2c3b aXc2c3bY ac1Xc3bYY
ac1c2Xb - ab
- a..c3b
- a.c2.b
- ac1..b
- a.c2c3b
- ac1.c3b
- ac1c2.b
15Motif Discovery
- The motifs of interest to the sequence alignment
problem are the irredundant motifs lemma 4 - There exists a polynomial time algorithm to
extract them from the input - In the rest of the paper we use motifs that occur
in multiple sequences, i.e., occur in each
sequence exactly once. - If there exists a motif that occurs more than
once in a sequence, then for the arguments that
follow, each occurrence is treated as a distinct
motif
16Lemma 4
- Lemma 4 if p is a redundant motif, then using
the motif p does not improve the cost of the
K-MSA optimization problem - Proof. Let p be rendered redundant by motifs p1,
p2, , pn, n 1. By definition, motif p has less
number of solid-characters than each of pi, 1 i
n. Thus if an alignment can use motif p, it can
certainly use all the motifs p1, p2, ,pn, giving
a larger number of solid-characters thus a
higher cost for the K-MSA optimization problem
17Stage 2 Sequence Alignment
- Definition 4
- two irredundant motifs pi and pj, if there
exists a sequence s containing both these motifs - Let ni and nj be the sizes of the motifs pi and
pj and let li and lj be the locations (offsets)
in a sequence s respectively - Motifs Overlap if the intervals li, li ni
and lj, lj nj have a non-empty intersection
p1
p2
18Definition 5
- Pairwise Compatible Motifs Two motifs, p1 and
p2, are pairwise feasible if there exists an
alignment of the sequences that does not
introduce gaps in the motifs p1 and p2
p1
p2
19Lemma 1
- Two irredundant motifs pi and pj are pairwise
feasible if and only if none of the following
hold - (domain crossing mismatch) if pi and pj do not
overlap in all the sequence, then pi is to the
left pj, without loss of generality - (overlap mismatch) if pi and pj overlap in any
sequence, than pi is at some fixed distance d to
left of pj, without loss of generality
20Mismatch
Motif A
Motif B
Overlap mismatches
(i)
(ii)
Domain-crossing mismatches
21Definition 6
- Motif alignment Given a set M of motifs, a motif
alignment of the sequences, s1, s2, ,sm, is the
alignment such that in all the sequences, without
breaking the motifs in M, the motifs are aligned
(in all the sequences they appear) - Feasible set If such an alignment exists, the
set is called a feasible set
22Definition 7
- Linear ordering of motifs given a set of
feasible motifs, a consistent ordering of the
motifs such that in every sequence, the set of
motifs that are present in the sequence appear in
the left to right order is called the linear
ordering
23Example 2
- H I A J G L B
- A M C N B Q D
- C O P E D R F
- H I S J G E T U F
- H I V J G
- AB in sequence 1 and 2
- CD in sequence 2 and 3
- E..F in sequence 3 and 4
- HI.JG in sequence 1, 4 and 5
24Example 2
- (iv) (i) (ii) (iii)
- H I A J G L B -- -- -- --
- -- -- A M C N B Q D -- --
- -- -- -- -- C O P E D R F
- H I S J G -- -- E T U F
- H I V J G -- -- -- -- -- --
i
ii
iii
iv
25Example 3
- H I A J G L B -- -- -- --
- -- -- A M C N B Q D -- --
- -- -- -- -- C O P E D R F
- -- -- -- -- -- H I E J G F
- -- -- -- -- -- H I K J G --
- No alignment due to overlap mismatch
- There is no linear ordering of the motifs
26Example 4
- A G H D X Y
- A I C D J F
- X Y C D K F
- A..D in sequences 1 and 2
- CD.F in sequences 2 and 3
- XY in sequences 1 and 3
- linear order of the motifs is (i), (iii), (ii)
- no alignment due to crossing mismatch
i
iii
ii
27Definition 8
- Domain crossing error Given a set of motifs, m1,
m2,, mn, a domain crossing error is said to
occur if there exists a linear ordering of the
motifs mi1, mi2, , min, yet there exists no
alignment that respects all the n motifs
28Lemma 2
- A set of irredundant motifs p1, p2, , pn is
feasible if and only if none of the following
holds - There exist distinct motifs pi and pj such that
pi and pj are pairwise infeasible - There exists a non-empty subset of the motifs
without a linear ordering - There exists a non-empty subset of the motifs
that demonstrate domain crossing error
29The Graph-theoretic Formulation
- Construct a directed graph G ( V, E) where every
motif pi corresponds to a vertex vi, thus n
V. The directed edge are introduced as follows - There is no edge between two vertices where the
two corresponding motifs do not occur
simultaneously in any sequence - If pi is to the left of pj in every sequence that
the two motifs are present, then a directed edge
is placed from vi to vj - The edges are labeled as follows
- Forbidden if the motifs are not pairwise
feasible - Overlap if the motifs corresponding to v1 and v2
overlap - Non-overlap if the motifs do not overlap
30Handling Domain Crossing Mismatches - intro
- Introduce consistent graph w.r.t. a vertex.
- First, define distance on edges.
- Dv1,v2 minimum distance between v1 and v2
in every sequence that both them appear in.
31Consistent Graph w.r.t. a Vertex
- G (V,E) for each edge (u,v) label?
forbidden, overlap, nonoverlap weight Du,v
- Definition
- Valid path contains no forbidden edge
- Overlap-path all edges in the path are
overlap - Weight Dp ? De, e in p (path)
32Consistent Graph w.r.t. a Vertex
- p ?V The graph is consistent w.r.t. p if
- ?q ?V, for all pair of vertex-disjoint valid
paths from p to q, P1 and P2, - 1. Dp1 Dp2, if P1 and P2 are both
overlap-paths, or, - 2. Dp1 ? Dp2, if P1 is an overlap-path and P2 is
not.
33An Example
- Consider a previous example.
- AGHDXY
- AICDJF
- XYCDKF
- A..D
- CD.F
- XY
P1 v1 -gt v2, D 2 P2 v1 -gt v3 -gtv2, D
6 Not consistent!
34Graph Consistent v.s. Domain Crossing
Mismatches
- Use Graph Consistent property to deal with the
domain crossing mismatches problem. - If the induced graph is consistent w.r.t every
vertex, then there is no domain crossing
mismatches. - Moreover, it may rule out overlap mismatches.
35Lemma Rewritten
- Given a subset of vertices(motifs) v1vn,
construct graph as previously defined. This
induced subgraph on v1vn is feasible, if the
following holds - 1. there is no edge labled forbidden in the
- subgraph,
- 2. the induced subgraph is acyclic, and,
- 3. the subgraph is consistent w.r.t. every vi.
36Algorithm
- Idea an infeasible set to a feasible set
37Algorithm
- Detect the basic infeasible subsets.
- Eliminate motifs to obtain a feasible set that
maximizes the cost. - Render the alignment.
38Algorithm
- Before we construct the basic infeasible sets, we
have one more definition - Given a graph G and two cycles C1 and C2 on G,
if all vertices defining C1 are also in C2, then
C2 is redundant with respect to C1.
39Algorithm Step1
- Compute the following sets
- 1. Fi contains the 2 endpoints of i-th
forbidden edge. - 2. Cj contains vertices that form a directed
non-redundant cycle in the graph. - 3. Pk contains vertices that a non-redundant
path in the graph.
40Algorithm Step1
- The number of the total basic infeasible sets we
obtained in Step1 is bounded. - More precisely, there would be no more than .
- It is easy to prove, since the number would be
bounded by the number of cycles or closed path.
41Algorithm Step2
- Mapping to SETCOVER problem.
- v1, v2vn F1 ? Fnf ? C1Cnc ? P1Pnp
- U F1Fnf, C1Cnc, P1Pnp
- Si Fl vi ?Fl, 1? l ? nf ?
- Cl vi ?Cl, 1? l ? nc ?
- Pl vi ?Pl, 1? l ? np
- costi solid_char in motifi sequences
- containing motifi
- AB..CD ? 4
42Algorithm Step3
- Find feasible subgraph containing only overlap
edges. - For each sequence, find an ordering of motifs in
that subgraph. Say, p1, p2pj. - Align sequences.
-
p1 p2 p4
p1
p2 p4
43Example
- A set of 3 sequences
- GFPCQFSAG
- GFPCQFSGG
- GPCQSAGK
- K2
44Example
- Irredundant motifs (By Teiresias)
- PCQ in Seq. 1,2,3
- GFPCQFS.G in Seq. 1,2
- PCQ..G in Seq. 2,3
- SAG in Seq. 1,3
- GFPCQFSAG
- GFPCQFSGG
- GPCQSAGK
45Example
2 (0.V.)
1
2
0 (O.V.)
2 (O.V.)
3 (non.)
4
3
3 (O.V.)
6 (O.V.)
46Example
- Basic infeasible sets
- No F
- No C
- Closed path set P,
- P1 214
- P2 123
- P3 134
- P4 234
47Example
- Make feasible set by S.C. problem
- G P1,P2,P3,P4
- S1 P1,P2,P3 w1 3
- S2 P1,P2,P4 w2 9
- S3 P2,P3,P4 w3 6
- S4 P1,P3,P4 w4 3
- Solution S1,S4
48Example
- Get aligned blocks from overlap graphs
- GFPCQFS.G
- PCQ..G
2
2 (O.V.)
3
49Example
- Result ---
- G F P C Q F S a G
- G F P C Q F S G G
- - g P C Q s a G K
50Demonstration
- IBM Bioinformatics Group
- http//cbcsrv.watson.ibm.com/Tmsa.html
51Approximation Analysis
- Q1 Is it polynomial time?
- Recall this algorithm---
- 1. Convert to graph.
- 2. Detect the basic infeasible subsets.
- 3. Eliminate motifs to obtain a feasible set that
maximizes the cost. - 4. Render the alignment.
-
52AnalysisQ1
- Step1---Covert to graph
- one irredundant motif ? one vertex
- one relation between two irredundant motifs ?
one edge - of irredundant motifs is linear in the input
size. (by Par98) -
53AnalysisQ1
- Step2--- Compute the following sets
- 2.1 Fi contains the 2 endpoints of i-th
forbidden edge. - Simply scan all the edges and collecting the
end points of the edges labeled Forbidden. - In G(V,E),Vn
- E n(n-1)/2O(n2)
54AnalysisQ1
- 2.2 Cj contains vertices that form a
directed non-redundant cycle in the graph. - Do DFS of the graph to capture all the
cycles. (If you do a DFS, you have a cycle iff
you have a back edge. ) - Dont need to traverse along a whole path.
-
- DFS is polynomial.
-
55AnalysisQ1
- 2.3 Pk contains vertices that a non-redundant
path in the graph. - Do BFS rooted at each vertex. This search is
also partial as in the last step. - n(BFS) is still polynomial.
56AnalysisQ1
- Step3--- Mapping to SETCOVER problem.
- Use V. Chvatal.
- A greedy heuristic for the set covering
problem. - Math. Oper. Res.,4233-235,1979
57AnalysisQ1
- Step4---
- Find a polynomial algorithm for each block.
- After summing up these 3 steps, its still
polynomial.
58Approximation Analysis
- Q2Is it feasible?
- Yes. Because of the blocking alignment, the
solution is an alignment which doesnt break any
needed motifs.
59Approximation Analysis
- Q3 What is approximation factor?
- Using the result of Chv79,the set cover
problem is approximable within 1 lnX, where X
is the number of the elements of the ground set.
We have shown that the maximum number of
non-redundant basic infeasible sets is no more
than (nN)6. Thus X (nN)6 . The approximation
factor is 1 6ln(nN).
60Conclusion
- Two stages one is identifying the local
similarities and the other is aligning the
similarities appropriately. - We dont think its a good approximation.
61MAX SNP
- Definition
- The class of problems having constant-factor
approximation algorithms, but no approximation
schemes unless PNP. - In other words, no PTAS exits.
- Fact Maximum Cut is MAX SNP hard
62Proof Procedure
- First , the paper prove that the max-version
K-MSA is MAX SNP hard. - Second , the paper prove that K-MSA is MAX SNP
hard
63MAX Version K-MSA
64Proof Procedure(2)
- The procedure to prove K-MSA(max) is MAX SNP hard
- Give a reduction from Maximum Cut to Bipartite
Maximum Cut - Then give another reduction from BMC to K-MSA(max)
65Maximum Cut
- Given graph (V,E),
- Output Two vertex set V1,V2
- Max The cardinality of the cut, i.e., the
number of edges with one end point in V1 and one
endpoint in V2.
66Bipartite Maximum Cut
- Given graph (Vtop,Vdown,E),
- any e in E is between Vtop,Vdown
- value(e) 1 or -1
- Output Two vertex set V1,V2
- Max The cardinality of the cut, i.e., the
number of edges with one end point in V1 and one
endpoint in V2.
67Proof Procedure(3)
Some instance of Bipartite Maximum Cut problem
Any instance of Maximum Cut problem
Some instance of K-MSA(max) problem
- Whole procedure from MC to K-MSA(max)
- some trouble in middle stage
68Procedure Overview(1)
- MC to BMC
- Given a cut of BMC , a cut of MC is constructed
69Procedure Overview(2)
70Procedure Overview(3)
- If we have those two reduction, then
- Assume that we have a PTAS for K-MSA(max)
- gt
- We will have a PTAS for Maximum Cut !
71(No Transcript)
72MC to BMC
- A basic idea for mc to bmc
- Split a vertex into top and bottom
- Much more complex because K-MSA
- Given a Graph. Construct an instance of BMC. For
each vi , construct 3di 6 vertex.
73MC to BMC (2)
74MC to BMC(3)
75MC to BMC(4)
76MC to BMC(5)
- So Complicated? Dont Worry
- Main vertex
- only these point connect to outside
- only negative edge
- only postive edge
- Banalce to zero
-
77Construct the Cut of MC
78Construct the Cut of MC(2)
- In any cut of BMC , we can modify the cut such
that - without decreasing the cost.
- Then we assign vi ( in MC) according to
ffff .In other words, if
then - vi belongs to S1(in MC) ,too
79Construct the Cut of MC(3)
- Cost(BMC) cost_in_square
cost_out_square - Total cost_in_squre S (3di 2)
- each vertex in MC
- 6e 2n
80Construct the Cut of MC(4)
- Cost_out_square cost from
- cost from
- Assume we will have cost(MC) K
- total cost from 2K
- total cost from S(1 (di ci))
- ci is the cut from vi
81Construct the Cut of MC(5)
- Cost(BMC) 6e 2n 2K S(1 (di ci)) 3n
4e 4K 3n 4e 4cost(MC) - 4Cost(MC) cost(BMC) 3n - 4e
- Claim MC has solution K iff BMC has solution 4K
4e 3n
82BMC to K-MSA(max)
- This part I cant understand very well.
- This paper lacks definition of some terms.
- May have error in this part?
- In Thesis , BSC problem
83BMC to K-MSA(max)
- Make reduction from BMC to restricted MSA
problem. - Every sequence has the same length.
- Gap can only be inserted in left end of right
end. - Only one gap can be inserted.
84Construct KSA problem
85Construct KSA problem
86Construct the cut of BMC
- Key observation
- Every edge link to ui is ????
- If ui change from S1 to S2 than
- the cost will shift!
87Construct the cut of BMC
- If one row i is right alignment , then
corresponding ui is belong to S1, else S2. - If one column j has more one than column j 1,
then corresponding vi belong to S1, else S2 (not
sure) - (??)Cluster negative edge in one Set.
88Complain
- The structure of the paper is hard to read and
understand. - Some proof is bypass .
- Some information lacked.
- But to appear in JCO.?
- But the reduction is interesting.