Title: PowerPoint-Pr
1Chaining Algorithms Simplified
Mohamed Ibrahim Abouelhoda University of Ulm and
Cairo University 2007
2The Genome
The Genome
- The total DNA content in a cell ? a string over
an alphabet of 4 characters A, C, G, T - Encodes the necessary information for the
existence and reproduction of an organism
..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG.
.
Gene
Gene
3Computational Comparative Genomics
What about?
Genome of Organism B
Genome of Organism A
4Sequence Alignment
Traditional solutions
TCACAA
TACAATCAA
S1
TCACAA
S1
CAAATCA
TCACTCAC
S2
CAAATCA
S2
Local Sequence Alignment
Global Sequence Alignment
Sequence Alignment is not suitable for comparing
genomic sequences
- Dynamic programming algorithms take
time (knumber of genomes, Naverage genome
length)
5The Anchor-based Strategy
- Computation of fragments (similar regions) among
genomes - Computation of an optimal global chain or chains
of colinear non-overlapping fragments - Detailed alignment of the regions between the
fragments of the chain
Different characters
Genome 1
GACCGCGCA
CACCGCGCT
Genome 2
Exact Fragments (e.g., maximal exact matches)
6Fragment Representation
- Box-Line Representation
- Geometric Representation Each fragment is
represented by a hyper-rectangle in kD space,
each axis corresponds to one sequence
S2
T C A C T C A C
S1
T A C A A T C A A
T C A C T C A C
S2
T A C A A T C A A
S1
Box-line Representation
Geometric Representation
7The Anchor-based Strategy
- Computation of fragments (similar regions) among
genomes - Computation of an optimal global chain or chains
of colinear non-overlapping fragments - Detailed alignment of the regions between the
fragments of the chain
First Genome G1
Second Genome G2
8The Anchor-based Strategy
- Computation of fragments (similar regions) among
genomes - Computation of an optimal global chain or chains
of colinear non-overlapping fragments - Detailed alignment of the regions between the
fragments of the chain
First Genome G1
Second Genome G2
anchors
9Our Contribution
Enhanced Suffix Array
- A novel and efficient data structure ? more
space efficient than the suffix tree - Efficient for computing various kinds of
fragments - Not limited to fragment generation, but also
applicable for various string processing tasks
Chaining Algorithms
- Computing chains of fragments efficiently
- Handling fragments of two or multiple genomes
- Introducing a number of variations for versatile
comparative genomics tasks
Not heuristic
Extends pair-wise alignment methods
Novel extensions of the chaining algorithms
10Chaining Algorithms
11The Global Chaining Problem
Given n weighted fragments from k genomes, find
the chain C of colinear non-overlapping
fragments such that its total score is maximum
over all other chains.
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi1, fi) is the gap cost of connecting
fi1 to fi
- The weight of a fragment is for example its
length
12The Global Chaining Problem
First Genome G1
Second Genome G2
Third Genome G3
fi1
fi
Given n weighted fragments from k genomes, find
the chain C of colinear non-overlapping
fragments such that its total score is maximum
over all other chains.
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi1, fi) is the gap cost of connecting
fi1 to fi
- The weight of a fragment is for example its
length
13Notions
- A fragment fi is represented as a
hyper-rectangle in a k-dimensional space. - A fragment fi is identified with its start and
end points start(fi) and end( fi). - We add two imaginary fragments O and t with
weight zero. - Any two fragments fi and fi1 in the chain must
be colinear and non-overlapping
filtlt fi1 end( fi).xr lt start(fi1).xr for all
r, 0 lt r lt k
14Types of Gap Costs
- The gap costs g can be described geometrically
L1
L8
ACC _XX ACC
ACC XX _ _ _ _ _ ACC
ACC YYY ACC
ACC _ _ YYY _ _ ACC
ACC _ ZZ ACC
ACC _ _ _ _ _ ZZ ACC
15A Graph-based Solution
- The score of a chain C is
score(C) ?i fi .weight - g(fi, fi-1)
- An optimal chain is a chain of maximum score
- A highest-scoring path in the graph is an
optimal chain
- The maximum score can be computed by the
recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
- A graph based solution takes O(n2) time.
16Sparse Dynamic Programming
- Chaining algorithms are sparse dynamic programming
D. Eppstein, R. Giancarlo, Z. Galil, and G.F.
Italiano, 1992
- The maximum score can be computed by the
recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
where filtlt fj end( fi).xr lt start(fj).xr for
all r, 0 lt r lt k
g( fi , fj ) is the gap cost of connecting fi
to fj
1
j
T C G C C C C G T T
A C G T C C G C A T
i
17Sparse Dynamic Programming
- Chaining algorithms are sparse dynamic programming
D. Eppstein, R. Giancarlo, Z. Galil, and G.F.
Italiano, 1992
- The maximum score can be computed by the
recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
where filtlt fj end( fi).xr lt start(fj).xr for
all r, 0 lt r lt k
g( fi , fj ) is the gap cost of connecting fi
to fj
1
j
Y Y Y Y Y Y Y Y Y Y
- The string characters are not given, only
positions - In extreme cases, you can enumerate all matches
and consider others as gaps ? - sparse dynamic programming (chaining) is used
to compute alignment directly ? - selecting gap cost function is critical
X X X X X X X X X
X
i
18A Geometric-based Solution
- The max function in the recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
can be replaced by range maximum query (RMQ)
fj.scorefj.weightRMQO, start(fj)
- RMQ (Range Maximum Query)
Retrieves the fragment fi whose end point Iies
in the hyper-rectangle bounded by start(fj) and O
such that fi.score-g( fi , fj ) is maximum.
- If the gap cost is zero, a RMQ returns the end
point of the fragment fi such that
is maximum.
If all the fragments have the same weight
(length) and no gap cost ? we are solving the
LCS problem
19The Algorithm without gap cost
1. Sort the start and end points of the fragments
w.r.t. x1 2. If a start point of a fragment, say
fj, is scanned apply the RMQ(O, (start(fj).x1,
start(fj).x2, , start(fj).xk)) to the
set of active end points and update the score
of the end point of fragment fj. 3. Otherwise,
add the end point to the set of active end points
(already scanned end points).
- Becaue of the sorting step, the dimension of the
RMQ can be reduced to k-1 ? - we can use RMQ(O, (start(fj).x2, ,
start(fj).xk))
- For comparing two sequences, the RMQ dimension
is 1 ? - we can use priority queues to find an optimal
fragment in O(log log m)
- But the complexity is dominated by the sorting,
unless the fragments are computed in order.
- Priority queue is complicated to implement
20The Complexity of the Algorithm
- The algorithm complexity depends on the data
structure supporting RMQ
Semi-dynamic data structure
Dynamic data structure
- Constructed point by point - Points are
explicitly inserted, deleted- Less space,
because some covered fragments can be deleted-
Very difficult to implement- Works for on-line
chaining
- Constructed for all point at once - Points are
not inserted/deleted, rather activated/inactivate
d- More space, all fragments remain in
memory- Easier to implement- Works for
off-line chaining
21The Complexity of the Algorithm
RMQ using semi-dynamic range tree
- supported by fractional cascading.
- enhanced with priority queues.
Willard, 1985
- D is implemented as a range tree
Johnson, 1982
van Emde Boas, 1977
- For n fragments and dimension d, the RMQ and
activation takes
O(n log d-1 n log log n) time and O(n log d-1 n)
space
- Since d k-1gt1, the complexity of the algorithm
is
O(n log k-2 n log log n) time and O(n log k-2 n)
space
22The Complexity of the Algorithm
RMQ using semi-dynamic kd-tree
- For n fragments and dimension dgt1, the RMQ and
activation takes
Lee-Wong 1977
- Since d k-1gt1, the complexity of the algorithm
is
The running time can be speeded-up in practice
using some programming tricks
Bently, 1990
23kd-trees
24kd-trees vs. Range Trees
- d stands for dimension
- C stands for construction
- Q stands for query and activation time
- For 4 strains E. coli, the range tree did not
fit in memory estimated space consumption is 7.1
Gb
25Including Gap Costs
Recall the recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
The gap cost should be included in the RMQ,
otherwise the algorithm would be quadratic.
fj.scorefj.weightRMQO, start(fj)
26Including Gap Costs in L1
- We define the geometric cost of a fragment f as
follows
gc( f) d1(t, end(f))
f
where d1(t, end(f) is the distance in the L1
metric between t and end(f).
f 2
f 1.score - g( f 1 , f ) gt f 2.score - g( f 2 , f
) iff f 1.score - gc( f 1) gt f 2.score - gc( f 2)
f 1
- gc( f) is a constant that can be precomputed and
attached to the fragments weight - We activate fragment with f .score - gc( f )
instead of f.score
The inclusion of gap cost can be done with no
extra cost ? the same complexity as the
algorithm with no gap cost
27The Local Chaining Problem
Given n weighted fragments from k genomes, a
chain C of colinear non-overlapping fragments has
score
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting
fi to fi-1
- The weight of a fragment is for example its
length or its statistical significance
- A local chain C is called optimal if its score
is maximum over all other chains.
28The Local Chaining Problem
Given n weighted fragments from k genomes, a
chain C of colinear non-overlapping fragments has
score
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting
fi to fi-1
- The weight of a fragment is for example its
length or its statistical significance
- A local chain C is called optimal if its score
is maximum over all other chains.
29Geometric Solution
fj
The recurrence
fj.scorefj.weightmax0, fi.score-g( fi , fj )
filtltfj
can be written as
fj.scorefj.weightRMQO, start(fj)
fj.scorefj.weightf.score gt 0,
fRMQO, start(fj)
then
Connect f to fj
else
Start a new chain, starting with fj
30Comparing two bacterial genomes
C. trachmoatish
Red points Forward fragments Green points
Reverse fragments
C. pneumonia
The two genomes
1- C. trachomatis (1.2 Mbp) 2- C. pneumoniae
(1.2 Mbp)
- Fragments of the type maximal exact matches of
minimum length 12 - Total number of fragments 288,899
31Comparing two bacterial genomes
Chains
C. trachmoatis
Termini of Replication
C. pneumonia
The two genomes
1- C. trachomatis (1.2 Mbp) 2- C. pneumoniae
(1.2 Mbp)
- Fragments of the type maximal multiple exact
matches of minimum length 12 - Total number of fragments 288,899
- CoCoNUT is fast it takes minutes to compute
fragments and local chains a task that took
hours by previous methods
32Conclusions
- Chaining Algorithms are efficient for
comparative genomics - More variations needed for real applications in
biology, i.e., limiting range search, considering
overlaps - CoCoNUT is a system for comparative genomics
containing various variations of the chaining
algorithms - Global and local chaining are analogous to
global and local sequence alignment - kd-tree is superior to range tree in practice
33More on Chaining Algorithms
1 E. Ohlebusch, M. I. Abouelhoda. Chaining Algorithms and Applications in Comparative Genomics. Handbook of Computational Molecular Biology (Chapter 20), 2005, in press.
2 M. I. Abouelhoda. Algorithms and a Software System for Comparative Genome Analysis. PhD Thesis, Ulm University, 2005.
34Thanks for attention