PowerPoint-Pr - PowerPoint PPT Presentation

About This Presentation

Title:

PowerPoint-Pr

Description:

Encodes the necessary information for the existence and ... The running time can be speeded-up in practice using some programming tricks. 23. kd-trees ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 34

Provided by: mibr5

Category:

more less

Transcript and Presenter's Notes

Title: PowerPoint-Pr

1
Chaining Algorithms Simplified
Mohamed Ibrahim Abouelhoda University of Ulm and
Cairo University 2007
2
The Genome
The Genome

The total DNA content in a cell ? a string over
an alphabet of 4 characters A, C, G, T
Encodes the necessary information for the
existence and reproduction of an organism

..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG.
.
Gene
Gene
3
Computational Comparative Genomics
What about?
Genome of Organism B
Genome of Organism A
4
Sequence Alignment
Traditional solutions
TCACAA
TACAATCAA
S1
TCACAA
S1
CAAATCA
TCACTCAC
S2
CAAATCA
S2
Local Sequence Alignment
Global Sequence Alignment
Sequence Alignment is not suitable for comparing
genomic sequences

Dynamic programming algorithms take
time (knumber of genomes, Naverage genome
length)

5
The Anchor-based Strategy

Composed of three phases

Computation of fragments (similar regions) among
genomes
Computation of an optimal global chain or chains
of colinear non-overlapping fragments
Detailed alignment of the regions between the
fragments of the chain

A fragment

Different characters
Genome 1
GACCGCGCA
CACCGCGCT
Genome 2
Exact Fragments (e.g., maximal exact matches)
6
Fragment Representation

Box-Line Representation
Geometric Representation Each fragment is
represented by a hyper-rectangle in kD space,
each axis corresponds to one sequence

S2
T C A C T C A C
S1
T A C A A T C A A
T C A C T C A C
S2
T A C A A T C A A
S1
Box-line Representation
Geometric Representation
7
The Anchor-based Strategy

Composed of three phases

Computation of fragments (similar regions) among
genomes
Computation of an optimal global chain or chains
of colinear non-overlapping fragments
Detailed alignment of the regions between the
fragments of the chain

First Genome G1
Second Genome G2
8
The Anchor-based Strategy

Composed of three phases

Computation of fragments (similar regions) among
genomes
Computation of an optimal global chain or chains
of colinear non-overlapping fragments
Detailed alignment of the regions between the
fragments of the chain

First Genome G1
Second Genome G2
anchors
9
Our Contribution
Enhanced Suffix Array

A novel and efficient data structure ? more
space efficient than the suffix tree
Efficient for computing various kinds of
fragments
Not limited to fragment generation, but also
applicable for various string processing tasks

Chaining Algorithms

Computing chains of fragments efficiently
Handling fragments of two or multiple genomes
Introducing a number of variations for versatile
comparative genomics tasks

Not heuristic
Extends pair-wise alignment methods
Novel extensions of the chaining algorithms
10
Chaining Algorithms
11
The Global Chaining Problem
Given n weighted fragments from k genomes, find
the chain C of colinear non-overlapping
fragments such that its total score is maximum
over all other chains.
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi1, fi) is the gap cost of connecting
fi1 to fi

The weight of a fragment is for example its
length

12
The Global Chaining Problem
First Genome G1
Second Genome G2
Third Genome G3
fi1
fi
Given n weighted fragments from k genomes, find
the chain C of colinear non-overlapping
fragments such that its total score is maximum
over all other chains.
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi1, fi) is the gap cost of connecting
fi1 to fi

The weight of a fragment is for example its
length

13
Notions

A fragment fi is represented as a
hyper-rectangle in a k-dimensional space.
A fragment fi is identified with its start and
end points start(fi) and end( fi).
We add two imaginary fragments O and t with
weight zero.
Any two fragments fi and fi1 in the chain must
be colinear and non-overlapping

filtlt fi1 end( fi).xr lt start(fi1).xr for all
r, 0 lt r lt k
14
Types of Gap Costs

The gap costs g can be described geometrically

L1
L8
ACC _XX ACC
ACC XX _ _ _ _ _ ACC
ACC YYY ACC
ACC _ _ YYY _ _ ACC
ACC _ ZZ ACC
ACC _ _ _ _ _ ZZ ACC
15
A Graph-based Solution

The score of a chain C is

score(C) ?i fi .weight - g(fi, fi-1)

An optimal chain is a chain of maximum score
A highest-scoring path in the graph is an
optimal chain

The maximum score can be computed by the
recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj

A graph based solution takes O(n2) time.

16
Sparse Dynamic Programming

Chaining algorithms are sparse dynamic programming

D. Eppstein, R. Giancarlo, Z. Galil, and G.F.
Italiano, 1992

The maximum score can be computed by the
recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
where filtlt fj end( fi).xr lt start(fj).xr for
all r, 0 lt r lt k
g( fi , fj ) is the gap cost of connecting fi
to fj

1
j
T C G C C C C G T T
A C G T C C G C A T
i
17
Sparse Dynamic Programming

Chaining algorithms are sparse dynamic programming

D. Eppstein, R. Giancarlo, Z. Galil, and G.F.
Italiano, 1992

The maximum score can be computed by the
recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
where filtlt fj end( fi).xr lt start(fj).xr for
all r, 0 lt r lt k
g( fi , fj ) is the gap cost of connecting fi
to fj

1
j
Y Y Y Y Y Y Y Y Y Y

The string characters are not given, only
positions
In extreme cases, you can enumerate all matches
and consider others as gaps ?
sparse dynamic programming (chaining) is used
to compute alignment directly ?
selecting gap cost function is critical

X X X X X X X X X
X
i
18
A Geometric-based Solution

The max function in the recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
can be replaced by range maximum query (RMQ)
fj.scorefj.weightRMQO, start(fj)

RMQ (Range Maximum Query)

Retrieves the fragment fi whose end point Iies
in the hyper-rectangle bounded by start(fj) and O
such that fi.score-g( fi , fj ) is maximum.

If the gap cost is zero, a RMQ returns the end
point of the fragment fi such that
is maximum.

If all the fragments have the same weight
(length) and no gap cost ? we are solving the
LCS problem
19
The Algorithm without gap cost

Line-sweep algorithm

1. Sort the start and end points of the fragments
w.r.t. x1 2. If a start point of a fragment, say
fj, is scanned apply the RMQ(O, (start(fj).x1,
start(fj).x2, , start(fj).xk)) to the
set of active end points and update the score
of the end point of fragment fj. 3. Otherwise,
add the end point to the set of active end points
(already scanned end points).

Becaue of the sorting step, the dimension of the
RMQ can be reduced to k-1 ?
we can use RMQ(O, (start(fj).x2, ,
start(fj).xk))

For comparing two sequences, the RMQ dimension
is 1 ?
we can use priority queues to find an optimal
fragment in O(log log m)

But the complexity is dominated by the sorting,
unless the fragments are computed in order.

Priority queue is complicated to implement

20
The Complexity of the Algorithm

The algorithm complexity depends on the data
structure supporting RMQ

Semi-dynamic data structure
Dynamic data structure

Constructed point by point - Points are
explicitly inserted, deleted- Less space,
because some covered fragments can be deleted-
Very difficult to implement- Works for on-line
chaining

Constructed for all point at once - Points are
not inserted/deleted, rather activated/inactivate
d- More space, all fragments remain in
memory- Easier to implement- Works for
off-line chaining

21
The Complexity of the Algorithm
RMQ using semi-dynamic range tree

supported by fractional cascading.
enhanced with priority queues.

Willard, 1985

D is implemented as a range tree

Johnson, 1982
van Emde Boas, 1977

For n fragments and dimension d, the RMQ and
activation takes

O(n log d-1 n log log n) time and O(n log d-1 n)
space

Since d k-1gt1, the complexity of the algorithm
is

O(n log k-2 n log log n) time and O(n log k-2 n)
space
22
The Complexity of the Algorithm
RMQ using semi-dynamic kd-tree

For n fragments and dimension dgt1, the RMQ and
activation takes

Lee-Wong 1977

Since d k-1gt1, the complexity of the algorithm
is

The running time can be speeded-up in practice
using some programming tricks
Bently, 1990
23
kd-trees
24
kd-trees vs. Range Trees

d stands for dimension
C stands for construction
Q stands for query and activation time
For 4 strains E. coli, the range tree did not
fit in memory estimated space consumption is 7.1
Gb

25
Including Gap Costs
Recall the recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
The gap cost should be included in the RMQ,
otherwise the algorithm would be quadratic.
fj.scorefj.weightRMQO, start(fj)
26
Including Gap Costs in L1

We define the geometric cost of a fragment f as
follows

gc( f) d1(t, end(f))
f
where d1(t, end(f) is the distance in the L1
metric between t and end(f).
f 2
f 1.score - g( f 1 , f ) gt f 2.score - g( f 2 , f
) iff f 1.score - gc( f 1) gt f 2.score - gc( f 2)
f 1

gc( f) is a constant that can be precomputed and
attached to the fragments weight
We activate fragment with f .score - gc( f )
instead of f.score

The inclusion of gap cost can be done with no
extra cost ? the same complexity as the
algorithm with no gap cost
27
The Local Chaining Problem
Given n weighted fragments from k genomes, a
chain C of colinear non-overlapping fragments has
score
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting
fi to fi-1

The weight of a fragment is for example its
length or its statistical significance

A local chain C is called optimal if its score
is maximum over all other chains.

28
The Local Chaining Problem
Given n weighted fragments from k genomes, a
chain C of colinear non-overlapping fragments has
score
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting
fi to fi-1

The weight of a fragment is for example its
length or its statistical significance

A local chain C is called optimal if its score
is maximum over all other chains.

29
Geometric Solution
fj
The recurrence
fj.scorefj.weightmax0, fi.score-g( fi , fj )
filtltfj
can be written as
fj.scorefj.weightRMQO, start(fj)

But we have to check
if

fj.scorefj.weightf.score gt 0,
fRMQO, start(fj)
then
Connect f to fj
else
Start a new chain, starting with fj
30
Comparing two bacterial genomes
C. trachmoatish
Red points Forward fragments Green points
Reverse fragments
C. pneumonia
The two genomes
1- C. trachomatis (1.2 Mbp) 2- C. pneumoniae
(1.2 Mbp)

Fragments of the type maximal exact matches of
minimum length 12
Total number of fragments 288,899

31
Comparing two bacterial genomes
Chains
C. trachmoatis
Termini of Replication
C. pneumonia
The two genomes
1- C. trachomatis (1.2 Mbp) 2- C. pneumoniae
(1.2 Mbp)

Fragments of the type maximal multiple exact
matches of minimum length 12
Total number of fragments 288,899

CoCoNUT is fast it takes minutes to compute
fragments and local chains a task that took
hours by previous methods

32
Conclusions

Chaining Algorithms are efficient for
comparative genomics
More variations needed for real applications in
biology, i.e., limiting range search, considering
overlaps
CoCoNUT is a system for comparative genomics
containing various variations of the chaining
algorithms
Global and local chaining are analogous to
global and local sequence alignment
kd-tree is superior to range tree in practice

33
More on Chaining Algorithms
1 E. Ohlebusch, M. I. Abouelhoda. Chaining Algorithms and Applications in Comparative Genomics. Handbook of Computational Molecular Biology (Chapter 20), 2005, in press.
2 M. I. Abouelhoda. Algorithms and a Software System for Comparative Genome Analysis. PhD Thesis, Ulm University, 2005.
34
Thanks for attention

Write a Comment

User Comments (0)