PowerPoint-Pr - PowerPoint PPT Presentation

About This Presentation
Title:

PowerPoint-Pr

Description:

Encodes the necessary information for the existence and ... The running time can be speeded-up in practice using some programming tricks. 23. kd-trees ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 34
Provided by: mibr5
Category:

less

Transcript and Presenter's Notes

Title: PowerPoint-Pr


1
Chaining Algorithms Simplified
Mohamed Ibrahim Abouelhoda University of Ulm and
Cairo University 2007
2
The Genome
The Genome
  • The total DNA content in a cell ? a string over
    an alphabet of 4 characters A, C, G, T
  • Encodes the necessary information for the
    existence and reproduction of an organism

..GCGGGGCGGTTCACGCGGCCGCAATCAACTGCGTGGGGGGGGGGGGG.
.
Gene
Gene
3
Computational Comparative Genomics
What about?
Genome of Organism B
Genome of Organism A
4
Sequence Alignment
Traditional solutions
TCACAA
TACAATCAA
S1
TCACAA
S1
CAAATCA
TCACTCAC
S2
CAAATCA
S2
Local Sequence Alignment
Global Sequence Alignment
Sequence Alignment is not suitable for comparing
genomic sequences
  • Dynamic programming algorithms take
    time (knumber of genomes, Naverage genome
    length)

5
The Anchor-based Strategy
  • Composed of three phases
  1. Computation of fragments (similar regions) among
    genomes
  2. Computation of an optimal global chain or chains
    of colinear non-overlapping fragments
  3. Detailed alignment of the regions between the
    fragments of the chain
  • A fragment

Different characters
Genome 1
GACCGCGCA
CACCGCGCT
Genome 2
Exact Fragments (e.g., maximal exact matches)
6
Fragment Representation
  • Box-Line Representation
  • Geometric Representation Each fragment is
    represented by a hyper-rectangle in kD space,
    each axis corresponds to one sequence

S2
T C A C T C A C
S1
T A C A A T C A A
T C A C T C A C
S2
T A C A A T C A A
S1
Box-line Representation
Geometric Representation
7
The Anchor-based Strategy
  • Composed of three phases
  1. Computation of fragments (similar regions) among
    genomes
  2. Computation of an optimal global chain or chains
    of colinear non-overlapping fragments
  3. Detailed alignment of the regions between the
    fragments of the chain

First Genome G1
Second Genome G2
8
The Anchor-based Strategy
  • Composed of three phases
  1. Computation of fragments (similar regions) among
    genomes
  2. Computation of an optimal global chain or chains
    of colinear non-overlapping fragments
  3. Detailed alignment of the regions between the
    fragments of the chain

First Genome G1
Second Genome G2
anchors
9
Our Contribution
Enhanced Suffix Array
  • A novel and efficient data structure ? more
    space efficient than the suffix tree
  • Efficient for computing various kinds of
    fragments
  • Not limited to fragment generation, but also
    applicable for various string processing tasks

Chaining Algorithms
  • Computing chains of fragments efficiently
  • Handling fragments of two or multiple genomes
  • Introducing a number of variations for versatile
    comparative genomics tasks

Not heuristic
Extends pair-wise alignment methods
Novel extensions of the chaining algorithms
10
Chaining Algorithms
11
The Global Chaining Problem
Given n weighted fragments from k genomes, find
the chain C of colinear non-overlapping
fragments such that its total score is maximum
over all other chains.
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi1, fi) is the gap cost of connecting
fi1 to fi
  • The weight of a fragment is for example its
    length

12
The Global Chaining Problem
First Genome G1
Second Genome G2
Third Genome G3
fi1
fi
Given n weighted fragments from k genomes, find
the chain C of colinear non-overlapping
fragments such that its total score is maximum
over all other chains.
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi1, fi) is the gap cost of connecting
fi1 to fi
  • The weight of a fragment is for example its
    length

13
Notions
  • A fragment fi is represented as a
    hyper-rectangle in a k-dimensional space.
  • A fragment fi is identified with its start and
    end points start(fi) and end( fi).
  • We add two imaginary fragments O and t with
    weight zero.
  • Any two fragments fi and fi1 in the chain must
    be colinear and non-overlapping

filtlt fi1 end( fi).xr lt start(fi1).xr for all
r, 0 lt r lt k
14
Types of Gap Costs
  • The gap costs g can be described geometrically

L1
L8
ACC _XX ACC
ACC XX _ _ _ _ _ ACC
ACC YYY ACC
ACC _ _ YYY _ _ ACC
ACC _ ZZ ACC
ACC _ _ _ _ _ ZZ ACC
15
A Graph-based Solution
  • The score of a chain C is

score(C) ?i fi .weight - g(fi, fi-1)
  • An optimal chain is a chain of maximum score
  • A highest-scoring path in the graph is an
    optimal chain
  • The maximum score can be computed by the
    recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
  • A graph based solution takes O(n2) time.

16
Sparse Dynamic Programming
  • Chaining algorithms are sparse dynamic programming

D. Eppstein, R. Giancarlo, Z. Galil, and G.F.
Italiano, 1992
  • The maximum score can be computed by the
    recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
where filtlt fj end( fi).xr lt start(fj).xr for
all r, 0 lt r lt k
g( fi , fj ) is the gap cost of connecting fi
to fj









1
j
T C G C C C C G T T
A C G T C C G C A T
i
17
Sparse Dynamic Programming
  • Chaining algorithms are sparse dynamic programming

D. Eppstein, R. Giancarlo, Z. Galil, and G.F.
Italiano, 1992
  • The maximum score can be computed by the
    recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
where filtlt fj end( fi).xr lt start(fj).xr for
all r, 0 lt r lt k
g( fi , fj ) is the gap cost of connecting fi
to fj









1
j
Y Y Y Y Y Y Y Y Y Y
  • The string characters are not given, only
    positions
  • In extreme cases, you can enumerate all matches
    and consider others as gaps ?
  • sparse dynamic programming (chaining) is used
    to compute alignment directly ?
  • selecting gap cost function is critical

X X X X X X X X X
X
i
18
A Geometric-based Solution
  • The max function in the recurrence

fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
can be replaced by range maximum query (RMQ)
fj.scorefj.weightRMQO, start(fj)
  • RMQ (Range Maximum Query)

Retrieves the fragment fi whose end point Iies
in the hyper-rectangle bounded by start(fj) and O
such that fi.score-g( fi , fj ) is maximum.
  • If the gap cost is zero, a RMQ returns the end
    point of the fragment fi such that
    is maximum.

If all the fragments have the same weight
(length) and no gap cost ? we are solving the
LCS problem
19
The Algorithm without gap cost
  • Line-sweep algorithm

1. Sort the start and end points of the fragments
w.r.t. x1 2. If a start point of a fragment, say
fj, is scanned apply the RMQ(O, (start(fj).x1,
start(fj).x2, , start(fj).xk)) to the
set of active end points and update the score
of the end point of fragment fj. 3. Otherwise,
add the end point to the set of active end points
(already scanned end points).
  • Becaue of the sorting step, the dimension of the
    RMQ can be reduced to k-1 ?
  • we can use RMQ(O, (start(fj).x2, ,
    start(fj).xk))
  • For comparing two sequences, the RMQ dimension
    is 1 ?
  • we can use priority queues to find an optimal
    fragment in O(log log m)
  • But the complexity is dominated by the sorting,
    unless the fragments are computed in order.
  • Priority queue is complicated to implement

20
The Complexity of the Algorithm
  • The algorithm complexity depends on the data
    structure supporting RMQ

Semi-dynamic data structure
Dynamic data structure
  • Constructed point by point - Points are
    explicitly inserted, deleted- Less space,
    because some covered fragments can be deleted-
    Very difficult to implement- Works for on-line
    chaining
  • Constructed for all point at once - Points are
    not inserted/deleted, rather activated/inactivate
    d- More space, all fragments remain in
    memory- Easier to implement- Works for
    off-line chaining

21
The Complexity of the Algorithm
RMQ using semi-dynamic range tree
  1. supported by fractional cascading.
  2. enhanced with priority queues.

Willard, 1985
  • D is implemented as a range tree

Johnson, 1982
van Emde Boas, 1977
  • For n fragments and dimension d, the RMQ and
    activation takes

O(n log d-1 n log log n) time and O(n log d-1 n)
space
  • Since d k-1gt1, the complexity of the algorithm
    is

O(n log k-2 n log log n) time and O(n log k-2 n)
space
22
The Complexity of the Algorithm
RMQ using semi-dynamic kd-tree
  • For n fragments and dimension dgt1, the RMQ and
    activation takes

Lee-Wong 1977
  • Since d k-1gt1, the complexity of the algorithm
    is

The running time can be speeded-up in practice
using some programming tricks
Bently, 1990
23
kd-trees
24
kd-trees vs. Range Trees
  • d stands for dimension
  • C stands for construction
  • Q stands for query and activation time
  • For 4 strains E. coli, the range tree did not
    fit in memory estimated space consumption is 7.1
    Gb

25
Including Gap Costs
Recall the recurrence
fj.scorefj.weightmaxfi.score-g( fi , fj )
filtltfj
The gap cost should be included in the RMQ,
otherwise the algorithm would be quadratic.
fj.scorefj.weightRMQO, start(fj)
26
Including Gap Costs in L1
  • We define the geometric cost of a fragment f as
    follows

gc( f) d1(t, end(f))
f
where d1(t, end(f) is the distance in the L1
metric between t and end(f).
f 2
f 1.score - g( f 1 , f ) gt f 2.score - g( f 2 , f
) iff f 1.score - gc( f 1) gt f 2.score - gc( f 2)
f 1
  • gc( f) is a constant that can be precomputed and
    attached to the fragments weight
  • We activate fragment with f .score - gc( f )
    instead of f.score

The inclusion of gap cost can be done with no
extra cost ? the same complexity as the
algorithm with no gap cost
27
The Local Chaining Problem
Given n weighted fragments from k genomes, a
chain C of colinear non-overlapping fragments has
score
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting
fi to fi-1
  • The weight of a fragment is for example its
    length or its statistical significance
  • A local chain C is called optimal if its score
    is maximum over all other chains.

28
The Local Chaining Problem
Given n weighted fragments from k genomes, a
chain C of colinear non-overlapping fragments has
score
score(C) ?i fi .weight - ?i g(fi, fi-1)
where g(fi, fi-1) is the gap cost of connecting
fi to fi-1
  • The weight of a fragment is for example its
    length or its statistical significance
  • A local chain C is called optimal if its score
    is maximum over all other chains.

29
Geometric Solution
fj
The recurrence
fj.scorefj.weightmax0, fi.score-g( fi , fj )
filtltfj
can be written as
fj.scorefj.weightRMQO, start(fj)
  • But we have to check
  • if

fj.scorefj.weightf.score gt 0,
fRMQO, start(fj)
then
Connect f to fj
else
Start a new chain, starting with fj
30
Comparing two bacterial genomes
C. trachmoatish
Red points Forward fragments Green points
Reverse fragments
C. pneumonia
The two genomes
1- C. trachomatis (1.2 Mbp) 2- C. pneumoniae
(1.2 Mbp)
  • Fragments of the type maximal exact matches of
    minimum length 12
  • Total number of fragments 288,899

31
Comparing two bacterial genomes
Chains
C. trachmoatis
Termini of Replication
C. pneumonia
The two genomes
1- C. trachomatis (1.2 Mbp) 2- C. pneumoniae
(1.2 Mbp)
  • Fragments of the type maximal multiple exact
    matches of minimum length 12
  • Total number of fragments 288,899
  • CoCoNUT is fast it takes minutes to compute
    fragments and local chains a task that took
    hours by previous methods

32
Conclusions
  • Chaining Algorithms are efficient for
    comparative genomics
  • More variations needed for real applications in
    biology, i.e., limiting range search, considering
    overlaps
  • CoCoNUT is a system for comparative genomics
    containing various variations of the chaining
    algorithms
  • Global and local chaining are analogous to
    global and local sequence alignment
  • kd-tree is superior to range tree in practice

33
More on Chaining Algorithms
1 E. Ohlebusch, M. I. Abouelhoda. Chaining Algorithms and Applications in Comparative Genomics. Handbook of Computational Molecular Biology (Chapter 20), 2005, in press.
2 M. I. Abouelhoda. Algorithms and a Software System for Comparative Genome Analysis. PhD Thesis, Ulm University, 2005.
34
Thanks for attention
Write a Comment
User Comments (0)
About PowerShow.com