Title: Genome Rearrangement SORTING BY REVERSALS
1Genome RearrangementSORTING BY REVERSALS
Ankur Jain Hoda Mokhtar CS290I SPRING 2003
2Comparative Genomics
The practice of analyzing and comparing the
genetic material of different species for the
purpose of studying evolution, the function of
genes and inherited diseases. Chromosome
breakage and mistakes in repair, along with a
number of other processes, give rise to changes
in gene order. These have important
consequences for the evolution of species.
3Problem Definition
- During biological evolution, inter- and
intra-chromosomal exchanges of chromosomal
fragments disrupt the order of genes on a
chromosome. - The genome rearrangements approach, is the use
of combinatorial optimization techniques, to
infer a sequence of rearrangement events to
account for the differences among the genomes.
4Outline
- Problem definition
- Genome Comparison
- Possible chromosomal changes
- Sorting by reversals - Previous work
- - Definitions
- - Duality Theorem
- Our technique - Bit Vector Method
- - Experimental results - Synthetic
datasets - - Real datasets
- - Breakpoints Technique
- Conclusions and Future work
5Genome Comparison
- In the late 1980 was discovered remarkable and
novel - pattern of evolutionary change in plant
organelles. - Jeffrey Palmer and his collegues compared the
mitochondrial genomes of cabbage and turnip,
which are very closely related. Molecules which
are almost identical in gene sequences, differ
dramatically in gene order. Sridhar, Pevzner
1995 - This discovery and many other studies proved that
genome rearrangements represent a common mode of
molecular evolution.
6Cabbage and Turnip
Gene orientation
7Single Chromosome Operations
- Reversal A section of a chromosome is excised,
reversed in orientation, and re-inserted. - (abc1c2c3c4de -gt ab-c4-c3-c2-c1de)
- Transposition A section of a chromosome is
excised and inserted at new position in the
chromosome, without changing orientation. (abcd
-gt cdab) - Inverted transposition Exactly like
transposition, except that the transposed segment
changes orientation. (abcd -gt -c-dab) - Gene duplication A section of a chromosome is
duplicated, so that multiple copies exist of
every gene in that section. - (abc -gt abcb, abc -gt abbc)
- Gene loss A section of a chromosome is excised
and lost. - (abc-gtac )
8Operations on 2 Chromosomes
- Translocation The end of one chromosome is
broken and attached to the end of another
chromosome. - Fusion two chromosomes merge.
- Fission one chromosome splits up into two
chromosomes.
9Genomic Sorting Problem
- Given genomes the genomic
sorting problem is to find a series of reversals
where and
t is minimal. - We call t the genomic distance between
and
10Sorting by Reversals
- Genome rearrangements can be modelled by a
combinatorical problem of sorting by reversals.
11Sorting by Reversals (Cont.)
Minimum Sorting by Reversals Given a permutation
?, what is the shortest sequence (?1?2.?t ) of
reversals that sorts ? ?Complexity remains
open. (NP-Hard) Caprara 97 Minimum Signed
Sorting by Reversals Given a signed permutation
?, what is the shortest sequence (?1?2.?t ) of
reversals that sorts ?? ?Solvable in polynomial
time.
12Sorting of Signed Permutations
- Transforming cabbage into turnip. Hannenhalli,
S., and Pevzner, P. 95 - Polynomial algorithm
for sorting signed permutations by reversals
- A Very Elementray Presentation of the
Hannenhalli-Pevzner Theory, A. Bergeron95
Polynomial algorithm for sorting signed
permutations, efficiently implemented using bit
vectors.
- A Very Elementray Presentation of the
Hannenhalli-Pevzner Theory, A. Bergeron95
Polynomial algorithm for sorting signed
permutations, efficiently implemented using bit
vectors.
- Experiments in Computing Sequences of Reversals,
A. Bergeron and F. Strasbourg95 Polynomial
algorithm for sorting signed permutations.
- Fast Sorting by Reversal, Berman, P.,
Hannenhalli, S. 96. - exploit a few
combinatorial properties of the cycle graph of a
permutation and provided a polynomial algorithm.
- A Faster and Simpler Algorithm for Sorting Signed
Permutations by Reversals, Kaplan, H., Shamir,
R., and Tarjan, R. 99. O(n2) using hurdles,
cycles and fortress.
- A Linear-Time Algorithm for Computing Inversion
Distance between Signed Permutations with an
Experimental Study, Moret, and Yan 00 -
Computes reversal distance (without actually
sorting) in O(n) time. Computes the connected
components using stack rather than Union-Find.
Hannenhalli-Pevzner 96 (GRAPPA program)
13Outline
- Problem definition
- Genome Comparison
- Possible chromosomal changes
- Sorting by reversals - Previous work
- - Definitions
- Our technique - Bit Vector Method
- Experimental results - Synthetic
datasets - - Real datasets
- - Breakpoints Technique
- Conclusions and Future work
14What is a Permutation?
- Permutation (?) an ordered arrangement of the
set 1,2,,n - Signed Permutation (?) a permutation where the
elements are oriented a reversal switches element
orientation - 3 -4 7 -6 1 -5 2
- ?(7,-5) 3 -4 5 -1 6 -7 2
15BreakPoint
-
- Let i j if i j 1. Extend
permutation
by adding
0 and n 1. - We call pair of elements ,
0 i n, - of an adjacency if
- and a breakpoint if is not (
)
0
n1
16What is breakpoint graph?
The breakpoint graph of a permutation
is a
edge-colored graph
with 2n2 vertices
by a black edge
We join vertices
and
and
by a gray edge if
We join vertices
17Breakpoint graph signed case
Straight edges every other pair of consecutive
elements Curved edges - every other pair of
consecutive integers
Every connected component of the graph is a cycle
18Correlation between the breakpoints and reversal
distance
- Correlations exists between the reversal distance
and the number of breakpoints - Sorting by reversals corresponds to eliminating
breakpoints - Every resersal can eliminate at most 2
breakpoints
Shamir, 95
19Outline
- Problem definition
- Genome Comparison
- Possible chromosomal changes
- Sorting by reversals - Previous work
- - Definitions
- - Duality Theorem (Hurdles !!)
- Our technique - Vector-Method
- Experimental results - Synthetic
datasets - - Real datasets
- -Breakpoints Technique
- Conclusions and Future work
20Hurdle
Hurdle - an unoriented component whose elements
are consecutive Simple hurdle - a hurdle whose
deletion decreases the number of hurdles Super
hurdles - hurdles that are not simple
21Duality Theorem for Sorting Signed Permutations
Hannenhalli and Pevzner, 1995.
For every signed permutation
if
is a fortress
otherwise
22Safe reversal
C3, h1
C 5, h 2
23Outline
- Problem definition
- Genome Comparison
- Possible chromosomal changes
- Sorting by reversals - Previous work
- - Definitions
- - Duality Theorem (Hurdles !!)
- Our technique - Bit Vector Method
- Experimental results - Synthetic
datasets - - Real datasets
- - Breakpoints Technique
- Conclusions and Future work
24Our Approach
- Finding hurdles and fortresses in a graph are
difficult and expensive Kaplan, H., Shamir, R.,
and Tarjan, R. 99. - Use oriented sort to remove the oriented
components in a graph and then apply the
breakpoint approach to perform the remaining
reversals - We used the bit-vector approach to perform the
oriented sort -
25Oriented Sort
- Choose among the several candidates, a
- safe reversal, that is a reversal that decreases
the reversal distance. - Theorem The reversal that maximizes the number
of oriented vertices is safe A. Bergeron95 -
-
26Basic Sorting oriented pair
- An oriented pair is a pair of consecutive
integers, that is - with opposite signs
- Example
- (0 3 1 6 5 -2 -4 7)
- Oriented pairs are (1,-2) , (3, -4)
27Reversal score
The number of oriented pairs in the resulting
permutation as a result of a reversal Example
( 0 3 1 6 5 -2 4 7 )
( 0 3 1 6 5 -2 4 7 )
( 0 3 1 6 5 -2 4 7 )
(1, -2)
(3, -2)
( 0 -5 -6 -1 -3 -2 4 7 ) ( 0 3 1 2
-5 -6 4 7 )
Score 4
Score 2
28Algorithm
- As long as has an oriented pair choose the
oriented reversal that has maximal score - (0 3 1 6 5 2 4 7)
( 0 -5 -6 -1 -3 -2 4 7 ) (-3, 4)
( 0 -5 -6 -1 2 3 4 7 ) (-1,2)
( 0 -5 -6 1 2 3 4 7 ) (-6,7)
( 0 -5 -4 -3 -2 -1 6 7 ) (-5,6)
( 0 1 2 3 4 5 6 7 )
29Oriented edge
Let
be a gray edge incident to
black edges
and
Then
.
i k j - l
is oriented if and only if
Edge 20-21 is oriented (contains 3 odd number
of vertices). I 20, j21, k22, l23 I-k -2
j-l -2
Bergeron
Pevzner
30Oriented reversals
- Reversals induces by an oriented pair will be
, and
, if
, if
Reversals that create consecutive integers are
always induced by oriented pairs. Such
reversals are called oriented reversal.
Example The pair (1, -2) induces the
reversal (0 3 1 6 5 2 4 7) (0 3 1 2 5 6 4
7)
31Interleaving Graph
C
Every 2 components are adjacent if there is an
overlap between them but neither of them
contains the other.
32Constructing the Bit Matrix
Consider the sequence P 3 1 6 5 2 4
7 Represent Pi by
2i-1, 2i if Pi is ve and
2i, 2i-1 otherwise Pi is -ve 3 1
6 5 -2 4 7 0 5
6 1 2 11 12 9 10 4 3 7 8 13 14 15
33The Algorithm
Step 1. Select the vertex vi with the maximum
score and perform the these operations until we
reach a situation when parity of all the vertices
is zero Step 2. If the sequence is not sorted
completely apply the breakpoint technique to
complete the sorting
34Outline
- Problem definition
- Genome Comparison
- Possible chromosomal changes
- Sorting by reversals - Previous work
- - Definitions
- - Duality Theorem (Hurdles !!)
- Our technique - Bit Vector Method
- Experimental results - Synthetic
datasets - - Real datasets
- - Breakpoints Technique
- Conclusions and Future work
35Experimental Settings
- 1- Synthetic Datasets
- generated random signed permutation of different
lengths and evolution rate using GRIMM
permutation generation module - 2- Real Datasets
- Used GRAPPA test sets for different species of
Campanulaceae (flower plant) - MGR (multiple genome rearrangement) human-mouse
gene order data - Genome.org Herpes Virus that affects human
36Experiment 1 - Synthetic
1- Generated files of random permutations of
different lengths (50, 100, 200, 400, 800, 1600)
each file with 50 permutations. 2- We computed
the number of correctly sorted permutations. 3-
Evolution rate varies 20,30,40
37Experiment 2 - Synthetic
1- Generated files of random permutations of
different lengths (50, 100, 200, 400, 800, 1600)
each file with 50 permutations. 2- We computed
the time needed to obtain the correctly sorted
permutations. 3- Evolution rate varies
20,30,40
38Experiment 3 - Synthetic
1- Generated files of random permutations of
length 1000 2- We computed the time needed to
obtain the correctly sorted permutations. 3-
Evolution rate varies in increments of 100.
Observation Saturation state is reached as
evolution rate approaches 1000
39Experiment 1 - Real
Considered Herpes simplex virus (HSV),
Epstein-Barr virus (EBV), and Cytomegalovirus
(CMV) gene orders (Hannenhalli et al. 1995) as
well as the identity gene order
(A) ObservationsOur reversal results matched
those obtained in optimal evolutionary scenario
recovered by MGR-MEDIAN.
40Experiment 2 - Real
1- Considered Campanulaceae species 2- Obtained
reversals for Cyanathus (11 reversals), Triodanus
(13 reversals), and Symphanra (12 reversals)
versus Tobacco but failed to sort Platyncodon,
Legousia and Codonopsis Observation The ones we
sorted were sorted with same number of reversals
as GRIMM
41Experiment 3 - Real
1- Considered Human-Mouse gene order from MGR
12 13 14 15 -9 -8 -7 -6 47 48 -46 -45 -44 -11 -10
-58 -57 -56 92 93 -95 -94 -21 -20 -5 -4 -3 -2 -1
34 35 41 42 43 36 37 38 -64 -63 61 62 65 66 67 68
90 91 -55 -54 51 52 53 39 40 -60 -59 -77 -76 -19
-18 16 17 -97 -96 -75 -74 -73 24 25 78 79 -83 -82
-81 -80 84 85 86 87 -28 -27 -26 22 23 98 99 69 70
-72 -71 -33 -32 -31 -30 -29 88 89 -50 -49 -105
-104 106 107 108 114 115 -117 -116 -103 -102 109
110 111 112 113 -101 -100 118 119 120 121 122 123
(mouse genome and human is identity)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 39 40 41 42 43
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69 70 94 95 96 97 98
99 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122
123 124
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 39 40 41 42 43
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69 70 94 95 96 97 98
99 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122
123 124
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 39 40 41 42 43
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69 70 94 95 96 97 98
99 100 101 102 103 104 105 106 107 108 109 110
111 112 113 114 115 116 117 118 119 120 121 122
123 124
Identity
GRIMM sorts the permutation in 41 reversals
42Conclusions
- We implemented a technique that integrates the
bit-matrix oriented sorting technique together
with the greedy breakpoint reversal technique. - The technique proposed was tested on both real
and synthetic data and was able to sort signed
permutations in a fair number of the test data - We think that such integration can yield good
results beside being a simple and relatively fast
technique - However, the oriented sort algorithm fails to
sort permutations that have hurdles, in those
cases we have to apply the breakpoint approach
43Future Work
- We really think that the technique we
implemented can provide good results, we think
that further experiments can strengthen our claim - We started implementing the algorithm proposed
in Kaplan, H., Shamir, R., and Tarjan R. 99 but
didnt succeed to complete the implementation. We
think that having this technique implemented
under that same conditions as ours can provide a
good source of comparative results, and can give
a better confidence about what we propose. - Applying the technique in different datasets
including exon order rather than gene order - Considering different species and trying to
compute reversal distance and use it to confirm
phylogenetic trees
44Oriented Pairs
(0 )
- An oriented pair ( , ) is a pair of
consecutive integers, that is - with opposite signs
- Example
- (0 3 1 6 5 2 4 7)
- Oriented pairs are
(1,-2)
(3, 2)
45Reversal Distance Estimation
This reversal distance is very in-accurate.
Bafna and Pevzner, 1996 showed that another
hidden parameter hurdles estimated reversal
distance with much greater accuracy.
46Proper reversal
For every permutation
and reversal
Given an arbitary reversal
denote
(increase in the size of cycle decomposition)
Then for every permutation
and reversal
We call reversal proper if
1
47Oriented pairs
- Oriented pairs are useful because they indicate
reversals that create consecutive elements of the
permutation. - Example
- The pair (1, -2) induces the reversal
- (0 3 1 6 5 2 4 7)
-
- (0 3 1 2 5 6 4 7)