Combinatorial and Statistical Approaches in Gene Rearrangement Analysis - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Combinatorial and Statistical Approaches in Gene Rearrangement Analysis

Description:

Title: Department of Computer Science and Engineering and the South Carolina Information Technology Institute Author: buell Last modified by: jtang – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 58

Provided by: bue96

Learn more at: https://people.computing.clemson.edu

Category:

more less

Transcript and Presenter's Notes

Title: Combinatorial and Statistical Approaches in Gene Rearrangement Analysis

1
Combinatorial and Statistical Approaches in Gene
Rearrangement Analysis

Jijun Tang
Computer Science and Engineering
University of South Carolina
jtang_at_cse.sc.edu
(803) 777-8923

2
Outline

Backgrounds
Branch-and-Bound Algorithms for the Median
Problem
Maximum Likelihood Methods for Phylogenetic
Reconstruction
Post-Analysis
Conclusions

3
(No Transcript)
4
Simple Rearrangements
5
Phylogenetic Reconstruction
6
Rearrangement Phylogeny
7
(No Transcript)
8
(No Transcript)
9
Median Problem
Goal find M so that DAMDBMDCM is minimized NP
hard for most metric distances
10
Multichromosomal Reversal Median problem

To find a median genome that minimizes the
summation of the multichromosomal HP distances on
the three edges
Events considered reversal, translocation,
fusion, fission
Exact and heuristic solvers exist for the
Unichromosomal Reversal Median Problem (reversals
are the only events)

11
Capless Breakpoint Graph

Genome A ? Non-perfect Matching M(A)
Let a,b be adjacency genes in A. Then (at,bh) is
an edge in M(A)
A genome is composed of a set of edges and ends.
Matchings naturally correspond to Undirected
Genomes (Flipping of chromosomes does not alter
matchings)

12
Example

Example Genomes
A -5, 1, 6, 3 , 2, 4
B 1, 6 , -5, -4, -3, -2

Adjacency Graph
13
Capless Breakpoint Graph
B-end
A-end

Denote C(A,B) Cycles, AB AB-Paths, AA
AA-paths, BB BB-paths in G(A,B), n
genes
n 6,C(A,B) 1,AB 4,
dHP 6-1-4/2 3

14
A Lower Bound of the HP Distance

A simpler lower bound only contains genes,
cycles, paths.
Derived from Hannenhalli, Pevzner 1995
dHP (A,B)n C(A,B) - AB/2 AA - BB
Pseudo-cycle of A and B

15
Pseudo-cycle distance Median Problem

Pseudo-cycle distance
Pseudo-cycle distance Median Problem (PMP) to
find a median genome that minimizes the summation
of the Pseudo-cycle distance on the three edges
We use the Pseudo-cycle distance as a lower bound
for the HP distance to derive a RMP solver

16
Branch-and-Bound Algorithm

Enumerate the solution genomes gene by gene
(Genome Enumeration)
After enumerated a gene, compute an upper bound
based on the partial solution genome
Bound check whether the upper bound of the
partial solution is less than a criteria
Branch
If it is true, the partial genome is discarded,
enumerate another gene
Otherwise update the criteria and continue
enumeration

17
Genome Enumeration for Multichromosome Genomes
Genome Enumeration For genomes on gene 1,2,3
2
2
2
-2
-2
-2
18
Features

Main Components
Contraction Operation
Upper Bound on the number of pseudo-cycles
Genome enumeration
Extension of Capraras method for unichromosomal
genomes (1999)

19
Contraction Operation

Contraction eat,bh on M(A) M(A)/e

20
Upper Bound on the Number of Pseudo-cycles

Let S be a genome and ZG1, G2, G3 a set of
three input genomes

The maximal ?(S,Z) is denoted by ?
Based on triangle inequality, an upper bound on
the number of pseudo-cycles can be derived

21
Notes

qn- ? is the lower bound of the sum of
pseudo-cycle distances between any S and each
genome in Z G1, G2, G3
Given an edge e, assume genome S contains e and
maximizes ?(S,Z) let ZG1/e, G2/e, G3/e, and
assume S maximizes Z?(S,Z), then S S?e

22
Upper Bound Test

In a step of the algorithm, the current partial
solution is Sie1,e2,,ei
The upper bound of ?(S,Z) of genoms containing Si
is the following

Let UB be the current upper bound
If UBSiltUB, then the best upper bound of the
genomes containing Si is worse than UB

23
Branch-and-Bound Algorithm for Multichromosomal
Genomes

Compute an initial Upper Bound (UB) from the
input genomes.
In each step, either an end or an edge is fixed
in the solution.
End Fixing Mark a node as an end of a
chromosome.
Edge Fixing Fix an edge e to the current partial
solution genome Si.

24
Genome Enumeration for Multichromosome Genomes
Genome Enumeration For genomes on gene 1,2,3
2
2
2
-2
-2
-2

Red line end fixing
Black line edge fixing

25
Properties

Can be extended to compute a given tree using
iterative or progressive approaches
However, median computation is still difficult
Large nuclear genomes
Complex events
We also need to search the best tree from the
large tree space
N species
20 species

26
Statistical Approaches

Combinatorial approaches are the focus of genome
rearrangement research
Only one MCMC method exists
Maximum Likelihood methods have been very popular
in sequence phylogenetic analysis
Bootstrapping (data resampling) is a popular
method to assess quality of obtained trees
Hard to directly apply ML and bootstrapping to
gene order

27
Sequence ML Phylogeny

For each position, generate all possible tree
structures
Based on the evolutionary model, calculate
likelihood of these trees and sum them to get the
column likelihood
Calculate tree likelihood by multiplying the
likelihood for each position
Choose tree with the greatest likelihood

28
Example
A acgcaa
B acataa
C atgtca
D gcgtta
29
All Possible Evolutionary Paths (Column 1)
a c g t
a c g t
a c g t
30
Likelihood for One Path
a
a
a
g
31
Sum of All Paths (Column 1)
a c g t
a c g t
a c g t
32
Whole Sequence
33
MLBE

Convert the gene-orders into binary sequences
based on adjacencies
Convert the binary sequences into protein or DNA
sequence
Use RAxML to compute a ML tree on the sequences
Binary encoding was used before for parsimony
analysis, with reasonable results

34
Binary Encoding
35
MLBE Sequences
36
Experimental Setup

Generate random trees of N taxa
Each tree is equally likely
Birth-death model is preferred
Starting from the root, apply r events along each
edge
r is the expected number of events
Actual number is a sample between 12r
Comparing the inferred tree with the true tree
using RF rate

37
(No Transcript)
38
Experimental Results (Equal Content 1)
80 inversion, 20 transposition
39
Experimental Results (Equal Content 2)
80 inversion, 20 transposition
40
Experimental Results (Unequal 1)
90 inversion, 10 of del/ins/dup, 5-30 genes per
segment
41
Experimental Results (Unequal 2)
90 inversion, 10 of del/ins/dup, 5-30 genes per
segment
42
Multistate Endocing
43
MLME Results (200 genes 20 genomes)
44
MLME Results (1000 genes 20 genomes)
45
Post Analysis

Bootstrapping has been widely used to assess the
quality of sequence phylogeny
The same procedure is impossible for gene order
data since there is only one character
We tested the procedure of jackknifing through
simulated data to obtain
Is jackknifing useful
The best jackknifing rate
What is the threshold of the support values

46
DNA bootstrapping
47
Bootstrapping Results
48
Jackknifing Procedure

Generate a new dataset by removing half of the
genes from the original genomes (orders are
preserved)
Compute a tree on the new dataset
Repeat K times and obtain K replicates
Obtain a consensus tree with support values

49
An ExampleNew Genomes

1 2 3 4 5 6 7 8 9 10
1 -4 5 2 8 10 9 -7 -6 3

1 3 5 7 9 1 5 9 -7 3
50
Jackknifing Rate
51
Support Value Threshold - FP
Up to 90 FP can be identified with 85 as the
threshold
52
Trees with FP
53
Support Value Threshold - FN
54
Low Support Branches
55
Jackknife Properties

Jackknifing is necessary and useful for gene
order phylogeny, and a large number of errors can
be identified
40 jackknifing rate is reasonable
85 is a conservative threshold, 75 can also be
used
Low support branches should be examined in detail

56
Conclusions