Title: Algorithmic Problems Related to Sequences and Phylogenetic Trees
1Algorithmic Problems Related to Sequences and
Phylogenetic Trees
- Bhaskar DasGupta
- Department of Computer Science
- University of Illinois at Chicago
- Chicago, IL 60607-7053
- Email dasgupta_at_cs.uic.edu
2- Outline
- Introduction
- Substructure Comparison Problems
- Sequences
- Nonoverlapping local alignment
- Proteins
- Transformation Based Distances
- Phylogenetic Trees
- Why compare?
- A few distances
- Genomes
- Syntenic Distance
- Conclusions
3- Computational Molecular Biology
- A Computer Scientists Participation
- Get to know the computational problems
- Talk to biologists
- State the computational problems as precisely as
possible - Investigate computational aspects of the problems
- exact solutions
- difficult/easy ?
- time/space efficient solutions ?
- approximate solutions (if exact solution is hard
or not time/space efficient) - guaranteed quality of approximation ? (tradeoff
with space/time?) - deterministic vs. randomized algorithms
- implementation aspects
- programming cleverness to reduce space/time
- algorithmic engineering techniques to reduce
space/time - interaction with the biologists
- are the solutions biologically meaningful ?
4- Few Computer Science Jargons
- When we say What we really mean
- Maximization/minimization problem Problem in
which we maximize/minimize some objective
function -
- Problem is NP-complete/hard Exact solution for
large size problem will most likely require too
much time - Polynomial-time solution Solvable in reasonable
time in a reasonably fast computer - Approximation algorithm An approximate solution
computed in reasonable time - with approximation ratio r with an objective
function value of a (for maximization/minimizati
on) least (at most r) of the optimum
5- Substructure Similarity (or, equivalently,
Dissimilarity)
a
b
b
c
c
a
a matches to a with similarity 10 b matches to
b with similarity 15 c matches to c with
similarity 11 total similarity 36 Goal
match disjoint substructures to maximize total
similarity
6- Few Complications
- Many short vs. fewer long substructures
- Measure of similarity between substructures
- Examples
- rmsd (root-mean-square distance) between 3D
substructures - edit distance between subsequences
- syntenic distance between multi-chromosome
genomes
7 - Sequences
- Non-overlapping local alignment
total similarity 101525
8- The problem
- Input pairs of fragments, one from each sequence
(or, equivalently a - set of rectangles).
- the weight of each pair (rectangle) is
their similarity measure - Output a set of pairs (rectangles) such that
- no two rectangles overlap on the x-axis
- (i.e., matched fragments of the first sequence
are disjoint) - no two rectangles overlap on the y-axis
- (i.e., matched fragments of the 2nd sequence
are disjoint) - total similarity of selected fragment pairs is
maximized
9- Further assumption
- We can preprocess input data (rectangles or
fragment pairs) to ensure that - for any two rectangles, the projection of one on
the y-axis does not enclose that of another
not allowed in the input data
- for any two rectangles, the projection of one on
the x-axis does not enclose
that of another
10A
G
15
2
G
C
1
C
10
T
G
A
A
A
C
C
C
An optimal solution of total similarity 25
11- Previous results
- (n number of rectangles (fragment pairs))
- Bafna, Narayanan and Ravi (WADS95)
- NP-complete
- O(n2) time approximation algorithm with
approximation ratio 3.25 - converts to a problem of finding maximum-weight
independent set in a 5-clawfree graph - gives approximation algorithm for (d1)-clawfree
graphs with approximation ratio of
- Halldórsson (SODA95)
- approximation algorithm with approximation ratio
of about 2.5 when all weights are one - again uses clawfree graphs
- Berman (SWAT00)
- O(n4) time algorithm with approximation ratio of
about 2.5 - via clawfree graphs again
12- Our recent results
- (Berman, DasGupta and Muthukrishnan, SODA02)
- O(n log n) time approximation algorithm with
approximation ratio 3 - very simple to implement
- uses a 2-phase approach (or, equivalently, the
local-ratio technique) - Extensions to d dimensions (d gt 2)
- Inputs are similarity measures of d fragments,
one from each of given d sequences - Motivation multiple sequence comparison problems
- Generalization of our above approach
- O(n d log n) time approximation algorithm with
approximation ratio of 2d-1 - current best (Bar-Yehuda, Halldórsson, Naor,
Shachnai and Shapira, SODA02) - polynomial time algorithm with approximation
ratio 2d - uses repeated linear programming and continuous
version of local-ratio techniques
13- Common substructure between protein structures
- (work in progress.......with Jie Liang and Andrew
Binkowski)
- Comparison of 2 4-helix bundles that differ by
topological rearrangement, ROP and cytochrome b56
- Topological cartoons of 1ROP and 256B. Helices
are drawn as cylinders and loops as lines.
Residue numbers of structurally equivalent
segments are indicated on the cylinders. - The alignment is non-sequential.
14- Motivation
- discovering similar substructures from
different proteins is essential for recognizing
remote evolutionary relationship at the level of
protein fragments - Few interesting points
- it is not easy to characterize topological
structures such as void, pocket, or tunnel where
ligand and other molecules bind. - Current computational tools do not perform very
well on discovering similar substructures. - For example
- (a) protein structures are typically
represented by distance matrices or contact maps,
which record pairwise inter-distances between
selected atoms (typically Ca atoms) on the
primary sequences - (b) finding common substructures becomes
matching submatrices of the two contact maps - (c) Heuristic algorithms have been developed
and have proven to be useful. But, they are time
consuming (typically O(n6)), and cannot be used
for more demanding tasks such as identifying
spatial functional motifs -
15- Our approach in work in progress
- reduce the problem to various constrained
rectangle-packing problems - use combinatorial methods (such as the
local-ratio technique) to design approximation
algorithms for these problems - Our final goal
- identification of the most discriminating
geometric and chemical features and their
combinations for various proteins - development of a robust method to compute the
similarity/dissimilarity of two shape
distributions of these features
16- Transformation based distances
- Objects
Transformation rules (with costs)
15
12
10
9
Goal find distance between two specified objects
15
10
9
cost 10159 34
10
12
cost 1012 22
and
is 22
distance between
17- Distances between Phylogenetic trees
- Objects
- Evolutionary trees (phylogenies) on n nodes
Transformation Rules How to modify trees locally
consistent with biological applications?
18- Why compute distances between phylogenies ?
- First motivation
parsimony method
compare them for similarity and discrepancy
compatibility method
input data
maximum-likelihood method
distance matrix method
different methods for inferring phylogenies
19- Why compute distances between phylogenies ?
- Second motivation
- To find out information about rare genetic events
such as recombination or gene conversion
recombination
gene conversion
20- Few distances that we have looked at......
- Nearest neighbor interchange (nni) distance
- Linear cost subtree transfer distances
- Synopsis of our works on these distances
- proving that exact solution is NP-hard
- providing fast approximate solutions
- investigating fixed-parameter tractability
- some implementation works .....
21- Genomic Distance
- Syntenic distance between multi-chromosome
genomes - (Ferretti, Nadeau and Sankoff, 1996)
- treats genomes at a higher level of abstraction
gene
chromosome
4
9
10
8
6
3
5
7
1
2
- order of genes in any chromosome is unknown or
ignored - intra-chromosomal events (e.g., reversal,
transposition) do not affect
chromosomal assignment - inter-chromosomal events are important
22- Inter-chromosomal events
- Fission
Fusion
2
5
2
1
3
5
1
3
4
4
5
4
3
2
1
4
5
3
2
1
(Reciprocal) translocation
5
6
7
3
4
2
1
23- Syntenic distance between two genomes
- minimum number of fission, fusion and
translocations necessary to transform one genome
to another - Other related problems
- finding the median of 3 genomes for the syntenic
distance metric - (useful for phylogentic tree inference problem
from synteny data) - Synopsis of our work on these problems
- showing NP-hardness of exact computation
- giving efficient approximation algorithms
- exhibiting fixed-parameter tractability
24Other problems......
- Genome partitioning with applications to DNA
microarray chip design - Consensus sequence reconstruction problems
25