Algorithmic Problems Related to Sequences and Phylogenetic Trees - PowerPoint PPT Presentation

About This Presentation

Title:

Algorithmic Problems Related to Sequences and Phylogenetic Trees

Description:

Common substructure between protein structures ... Current computational tools do not perform very well on discovering similar substructures. ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 26

Provided by: bhas7

Learn more at: https://www.cs.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Algorithmic Problems Related to Sequences and Phylogenetic Trees

1
Algorithmic Problems Related to Sequences and
Phylogenetic Trees

Bhaskar DasGupta
Department of Computer Science
University of Illinois at Chicago
Chicago, IL 60607-7053
Email dasgupta_at_cs.uic.edu

Outline
Introduction
Substructure Comparison Problems
Sequences
Nonoverlapping local alignment
Proteins
Transformation Based Distances
Phylogenetic Trees
Why compare?
A few distances
Genomes
Syntenic Distance
Conclusions

Computational Molecular Biology
A Computer Scientists Participation
Get to know the computational problems
Talk to biologists
State the computational problems as precisely as
possible
Investigate computational aspects of the problems
exact solutions
difficult/easy ?
time/space efficient solutions ?
approximate solutions (if exact solution is hard
or not time/space efficient)
guaranteed quality of approximation ? (tradeoff
with space/time?)
deterministic vs. randomized algorithms
implementation aspects
programming cleverness to reduce space/time
algorithmic engineering techniques to reduce
space/time
interaction with the biologists
are the solutions biologically meaningful ?

Few Computer Science Jargons
When we say What we really mean
Maximization/minimization problem Problem in
which we maximize/minimize some objective
function
Problem is NP-complete/hard Exact solution for
large size problem will most likely require too
much time
Polynomial-time solution Solvable in reasonable
time in a reasonably fast computer
Approximation algorithm An approximate solution
computed in reasonable time
with approximation ratio r with an objective
function value of a (for maximization/minimizati
on) least (at most r) of the optimum

Substructure Similarity (or, equivalently,
Dissimilarity)

a
b
b
c
c
a
a matches to a with similarity 10 b matches to
b with similarity 15 c matches to c with
similarity 11 total similarity 36 Goal
match disjoint substructures to maximize total
similarity
6

Few Complications
Many short vs. fewer long substructures

Measure of similarity between substructures
Examples
rmsd (root-mean-square distance) between 3D
substructures
edit distance between subsequences
syntenic distance between multi-chromosome
genomes

Sequences
Non-overlapping local alignment

total similarity 101525
8

The problem
Input pairs of fragments, one from each sequence
(or, equivalently a
set of rectangles).
the weight of each pair (rectangle) is
their similarity measure
Output a set of pairs (rectangles) such that
no two rectangles overlap on the x-axis
(i.e., matched fragments of the first sequence
are disjoint)
no two rectangles overlap on the y-axis
(i.e., matched fragments of the 2nd sequence
are disjoint)
total similarity of selected fragment pairs is
maximized

Further assumption
We can preprocess input data (rectangles or
fragment pairs) to ensure that
for any two rectangles, the projection of one on
the y-axis does not enclose that of another

not allowed in the input data

for any two rectangles, the projection of one on
the x-axis does not enclose
that of another

An illustration
Input

A
G
15
2
G
C
1
C
10
T
G
A
A
A
C
C
C
An optimal solution of total similarity 25
11

Previous results
(n number of rectangles (fragment pairs))
Bafna, Narayanan and Ravi (WADS95)
NP-complete
O(n2) time approximation algorithm with
approximation ratio 3.25
converts to a problem of finding maximum-weight
independent set in a 5-clawfree graph
gives approximation algorithm for (d1)-clawfree
graphs with approximation ratio of

Halldórsson (SODA95)
approximation algorithm with approximation ratio
of about 2.5 when all weights are one
again uses clawfree graphs
Berman (SWAT00)
O(n4) time algorithm with approximation ratio of
about 2.5
via clawfree graphs again

Our recent results
(Berman, DasGupta and Muthukrishnan, SODA02)
O(n log n) time approximation algorithm with
approximation ratio 3
very simple to implement
uses a 2-phase approach (or, equivalently, the
local-ratio technique)
Extensions to d dimensions (d gt 2)
Inputs are similarity measures of d fragments,
one from each of given d sequences
Motivation multiple sequence comparison problems
Generalization of our above approach
O(n d log n) time approximation algorithm with
approximation ratio of 2d-1
current best (Bar-Yehuda, Halldórsson, Naor,
Shachnai and Shapira, SODA02)
polynomial time algorithm with approximation
ratio 2d
uses repeated linear programming and continuous
version of local-ratio techniques

Common substructure between protein structures
(work in progress.......with Jie Liang and Andrew
Binkowski)

Comparison of 2 4-helix bundles that differ by
topological rearrangement, ROP and cytochrome b56
Topological cartoons of 1ROP and 256B. Helices
are drawn as cylinders and loops as lines.
Residue numbers of structurally equivalent
segments are indicated on the cylinders.
The alignment is non-sequential.

Motivation
discovering similar substructures from
different proteins is essential for recognizing
remote evolutionary relationship at the level of
protein fragments
Few interesting points
it is not easy to characterize topological
structures such as void, pocket, or tunnel where
ligand and other molecules bind.
Current computational tools do not perform very
well on discovering similar substructures.
For example
(a) protein structures are typically
represented by distance matrices or contact maps,
which record pairwise inter-distances between
selected atoms (typically Ca atoms) on the
primary sequences
(b) finding common substructures becomes
matching submatrices of the two contact maps
(c) Heuristic algorithms have been developed
and have proven to be useful. But, they are time
consuming (typically O(n6)), and cannot be used
for more demanding tasks such as identifying
spatial functional motifs

Our approach in work in progress
reduce the problem to various constrained
rectangle-packing problems
use combinatorial methods (such as the
local-ratio technique) to design approximation
algorithms for these problems
Our final goal
identification of the most discriminating
geometric and chemical features and their
combinations for various proteins
development of a robust method to compute the
similarity/dissimilarity of two shape
distributions of these features

Transformation based distances
Objects

Transformation rules (with costs)
15
12
10
9
Goal find distance between two specified objects
15
10
9
cost 10159 34
10
12
cost 1012 22
and
is 22
distance between
17

Distances between Phylogenetic trees
Objects
Evolutionary trees (phylogenies) on n nodes

Transformation Rules How to modify trees locally
consistent with biological applications?
18

Why compute distances between phylogenies ?
First motivation

parsimony method
compare them for similarity and discrepancy
compatibility method
input data
maximum-likelihood method
distance matrix method
different methods for inferring phylogenies
19

Why compute distances between phylogenies ?
Second motivation
To find out information about rare genetic events
such as recombination or gene conversion

recombination
gene conversion
20

Few distances that we have looked at......
Nearest neighbor interchange (nni) distance
Linear cost subtree transfer distances
Synopsis of our works on these distances
proving that exact solution is NP-hard
providing fast approximate solutions
investigating fixed-parameter tractability
some implementation works .....

Genomic Distance
Syntenic distance between multi-chromosome
genomes
(Ferretti, Nadeau and Sankoff, 1996)
treats genomes at a higher level of abstraction

gene
chromosome
4
9
10
8
6
3
5
7
1
2

order of genes in any chromosome is unknown or
ignored
intra-chromosomal events (e.g., reversal,
transposition) do not affect
chromosomal assignment
inter-chromosomal events are important

Inter-chromosomal events
Fission
Fusion

2
5
2
1
3
5
1
3
4
4
5
4
3
2
1
4
5
3
2
1
(Reciprocal) translocation
5
6
7
3
4
2
1
23

Syntenic distance between two genomes
minimum number of fission, fusion and
translocations necessary to transform one genome
to another
Other related problems
finding the median of 3 genomes for the syntenic
distance metric
(useful for phylogentic tree inference problem
from synteny data)
Synopsis of our work on these problems
showing NP-hardness of exact computation
giving efficient approximation algorithms
exhibiting fixed-parameter tractability

24
Other problems......