Algorithmic Problems Related to Sequences and Phylogenetic Trees - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithmic Problems Related to Sequences and Phylogenetic Trees

Description:

Common substructure between protein structures ... Current computational tools do not perform very well on discovering similar substructures. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 26
Provided by: bhas7
Learn more at: https://www.cs.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: Algorithmic Problems Related to Sequences and Phylogenetic Trees


1
Algorithmic Problems Related to Sequences and
Phylogenetic Trees
  • Bhaskar DasGupta
  • Department of Computer Science
  • University of Illinois at Chicago
  • Chicago, IL 60607-7053
  • Email dasgupta_at_cs.uic.edu

2
  • Outline
  • Introduction
  • Substructure Comparison Problems
  • Sequences
  • Nonoverlapping local alignment
  • Proteins
  • Transformation Based Distances
  • Phylogenetic Trees
  • Why compare?
  • A few distances
  • Genomes
  • Syntenic Distance
  • Conclusions

3
  • Computational Molecular Biology
  • A Computer Scientists Participation
  • Get to know the computational problems
  • Talk to biologists
  • State the computational problems as precisely as
    possible
  • Investigate computational aspects of the problems
  • exact solutions
  • difficult/easy ?
  • time/space efficient solutions ?
  • approximate solutions (if exact solution is hard
    or not time/space efficient)
  • guaranteed quality of approximation ? (tradeoff
    with space/time?)
  • deterministic vs. randomized algorithms
  • implementation aspects
  • programming cleverness to reduce space/time
  • algorithmic engineering techniques to reduce
    space/time
  • interaction with the biologists
  • are the solutions biologically meaningful ?

4
  • Few Computer Science Jargons
  • When we say What we really mean
  • Maximization/minimization problem Problem in
    which we maximize/minimize some objective
    function
  • Problem is NP-complete/hard Exact solution for
    large size problem will most likely require too
    much time
  • Polynomial-time solution Solvable in reasonable
    time in a reasonably fast computer
  • Approximation algorithm An approximate solution
    computed in reasonable time
  • with approximation ratio r with an objective
    function value of a (for maximization/minimizati
    on) least (at most r) of the optimum

5
  • Substructure Similarity (or, equivalently,
    Dissimilarity)

a
b
b
c
c
a
a matches to a with similarity 10 b matches to
b with similarity 15 c matches to c with
similarity 11 total similarity 36 Goal
match disjoint substructures to maximize total
similarity
6
  • Few Complications
  • Many short vs. fewer long substructures
  • Measure of similarity between substructures
  • Examples
  • rmsd (root-mean-square distance) between 3D
    substructures
  • edit distance between subsequences
  • syntenic distance between multi-chromosome
    genomes

7
  • Sequences
  • Non-overlapping local alignment

total similarity 101525
8
  • The problem
  • Input pairs of fragments, one from each sequence
    (or, equivalently a
  • set of rectangles).
  • the weight of each pair (rectangle) is
    their similarity measure
  • Output a set of pairs (rectangles) such that
  • no two rectangles overlap on the x-axis
  • (i.e., matched fragments of the first sequence
    are disjoint)
  • no two rectangles overlap on the y-axis
  • (i.e., matched fragments of the 2nd sequence
    are disjoint)
  • total similarity of selected fragment pairs is
    maximized

9
  • Further assumption
  • We can preprocess input data (rectangles or
    fragment pairs) to ensure that
  • for any two rectangles, the projection of one on
    the y-axis does not enclose that of another

not allowed in the input data
  • for any two rectangles, the projection of one on
    the x-axis does not enclose
    that of another

10
  • An illustration
  • Input

A
G
15
2
G
C
1
C
10
T
G
A
A
A
C
C
C
An optimal solution of total similarity 25
11
  • Previous results
  • (n number of rectangles (fragment pairs))
  • Bafna, Narayanan and Ravi (WADS95)
  • NP-complete
  • O(n2) time approximation algorithm with
    approximation ratio 3.25
  • converts to a problem of finding maximum-weight
    independent set in a 5-clawfree graph
  • gives approximation algorithm for (d1)-clawfree
    graphs with approximation ratio of
  • Halldórsson (SODA95)
  • approximation algorithm with approximation ratio
    of about 2.5 when all weights are one
  • again uses clawfree graphs
  • Berman (SWAT00)
  • O(n4) time algorithm with approximation ratio of
    about 2.5
  • via clawfree graphs again

12
  • Our recent results
  • (Berman, DasGupta and Muthukrishnan, SODA02)
  • O(n log n) time approximation algorithm with
    approximation ratio 3
  • very simple to implement
  • uses a 2-phase approach (or, equivalently, the
    local-ratio technique)
  • Extensions to d dimensions (d gt 2)
  • Inputs are similarity measures of d fragments,
    one from each of given d sequences
  • Motivation multiple sequence comparison problems
  • Generalization of our above approach
  • O(n d log n) time approximation algorithm with
    approximation ratio of 2d-1
  • current best (Bar-Yehuda, Halldórsson, Naor,
    Shachnai and Shapira, SODA02)
  • polynomial time algorithm with approximation
    ratio 2d
  • uses repeated linear programming and continuous
    version of local-ratio techniques

13
  • Common substructure between protein structures
  • (work in progress.......with Jie Liang and Andrew
    Binkowski)
  • Comparison of 2 4-helix bundles that differ by
    topological rearrangement, ROP and cytochrome b56
  • Topological cartoons of 1ROP and 256B. Helices
    are drawn as cylinders and loops as lines.
    Residue numbers of structurally equivalent
    segments are indicated on the cylinders.
  • The alignment is non-sequential.

14
  • Motivation
  • discovering similar substructures from
    different proteins is essential for recognizing
    remote evolutionary relationship at the level of
    protein fragments
  • Few interesting points
  • it is not easy to characterize topological
    structures such as void, pocket, or tunnel where
    ligand and other molecules bind.
  • Current computational tools do not perform very
    well on discovering similar substructures.
  • For example
  • (a) protein structures are typically
    represented by distance matrices or contact maps,
    which record pairwise inter-distances between
    selected atoms (typically Ca atoms) on the
    primary sequences
  • (b) finding common substructures becomes
    matching submatrices of the two contact maps
  • (c) Heuristic algorithms have been developed
    and have proven to be useful. But, they are time
    consuming (typically O(n6)), and cannot be used
    for more demanding tasks such as identifying
    spatial functional motifs

15
  • Our approach in work in progress
  • reduce the problem to various constrained
    rectangle-packing problems
  • use combinatorial methods (such as the
    local-ratio technique) to design approximation
    algorithms for these problems
  • Our final goal
  • identification of the most discriminating
    geometric and chemical features and their
    combinations for various proteins
  • development of a robust method to compute the
    similarity/dissimilarity of two shape
    distributions of these features

16
  • Transformation based distances
  • Objects

Transformation rules (with costs)
15
12
10
9
Goal find distance between two specified objects
15
10
9
cost 10159 34
10
12
cost 1012 22
and
is 22
distance between
17
  • Distances between Phylogenetic trees
  • Objects
  • Evolutionary trees (phylogenies) on n nodes

Transformation Rules How to modify trees locally
consistent with biological applications?
18
  • Why compute distances between phylogenies ?
  • First motivation

parsimony method
compare them for similarity and discrepancy
compatibility method
input data
maximum-likelihood method
distance matrix method
different methods for inferring phylogenies
19
  • Why compute distances between phylogenies ?
  • Second motivation
  • To find out information about rare genetic events
    such as recombination or gene conversion

recombination
gene conversion
20
  • Few distances that we have looked at......
  • Nearest neighbor interchange (nni) distance
  • Linear cost subtree transfer distances
  • Synopsis of our works on these distances
  • proving that exact solution is NP-hard
  • providing fast approximate solutions
  • investigating fixed-parameter tractability
  • some implementation works .....

21
  • Genomic Distance
  • Syntenic distance between multi-chromosome
    genomes
  • (Ferretti, Nadeau and Sankoff, 1996)
  • treats genomes at a higher level of abstraction

gene
chromosome
4
9
10
8
6
3
5
7
1
2
  • order of genes in any chromosome is unknown or
    ignored
  • intra-chromosomal events (e.g., reversal,
    transposition) do not affect
    chromosomal assignment
  • inter-chromosomal events are important

22
  • Inter-chromosomal events
  • Fission
    Fusion

2
5
2
1
3
5
1
3
4
4
5
4
3
2
1
4
5
3
2
1
(Reciprocal) translocation
5
6
7
3
4
2
1
23
  • Syntenic distance between two genomes
  • minimum number of fission, fusion and
    translocations necessary to transform one genome
    to another
  • Other related problems
  • finding the median of 3 genomes for the syntenic
    distance metric
  • (useful for phylogentic tree inference problem
    from synteny data)
  • Synopsis of our work on these problems
  • showing NP-hardness of exact computation
  • giving efficient approximation algorithms
  • exhibiting fixed-parameter tractability

24
Other problems......
  • Genome partitioning with applications to DNA
    microarray chip design
  • Consensus sequence reconstruction problems

25
  • THE END
Write a Comment
User Comments (0)
About PowerShow.com