A Combinatorial Approach to GenomeWide Ortholog Assignment: Beyond Sequence Similarity Search - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

A Combinatorial Approach to GenomeWide Ortholog Assignment: Beyond Sequence Similarity Search

Description:

Computing signed reversal distance with duplicates. Minimum ... Polynominal-time solvable. Breakpoint graph. Breakpoint, cycle, hurdle, fortress. HP formula: ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 54
Provided by: Hul85
Category:

less

Transcript and Presenter's Notes

Title: A Combinatorial Approach to GenomeWide Ortholog Assignment: Beyond Sequence Similarity Search


1
A Combinatorial Approach to Genome-Wide
Ortholog Assignment Beyond Sequence Similarity
Search
  • Tao Jiang
  • Department of Computer Science
  • University of California, Riverside

Joint work with X. Chen, Z. Fu, J. Zheng, V.
Vacic, P. Nan, Y. Zhong, and S. Lonardi
2
Outline
  • An introduction to orthology
  • Existing ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Minimum Common Substring Partition
  • Maximum Cycle Decomposition
  • Experimental results
  • Summary and future directions

3
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Minimum Common Substring Partition
  • Maximum Cycle Decomposition
  • Experimental Results
  • Summary and future directions

4
Orthology
  • Homolog
  • Gene family
  • Paralog
  • Duplication
  • Ortholog
  • Speciation

5
Orthology
a
b
mouse
  • Homolog
  • Gene family
  • Paralog
  • Duplication
  • Ortholog
  • Speciation

chicken
frog
6
Orthology
a
b
mouse
  • Homolog
  • Gene family
  • Paralog
  • Duplication
  • Ortholog
  • Speciation

chicken
frog
7
Orthology the more complicated picture
A
Speciation 1
Gene duplication 1
B
C
Speciation 2
Speciation 2
Gene duplication 2
A1
B1 C1
B2 C2 C3
G1
G3
G2
8
Significance
  • Orthologous genes in different species are
    evolutionary and functional counterparts.
  • Many methods use orthologs in a critical way
  • Function inference
  • Protein structure prediction
  • Motif finding
  • Phylogenetic analysis
  • Pathway reconstruction
  • and more ...
  • Identification of orthologs, especially exemplar
    genes, is a fundamental and challenging problem.

9
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Minimum Common Substring Partition
  • Maximum Cycle Decomposition
  • Experimental Results
  • Summary and future directions

10
Existing Methods
  • Methods based on sequence similarity
  • BBH
  • Inparanoid/Multiparanoid
  • PhiGs
  • COG/KOG
  • OrthoMCL
  • MGD
  • TOGA/EGO
  • KEGG
  • HomoloGene
  • Methods based on phylogenetic trees
  • Reconciled tree
  • Orthostrapper
  • OrthologID
  • RAP
  • RIO
  • PhyOP
  • TreeFam
  • Methods based that take into account gene
    locations
  • Shared genomic synteny

11
Observations
  • Sequence similarity-based methods assume that the
    evolutionary rates of all genes in a homologous
    family are equal and thus the divergence time
    could be estimated by comparing the sequence of
    genes.
  • Tree-based methods critically rely on the
    correctness of reconstructed gene and species
    trees.
  • Global genome rearrangements are not considered
    in gene location-based methods.

12
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • NP-hard
  • A low bound
  • Minimum Common Substring Partition
  • Maximum Cycle Decomposition
  • Experimental Results
  • Summary and future directions

13
Molecular Evolution
  • Global rearrangement and duplication
  • Inversion/Reversal
  • Translocation
  • Transposition
  • Fusion/Fission
  • Duplication/Loss
  • Local mutation
  • Base substitution
  • Base insertion
  • Base deletion

A complete ortholog assignment system should make
use of information from both levels of molecular
evolution.
14
Genome Rearrangement Operations
15
Example
Given the evolutionary scenario, main ortholog
pairs and inparalogs could be identified in a
straightforward way.
16
The Parsimony Approach
  • Identify homologs using sequence similarity
    search (e.g.) BLASTp.
  • Reconstruct the evolutionary scenario on the
    basis of the parsimony principle postulate the
    minimum possible number of rearrangement events
    and duplication events in the evolution of two
    closely related genomes since their splitting so
    as to assign orthologs.
  • Ortholog assignment problem could be formulated
    as a problem of finding a most parsimonious
    transformation from one genome into the other,
    without explicitly inferring their ancestral
    genome.

17
RD (Rearrangement-Duplication) Distance
  • RD distance
  • denotes the number of rearrangement
    events in a most parsimonious transformation
  • denotes the number of gene duplications
    in a most parsimonious transformation

18
The key algorithmic problem -SRDD
  • Two related (unichromosomal) genomes
  • No inparalogs, i.e. no post-speciation
    duplications
  • No gene losses, and thus equal gene content
  • Only reversals have occurred
  • Signed Reversal Distance with Duplicates
  • How to find a shortest sequence of reversals
  • Almost untouched in the literature
  • Duplicated genes are present
  • Generalizes the problem of sorting by reversal

19
When there are no (post-speciation) duplications
The most parsimonious rearrangement scenario may
suggest the true orthology.
20
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • NP-hard
  • A low bound
  • Minimum Common Substring Partition
  • Maximum Cycle Decomposition
  • Experimental Results
  • Summary and future directions

21
Sorting by reversal
  • Sorting a permutation into the identity by
    reversals
  • Distinct genes only
  • Signed vs. unsigned version

22
Sorting signed permutation
  • Hannenhalli-Pevzner (HP) theory
  • Polynominal-time solvable
  • Breakpoint graph
  • Breakpoint, cycle, hurdle, fortress
  • HP formula

Hannenhalli and Pevzner, STOC, 178-187, 1995
23
Sorting unsigned permutation
  • NP-hard (Caprara, 1997)
  • Breakpoint graph
  • Maximum alternating cycle decomposition (NP-hard)
  • 1.375-approximation (Berman, et al. 2002)

Caprara, RECOMB, 75-83, 1997
24
A brief history
The work has also been extended to genomes with
multiple chromosomes (Hannenhalli and Pevaner,
1995 Tesler, 2002 Ozery-Flato and Shamir, 2003)
25
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Computing Minimum Common Substring Partition
  • Computing Maximum Cycle Decomposition
  • Experimental results
  • Summary and future directions

26
SRDD The exhaustive method
  • Given genomes and ,
    .
  • the set of all the possible ortholog
    assignments
  • the genome after orthologs have been
    assigned
  • Assume one family with ten duplicated genes in
    each genome

27
SRDD Hardness
  • SRDD is NP-hard, even when the maximum size of a
    gene family is limited to two.
  • Reduction from the problem of sorting an unsigned
    permutation by reversals

No breakpoint
No breakpoint
28
SRDD A lower bound
  • Partial graph
  • the number of edges linking two
    nodes labeled by and , respectively
  • The number of breakpoints
  • Let and be a pair of related genomes.
    Their reversal distance is lower bounded by

29
(Sub)optimal assignment rules
  • Rule one
  • Rule two

30
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Computing Minimum Common Substring Partition
  • Computing Maximum Cycle Decomposition
  • Experimental results
  • Summary and future directions

31
The MCSP problem
  • Minimum Common Substring Partition
  • This may help eliminate many duplicates, but is
    different from syntenic blocks.
  • Give two related genomes and , we have

32
MCSP - Hardness
  • Let k-MCSP denote the version of MCSP where each
    gene family is of size at most k. The problem
    k-MCSP is NP-hard, for any k gt 1.

Petr Kolman gave a linear time O(
)-approximation algorithm for k-MCSP (MFCS05),
and thus k-SRDD. The approximation ratio was
recently improved to O(k).
Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004
33
A special case of MCP - MCIP
  • Minimum common integer partition (MCIP) Given
    two multisets with equal sum, partition the
    numbers to result in a smallest common multiset.
  • For example, given A 3, 5, 7 and B
    2,4,4,5, the answer is C 1,2,3,4,5.
  • The problem is NP-hard. A 1.25-approximation
    algorithm based on set packing has been given in
    Chen et al. 2006.
  • It can be extended to k input multisets
    (Woodruff06, Zhao et al.06).


Chen, Liu, Liu, and Jiang, CIAC2006.
34
MCSP Pair-match graph
  • A pair-match graph
  • Single match v.s. pair match
  • Incompatible pair-matches
  • The maximum independent set problem on
    is equivalent to the minimum common substring
    partition problem, i.e.,
    .

Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004
35
MCSP Approximation
  • Algorithm APPROX-MCSP( , )
  • / and are a pair of related genomes /
  • Construct the pair-match graph for
    and
  • Find an approximation of the vertex cover of
  • Identify segments based on the pair-matches in
  • Output all the segments as a common substring
    partition
  • If the common substring parititon found by the
    above algorithm APPROX-MCSP is , then
    where is the ratio of
    the approximation algorithm for vertex cover and
    is the genome size. In particular for 2-MCSP,
    the algorithm achieves an approximation ratio of
    1.5.

36
MCSP for multi-chromosomal genomes
  • For any two genomes and
  • is the number of unmatched genes in a
    maximum gene matching between two genomes
    is the number of chromosomes in each genome
    is the minimum common partition between the
    two genomes.

37
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Computing Minimum Common Substring Partition
  • Computing Maximum Cycle Decomposition
  • Experimental results
  • Summary and future directions

38
Maximum cycle decomposition
What if there still are some duplicates?
  • Given any two genomes without duplicated genes,
    the (revised) HP formula for computing the
    rearrangement distance between the two genomes is
    as follows

Genome rearrangement distance
(Hannenhalli and Pevaner, 1995 Tesler, 2002
Ozery-Flato and Shamir, 2003)
39
Generalizing breakpoint graphs to
multi-chromosomal genomes with duplicates
  • Maximum Cycle Decomposition
  • Complete-breakpoint graph
  • Greedy cycle/path decomposition (to maximize
    )
  • Vertex disjoint
  • Pairing condition
  • Edge alternating
  • Example
    and

40
MSOAR
  • MSOAR is a high-throughput system for ortholog
    assignment between closely related genomes.
  • MSOAR employs a heuristic algorithm to calculate
    the rearrangement/duplication (RD) distance
    between two genomes using the sub-optimal
    assignment rules, MCSP and MCD, which can be used
    to reconstruct a most parsimonious evolutionary
    scenario.
  • MSOAR extends SOAR by allowing for
    multi-chromosomal genomes and the detection of
    inparalogs.

41
Noise gene pair detection
  • The previous steps determine a one-to-one gene
    matching between two genomes.
  • Unmatched genes are removed and marked as
    inparalogs.
  • Remove gene pairs whose deletion decreases the
    rearrangement distance by at least
    two. Since each pair incurs two duplications, the
    RD distance will not increase
  • These deleted genes form inparalogs.

42
An outline of MSOAR
43
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Computing Minimum Common Substring Partition
  • Computing Maximum Cycle Decomposition
  • Experimental results
  • Summary and future directions

44
Simulated data test
  • Simulated genome 100 distinct genes
  • Simulated genome Randomly perform
    reversals on to obtain another genome
  • Experiments
  • One Randomly copy some genes and insert them
    back into
  • Two Randomly copy some genes and insert them
    back into and
  • (Inserted genes are inparalogs by
    definition.)

45
Simulated data test
  • Randomly generate two genomes ( , , , )
  • Average on 20 random instances for each parameter
    set
  • Our heuristic algorithm v.s. the iterated
    exemplar algorithm (Sankoff,
    Bioinformatics, 1999)

46
Exemplar algorithm
  • Look for the true exemplar gene of a family
  • Direct descendants of a gene in the most recent
    common ancestral genome
  • Delete all but one member of every gene family on
    each genome
  • Return one with the smallest rearrangement
    distance
  • By iterating the exemplar algorithm, we can
    assign orthology for all duplicated genes.

Sankoff, Bioinformatics, 1999
47
Real data
  • Homo sapiens
  • Build 36.1 human genome assembly (UCSC hg18,
    March 2006)
  • 20161 protein sequences in total
  • Mus musculus
  • Build 36 mouse genome assembly (UCSC mm8,
    February 2006)
  • 19199 protein sequences in total

48
MSOAR vs. Inparanoid
  • Validation Official gene symbols extracted from
    the UniProt release 6.0 (September 2005)
  • For 20161 human protein sequences and 19199 mouse
    protein sequences, MSOAR assigned 14362 orthologs
    between Human and Mouse, among which 11050 are
    true positives, 1748 are unknown pairs and 1508
    are false positives, resulting in a sensitivity
    of 92.26 and a specificity of 87.99.
  • The comparison between MSOAR and Inparanoid (Remm
    et al., J. Mol. Biol., 2001)

49
MSOAR vs. Inparanoid
The ortholog pair SNRPB (Human) and Snrpb (Mouse)
are not bi-directional best hits, which could be
missed by the sequence-similarity based ortholog
assignment methods like Inparanoid.
50
Number of main ortholog pairs assigned by MSOAR
across the chromosome pairs
51
An alignment between syntenic blocks and MSOAR
blocks
52
Validation by HCOP
  • The HGNC Comparison of Orthology Predictions
    (HCOP) is a tool that integrates and displays the
    human-mouse orthology assertions made by Ensembl,
    Homologene, Inparanoid, PhIGS, MGD and HGNC.
    (http//www.gene.ucl.ac.uk/cgi-bin/nomenclature/hc
    op.pl)

53
Other validations
  • By PANTHER protein sequence classification
    (ftp//ftp.pantherdb.org/sequence_classifications/
    )
  • MSOAR identified 14083 ortholog pairs with
    valid Geneid between human and mouse, among which
    11887 pairs have both orthologous genes in the
    same protein subfamily.

54
Orthology mapping human/mouse
55
Orthology mapping human/rat
56
Orthology mapping mouse/rat
57
Sequence-based mapping
Pevzner and Tesler, Genome Research, 2003
58
Outline
  • An introduction to orthology
  • Previous ortholog assignment methods
  • Ortholog assignment via genome rearrangement
  • An introduction to genome rearrangement
  • Computing signed reversal distance with
    duplicates
  • Computing Minimum Common Substring Partition
  • Computing Maximum Cycle Decomposition
  • Experimental results
  • Summary and future directions

59
Summary and future work
  • Presented a novel approach to assign orthologs
    between two genomes via genome rearrangement and
    gene duplication
  • Introduced a rearrangement/duplication (RD)
    distance for genome comparisons
  • Proposed a heuristic algorithm for assigning
    orthologs under maximum parsimony
  • Developed a high-throughput system for ortholog
    assignment (MSOAR)
  • Tested the system on simulated data and real
    genomic data of human and mouse
  • MSOAR vs. Iterated exemplar algorithm
  • MSOAR vs. Inparanoid
  • Various validation methods
  • Future directions
  • More efficient algorithms for MCSP and MCD
  • Refine the evolutionary model for MSOAR
    (transposition, tandem duplication, gene loss,
    etc.)
  • Ortholog assignment for multiple genome
    comparison
  • More explicit treatment of one-to-many and
    many-to-many orthology relationship

60
References
  • X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S.
    Lonardi, and T. Jiang. Computing the assignment
    of orthologous genes via genome rearrangement.
    Proc. 3rd Asia-Pacific Bioinformatics Conference
    (APBC), 2005, pp. 363-378.
  • X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S.
    Lonardi, and T. Jiang. Assignment of orthologous
    genes via genome rearrangement. IEEE/ACM
    Transactions on Computational Biology and
    Bioinformatics (TCBB) 2-4, pp. 302-315, 2005.
  • Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and
    T. Jiang. A parsimony approach to
    genome-wide ortholog assignment. Proc.
    10th Annual International Conference on Research
    in Computational Molecular Biology (RECOMB),
    2006, pp. 578-594.
  • Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and
    T. Jiang. MSOAR A High-throughput ortholog
    assignment system based on genome rearrangement.
    Submitted, 2007.
  • Z. Fu and T. Jiang. Clustering of main orthologs
    for multiple genomes. To be presented at LSI
    Conference on Computational Systems Biology
    (CSB), 2007.

61
Acknowledgement
  • NSF
  • DoE Genomes to Life (GtL) program
  • National Key Project for Basic Research
  • NSFC
  • Changjiang Visiting Professorship, Tsinghua Univ.
  • Discussion with Marek Chrobak, Petr Kolman, and
    Lan Liu on MCSP and MCIP
Write a Comment
User Comments (0)
About PowerShow.com