Title: A Combinatorial Approach to GenomeWide Ortholog Assignment: Beyond Sequence Similarity Search
1A Combinatorial Approach to Genome-Wide
Ortholog Assignment Beyond Sequence Similarity
Search
- Tao Jiang
- Department of Computer Science
- University of California, Riverside
Joint work with X. Chen, Z. Fu, J. Zheng, V.
Vacic, P. Nan, Y. Zhong, and S. Lonardi
2Outline
- An introduction to orthology
- Existing ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Minimum Common Substring Partition
- Maximum Cycle Decomposition
- Experimental results
- Summary and future directions
3Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Minimum Common Substring Partition
- Maximum Cycle Decomposition
- Experimental Results
- Summary and future directions
4Orthology
- Homolog
- Gene family
- Paralog
- Duplication
- Ortholog
- Speciation
5Orthology
a
b
mouse
- Homolog
- Gene family
- Paralog
- Duplication
- Ortholog
- Speciation
chicken
frog
6Orthology
a
b
mouse
- Homolog
- Gene family
- Paralog
- Duplication
- Ortholog
- Speciation
chicken
frog
7Orthology the more complicated picture
A
Speciation 1
Gene duplication 1
B
C
Speciation 2
Speciation 2
Gene duplication 2
A1
B1 C1
B2 C2 C3
G1
G3
G2
8Significance
- Orthologous genes in different species are
evolutionary and functional counterparts. - Many methods use orthologs in a critical way
- Function inference
- Protein structure prediction
- Motif finding
- Phylogenetic analysis
- Pathway reconstruction
- and more ...
- Identification of orthologs, especially exemplar
genes, is a fundamental and challenging problem.
9Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Minimum Common Substring Partition
- Maximum Cycle Decomposition
- Experimental Results
- Summary and future directions
10Existing Methods
- Methods based on sequence similarity
- BBH
- Inparanoid/Multiparanoid
- PhiGs
- Methods based on phylogenetic trees
- Reconciled tree
- Orthostrapper
- OrthologID
- Methods based that take into account gene
locations - Shared genomic synteny
11Observations
- Sequence similarity-based methods assume that the
evolutionary rates of all genes in a homologous
family are equal and thus the divergence time
could be estimated by comparing the sequence of
genes. - Tree-based methods critically rely on the
correctness of reconstructed gene and species
trees. - Global genome rearrangements are not considered
in gene location-based methods.
12Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - NP-hard
- A low bound
- Minimum Common Substring Partition
- Maximum Cycle Decomposition
- Experimental Results
- Summary and future directions
13Molecular Evolution
- Global rearrangement and duplication
- Inversion/Reversal
- Translocation
- Transposition
- Fusion/Fission
- Duplication/Loss
- Local mutation
- Base substitution
- Base insertion
- Base deletion
A complete ortholog assignment system should make
use of information from both levels of molecular
evolution.
14Genome Rearrangement Operations
15Example
Given the evolutionary scenario, main ortholog
pairs and inparalogs could be identified in a
straightforward way.
16The Parsimony Approach
- Identify homologs using sequence similarity
search (e.g.) BLASTp. - Reconstruct the evolutionary scenario on the
basis of the parsimony principle postulate the
minimum possible number of rearrangement events
and duplication events in the evolution of two
closely related genomes since their splitting so
as to assign orthologs. - Ortholog assignment problem could be formulated
as a problem of finding a most parsimonious
transformation from one genome into the other,
without explicitly inferring their ancestral
genome.
17RD (Rearrangement-Duplication) Distance
- RD distance
- denotes the number of rearrangement
events in a most parsimonious transformation - denotes the number of gene duplications
in a most parsimonious transformation
18The key algorithmic problem -SRDD
- Two related (unichromosomal) genomes
- No inparalogs, i.e. no post-speciation
duplications - No gene losses, and thus equal gene content
- Only reversals have occurred
- Signed Reversal Distance with Duplicates
- How to find a shortest sequence of reversals
- Almost untouched in the literature
- Duplicated genes are present
- Generalizes the problem of sorting by reversal
19When there are no (post-speciation) duplications
The most parsimonious rearrangement scenario may
suggest the true orthology.
20Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - NP-hard
- A low bound
- Minimum Common Substring Partition
- Maximum Cycle Decomposition
- Experimental Results
- Summary and future directions
21Sorting by reversal
- Sorting a permutation into the identity by
reversals - Distinct genes only
- Signed vs. unsigned version
22Sorting signed permutation
- Hannenhalli-Pevzner (HP) theory
- Polynominal-time solvable
- Breakpoint graph
- Breakpoint, cycle, hurdle, fortress
- HP formula
Hannenhalli and Pevzner, STOC, 178-187, 1995
23Sorting unsigned permutation
- NP-hard (Caprara, 1997)
- Breakpoint graph
- Maximum alternating cycle decomposition (NP-hard)
- 1.375-approximation (Berman, et al. 2002)
Caprara, RECOMB, 75-83, 1997
24A brief history
The work has also been extended to genomes with
multiple chromosomes (Hannenhalli and Pevaner,
1995 Tesler, 2002 Ozery-Flato and Shamir, 2003)
25Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Computing Minimum Common Substring Partition
- Computing Maximum Cycle Decomposition
- Experimental results
- Summary and future directions
26SRDD The exhaustive method
- Given genomes and ,
. - the set of all the possible ortholog
assignments - the genome after orthologs have been
assigned - Assume one family with ten duplicated genes in
each genome -
27SRDD Hardness
- SRDD is NP-hard, even when the maximum size of a
gene family is limited to two. - Reduction from the problem of sorting an unsigned
permutation by reversals
No breakpoint
No breakpoint
28SRDD A lower bound
- Partial graph
-
-
- the number of edges linking two
nodes labeled by and , respectively - The number of breakpoints
- Let and be a pair of related genomes.
Their reversal distance is lower bounded by -
-
29(Sub)optimal assignment rules
30Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Computing Minimum Common Substring Partition
- Computing Maximum Cycle Decomposition
- Experimental results
- Summary and future directions
31The MCSP problem
- Minimum Common Substring Partition
- This may help eliminate many duplicates, but is
different from syntenic blocks. -
- Give two related genomes and , we have
32MCSP - Hardness
- Let k-MCSP denote the version of MCSP where each
gene family is of size at most k. The problem
k-MCSP is NP-hard, for any k gt 1.
Petr Kolman gave a linear time O(
)-approximation algorithm for k-MCSP (MFCS05),
and thus k-SRDD. The approximation ratio was
recently improved to O(k).
Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004
33A special case of MCP - MCIP
- Minimum common integer partition (MCIP) Given
two multisets with equal sum, partition the
numbers to result in a smallest common multiset.
- For example, given A 3, 5, 7 and B
2,4,4,5, the answer is C 1,2,3,4,5. - The problem is NP-hard. A 1.25-approximation
algorithm based on set packing has been given in
Chen et al. 2006. - It can be extended to k input multisets
(Woodruff06, Zhao et al.06).
Chen, Liu, Liu, and Jiang, CIAC2006.
34MCSP Pair-match graph
- A pair-match graph
- Single match v.s. pair match
- Incompatible pair-matches
- The maximum independent set problem on
is equivalent to the minimum common substring
partition problem, i.e.,
.
Goldstein, Kolman, and Zheng, ISAAC, 473-484, 2004
35MCSP Approximation
- Algorithm APPROX-MCSP( , )
- / and are a pair of related genomes /
- Construct the pair-match graph for
and - Find an approximation of the vertex cover of
- Identify segments based on the pair-matches in
- Output all the segments as a common substring
partition
- If the common substring parititon found by the
above algorithm APPROX-MCSP is , then
where is the ratio of
the approximation algorithm for vertex cover and
is the genome size. In particular for 2-MCSP,
the algorithm achieves an approximation ratio of
1.5.
36MCSP for multi-chromosomal genomes
- For any two genomes and
-
- is the number of unmatched genes in a
maximum gene matching between two genomes
is the number of chromosomes in each genome
is the minimum common partition between the
two genomes.
37Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Computing Minimum Common Substring Partition
- Computing Maximum Cycle Decomposition
- Experimental results
- Summary and future directions
38Maximum cycle decomposition
What if there still are some duplicates?
- Given any two genomes without duplicated genes,
the (revised) HP formula for computing the
rearrangement distance between the two genomes is
as follows
Genome rearrangement distance
(Hannenhalli and Pevaner, 1995 Tesler, 2002
Ozery-Flato and Shamir, 2003)
39Generalizing breakpoint graphs to
multi-chromosomal genomes with duplicates
- Maximum Cycle Decomposition
- Complete-breakpoint graph
- Greedy cycle/path decomposition (to maximize
) - Vertex disjoint
- Pairing condition
- Edge alternating
- Example
and
40MSOAR
- MSOAR is a high-throughput system for ortholog
assignment between closely related genomes. - MSOAR employs a heuristic algorithm to calculate
the rearrangement/duplication (RD) distance
between two genomes using the sub-optimal
assignment rules, MCSP and MCD, which can be used
to reconstruct a most parsimonious evolutionary
scenario. - MSOAR extends SOAR by allowing for
multi-chromosomal genomes and the detection of
inparalogs.
41Noise gene pair detection
- The previous steps determine a one-to-one gene
matching between two genomes. - Unmatched genes are removed and marked as
inparalogs. - Remove gene pairs whose deletion decreases the
rearrangement distance by at least
two. Since each pair incurs two duplications, the
RD distance will not increase - These deleted genes form inparalogs.
42An outline of MSOAR
43Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Computing Minimum Common Substring Partition
- Computing Maximum Cycle Decomposition
- Experimental results
- Summary and future directions
44Simulated data test
- Simulated genome 100 distinct genes
- Simulated genome Randomly perform
reversals on to obtain another genome - Experiments
- One Randomly copy some genes and insert them
back into - Two Randomly copy some genes and insert them
back into and - (Inserted genes are inparalogs by
definition.)
45Simulated data test
- Randomly generate two genomes ( , , , )
- Average on 20 random instances for each parameter
set - Our heuristic algorithm v.s. the iterated
exemplar algorithm (Sankoff,
Bioinformatics, 1999)
46Exemplar algorithm
- Look for the true exemplar gene of a family
- Direct descendants of a gene in the most recent
common ancestral genome - Delete all but one member of every gene family on
each genome - Return one with the smallest rearrangement
distance - By iterating the exemplar algorithm, we can
assign orthology for all duplicated genes.
Sankoff, Bioinformatics, 1999
47Real data
- Homo sapiens
- Build 36.1 human genome assembly (UCSC hg18,
March 2006) - 20161 protein sequences in total
- Mus musculus
- Build 36 mouse genome assembly (UCSC mm8,
February 2006) - 19199 protein sequences in total
48MSOAR vs. Inparanoid
- Validation Official gene symbols extracted from
the UniProt release 6.0 (September 2005) - For 20161 human protein sequences and 19199 mouse
protein sequences, MSOAR assigned 14362 orthologs
between Human and Mouse, among which 11050 are
true positives, 1748 are unknown pairs and 1508
are false positives, resulting in a sensitivity
of 92.26 and a specificity of 87.99. - The comparison between MSOAR and Inparanoid (Remm
et al., J. Mol. Biol., 2001)
49MSOAR vs. Inparanoid
The ortholog pair SNRPB (Human) and Snrpb (Mouse)
are not bi-directional best hits, which could be
missed by the sequence-similarity based ortholog
assignment methods like Inparanoid.
50Number of main ortholog pairs assigned by MSOAR
across the chromosome pairs
51An alignment between syntenic blocks and MSOAR
blocks
52Validation by HCOP
- The HGNC Comparison of Orthology Predictions
(HCOP) is a tool that integrates and displays the
human-mouse orthology assertions made by Ensembl,
Homologene, Inparanoid, PhIGS, MGD and HGNC.
(http//www.gene.ucl.ac.uk/cgi-bin/nomenclature/hc
op.pl)
53Other validations
- By PANTHER protein sequence classification
(ftp//ftp.pantherdb.org/sequence_classifications/
) - MSOAR identified 14083 ortholog pairs with
valid Geneid between human and mouse, among which
11887 pairs have both orthologous genes in the
same protein subfamily.
54Orthology mapping human/mouse
55Orthology mapping human/rat
56Orthology mapping mouse/rat
57Sequence-based mapping
Pevzner and Tesler, Genome Research, 2003
58Outline
- An introduction to orthology
- Previous ortholog assignment methods
- Ortholog assignment via genome rearrangement
- An introduction to genome rearrangement
- Computing signed reversal distance with
duplicates - Computing Minimum Common Substring Partition
- Computing Maximum Cycle Decomposition
- Experimental results
- Summary and future directions
59Summary and future work
- Presented a novel approach to assign orthologs
between two genomes via genome rearrangement and
gene duplication - Introduced a rearrangement/duplication (RD)
distance for genome comparisons - Proposed a heuristic algorithm for assigning
orthologs under maximum parsimony - Developed a high-throughput system for ortholog
assignment (MSOAR) - Tested the system on simulated data and real
genomic data of human and mouse - MSOAR vs. Iterated exemplar algorithm
- MSOAR vs. Inparanoid
- Various validation methods
- Future directions
- More efficient algorithms for MCSP and MCD
- Refine the evolutionary model for MSOAR
(transposition, tandem duplication, gene loss,
etc.) - Ortholog assignment for multiple genome
comparison - More explicit treatment of one-to-many and
many-to-many orthology relationship
60References
- X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S.
Lonardi, and T. Jiang. Computing the assignment
of orthologous genes via genome rearrangement.
Proc. 3rd Asia-Pacific Bioinformatics Conference
(APBC), 2005, pp. 363-378. -
- X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S.
Lonardi, and T. Jiang. Assignment of orthologous
genes via genome rearrangement. IEEE/ACM
Transactions on Computational Biology and
Bioinformatics (TCBB) 2-4, pp. 302-315, 2005. - Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and
T. Jiang. A parsimony approach to
genome-wide ortholog assignment. Proc.
10th Annual International Conference on Research
in Computational Molecular Biology (RECOMB),
2006, pp. 578-594. - Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and
T. Jiang. MSOAR A High-throughput ortholog
assignment system based on genome rearrangement.
Submitted, 2007. - Z. Fu and T. Jiang. Clustering of main orthologs
for multiple genomes. To be presented at LSI
Conference on Computational Systems Biology
(CSB), 2007.
61Acknowledgement
- NSF
- DoE Genomes to Life (GtL) program
- National Key Project for Basic Research
- NSFC
- Changjiang Visiting Professorship, Tsinghua Univ.
- Discussion with Marek Chrobak, Petr Kolman, and
Lan Liu on MCSP and MCIP