Title: Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies
1Information Theoretic Approach to Whole Genome
Phylogenies
- David Burstein Igor Ulitsky Tamir Tuller
Benny Chor
School Of Computer Science Tel Aviv University
2Tree of Life
- I believe it has been with the tree of life,
which fills with its dead and broken branches the
crust of the earth, and covers the surface with
its ever branching and beautiful
ramifications"... - Charles Darwin, 1859
3Accepted Evolutionary Model Trees
- Initial period Primordial soup, where you are
what you eat. Recombination events. Horizontal
transfers. - Formation of distinct
- taxa. Speciation events
- induce a tree-like
- evolution.
4Phylogenetic Trees Based on What?
- Morphology
- Single genes
- Whole genomes
5Whole Genome Phylogenies Motivation
- Cons for single genes trees
- Require preprocessing
- Gene duplications
- Often too sensitive
- Pros for whole genomes trees
- Fully automatic
- More information
- Seems essential in viruses
- What about proteomes trees?
- Less noise, but do require preprocessing
6Whole Genome Phylogenies Challenges
- Very large inputs Up to 5G bp long
- Extreme length variability (5G to 1M bp)
- No meaningful alignment
- Different segments experienced different
evolutionary processes
7Previous Approaches
- Genome rearrangements (Hannanelly Pevzner
1995,) - Gene/domain contents (Snel et al. 1999,)
- Li et al (2001) Kolmogorov complexity
- Otu et al (2003) Lempel Ziv compression
- Qi et al (2004) Composition vectors
- Common approach (ours too)
- Compute pairwise distances
- Build a tree from distance matrix (e.g. using
Neighbor Joining, Saitou and Nei 1987)
8Genome Rearrangements
- Emphasis on finding best sequence of
rearrangements - Drawbacks
- Requires manual definition of blocks
- Disregards changes within the block
9Gene/Domain Content
- Genome ? equi length Boolean vector
- Various tree construction methods
- The drawback
- Requires gene/domain definition/knowledge
- Disregards most of the genetic information
10Ming Li et al.- Kolomogorov Complexity
- Kolmogorov Complexity is a wonderful measure
- But it is not computable
- Approximate KC by compression
- Drawbacks
- Justification of the approximation
- Compression of one human chromosome
- reportedly took 24 hours (sloooow).
11Otu et al. Lempel-Ziv Distance
- Run LZ compression on genome A.
- Use Genome A dictionary to compress Genome B.
- Log compression ratio (B given A vs. B given B)
- distance (B, A)
- Easy to implement
- Linear running time
- Drawback
- Dictionary size effects
12Qi et al. Composition Vector
- Calculate distributions of the K-tuples.
- For K1 nucleotide/amino acid frequencies.
- For K5 45 (205) possible 5-tuples
- Various methods for scoring distances
- Report K5 as seemingly optimal
13Our Approach Average Common Substring (ACS)
- For every position in Genome A, find the
- longest common substring in Genome B.
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
14Our Approach ACS (cont.)
- For every position in Genome A, find the
- longest common substring in Genome B.
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
15Our Approach ACS (cont.)
- For every position in Genome A, find the
- longest common substring in Genome B.
-
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
16Our Approach ACS (cont.)
- For every position in Genome A, find the
- longest common substring in Genome B.
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
17Our Approach ACS (cont.)
- For every position in Genome A, find the
- longest common substring in Genome B.
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
18Our Approach ACS (cont.)
- For every position in Genome A, find the
length - of longest common substring in Genome B.
- In this case, l( )5.
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT
19Our Approach ACS (cont.)
- For every position in Genome A, find the
length - of longest common substring in Genome B.
- In this case, l( )5.
- ACS average l( ) L(Genome A, Genome B)
Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT
20From ACS to Our Distance Intuition
- High L( A , B ) indicates higher similarity.
- Should normalize to account for length of B.
21From ACS to Our Distance Intuition
- High L( A , B ) indicates higher similarity.
- Should normalize to account for length of B.
- Still, we want distance rather than similarity.
22From ACS to Our Distance Intuition
- High L( A , B ) indicates higher similarity.
- Should normalize to account for length of B.
- Still, we want distance rather than similarity.
23From ACS to Our Distance Intuition
- High L( A , B ) indicates higher similarity.
- Should normalize to account for length of B.
- Still, we want distance rather than similarity.
- And want to have D( A , A ) 0 .
24From ACS to Our Distance Intuition
- High L( A , B ) indicates higher similarity.
- Should normalize to account for length of B.
- Still, we want distance rather than similarity.
- And want to have D( A , A ) 0 .
- Finally, we want to ensure symmetry.
25Comparison to Human (H)
26What Good is this Weird Measure?
1) Our ACS distance is related to an
information theoretic measure that is close to
Kullback Leibler relative entropy between two
distributions. 2) The proof of the pudding is in
the eating Will show this weird measure is
empirically good.
27An Info Theoretic Measure
- Define number of bits required
- to describe distribution p, given q.
- is closely related to Kullback
Leibler - relative entropy
28An Info Theoretic Measure
- Both and are
common - distance measures between two probability
- distributions p and q.
- Both distances are neither symmetric, nor
- satisfy triangle inequality.
-
29Relations Between ACS and
- Suppose p and q are Markovian probability
- distributions on strings, and A, B are
- generated by them.
- Abraham Wyner (1993) showed that w.h.p
-
30Computing Our Distance
- Average number of bits for compression a bit from
one genome by the other vice versa. - Practically achieved better results than the sum
of relative entropies.
31Implementation and Complexity
- Computation distance of two k long genomes
- Naïve implementation requires O(k2)
- (disaster on billion letters long genomes)
- With suffix trees/arrays Total time for
- computing is O(k) (much
nicer).
32Results and Comparisons
- Many genomes and proteomes
- Small ribosomal subunit ML tree
- Compare to other whole-genome methods
- Quantitative and qualitative evaluation
33Four Datasets Used
- Benchmark dataset 75 species
- 191 species (all non-viral proteomes in NCBI)
- 1,865 viral genomes
- 34 mitochondrial DNA of
- mammals (same as Li et al.)
34Benchmark Dataset 75 Species
- Genomes and proteomes of archaea, bacteria and
eukarya - Tree topologies reconstructed from distance
matrix using Neighbor Joining (Saitou and Nei
1987) - Reference tree and distance matrix obtained from
the RDP (ribosomal database)
35Results Quantitative Evaluations
- Benchmark dataset
- Genomes/Proteomes of 75 species from archaea,
bacteria and eukarya. - Methods tested
- ACS (Ours)
- Lempel Ziv complexity (Otu and Sayhood)
- K-mers composition vectors (Qi et al.).
36Results Quantitative Evaluations
- Tree evaluation
- Reference tree Accepted tree obtained from
ribosomal database project (Cole et al. 2003) - Tree Distance Robinson-Foulds (1981)
37Robinson-Foulds Distance
- Each tree edge partitions species into 2 sets.
- Search which partitions exist only in one of the
trees.
A
C
A
E
Common Partition
x
A,B
C,D,E
A,B
C,D,E
y
B
B
D
E
D
C
Tree A
Tree B
38Robinson-Foulds Distance
- Each tree edge partitions species into 2 sets.
- Search which partitions exist only in one of the
trees.
A
C
A
E
A,B,C
Partition Not in B
x
y
B
B
D,E
D
E
D
C
Tree A
Tree B
39Robinson-Foulds Distance
- Distance number of edges inducing partitions
existing only in one of the trees. - For n leaves, distance ranges from 0 through
2n-6.
A
C
A
E
A,B,C
Partition Not in B
x
y
B
B
D,E
D
E
D
C
Tree A
Tree B
40Robinson-Foulds Distance - Results
Benchmark set has n75 species, so max distance
is 144.
41All Proteomes Dataset
- 191 proteomes from NCBI Genome
- 11 Eukarya, 19 Archaea, 161 Bacteria
- Compared to NCBI Taxonomy
42All Proteomes Dataset
- 191 proteomes from NCBI Genome
- 11 Eukarya, 19 Archaea, 161 Bacteria
- Compared to NCBI Taxonomy
43All Proteomes Dataset
- 191 proteomes from NCBI Genome
- 11 Eukarya, 19 Archaea, 161 Bacteria
- Compared to NCBI Taxonomy
44Viral Forest
- 1865 viral genomes from EBI
- Split into super-families
- dsDNA
- ssDNA
- dsRNA
- ssRNA positive
- ssRNA negative
- Retroids
- Satellite nucleic acid
45Retroid Tree
- 83 Reverse-transcriptases
- Hepatitis B viruses
- Circular dsDNA
- ssRNA
46ssRNA Negative Tree
- Each segment treated separately
- 174 segments of 74 viruses.
47Mammalian mtDNA Tree
48Throwing Branch Lengths In
49General Insights
- Proteomes vs. Genomes
- Overlapping vs. Non-overlapping
- Triangle inequality held in all cases
50Additional Directions attempted
- Naïve introduction of mismatches
- Division into segments
- Weighted combinations of genome and proteome data
- Bottom line (subject to change)
- Simple is beautiful.
51Summary
- Whole genome phylogeny based on ACS method
- Effective algorithm
- Information theoretic justification
- Successful reconstruction of known phylogenies.
52Future work
- Additional datasets
- Statistical significance
- Improved branch lengths estimation
- Better time and space complexities
53Questions ?