Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies presentation

About This Presentation

Transcript and Presenter's Notes

Title: Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies

1
Information Theoretic Approach to Whole Genome
Phylogenies

David Burstein Igor Ulitsky Tamir Tuller
Benny Chor

School Of Computer Science Tel Aviv University
2
Tree of Life

I believe it has been with the tree of life,
which fills with its dead and broken branches the
crust of the earth, and covers the surface with
its ever branching and beautiful
ramifications"...
Charles Darwin, 1859

3
Accepted Evolutionary Model Trees

Initial period Primordial soup, where you are
what you eat. Recombination events. Horizontal
transfers.
Formation of distinct
taxa. Speciation events
induce a tree-like
evolution.

4
Phylogenetic Trees Based on What?

Morphology
Single genes
Whole genomes

5
Whole Genome Phylogenies Motivation

Cons for single genes trees
Require preprocessing
Gene duplications
Often too sensitive
Pros for whole genomes trees
Fully automatic
More information
Seems essential in viruses
What about proteomes trees?
Less noise, but do require preprocessing

6
Whole Genome Phylogenies Challenges

Very large inputs Up to 5G bp long
Extreme length variability (5G to 1M bp)
No meaningful alignment
Different segments experienced different
evolutionary processes

7
Previous Approaches

Genome rearrangements (Hannanelly Pevzner
1995,)
Gene/domain contents (Snel et al. 1999,)
Li et al (2001) Kolmogorov complexity
Otu et al (2003) Lempel Ziv compression
Qi et al (2004) Composition vectors
Common approach (ours too)
Compute pairwise distances
Build a tree from distance matrix (e.g. using
Neighbor Joining, Saitou and Nei 1987)

8
Genome Rearrangements

Emphasis on finding best sequence of
rearrangements
Drawbacks
Requires manual definition of blocks
Disregards changes within the block

9
Gene/Domain Content

Genome ? equi length Boolean vector
Various tree construction methods
The drawback
Requires gene/domain definition/knowledge
Disregards most of the genetic information

10
Ming Li et al.- Kolomogorov Complexity

Kolmogorov Complexity is a wonderful measure
But it is not computable
Approximate KC by compression
Drawbacks
Justification of the approximation
Compression of one human chromosome
reportedly took 24 hours (sloooow).

11
Otu et al. Lempel-Ziv Distance

Run LZ compression on genome A.
Use Genome A dictionary to compress Genome B.
Log compression ratio (B given A vs. B given B)
distance (B, A)
Easy to implement
Linear running time
Drawback
Dictionary size effects

12
Qi et al. Composition Vector

Calculate distributions of the K-tuples.
For K1 nucleotide/amino acid frequencies.
For K5 45 (205) possible 5-tuples
Various methods for scoring distances
Report K5 as seemingly optimal

13
Our Approach Average Common Substring (ACS)

For every position in Genome A, find the
longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
14
Our Approach ACS (cont.)

For every position in Genome A, find the
longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
15
Our Approach ACS (cont.)

For every position in Genome A, find the
longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
16
Our Approach ACS (cont.)

For every position in Genome A, find the
longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
17
Our Approach ACS (cont.)

For every position in Genome A, find the
longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
18
Our Approach ACS (cont.)

For every position in Genome A, find the
length
of longest common substring in Genome B.
In this case, l( )5.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT
19
Our Approach ACS (cont.)

For every position in Genome A, find the
length
of longest common substring in Genome B.
In this case, l( )5.
ACS average l( ) L(Genome A, Genome B)

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT
20
From ACS to Our Distance Intuition

High L( A , B ) indicates higher similarity.
Should normalize to account for length of B.

21
From ACS to Our Distance Intuition

High L( A , B ) indicates higher similarity.
Should normalize to account for length of B.
Still, we want distance rather than similarity.

22
From ACS to Our Distance Intuition

High L( A , B ) indicates higher similarity.
Should normalize to account for length of B.
Still, we want distance rather than similarity.

23
From ACS to Our Distance Intuition

High L( A , B ) indicates higher similarity.
Should normalize to account for length of B.
Still, we want distance rather than similarity.
And want to have D( A , A ) 0 .

24
From ACS to Our Distance Intuition

High L( A , B ) indicates higher similarity.
Should normalize to account for length of B.
Still, we want distance rather than similarity.
And want to have D( A , A ) 0 .
Finally, we want to ensure symmetry.

25
Comparison to Human (H)
26
What Good is this Weird Measure?
1) Our ACS distance is related to an
information theoretic measure that is close to
Kullback Leibler relative entropy between two
distributions. 2) The proof of the pudding is in
the eating Will show this weird measure is
empirically good.
27
An Info Theoretic Measure

Define number of bits required
to describe distribution p, given q.
is closely related to Kullback
Leibler
relative entropy

28
An Info Theoretic Measure

Both and are
common
distance measures between two probability
distributions p and q.
Both distances are neither symmetric, nor
satisfy triangle inequality.

29
Relations Between ACS and

Suppose p and q are Markovian probability
distributions on strings, and A, B are
generated by them.
Abraham Wyner (1993) showed that w.h.p

30
Computing Our Distance

Average number of bits for compression a bit from
one genome by the other vice versa.
Practically achieved better results than the sum
of relative entropies.

31
Implementation and Complexity

Computation distance of two k long genomes
Naïve implementation requires O(k2)
(disaster on billion letters long genomes)
With suffix trees/arrays Total time for
computing is O(k) (much
nicer).

32
Results and Comparisons

Many genomes and proteomes
Small ribosomal subunit ML tree
Compare to other whole-genome methods
Quantitative and qualitative evaluation

33
Four Datasets Used

Benchmark dataset 75 species
191 species (all non-viral proteomes in NCBI)
1,865 viral genomes
34 mitochondrial DNA of
mammals (same as Li et al.)

34
Benchmark Dataset 75 Species

Genomes and proteomes of archaea, bacteria and
eukarya
Tree topologies reconstructed from distance
matrix using Neighbor Joining (Saitou and Nei
1987)
Reference tree and distance matrix obtained from
the RDP (ribosomal database)

35
Results Quantitative Evaluations

Benchmark dataset
Genomes/Proteomes of 75 species from archaea,
bacteria and eukarya.
Methods tested
ACS (Ours)
Lempel Ziv complexity (Otu and Sayhood)
K-mers composition vectors (Qi et al.).

36
Results Quantitative Evaluations

Tree evaluation
Reference tree Accepted tree obtained from
ribosomal database project (Cole et al. 2003)
Tree Distance Robinson-Foulds (1981)

37
Robinson-Foulds Distance

Each tree edge partitions species into 2 sets.
Search which partitions exist only in one of the
trees.

A
C
A
E
Common Partition
x
A,B
C,D,E
A,B
C,D,E
y
B
B
D
E
D
C
Tree A
Tree B
38
Robinson-Foulds Distance

Each tree edge partitions species into 2 sets.
Search which partitions exist only in one of the
trees.

A
C
A
E
A,B,C
Partition Not in B
x
y
B
B
D,E
D
E
D
C
Tree A
Tree B
39
Robinson-Foulds Distance

Distance number of edges inducing partitions
existing only in one of the trees.
For n leaves, distance ranges from 0 through
2n-6.

A
C
A
E
A,B,C
Partition Not in B
x
y
B
B
D,E
D
E
D
C
Tree A
Tree B
40
Robinson-Foulds Distance - Results
Benchmark set has n75 species, so max distance
is 144.
41
All Proteomes Dataset

191 proteomes from NCBI Genome
11 Eukarya, 19 Archaea, 161 Bacteria
Compared to NCBI Taxonomy

42
All Proteomes Dataset

191 proteomes from NCBI Genome
11 Eukarya, 19 Archaea, 161 Bacteria
Compared to NCBI Taxonomy

43
All Proteomes Dataset

191 proteomes from NCBI Genome
11 Eukarya, 19 Archaea, 161 Bacteria
Compared to NCBI Taxonomy

44
Viral Forest

1865 viral genomes from EBI
Split into super-families
dsDNA
ssDNA
dsRNA
ssRNA positive
ssRNA negative
Retroids
Satellite nucleic acid

45
Retroid Tree

83 Reverse-transcriptases
Hepatitis B viruses
Circular dsDNA
ssRNA

46
ssRNA Negative Tree

Each segment treated separately
174 segments of 74 viruses.

47
Mammalian mtDNA Tree
48
Throwing Branch Lengths In
49
General Insights

Proteomes vs. Genomes
Overlapping vs. Non-overlapping
Triangle inequality held in all cases

50
Additional Directions attempted

Naïve introduction of mismatches
Division into segments
Weighted combinations of genome and proteome data
Bottom line (subject to change)
Simple is beautiful.

51
Summary

Whole genome phylogeny based on ACS method
Effective algorithm
Information theoretic justification
Successful reconstruction of known phylogenies.

52
Future work

Additional datasets
Statistical significance
Improved branch lengths estimation
Better time and space complexities

53
Questions ?

Write a Comment

User Comments (0)

About PowerShow.com

Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies PowerPoint PPT Presentation