Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies - PowerPoint PPT Presentation

About This Presentation
Title:

Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies

Description:

Information Theoretic Approach to Whole Genome Phylogenies ... Compare to other whole-genome methods. Quantitative and qualitative evaluation ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 53
Provided by: oper
Category:

less

Transcript and Presenter's Notes

Title: Information%20Theoretic%20Approach%20to%20Whole%20Genome%20Phylogenies


1
Information Theoretic Approach to Whole Genome
Phylogenies
  • David Burstein Igor Ulitsky Tamir Tuller
    Benny Chor

School Of Computer Science Tel Aviv University
2
Tree of Life
  • I believe it has been with the tree of life,
    which fills with its dead and broken branches the
    crust of the earth, and covers the surface with
    its ever branching and beautiful
    ramifications"...
  • Charles Darwin, 1859

3
Accepted Evolutionary Model Trees
  • Initial period Primordial soup, where you are
    what you eat. Recombination events. Horizontal
    transfers.
  • Formation of distinct
  • taxa. Speciation events
  • induce a tree-like
  • evolution.

4
Phylogenetic Trees Based on What?
  1. Morphology
  2. Single genes
  3. Whole genomes

5
Whole Genome Phylogenies Motivation
  • Cons for single genes trees
  • Require preprocessing
  • Gene duplications
  • Often too sensitive
  • Pros for whole genomes trees
  • Fully automatic
  • More information
  • Seems essential in viruses
  • What about proteomes trees?
  • Less noise, but do require preprocessing

6
Whole Genome Phylogenies Challenges
  • Very large inputs Up to 5G bp long
  • Extreme length variability (5G to 1M bp)
  • No meaningful alignment
  • Different segments experienced different
    evolutionary processes

7
Previous Approaches
  • Genome rearrangements (Hannanelly Pevzner
    1995,)
  • Gene/domain contents (Snel et al. 1999,)
  • Li et al (2001) Kolmogorov complexity
  • Otu et al (2003) Lempel Ziv compression
  • Qi et al (2004) Composition vectors
  • Common approach (ours too)
  • Compute pairwise distances
  • Build a tree from distance matrix (e.g. using
    Neighbor Joining, Saitou and Nei 1987)

8
Genome Rearrangements
  • Emphasis on finding best sequence of
    rearrangements
  • Drawbacks
  • Requires manual definition of blocks
  • Disregards changes within the block

9
Gene/Domain Content
  • Genome ? equi length Boolean vector
  • Various tree construction methods
  • The drawback
  • Requires gene/domain definition/knowledge
  • Disregards most of the genetic information

10
Ming Li et al.- Kolomogorov Complexity
  • Kolmogorov Complexity is a wonderful measure
  • But it is not computable
  • Approximate KC by compression
  • Drawbacks
  • Justification of the approximation
  • Compression of one human chromosome
  • reportedly took 24 hours (sloooow).

11
Otu et al. Lempel-Ziv Distance
  • Run LZ compression on genome A.
  • Use Genome A dictionary to compress Genome B.
  • Log compression ratio (B given A vs. B given B)
  • distance (B, A)
  • Easy to implement
  • Linear running time
  • Drawback
  • Dictionary size effects

12
Qi et al. Composition Vector
  • Calculate distributions of the K-tuples.
  • For K1 nucleotide/amino acid frequencies.
  • For K5 45 (205) possible 5-tuples
  • Various methods for scoring distances
  • Report K5 as seemingly optimal

13
Our Approach Average Common Substring (ACS)
  • For every position in Genome A, find the
  • longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
14
Our Approach ACS (cont.)
  • For every position in Genome A, find the
  • longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
15
Our Approach ACS (cont.)
  • For every position in Genome A, find the
  • longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
16
Our Approach ACS (cont.)
  • For every position in Genome A, find the
  • longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
17
Our Approach ACS (cont.)
  • For every position in Genome A, find the
  • longest common substring in Genome B.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT
18
Our Approach ACS (cont.)
  • For every position in Genome A, find the
    length
  • of longest common substring in Genome B.
  • In this case, l( )5.

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT
19
Our Approach ACS (cont.)
  • For every position in Genome A, find the
    length
  • of longest common substring in Genome B.
  • In this case, l( )5.
  • ACS average l( ) L(Genome A, Genome B)

Genome A
AGGCTTAGATCGAGGCTAGGATCCCCTTAGCG
Genome B
AAAGCTACCTGGATGAAGGTAGGCTGCGCCCTTT
20
From ACS to Our Distance Intuition
  • High L( A , B ) indicates higher similarity.
  • Should normalize to account for length of B.

21
From ACS to Our Distance Intuition
  • High L( A , B ) indicates higher similarity.
  • Should normalize to account for length of B.
  • Still, we want distance rather than similarity.

22
From ACS to Our Distance Intuition
  • High L( A , B ) indicates higher similarity.
  • Should normalize to account for length of B.
  • Still, we want distance rather than similarity.

23
From ACS to Our Distance Intuition
  • High L( A , B ) indicates higher similarity.
  • Should normalize to account for length of B.
  • Still, we want distance rather than similarity.
  • And want to have D( A , A ) 0 .

24
From ACS to Our Distance Intuition
  • High L( A , B ) indicates higher similarity.
  • Should normalize to account for length of B.
  • Still, we want distance rather than similarity.
  • And want to have D( A , A ) 0 .
  • Finally, we want to ensure symmetry.

25
Comparison to Human (H)
26
What Good is this Weird Measure?
1) Our ACS distance is related to an
information theoretic measure that is close to
Kullback Leibler relative entropy between two
distributions. 2) The proof of the pudding is in
the eating Will show this weird measure is
empirically good.
27
An Info Theoretic Measure
  • Define number of bits required
  • to describe distribution p, given q.
  • is closely related to Kullback
    Leibler
  • relative entropy

28
An Info Theoretic Measure
  • Both and are
    common
  • distance measures between two probability
  • distributions p and q.
  • Both distances are neither symmetric, nor
  • satisfy triangle inequality.

29
Relations Between ACS and
  • Suppose p and q are Markovian probability
  • distributions on strings, and A, B are
  • generated by them.
  • Abraham Wyner (1993) showed that w.h.p

30
Computing Our Distance
  • Average number of bits for compression a bit from
    one genome by the other vice versa.
  • Practically achieved better results than the sum
    of relative entropies.

31
Implementation and Complexity
  • Computation distance of two k long genomes
  • Naïve implementation requires O(k2)
  • (disaster on billion letters long genomes)
  • With suffix trees/arrays Total time for
  • computing is O(k) (much
    nicer).

32
Results and Comparisons
  • Many genomes and proteomes
  • Small ribosomal subunit ML tree
  • Compare to other whole-genome methods
  • Quantitative and qualitative evaluation

33
Four Datasets Used
  • Benchmark dataset 75 species
  • 191 species (all non-viral proteomes in NCBI)
  • 1,865 viral genomes
  • 34 mitochondrial DNA of
  • mammals (same as Li et al.)

34
Benchmark Dataset 75 Species
  • Genomes and proteomes of archaea, bacteria and
    eukarya
  • Tree topologies reconstructed from distance
    matrix using Neighbor Joining (Saitou and Nei
    1987)
  • Reference tree and distance matrix obtained from
    the RDP (ribosomal database)

35
Results Quantitative Evaluations
  • Benchmark dataset
  • Genomes/Proteomes of 75 species from archaea,
    bacteria and eukarya.
  • Methods tested
  • ACS (Ours)
  • Lempel Ziv complexity (Otu and Sayhood)
  • K-mers composition vectors (Qi et al.).

36
Results Quantitative Evaluations
  • Tree evaluation
  • Reference tree Accepted tree obtained from
    ribosomal database project (Cole et al. 2003)
  • Tree Distance Robinson-Foulds (1981)

37
Robinson-Foulds Distance
  • Each tree edge partitions species into 2 sets.
  • Search which partitions exist only in one of the
    trees.

A
C
A
E
Common Partition
x
A,B
C,D,E
A,B
C,D,E
y
B
B
D
E
D
C
Tree A
Tree B
38
Robinson-Foulds Distance
  • Each tree edge partitions species into 2 sets.
  • Search which partitions exist only in one of the
    trees.

A
C
A
E
A,B,C
Partition Not in B
x
y
B
B
D,E
D
E
D
C
Tree A
Tree B
39
Robinson-Foulds Distance
  • Distance number of edges inducing partitions
    existing only in one of the trees.
  • For n leaves, distance ranges from 0 through
    2n-6.

A
C
A
E
A,B,C
Partition Not in B
x
y
B
B
D,E
D
E
D
C
Tree A
Tree B
40
Robinson-Foulds Distance - Results
Benchmark set has n75 species, so max distance
is 144.
41
All Proteomes Dataset
  • 191 proteomes from NCBI Genome
  • 11 Eukarya, 19 Archaea, 161 Bacteria
  • Compared to NCBI Taxonomy

42
All Proteomes Dataset
  • 191 proteomes from NCBI Genome
  • 11 Eukarya, 19 Archaea, 161 Bacteria
  • Compared to NCBI Taxonomy

43
All Proteomes Dataset
  • 191 proteomes from NCBI Genome
  • 11 Eukarya, 19 Archaea, 161 Bacteria
  • Compared to NCBI Taxonomy

44
Viral Forest
  • 1865 viral genomes from EBI
  • Split into super-families
  • dsDNA
  • ssDNA
  • dsRNA
  • ssRNA positive
  • ssRNA negative
  • Retroids
  • Satellite nucleic acid

45
Retroid Tree
  • 83 Reverse-transcriptases
  • Hepatitis B viruses
  • Circular dsDNA
  • ssRNA

46
ssRNA Negative Tree
  • Each segment treated separately
  • 174 segments of 74 viruses.

47
Mammalian mtDNA Tree
48
Throwing Branch Lengths In
49
General Insights
  • Proteomes vs. Genomes
  • Overlapping vs. Non-overlapping
  • Triangle inequality held in all cases

50
Additional Directions attempted
  • Naïve introduction of mismatches
  • Division into segments
  • Weighted combinations of genome and proteome data
  • Bottom line (subject to change)
  • Simple is beautiful.

51
Summary
  • Whole genome phylogeny based on ACS method
  • Effective algorithm
  • Information theoretic justification
  • Successful reconstruction of known phylogenies.

52
Future work
  • Additional datasets
  • Statistical significance
  • Improved branch lengths estimation
  • Better time and space complexities

53
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com