Whole Genome Phylogenetic Analysis - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Whole Genome Phylogenetic Analysis

Description:

Whole Genome Phylogenetic Analysis ... Conclusion * DNA versus AA Sequence There are more k ... Neighbor-Joining program in PHYLIP We visualized ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 54

Provided by: nob54

Category:

more less

Transcript and Presenter's Notes

Title: Whole Genome Phylogenetic Analysis

1
Whole Genome Phylogenetic Analysis

Yifeng Liu and Reihaneh Rabbanyk Khorasgani
April 8th, 2009

2
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

3
Whole Genome Phylogeny Motivations

Currently the dominant method for phylogenetic
analysis is based on a single gene or protein.
However different gene tells a different story
Recently more genomic sequences became available
We hope to resolve the above inconsistency by
using the entire genome (or proteome) to
reconstruct phylogenetic tree.

4
Whole Genome Phylogeny Methods

Major categories of methods are based on
Shared gene (ortholog) content
Nucleotide and amino acid (string) composition
Genome Compression
Gene order
In our study, we focus on string composition and
compression methods

5
Complete Composition Vector (CCV)

The observed occurrence probability for a
k-string
The estimated background occurrence probability
based on the Markov assumption is

6
Complete Composition Vector (CCV)

The occurrence probability due to selective
pressureThe k-th composition vectorThe
Complete Composition Vector (CCV)

7
Compression Methods

Kolmogorov Complexity
Lempel-Ziv complexity

8
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

9
A new term weighting scheme

CCV uses S() to weight each k-string, which
Utilizes only local information available within
a single sequence
Estimates random background based on Markov
model
Can we have a measure that use both local and
global information without making the Markov
assumption?

10
Term and Document Frequency

Genomes are documents written in a language of
four alphabets A,T,C,G similarly, proteomes
are documents written in a language of twenty
alphabets.
Each k-string can be viewed as a word within a
gnome (or proteome) document.
The collection of all genomes in the dataset is
therefore a corpus.

11
Term and Document Frequency

In statistical Natural Language Processing, a
well-known term weighting scheme TF-IDF combines
both term frequency and document frequency into a
single weight.

12
CCV meets Document Frequency

We can also combine the occurrence probability
due to selection S() with the inverse document
frequency into a single weight called CCV-IDF.
S() provides local information and dfi provides
global information.

13
Ensemble Measures

Normalizing distances to same range
Combining distance matrixes
These parameters should be adjusted

14
Tree Evaluation

We propose a new evaluation method for evaluating
phylogenetic trees
A numeric measure
Shows how compatible the tree is with the given
taxonomy

15
Tree Evaluation (Cont.)

Labeling the inner nodes in the tree
For each species
A path in the tree
? sequence of inner node labels
A taxonomy description
? taxonomy sequence
There should be a many to many alignment between
these two sequences

16
Tree Evaluation (Cont.)

Finding alignment between these sequences for all
the species
Using Bayesian Network
Finding the most probable alignments
Measuring the Log likelihood of these alignment
How probable is this tree given this taxonomy

17
Tree Evaluation (Example)

Phylogenetic tree
Taxonomy
T1T2 A
T1T3 B
T1T3 C
T1T3 D

1
2
3
1
ltT1T2,1gt
P1
12
ltT1,1gt ltT3,2gt
P2
123
ltT1,1gtltT3, 23gt
P3
123
ltT1,1gtltT3, 23gt
P4
18
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

19
Dataset influenza virus

Influenza virus genomes (flu)
44 influenza A genomes (3 for H1-H13, 2 for H16)
3 influenza B genomes
1 influenza C genome (out group)
Coding gene sequences only
Collected and joined from individual gene
sequences according to the following order HA,
NA, NP, M, NS, PA, PB1, PB2

20
Dataset Prokaryotes

Prokaryote genomes (bac)
88 bacterial genomes
11 archaean genomes
Uses Nanoarchaeum equitans as the out group.
Collected from NCBI according to the accession
number provided in the CCV paper.
Genomeic DNA sequence including intergenic
regions.

21
Dataset Mammal mitochondria

Mammal mitochondria (mito)
425 mammal mitochondria
1 Arabidopsis mitochondrion (out group)
Collected from the Organelle Genome
Megasequencing Program website.
converted from NCBI format to fasta format.
Contains many duplicated entries for
Bos taurus (cattle)
Sus scrofa (wild Boar)
Mus musculus (mouse)
Rattus norvegicus (rat)

22
Experiments

We built a multiple sequence alignment tree for
flu
We ran CCV, TF-IDF and CCV-IDF on all three
datasets with the following k-string length (we
fixed K1 1 and only vary K2, L K2 - K1 1
K2 )
Flu L 7, L 15
Bac and mito L 7 and L 9
Each run generates a pairwise distance matrix.

23
Experiments

We ran GenCompress and LZ compression programs on
flu and mito and calculate pairwise distance
We tried ensembling different measures Reihaneh

24
Experiments

We converted pairwise distance matrices into
phylogenetic trees using the Neighbor-Joining
program in PHYLIP
We visualized resulting trees using DRAWGRAM and
TreeView.

25
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

26
MSA trees versus HA tree
HA tree by Suzuki et.al.
27
MSA versus Compression
GenCompress
1, 2, 3
10, 12, 8
7
13, 16
15, 4, 5, 6, 9
B
28
MSA versus CCV
MSA tree
H1, 2, 3
H5, 6, 9

H4, 15, 16, 13

7, 8, 10, 12
B
29
MSA vs TF-IDF
MSA tree
H1, 2, 3
H5, 6, 9

H4, 15, 16, 13

H7, 10, 12, 8
B
30
MSA vs CCV-IDF
CCV-IDF L15 cos
MSA tree
H1, 2, 3
H1, 2, 3
H5, 6, 9
8, 10, 12
7

H4, 15, 16, 13

7, 8, 10, 12
4, 5, 6, 9
15

13, 16

B
B
31
CCV vs TF-IDF
32
CCV vs CCV-IDF
33
Observations

All methods (MSA, CCV, GenCompress, TF-IDF,
CCV-IDF) generate similar results.
Our results are significantly different from
previous studies.
Most clades are intact while some are scattered
around.
Most clades are pure while some are mixed with
species from nearby clades.
CCV and CCV-IDF results are highly similar.

34
AA versus DNA
CCV k13, k27 protein
CCV k11, k27 DNA
35
CCV L7 and L9
CCV k11, k27 DNA
CCV k11, k29 DNA
36
Observations

Most clades are intact.
For similar CCV length, the DNA tree is worse
than the protein tree and unable to recognize
Archaea as a distinctive clade.
CCV trees are similar for length 7 and length 9.
Similarly the L7, L15 and L21 tree for flu are
almost identical

37
Mito results

For the mito dataset, we have similar
observations.
All methods failed to resolve fine branches of
the tree by mixing in distant species.

38
Mito primates
TF-IDF L9 cos

CCV L9 cos

39
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

40
DNA versus AA Sequence

There are more k-strings for protein sequence
than DNA sequence for the same length.
We need longer k-strings for DNA to achieve the
same resolution as amino acid (AA) sequence.
Due to the redundant nature of the genetic code,
different DNA k-strings may correspond to the
same AA k-string.
AA k-strings can share information even though
their DNA sequence might be different
DNA sequence may contain intergenic regions which
do not response to selection pressure
Intergenic region may not contribute much to the
resolution of the tree they might even reduce
such resolution.

41
Thoughts on Document Frequency

We did not observe significant performance
difference by adding in document frequency
information.
For longer genome (e.g. bac), we need longer
k-strings to see the effect of DF.
All bac genomes share 87.9 9-strings and only
0.8 11-strings

42
Compression programs

Current compression programs are problematic
LZ could not handle large datasets
Kolmogorov is not applicable for large sequences
These method should be reimplemented

43
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

44
Future works

Run the same experiments on protein sequence
To investigate the effect of using AA versus DNA
sequences.
We expect to see better results with protein
sequences
New result may reveal subtle difference between
different methods.

45
Future works

Speed up the implementation for TF-IDF and run
them on longer k-strings
Computational complexity is the bottle neck for
achieving high resolution in a reasonable amount
of time.
Initially the calculations for TF and IDF are
separated slow
We achieved significant speedup by integrating
the calculation of TF and IDF into a two-pass
algorithm
We may drop k-string with low TF-IDF values to
further speed up the program.

46
Future works

Perform bootstrapping analysis
We are unable to perform bootstrapping analysis
due to time and computational resource constraints

47
Future works

In our proposed evaluation method, we need a Many
to many alignment which is not a trivial task
It is well studied in Machine translation and
Natural Language Processing and those techniques
could help here
This measure could also be used as a measure of
similarity between trees

48
Agenda

Introduction
Our method proposals
Datasets and experiments
Results
Discussion
Future work
Conclusion

49
Conclusion

All string composition methods (CCV, TF-IDF,
CCV-IDF) somewhat group most similar species
together and produce consistent results.
However they all failed to resolve big branches
as well as fine branches.
We did not observe significant improvement by
adding document frequency.
But we will need further experiments (with longer
k-strings on AA sequences) to fully understand
the effect.

50
Major Contributions

We proposed a novel term weighting scheme which
achieves similar performance as CCV in our
experiments
We proposed the notion of adding in global
information in the form of document frequency
We discovered that using protein sequence may
significantly improve performance for all methods
We proposed a novel evaluation method for
phylogenetic trees

51
Author Contributions

Yifeng
Collected all three data sets
Performed CCV experiments
Implemented TF-IDF and CCV-IDF
Reihaneh
Built MSA tree for flu
Performed Compression experiments
Implemented ensemble and evaluation methods.

52
Special thanks to

Professor Guohui Lin
Dr. Zhipeng Cai
Proteome Analyst Research Group

53
References

1 Xin Chen, Sam Kwong, and Ming Li. A
compression algorithm for dna sequences and its
applications in genome comparison. In in Genome
Informatics, pages 52-61, 1999.
2 Joseph Felsenstein. Phylip - phylogeny
inference package (version 3.2). Cladistics,
5164-166, 1989.
3 M. Li, J. H. Badger, X. Chen, S. Kwong, P.
Kearney, and H. Zhang. An information-based
sequence distance and its application to whole
mitochondrial genome phylogeny. Bioinformatics,
17(2)149-154, 2001.
4 Yifeng Liu and Reihaneh Rabbanyk Khorasgani.
A survey on whole genome phylogenetic analysis.
CM- PUT 606 course survey, Feburary 2009.
5 Christopher D. Manning and Hinrich Schtze.
Foundations of Statistical Natural Language
Processing. The MIT Press, June 1999.
6 Hasan H. Otu and Khalid Sayood. A new
sequence distance measure for phylogenetic tree
construction. Bioinformatics, 19(16)2122-2130,
2003.
7 N. Saitou and M. Nei. The neighbor-joining
method a new method for reconstructing
phylogenetic trees. Mol Biol Evol, 4(4)406-425,
July 1987.
8 Yoshiyuki Suzuki and Masatoshi Nei. Origin
and evolution of in?uenza virus hemagglutinin
genes. Mol Biol Evol, 19(4)501-509, April 2002.
9 Xiaomeng Wu, Xiufeng Wan, Gang Wu, Dong Xu,
and Guohui Lin. Phylogenetic analysis using
complete signature information of whole genomes
and clustered neighbor-joining method.
International Journal on Bioinformatics Research
and Application, 2(3)219-248, 2006.