Whole Genome Phylogenetic Analysis - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Whole Genome Phylogenetic Analysis

Description:

Whole Genome Phylogenetic Analysis ... Conclusion * DNA versus AA Sequence There are more k ... Neighbor-Joining program in PHYLIP We visualized ... – PowerPoint PPT presentation

Number of Views:171
Avg rating:3.0/5.0
Slides: 54
Provided by: nob54
Category:

less

Transcript and Presenter's Notes

Title: Whole Genome Phylogenetic Analysis


1
Whole Genome Phylogenetic Analysis
  • Yifeng Liu and Reihaneh Rabbanyk Khorasgani
  • April 8th, 2009

2
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

3
Whole Genome Phylogeny Motivations
  • Currently the dominant method for phylogenetic
    analysis is based on a single gene or protein.
  • However different gene tells a different story
  • Recently more genomic sequences became available
  • We hope to resolve the above inconsistency by
    using the entire genome (or proteome) to
    reconstruct phylogenetic tree.

4
Whole Genome Phylogeny Methods
  • Major categories of methods are based on
  • Shared gene (ortholog) content
  • Nucleotide and amino acid (string) composition
  • Genome Compression
  • Gene order
  • In our study, we focus on string composition and
    compression methods

5
Complete Composition Vector (CCV)
  • The observed occurrence probability for a
    k-string
  • The estimated background occurrence probability
    based on the Markov assumption is

6
Complete Composition Vector (CCV)
  • The occurrence probability due to selective
    pressureThe k-th composition vectorThe
    Complete Composition Vector (CCV)

7
Compression Methods
  • Kolmogorov Complexity
  • Lempel-Ziv complexity

8
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

9
A new term weighting scheme
  • CCV uses S() to weight each k-string, which
  • Utilizes only local information available within
    a single sequence
  • Estimates random background based on Markov
    model
  • Can we have a measure that use both local and
    global information without making the Markov
    assumption?

10
Term and Document Frequency
  • Genomes are documents written in a language of
    four alphabets A,T,C,G similarly, proteomes
    are documents written in a language of twenty
    alphabets.
  • Each k-string can be viewed as a word within a
    gnome (or proteome) document.
  • The collection of all genomes in the dataset is
    therefore a corpus.

11
Term and Document Frequency
  • In statistical Natural Language Processing, a
    well-known term weighting scheme TF-IDF combines
    both term frequency and document frequency into a
    single weight.

12
CCV meets Document Frequency
  • We can also combine the occurrence probability
    due to selection S() with the inverse document
    frequency into a single weight called CCV-IDF.
  • S() provides local information and dfi provides
    global information.

13
Ensemble Measures
  • Normalizing distances to same range
  • Combining distance matrixes
  • These parameters should be adjusted

14
Tree Evaluation
  • We propose a new evaluation method for evaluating
    phylogenetic trees
  • A numeric measure
  • Shows how compatible the tree is with the given
    taxonomy

15
Tree Evaluation (Cont.)
  • Labeling the inner nodes in the tree
  • For each species
  • A path in the tree
  • ? sequence of inner node labels
  • A taxonomy description
  • ? taxonomy sequence
  • There should be a many to many alignment between
    these two sequences

16
Tree Evaluation (Cont.)
  • Finding alignment between these sequences for all
    the species
  • Using Bayesian Network
  • Finding the most probable alignments
  • Measuring the Log likelihood of these alignment
  • How probable is this tree given this taxonomy

17
Tree Evaluation (Example)
  • Phylogenetic tree
  • Taxonomy
  • T1T2 A
  • T1T3 B
  • T1T3 C
  • T1T3 D

1
2
3
1
ltT1T2,1gt
P1
12
ltT1,1gt ltT3,2gt
P2
123
ltT1,1gtltT3, 23gt
P3
123
ltT1,1gtltT3, 23gt
P4
18
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

19
Dataset influenza virus
  • Influenza virus genomes (flu)
  • 44 influenza A genomes (3 for H1-H13, 2 for H16)
  • 3 influenza B genomes
  • 1 influenza C genome (out group)
  • Coding gene sequences only
  • Collected and joined from individual gene
    sequences according to the following order HA,
    NA, NP, M, NS, PA, PB1, PB2

20
Dataset Prokaryotes
  • Prokaryote genomes (bac)
  • 88 bacterial genomes
  • 11 archaean genomes
  • Uses Nanoarchaeum equitans as the out group.
  • Collected from NCBI according to the accession
    number provided in the CCV paper.
  • Genomeic DNA sequence including intergenic
    regions.

21
Dataset Mammal mitochondria
  • Mammal mitochondria (mito)
  • 425 mammal mitochondria
  • 1 Arabidopsis mitochondrion (out group)
  • Collected from the Organelle Genome
    Megasequencing Program website.
  • converted from NCBI format to fasta format.
  • Contains many duplicated entries for
  • Bos taurus (cattle)
  • Sus scrofa (wild Boar)
  • Mus musculus (mouse)
  • Rattus norvegicus (rat)

22
Experiments
  • We built a multiple sequence alignment tree for
    flu
  • We ran CCV, TF-IDF and CCV-IDF on all three
    datasets with the following k-string length (we
    fixed K1 1 and only vary K2, L K2 - K1 1
    K2 )
  • Flu L 7, L 15
  • Bac and mito L 7 and L 9
  • Each run generates a pairwise distance matrix.

23
Experiments
  • We ran GenCompress and LZ compression programs on
    flu and mito and calculate pairwise distance
  • We tried ensembling different measures Reihaneh

24
Experiments
  • We converted pairwise distance matrices into
    phylogenetic trees using the Neighbor-Joining
    program in PHYLIP
  • We visualized resulting trees using DRAWGRAM and
    TreeView.

25
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

26
MSA trees versus HA tree
HA tree by Suzuki et.al.
27
MSA versus Compression
GenCompress
1, 2, 3
10, 12, 8
7
13, 16
15, 4, 5, 6, 9
B
28
MSA versus CCV
MSA tree
H1, 2, 3
H5, 6, 9
  • H4, 15, 16, 13

7, 8, 10, 12
B
29
MSA vs TF-IDF
MSA tree
H1, 2, 3
H5, 6, 9
  • H4, 15, 16, 13

H7, 10, 12, 8
B
30
MSA vs CCV-IDF
CCV-IDF L15 cos
MSA tree
H1, 2, 3
H1, 2, 3
H5, 6, 9
8, 10, 12
7
  • H4, 15, 16, 13

7, 8, 10, 12
4, 5, 6, 9
15
  • 13, 16

B
B
31
CCV vs TF-IDF
32
CCV vs CCV-IDF
33
Observations
  • All methods (MSA, CCV, GenCompress, TF-IDF,
    CCV-IDF) generate similar results.
  • Our results are significantly different from
    previous studies.
  • Most clades are intact while some are scattered
    around.
  • Most clades are pure while some are mixed with
    species from nearby clades.
  • CCV and CCV-IDF results are highly similar.

34
AA versus DNA
CCV k13, k27 protein
CCV k11, k27 DNA
35
CCV L7 and L9
CCV k11, k27 DNA
CCV k11, k29 DNA
36
Observations
  • Most clades are intact.
  • For similar CCV length, the DNA tree is worse
    than the protein tree and unable to recognize
    Archaea as a distinctive clade.
  • CCV trees are similar for length 7 and length 9.
  • Similarly the L7, L15 and L21 tree for flu are
    almost identical

37
Mito results
  • For the mito dataset, we have similar
    observations.
  • All methods failed to resolve fine branches of
    the tree by mixing in distant species.

38
Mito primates
TF-IDF L9 cos
  • CCV L9 cos

39
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

40
DNA versus AA Sequence
  • There are more k-strings for protein sequence
    than DNA sequence for the same length.
  • We need longer k-strings for DNA to achieve the
    same resolution as amino acid (AA) sequence.
  • Due to the redundant nature of the genetic code,
    different DNA k-strings may correspond to the
    same AA k-string.
  • AA k-strings can share information even though
    their DNA sequence might be different
  • DNA sequence may contain intergenic regions which
    do not response to selection pressure
  • Intergenic region may not contribute much to the
    resolution of the tree they might even reduce
    such resolution.

41
Thoughts on Document Frequency
  • We did not observe significant performance
    difference by adding in document frequency
    information.
  • For longer genome (e.g. bac), we need longer
    k-strings to see the effect of DF.
  • All bac genomes share 87.9 9-strings and only
    0.8 11-strings

42
Compression programs
  • Current compression programs are problematic
  • LZ could not handle large datasets
  • Kolmogorov is not applicable for large sequences
  • These method should be reimplemented

43
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

44
Future works
  • Run the same experiments on protein sequence
  • To investigate the effect of using AA versus DNA
    sequences.
  • We expect to see better results with protein
    sequences
  • New result may reveal subtle difference between
    different methods.

45
Future works
  • Speed up the implementation for TF-IDF and run
    them on longer k-strings
  • Computational complexity is the bottle neck for
    achieving high resolution in a reasonable amount
    of time.
  • Initially the calculations for TF and IDF are
    separated slow
  • We achieved significant speedup by integrating
    the calculation of TF and IDF into a two-pass
    algorithm
  • We may drop k-string with low TF-IDF values to
    further speed up the program.

46
Future works
  • Perform bootstrapping analysis
  • We are unable to perform bootstrapping analysis
    due to time and computational resource constraints

47
Future works
  • In our proposed evaluation method, we need a Many
    to many alignment which is not a trivial task
  • It is well studied in Machine translation and
    Natural Language Processing and those techniques
    could help here
  • This measure could also be used as a measure of
    similarity between trees

48
Agenda
  • Introduction
  • Our method proposals
  • Datasets and experiments
  • Results
  • Discussion
  • Future work
  • Conclusion

49
Conclusion
  • All string composition methods (CCV, TF-IDF,
    CCV-IDF) somewhat group most similar species
    together and produce consistent results.
  • However they all failed to resolve big branches
    as well as fine branches.
  • We did not observe significant improvement by
    adding document frequency.
  • But we will need further experiments (with longer
    k-strings on AA sequences) to fully understand
    the effect.

50
Major Contributions
  • We proposed a novel term weighting scheme which
    achieves similar performance as CCV in our
    experiments
  • We proposed the notion of adding in global
    information in the form of document frequency
  • We discovered that using protein sequence may
    significantly improve performance for all methods
  • We proposed a novel evaluation method for
    phylogenetic trees

51
Author Contributions
  • Yifeng
  • Collected all three data sets
  • Performed CCV experiments
  • Implemented TF-IDF and CCV-IDF
  • Reihaneh
  • Built MSA tree for flu
  • Performed Compression experiments
  • Implemented ensemble and evaluation methods.

52
Special thanks to
  • Professor Guohui Lin
  • Dr. Zhipeng Cai
  • Proteome Analyst Research Group

53
References
  • 1 Xin Chen, Sam Kwong, and Ming Li. A
    compression algorithm for dna sequences and its
    applications in genome comparison. In in Genome
    Informatics, pages 52-61, 1999.
  • 2 Joseph Felsenstein. Phylip - phylogeny
    inference package (version 3.2). Cladistics,
    5164-166, 1989.
  • 3 M. Li, J. H. Badger, X. Chen, S. Kwong, P.
    Kearney, and H. Zhang. An information-based
    sequence distance and its application to whole
    mitochondrial genome phylogeny. Bioinformatics,
    17(2)149-154, 2001.
  • 4 Yifeng Liu and Reihaneh Rabbanyk Khorasgani.
    A survey on whole genome phylogenetic analysis.
    CM- PUT 606 course survey, Feburary 2009.
  • 5 Christopher D. Manning and Hinrich Schtze.
    Foundations of Statistical Natural Language
    Processing. The MIT Press, June 1999.
  • 6 Hasan H. Otu and Khalid Sayood. A new
    sequence distance measure for phylogenetic tree
    construction. Bioinformatics, 19(16)2122-2130,
    2003.
  • 7 N. Saitou and M. Nei. The neighbor-joining
    method a new method for reconstructing
    phylogenetic trees. Mol Biol Evol, 4(4)406-425,
    July 1987.
  • 8 Yoshiyuki Suzuki and Masatoshi Nei. Origin
    and evolution of in?uenza virus hemagglutinin
    genes. Mol Biol Evol, 19(4)501-509, April 2002.
  • 9 Xiaomeng Wu, Xiufeng Wan, Gang Wu, Dong Xu,
    and Guohui Lin. Phylogenetic analysis using
    complete signature information of whole genomes
    and clustered neighbor-joining method.
    International Journal on Bioinformatics Research
    and Application, 2(3)219-248, 2006.
Write a Comment
User Comments (0)
About PowerShow.com