Title: PHYLOGENETIC TREES
1PHYLOGENETIC TREES
- Introduction to Computational Biology CIS 786
With Dr. Barry Cohen - Tuesday, May 7, 2001
- Paul Wood
- Yanchun Song
- Chaowei Sun
2Introduction
Paul Wood
Chaowei Sun
Yanchun Song
3What is a Phylogenetic Tree?
- Phylogenetic trees are representations of the
similarity or dissimilarityamong both existing
extinct living individuals across a set of
characteristics or features. - Similarity of molecular and physical systems
provide compelling evidence that all life on
earth arose from a common ancestry.
Carl R. Woese, Interpreting the universal
phylogenetic tree, Proc. Natl. Acad. Sci. USA,
Vol. 97, Issue 15, 8392-8396, July 18,
2000 http//www.pnas.org/cgi/content/full/97/15/83
92
4Why do we study Phylogenetic Trees?
because humans need to.fill in blanks
and understand in our own language
COMPARE
- Shall I thee to a
summers day? - W. Shakespeare, Sonnet 18
- There is a between
Homer and Hesiod, between Æschylus and Euripides - P. Shelley, Prometheus Unbound
- Life all around meAll in the loom, and oh
- What ! Woodlands,
meadows, - E. L. Masters, Spoon River Anthology
- If the foolish call them flowers/Need the wiser
tell? -
- // If the savants
them/It is just as well. - E. Dickenson, Part 1 Life, XCIV
SIMILARITY
PATTERNS
CLASSIFY
5What are some applications of phylogenetic
trees?
- Computational Linguistics
- Manning, Christopher D. and Heinrich Schutze,
Foundations of Statistical Natural Language
Processing, MIT Press, Cambridge Massachusetts,
1999. http//www.aclweb.org/archive/fsnlp-ch1.pdf - Archaeological Statistics
- Archaeological Statistics Brief Bibliography
http//ad.trafficmp.com/tmpad/banner/itrack.asp?rv
3.0id16nojs1 - Broad Historical and Technical Overview
- Discriminant Analysis and Clustering, Panel on
Discriminant Analysis, Classification, and
Clustering, Committee on Applied and Theoretical
Statistics Board on Mathematical Sciences,
Commission on Physical Sciences, Mathematics, and
Resources National Research Council, NATIONAL
ACADEMY PRESS, Washington, D.C. 1988
http//www.ulib.org/webRoot/Books/National_Academy
_Press_Books/discrim_analysis/discr001.htm
6Phylogenetic trees are used to study locations,
migrations, lives, health cultures of
populations.
Xenia
Katrina
Helena
Tara
Ursula
Velda
Jasmine
http//www.oxfordancestors.com/daughters.html
7Phylogenetic trees are used to study physical
genetic variability, evolution of species.
http//www.oxfordancestors.com/daughters.html
8Which areas of the genome provide mutant data to
create phylogenetic trees?
Autosomes
Mitochondrial Control Region
Y-Chromosome
9How do we get data for computational biology?
TISSUE
STEP 1 Eukaryotic Biochemical Protocol iskind
of like washing greasy dishes!
Homogenize
Detergent (Sodium Dodecyl Sulphate SDS)
High Weight
DNA
Concentration gradient
Phenol
DNA
DNA
Medium Weight
Genetic Material
Remove Upper Phase
DNA
DNA
RNA
SPIN 40 hrs _at_ 40,000 RPM
RNA
RNA
Insoluble Protein
RNA
RNA
Low Weight
Cesium Chloride
Cs
Cs
Cs
Phenol
Cs
10How do we get sequence data?
STEP 2 Cut up DNA using one of two methods
STEP 3 Label fragments using one of two
methods
Gel Electro- phoresis
2 a Sanger (Dideoxy)
4 Reactions
Restriction Enzymes
Fluorescent Dye
Fluorescence Spectroscopy
DNA
3a
DNA
atcgagtcc
DNA
DNA
DNA
EtOH
RNA
32Phosphate
Auto Radiography
3b
RNA
RNA
RNA
2 b Maxam-Gilbert
RNA
Gel Electro- phoresis
4 Reactions
Cs
Cs
Cs
Cs
11What is the rate of evolutionary changeorhow
many mutants can we expect?
- Estimates vary depending upon assessment method
and location within the genome - 134 independent mtDNA lineages spanning 327
generations found 2.5 mutations per site per
1000 yrs. - A high observed substitution rate in the human
mitochondrial DNA control region. Parsons TJ,
Muniec DS, Sullivan K, Woodyatt N,
Alliston-Greiner R, Wilson MR, Berry DL, Holland
KA, Weedn VW, Gill P, Holland MM. Nat Genet 1997
Apr 15(4)363-8. Armed Forces DNA
Identification Laboratory, Armed Forces Institute
of Pathology, Rockville, Maryland 20850, USA.
http//www.mhrc.net/mitochondria.htm - M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt.
(1978) A model of evolutionary change in
proteins. In Atlas of Protein Sequence and
Structure, M. O. Dayhoff, (Ed.). National
Biomedical Research Foundation, Vol. 5, Suppl. 3,
chapter 22, 345-352)
12What do sequence data and input files typically
look like?
PHYLIP INPUT FILE (SEQUENCE)
- 282
- 1 AY053096 cacgggagct variable region... 282
- 2 AY053097 cacgggagct variable region... 282
- 3 AY053098 cacgggagct variable region... 282
- .
- 263
- !DomainData propertyCoding CodonStart1
- W._Pygmy_(1)_African TTC TTT CAT GGG
- W._Pygmy_(6)_African ... ... ... ...
- Kung_(7)_African ... .C. ... ... .T.
- Kung_(9)_African ... ... ... ... ...
- Kung_(10)_African ... ... ... ... ...
- Kung_(13)_African ... ... .G. ... ...
DISTANCE MATRIX
MEGA INPUT FILE (SEQUENCE)
13What are some of the major classifications of
algorithms software applications?
PHYLIP, PAUP MEGA are represented across most
categories. PHYLIP is the most widely
distributed and used. PAUP is most frequently
cited in publications. MEGA has a nice GUI and
is user friendly. http//evolution.genetics.washi
ngton.edu/phylip/software.html
14Yanchun Song
15Two Types of Data
- Distance-based
- The input is a matrix of distances between the
species (e.g., the alignment score between them
or the fraction of residues they agree on). - Character-based
- Examine each character (e.g., a base in a
specific position in the DNA) separately
16Pairwise Distance
- Model of Jukes and Cantor
- Each base in the DNA sequence has an equal chance
of mutating, and when it does, it is replaced by
some other nucleotide uniformly. - Distance dij
- The fraction f of sites u where residues xui and
xuj differ (presupposing an alignment of the two
sequences).
T. H. Jukes and C. Cantor, Mammalian Protein
Metabolism, Chapter Evolution of protein
molecules, pages 21-132, Academic Press, New
York, 1969
17How to Make a Tree?
- Clustering methods
- UPGMA
- Neighbor-joining
- Parsimony
18Clustering Method UPGMA
- UPGMA Unweighted Pair Group Method with
Arithmetic Mean - Di,j between two clusters of species Ci and Cj
-
-
- d(p, q) distance function between species,
- ni Ci and nj Cj.
http//www.math.tau.ac.il/rshamir/algmb/00/scribe
00/html/lec08/node21.html
19Algorithm
- Initialization
- Initialize n clusters with the given species, one
species per cluster. - Size of each cluster ni ? 1 assign a leaf for
each species. - Iteration
- Find minimal Dij,
- Create a new cluster (ij), which has n(ij) ni
nj members. - Connect i and j to the new node (ij), each given
length Di,j /2. - Compute the distance from (ij) to all other
clusters as a weighted average of the distances
from its components - Replace the columns and rows of clusters i and in
D with cluster (ij), with D(ij),k computed as
above. - Termination
- until there is only one cluster left.
20UPGMA Example
http//www.icp.ucl.ac.be/opperd/private/upgma.htm
l
21UPGMA Example (contd)
D(A,B),C (DAC DBC) / 2 4 D(A,B),D (DAD
DBD) / 2 6 D(A,B),E (DAE DBE) / 2 6
D(A,B),F (DAF DBF) / 2 8
http//www.icp.ucl.ac.be/opperd/private/upgma.htm
l
22UPGMA Example (contd)
http//www.icp.ucl.ac.be/opperd/private/upgma.htm
l
23Additivity
- Given a tree, its edge lengths are said to be
additive if the distance between any pair of
leaves is the sum of the lengths of the edges on
the path connecting them.
24Additivity
- Dim Dik Dkm
- Djm Djk Dkm
- Dij Dik Djk
25The idea of Neighbor-joining
- Distance of i from the rest of the tree
- To find neighboring nodes i and j
- min(Di,j (ui uj) )
R. Durbin, et al, Additivity and
neighbour-joining, Biological Sequence Analysis,
p. 169-173, Cambridge Univ. Press, 1999.
26Algorithm Neighbor-Joining
- Initialization
- Define T to be the set of leaf nodes, one for
each given sequence, and put n T. - Iteration
- For each species, compute .
- Choose a pair i, j in T for which Di,j (ui
uj) is minimal. - Join i and j to a new cluster k(ij). Calculate
the branch lengths from i and j to the new node k
as - Di,k1/2(Di,j ui uj), Dj,k1/2(Di,j uj
ui) - Compute the distances between k and each other
cluster - Dk,m1/2(Di,m Dj,m Di,j), m?T
- Remove i and j from T and add k.
- Termination
- When T consists of only two nodes i and j,
connect the remaining nodes by a branch of length
Dij.
27Chaowei Sun
28MEGA 2
- Molecular Evolutionary Genetics Analysis
- Provides tools for exploring and analyzing DNA
and protein sequences from evolutionary
perspectives -
29History of MEGA
- MEGA 1
- DOS-Based
- MEGA 2
- User-friendly interface
- Windows
- Macintosh
- Sun Workstation
- Linux
30Input
- Character Sequence
- - DNA/RNA
- - Protein
- Distance Matrix
- Import data from other formats, PHYLIP, XML, etc.
-
31Character Sequence
32Distance Matrix
33Methods and Algorithms
- methods for constructing phylogenetic trees from
molecular data. - 1. UPGMA Method
- 2. Neighbor-Joining (NJ) Method
- 3. Minimum Evolution (ME) Method
- 4. Maximum Parsimony (MP) Method
34Unweighted Pair Group Method with Arithmetic Mean
- UPGMA
- Assumes a constant rate of evolution
- sequential clustering method
- Produces a rooted tree
- edge lengths - time measured by a molecular clock
35Neighbor-Joining - NJ
- No assumption
- finds neighbors sequentially that may minimize
the total length of the tree - produces an unrooted tree
- root - midpoint of the longest route connecting
two taxa in the tree
36Minimum Evolution - ME
- Finds a topology with the smallest sum of branch
lengths - time-consuming sum of branches for all
topologies have to be evaluated
37Maximum Parsimony - MP
- Finds a topology that requires the smallest
number of changes (substitution) - For each topology sums up total number of
substitutions
38Output - UPGMA
39Unrooted Tree - NJ
40Output - NJ
41Output - ME
42Comparison
Computational Method
Optimality criterion
Clustering algorithm
Parsimony
Characters
Minimum Evolution
UPGMA Neighbor-Joining
Distance
43Comparison Contd
- UPGMA, Neighbor-Joining
- Minimum Evolution, Maximum Parsimony
- Fast O(n2), Large dataset
- depends upon the order in which we add
sequences to the tree
- Time consuming, NP-Complete
- use an explicit function relating the trees to
the data
44The End
Thank you and enjoy the finals