IE68 Biological databases Phylogenetic analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: IE68 Biological databases Phylogenetic analysis

1
IE68 - Biological databasesPhylogenetic analysis
2
Phylogenetic analysis

Phylogeny
a reconstruction of the evolutionary
(genealogical) history of a group of
organisms/genes or proteins from biological data
organisms populations, species, genera,... gt
taxa gt operational taxonomic units (OTUs)
data molecular, morphological,
archaeological,... gt characters
Phylogenetic tree
the graphical reconstruction of a phylogeny
tree structure phylogram, cladogram

3
Phylogenetic tree
A tree consists of nodes connected by branches
polytomy
A B C D E
gt OTUs for which we have data
outgroup/midpoint
gt Ancestor of all the taxa that comprise the tree
notation ((A,B),(C,D,E))
4
Phylogenetics ltgt Phenetics

Phenetics method of grouping taxa that is based
on overall (dis)similarities of characters gt
with no reference to evolution!
Phylogenetics method of grouping taxa that is
based on shared derived characters
(synapomorphies) or a model of evolution

5
Why do we need phylogenies?

Intrinsic interest in the tree gt tree of life
origin of organisms

6
Why do we need phylogenies?

Phylogenies can also be used as tools for
investigating other problems
e.g. biogeography
phylogeny reflects the order of separation of
the areas the different taxa occupy

T
7
Why do we need phylogenies?

Phylogenies can also be used as tools for
investigating other problems
e.g. forensic science

8
(No Transcript)
9
Phylogenetic analysis

Molecular Phylogenetics
reconstruction of the evolutionary (geneological)
history of a group of organisms from molecular
data, i.e. DNA or protein sequences
In this lecture, we will focus on phylogenetic
analysis of organisms based on DNA sequence data

10
Molecular phylogenetics approach

Step 1 PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by DNA
sequence analysis
Step 2 Multiple DNA sequence alignment
Step 3 Phylogenetic analysis

11
PCR and DNA sequencing

Which loci?
DNA sequence information, primers, variability,
single or low-copy, orthologous, neutral,
recombination...
Gene trees versus organismal trees
phylogenies for genes do not always match those
of their corresponding organisms gt analyse more
than one gene

12
Confounding influence of gene duplication
2 types of homology orthology (speciation) and
paralogy (gene duplication)
13
Lineage sorting and coalescence
species alleles
14
Molecular phylogenetics approach

Step 1 PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by DNA
sequence analysis
Step 2 Multiple DNA sequence alignment
Step 3 Phylogenetic analysis

15
Multiple DNA sequence alignment

Problem alternative alignments
possible to align any two sequences by
postulating some combination of gaps
(insertion/deletions indels) and substitutions
gt which one to choose?
Basic task of sequence alignment is to find the
alignment with the highest similarity, smallest
distance, or lowest overall cost

16
Multiple DNA sequence alignment

2 sequences scoring scheme gt optimal alignment
Scoring scheme
- scoring matrix distance weights or similarity
scores for each pair of aligned bases e.g.
transition transversion matrix
A T G C
A 0 5 1 5
T 5 0 5 1
G 1 5 0 5
C 5 1 5 0
- gap weight, cost or penalty

17
Multiple DNA sequence alignment

Cost of an alignment D s wg
s no of substitutions, g total length of
gaps
w gap penalty cost of gap relative to
substitution
Gap penalty W makes implicit assumptions about
how the sequences have evolved
if indels are thought to be rare, then W should
be large (and vice versa)
gt have to use knowledge of biology e.g.
translation (3 bp indel, position),
transitionltgttransversion, ...

18
Multiple DNA sequence alignment

Software programs e.g. CLUSTALW (global
alignment)
http//www.ebi.ac.uk/clustalw/index.html
The optimal alignment is not always the true
alignment gt new developments phylogenetic
analysis without the multiple DNA sequence
alignment step

19
Molecular phylogenetics approach

Step 1 PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by DNA
sequence analysis
Step 2 Multiple DNA sequence alignment
Step 3 Phylogenetic analysis

20
Inferring phylogenies from DNA sequences
C
Sequence alignment A ..AGCGTCT..B
..AGCGTGT..C ..AGGAGT..
A
B
Phylogenetic methods
unrooted tree
A
B
taxa
characters
C
rooted tree
21
Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
22
Pairwise distance methods
3 taxa, 3 sequences

Dissimilarity matrix count the number of
differences between all possible pairs of
sequences
Convert dissimilarity to evolutionary distance by
correcting for multiple events per site according
to a certain model of evolution
Infer tree topology on the basis of the
evolutionary distances by using a clustering
algorithm or optimality criterion

1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
23
Models of sequence evolution
expected ? observed difference gt correction
(linear) (not linear)
Apply a substitution model that tries to estimate
the correct number of substitutions
24
Models of sequence evolution

Distance correction methodsconvert observed
distances into measure that correspond to ACTUAL
distance
Several methods have been proposed, all with
different assumptions about the nature of the
evolutionary process
Essentially they differ by the number of
parameters they include
We can use a general framework to show how these
models are inter-related

25
Substitution models general framework
26
Substitution models general framework
27
e.g. Model of Jukes Cantor (JC)

One of the first proposed perhaps the simplest
model of evolution
Assumes that all four bases have equal frequency
and that all substitutions are equally likely
Under this model, the distance between any two
sequences is given by d -3/4ln(1-4/3p), where p
is the proportion of nucleotides that are
different in the two sequences

28
e.g. Kimura 2 parameter model (K2P)

incorporates the observation that transitions
accumulate more rapidly than transversion
assumes all four bases have equal frequencies
but that there are 2 rate classes for
substitutions
Under this model, the distance between any two
sequences is given by d 1/2ln1/(1-2P-Q)
1/4ln1/(1-2Q), where P and Q are the
proportional differences between the two
sequences due to transitions and transversions,
respectively

29
Substitution models

Other models adding more parameters
Felsenstein model (F81)
variation in base composition gt base frequency
f ?A ?C ?G ?T may vary
Hasewaga Kishino Yano (HKY) model
unequal base frequency, transition/transversion
General reversible model (REV) unequal base
frequency, all six pairs of substitutions have
different rates
gt ideally, we want the simplest model we can get
away with that still yields a reasonable
estimate

30
Substitution models

Assumptions of these models
all nucleotide sites change independently
base composition equilibrium
substitution rate is constant over time and in
different lineages
each site in a sequence is equally likely to
undergo substitutiongt gamma distribution has a
parameter that specifies the range of rate
variation among sites model ?

Pairwise distance methods
Dissimilarity matrix count the number of
differences between all possible pairs of
sequences
Convert dissimilarity to evolutionary distance
by correcting for multiple events per site
according to a certain model of evolution
Infer tree topology on the basis of the
evolutionary distances by using a clustering
algorithm

3 taxa, 3 sequences
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
32
Clustering methods

Clustering methods follow a set of steps (an
algorithm) and arrive at a tree
UPGMA (Unweighted Pair Group Method using
Arithmetic Averages) results in an rooted and
additive tree with molecular clock
Neighbor-joining results in an unrooted and
additive tree
Other approaches least-squares, Fitch, Kitch,...

33
UPGMA clustering
A B C B 2 least differences C 4 4 D 6
6 6
1
A
1
B
Compute new distances between (AB) and other
OTUs d(AB)C (dAC dBC) /2 4 d(AB)D (dAD
dBD) /2 6
34
UPGMA clustering
1
A
AB C C 4 D 6 6
1
1
B
2
C
1
A
1
Compute new distances between (ABC) and other
OTUs d(ABC)D (d(AB)D dCD) /2 6
1
B
1
2
C
3
D
35
Clustering methods

UPGMA additive and ultrametric distancesgt
assumes a molecular clock gt very sensitive to
unequal rate of evolution! gt relative-rate test
Use other clustering methods for phylogenye.g.
Neighbor-joining
Goodness of fit statistics to select the
metric tree that best accounts for the observed
distances

Pairwise distance methods
Dissimilarity matrix count the number of
differences between all possible pairs of
sequences
Convert dissimilarity to evolutionary distance
by correcting for multiple events per site
according to a certain model of evolution
Infer tree topology on the basis of the
evolutionary distances by using an optimality
criterion

3 taxa, 3 sequences
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
37
Minimum evolution

Distance matrix gt unrooted metric trees
Each tree has a length L, which is the sum of all
the branch lengths
Optimality criterionthe minimum evolution tree
ME is the tree which minimizes L

38
Pairwise distance method

Advantages
very fast
based on a model of evolution
Disadvantages
sequence information is reduced to one number
branch lengths may not be biologically
interpreted
most methods provide only one tree topology
dependent on the model of evolution used

39
Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
40
Character-based methods

Character-based (discrete) methods operate
directly on sequences, rather than on pairwise
distances
Two major discrete methods
Maximum parsimony (MP) chooses tree(s) that
require fewest evolutionary changes
Maximum Likelihood (ML) chooses tree(s) that is
the one most likely to have produced the observed
data

41
Maximum parsimony

Maximum parsimony infers a phylogenetic tree by
minimizing the total number of evolutionary steps
Principle
Investigate all possible tree topologies
Reconstruct ancestral sequences
Choose topology with smallest number of steps

42
Maximum parsimony - principle
A
1
3
2
4
1
2
B
3
4
1
2
C
3
4
possible tree topologies
43
Maximum parsimony - principle
44
Maximum parsimony - principle
45
Maximum parsimony - principle
46
Maximum parsimony - generalized

In previous example, cost of each substitution
was one step gt equal weight
Instead, we can use different costs for different
types of change (e.g. transitions vs
transversions) to better match our assumptions
about evolutionary processes gt weighted
parsimonyaccording to Dollo, Wagner, Fitch, ...

47
Maximum parsimony - characters
48
Maximum parsimony search methods

Number of tree topologies Nu
(2n-5)!/2n-3(n-3)!i.e., 3 sequences 1 tree, 4
seq 3 trees, 5 seq 15, 6 105, gt the more
sequences ( taxa), the more trees gt
computationally expensive
Finding optimal trees
Exhaustive search limited number of taxa
(lt10)find the minimum tree of all possible trees
Branch and bound small number of taxa (lt18)find
the minimum tree without evaluating all trees by
discarding families of trees during tree
construction that cannot be shorter than the
shortest tree found so far
Heuristic search large number of taxa

49
Maximum parsimony search methods
- Heuristic searchexplore a subset of all
possible trees, by using stepwise addition of
taxa plus a rearrangement process (branch
swapping), but not guaranteed to find the minimal
tree
Global optimum
Local optimum
50
Maximum parsimony - output

Consensus treeMP can yield multiple equally
most parsimonious (optimal) trees gt
relationships common to all the optimal trees are
summarized with a consensus tree
Strict consensus includes splits found in all
trees
Majority-rule consensus includes splits found in
the majority of the trees (gt 50)

51
Maximum parsimony - output

Consistency index (CI) - Retention index (RI)
measures of the parsimony fit of a character to a
tree, or of the average fit of all characters to
a tree
more specifically index of how much homoplasy
the constructed tree has
Value from 0 to 1
higher value gt less homoplasy

52
(No Transcript)
53
Parsimony branch support and tree stability

Bootstrap analysis
is a resampling technique used to measure
sampling error
gives an idea about the reliability of branches
and clusters
original dataset gt resample gt construct trees
gt compare trees to original trees
gt70 quite confident of tree topology
Decay index (Bremer support)
gives us a sense of how many steps would be
required before a grouping collapses
higher value gt better branch support

54
Maximum parsimony

Advantages
based on shared derived characters
evaluates different tree topologies
does not reduce the information
Disadvantages
computationally intensive for large datasets
no correction for multiple mutations
sensitive to unequal rates of evolution (long
branch attraction)

55
Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
56
Maximum likelihood

Statistical method
If given some data D and a hypothesis H, the
likelihood of that data is given byLD Pr (DH)
Which is the probability of D given H?

57
Maximum likelihood

In the context of molecular phylogenetics
D is the set of sequences being compared
H is a phylogenetic tree
We want to find the likelihood of obtaining the
observed data given the tree
The tree that makes the data the most probable
evolutionary outcome is the Maximum Likelihood
estimate of the phylogeny

58
Maximum likelihood

In other wordsWhich tree is most likely to have
yielded these sequences (observed data) under a
given model of evolution (JC, K2P, ...)?

59
Maximum likelihood

Advantages
Statistically well founded
Based on a model of evolution
Evaluates different topologies
Uses all sequence information
Often yields estimates that have lower variance
than other methods
Disadvantages
Very slow (computationally intensive)
Dependent on the model of evolution used

60
Software programs for phylogenetic analysis

Overview http//evolution.genetics.washington.edu
/phylip/software.html
Most widely used software programs
PHYLIP free available (downloadable or online
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html)
PAUP user friendly but not free available

61
Phylogenetic information on the internet

http//tolweb.org/tree/phylogeny.html
http//www.treebase.org/treebase/
....

62
If you need more information

Jacqueline Vander Stappen
K.U.Leuven
Laboratory of Gene Technology
Kasteelpark Arenberg 21
B-3001 Leuven
Jacqueline.vanderstappen_at_agr.kuleuven.ac.be

Write a Comment

User Comments (0)

About PowerShow.com

IE68 Biological databases Phylogenetic analysis PowerPoint PPT Presentation