Phylogenetic Analysis Unit 16 - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Phylogenetic Analysis Unit 16

Description:

Distance along branches directly represents node distance ... The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch) ... – PowerPoint PPT presentation

Number of Views:364
Avg rating:3.0/5.0
Slides: 62
Provided by: irenegab
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic Analysis Unit 16


1
Phylogenetic AnalysisUnit 16
  • BIOL221T Advanced Bioinformatics for
    Biotechnology

Irene Gabashvili, PhD
2
BO, chapter 14
  • Bioinformatics analyses should be interpreted in
    evolutionary context
  • Good-quality sequence alignments important for
    evolutionary analysis
  • Common phylogenetic methods and software are
    different be cautious when using and
    interpreting your results

3
Terminology and the Basics
  • Phylogenetics is sometimes called claudistics
  • Clade a set of descendants from a single
    ancestor (greek for branch).
  • 3 basic assumptions
  • Any group of organism descended from a common
    ancestor
  • Bifurcating pattern of cladogenesis
  • Change in characteristics occurs in lineages over
    time

4
Brief Introduction to the Theory of Evolution
5
Classification Linnaeus
Carl Linnaeus 1707-1778
6
Classification Linnaeus
  • Hierarchical system
  • Kingdom (Rige)
  • Phylum (Række)
  • Class (Klasse)
  • Order (Orden)
  • Family (Familie)
  • Genus (Slægt)
  • Species (Art)

7
Classification depicted as a tree
8
Classification depicted as a tree
Species
Genus
Family
Order
Class
9
Theory of evolution
Charles Darwin 1809-1882
10
Phylogenetic basis of systematics
  • Linnaeus
  • Ordering principle is God.
  • Darwin
  • Ordering principle is shared descent from common
    ancestors.
  • Today, systematics is explicitly based on
    phylogeny.

11
Darwins four postulates
  • More young are produced each generation than can
    survive to reproduce.
  • Individuals in a population vary in their
    characteristics.
  • Some differences among individuals are based on
    genetic differences.
  • Individuals with favorable characteristics have
    higher rates of survival and reproduction.
  • Evolution by means of natural selection
  • Presence of design-like features in organisms
  • quite often features are there for a reason

12
Theory of evolution as the basis of biological
understanding
Nothing in biology makes sense, except in the
light of evolution. Without that light it
becomes a pile of sundry facts - some of them
interesting or curious but making no meaningful
picture as a whole
T. Dobzhansky
13
Phylogenetic ReconstructionDistance Matrix
Methods
14
Trees terminology
15
Terminology
  • Clades monophyletic taxon
  • Taxons any named group of organism
  • Branches divergence (length may indicate the
    degree)
  • Nodes any bifurcating branch point

16
Trees terminology
17
Trees representations
Three different representations of the same tree
18
Trees rooted vs. unrooted
  • A rooted tree has a single node (the root) that
    represents a point in time that is earlier than
    any other node in the tree.
  • A rooted tree has directionality (nodes can be
    ordered in terms of earlier or later).
  • In the rooted tree, distance between two nodes is
    represented along the time-axis only (the second
    axis just helps spread out the leafs)

Early
Late
19
Trees rooted vs. unrooted
  • A rooted tree has a single node (the root) that
    represents a point in time that is earlier than
    any other node in the tree.
  • A rooted tree has directionality (nodes can be
    ordered in terms of earlier or later).
  • In the rooted tree, distance between two nodes is
    represented along the time-axis only (the second
    axis just helps spread out the leafs)

Early
Late
20
Trees rooted vs. unrooted
  • A rooted tree has a single node (the root) that
    represents a point in time that is earlier than
    any other node in the tree.
  • A rooted tree has directionality (nodes can be
    ordered in terms of earlier or later).
  • In the rooted tree, distance between two nodes is
    represented along the time-axis only (the second
    axis just helps spread out the leafs)

Early
Late
21
Trees rooted vs. unrooted
  • In unrooted trees there is no directionality we
    do not know if a node is earlier or later than
    another node
  • Distance along branches directly represents node
    distance

22
Trees rooted vs. unrooted
  • In unrooted trees there is no directionality we
    do not know if a node is earlier or later than
    another node
  • Distance along branches directly represents node
    distance

23
Reconstructing a tree using non-contemporaneous
data
24
Reconstructing a tree using present-day data
25
Data molecular phylogeny
  • DNA sequences
  • genomic DNA
  • mitochondrial DNA
  • chloroplast DNA
  • Protein sequences
  • Restriction site polymorphisms
  • DNA/DNA hybridization
  • Immunological cross-reaction

26
Morphology vs. molecular data
African white-backed vulture (old world vulture)
Andean condor (new world vulture)
New and old world vultures seem to be closely
related based on morphology. Molecular data
indicates that old world vultures are related to
birds of prey (falcons, hawks, etc.) while new
world vultures are more closely related to
storks Similar features presumably the result of
convergent evolution
27
Molecular data single-celled organisms
Molecular data useful for analyzing single-celled
organisms (which have only few prominent
morphological features).
28
Distance Matrix Methods
Gorilla ACGTCGTA Human
ACGTTCCT Chimpanzee ACGTTTCG
  • Construct multiple alignment of sequences
  • Construct table listing all pairwise differences
    (distance matrix)
  • Construct tree from pairwise distances

Ch
1
1
1
Hu
2
Go
29
Finding Optimal Branch Lengths
S2
S1
a
c
b
e
d
S3
S4
Distance along tree
Observed distance
D12 ? d12 a b c D13 ? d13 a d D14 ? d14
a b e D23 ? d23 d b c D24 ? d24 c
e D34 ? d34 d b e
Goal
30
Optimal Branch Lengths Least Squares
S2
S1
a
c
  • Fit between given tree and observed distances can
    be expressed as sum of squared differences
  • Q ?(Dij - dij)2
  • Find branch lengths that minimize Q - this is the
    optimal set of branch lengths for this tree.

b
e
d
S3
S4
Distance along tree
jgti
D12 ? d12 a b c D13 ? d13 a d D14 ? d14
a b e D23 ? d23 d b c D24 ? d24 c
e D34 ? d34 d b e
Goal
31
Least Squares Optimality Criterion
  • Search through all (or many) tree topologies
  • For each investigated tree, find best branch
    lengths using least squares criterion
  • Among all investigated trees, the best tree is
    the one with the smallest sum of squared errors.

32
Exhaustive search impossible for large data sets
33
Heuristic search
  • Construct initial tree determine sum of squares
  • Construct set of neighboring trees by making
    small rearrangements of initial tree determine
    sum of squares for each neighbor
  • If any of the neighboring trees are better than
    the initial tree, then select it/them and use as
    starting point for new round of rearrangements.
    (Possibly several neighbors are equally good)
  • Repeat steps 23 until you have found a tree
    that is better than all of its neighbors.
  • This tree is a local optimum (not necessarily a
    global optimum!)

34
Clustering Algorithms
  • Starting point Distance matrix
  • Cluster least different pair of sequences
  • Tree pair connected to common ancestral node,
    compute branch lengths from ancestral node to
    both descendants
  • Distance matrix combine two entries into one.
    Compute new distance matrix, by finding distance
    from new node to all other nodes
  • Repeat until all nodes are linked
  • Results in only one tree, there is no measure of
    tree-goodness.

35
Neighbor Joining Algorithm
  • For each tip compute ui ?j Dij/(n-2)
  • (this is essentially the average distance to all
    other tips, except the denominator is n-2 instead
    of n)
  • Find the pair of tips, i and j, where Dij-ui-uj
    is smallest
  • Connect the tips i and j, forming a new ancestral
    node. The branch lengths from the ancestral node
    to i and j are
  • vi 0.5 Dij 0.5 (ui-uj)
  • vj 0.5 Dij 0.5 (uj-ui)
  • Update the distance matrix Compute distance
    between new node and each remaining tip as
    follows
  • Dij,k (DikDjk-Dij)/2
  • Replace tips i and j by the new node which is now
    treated as a tip
  • Repeat until only two nodes remain.

36
Superimposed Substitutions
  • Actual number of
  • evolutionary events 5
  • Observed number of
  • differences 2
  • Distance is (almost) always underestimated

ACGGTGC C T GCGGTGA
37
Model-based correction for superimposed
substitutions
  • Goal try to infer the real number of
    evolutionary events (the real distance) based on
  • Observed data (sequence alignment)
  • A model of how evolution occurs

38
Jukes and Cantor Model
  • Four nucleotides assumed to be equally frequent
    (f0.25)
  • All 12 substitution rates assumed to be equal
  • Under this model the corrected distance is
  • DJC -0.75 x ln(1-1.33 x DOBS)
  • For instance
  • DOBS0.43 gt DJC0.64

39
Other models of evolution
40
Homologs
  • Orthologs - speciation
  • Paralogs - duplication
  • Xenologs horizontal transfer

41
Clustering Algorithms
  • Clustering algorithms use distances to calculate
    phylogenetic trees. These trees are based solely
    on the relative numbers of similarities and
    differences between a set of sequences.
  • Start with a matrix of pairwise distances
  • Cluster methods construct a tree by linking the
    least distant pairs of taxa, followed by
    successively more distant taxa.

42
From Multiple Sequence Alignment
  • Best cluster ATCC,ATGC
  • Best cluster TTCG,TCGG

43
Example
A Cladogram or a Phylogram?
1.5
1.5
0.5
0.5
1
1
ATCC ATGC TTCG TCGG
44
Cladistic Methods
  • Evolutionary relationships are documented by
    creating a branching structure, termed a
    phylogeny or tree, that illustrates the
    relationships between the sequences.
  • Cladistic methods construct a tree (cladogram) by
    considering the various possible pathways of
    evolution and choose from among these the best
    possible tree.
  • A phylogram is a tree with branches that are
    proportional to evolutionary distances.

45
(No Transcript)
46
Hamming distance
  • ltbetween two strings of equal lengthgt the
    number of positions for which the corresponding
    symbols are different. The number of
    substitutions required to change one into the
    other, or the number of errors that transformed
    one string into the other.

47
Hamming distance
  • The Hamming distance between 1011101 and 1001001
    is 2.
  • The Hamming distance between 2143896 and 2233796
    is 3.
  • The Hamming distance between "toned" and "roses"
    is 3.

48
Levenshtein Distance
  • A measure of the similarity between two strings
    number of deletions, insertions, or substitutions
  • For example,
  • If s is "test" and t is "test", then LD(s,t) 0,
  • If s is "test" and t is "tent", then LD(s,t) 1,
    because one substitution (change "s" to "n") is
    sufficient to transform s into t.
  • If s os test and t is attempt, LD(s,t)4

49
Levenshtein distance
  • The Levenshtein distance algorithm has been used
    in
  • Spell checking
  • Speech recognition
  • DNA analysis
  • Plagiarism detection

50
DNA Distances
  • Distances between pairs of DNA sequences are
    usually computed as the sum of all base pair
    differences between the two sequences.
  • If sequences are similar enough to be aligned
  • Generally all base changes are considered equal
  • Insertion/deletions are generally given a larger
    weight than replacements (gap penalties).
  • It is also possible to correct for multiple
    substitutions at a single site, which is common
    in distant relationships and for rapidly evolving
    sites.

51
Phylogenetic methods (1) Distance matrix/cluster
(UPGMA, NJ) Bacterial taxonomy based on
morphological, chemical, biochemical and
physiological chacters did not allow natural
relationships to be deduced Numerical taxonomy
(Sneath and Sokal, 1963, 1973) Parsimony
(maximum parsomony) The taxonomy of animals
shall reflect their natural relatioonships Phy
logenetic Systematics (Willi Hennig 1950,
1966) Without direction (eg. Wiley 1980)
52
UPGMA
  • The simplest of the distance methods is the UPGMA
    (Unweighted Pair Group Method using Arithmetic
    averages)
  • The PHYLIP programs DNADIST and PROTDIST
    calculate absolute pairwise distances between a
    group of sequences. Then the GCG program GROWTREE
    uses UPGMA to build a tree.
  • Many multiple alignment programs such as PILEUP
    use a variant of UPGMA to create a dendrogram of
    DNA sequences which is then used to guide the
    multiple alignment algorithm.

53
Neighbor Joining
  • The Neighbor Joining method is the most popular
    way to build trees from distance measurements
  • (Saitou and Nei 1987, Mol. Biol. Evol. 4406)
  • Neighbor Joining corrects the UPGMA method for
    its (frequently invalid) assumption that the same
    rate of evolution applies to each branch of a
    tree.
  • The distance matrix is adjusted for differences
    in the rate of evolution of each taxon (branch).
  • Neighbor Joining has given the best results in
    simulation studies and it is the most
    computationally efficient of the distance
    algorithms (N. Saitou and T. Imanishi, Mol.
    Biol. Evol. 6514 (1989)

54
Cladistic methods
  • Cladistic methods are based on the assumption
    that a set of sequences evolved from a common
    ancestor by a process of mutation and selection
    without mixing (hybridization or other horizontal
    gene transfers).
  • These methods work best if a specific tree, or at
    least an ancestral sequence, is already known so
    that comparisons can be made between a finite
    number of alternate trees rather than calculating
    all possible trees for a given set of sequences.

55
Parsimony
  • Parsimony is the most popular method for
    reconstructing ancestral relationships.
  • Parsimony allows the use of all known
    evolutionary information in building a tree
  • In contrast, distance methods compress all of the
    differences between pairs of sequences into a
    single number

56
Building Trees with Parsimony
  • Parsimony involves evaluating all possible trees
    and giving each a score based on the number of
    evolutionary changes that are needed to explain
    the observed data.
  • The best tree is the one that requires the fewest
    base changes for all sequences to derive from a
    common ancestor.

57
Methods
  • Distance-based UPGMA, NJ, FM, ME
  • Other Maximum Parsimony, ML, etc
  • Neighbor Joining methods generally produce just
    one tree, which can help to validate a tree built
    with the parsimony or maximum likelihood method

58
Phylogenetic methods Maximum likelihood
methods Phylogenies should be formulated in a
probalistic framework and statistically
testable. Protein and DNA sequence data are
extraordinary good for phylogenetic
interpreation and can resist such treatment.
Cavalli-Sforza and Edwards 1967
(theory) Felsenstein 1981 first practically
useful algorithms.
59
Phylogenetic analysis. Comparison of phylogenetic
methodsConsistency a phylogenetic method is
consistent for an evolutionary model, if the
method converges on the corrrect tree as the data
becomes infinite. Efficiency a phylogenetic
method have high efficiency if it quickly
converges on the correct solution as more data
are applied to the problem. Robustness a
phylogenetic method is robust if converges on the
correct solution with violations of the
assumptions about the evolutionary model.
Hillis 1995. Syst. Biol. 44, 3-16.
60
  • Phylogenetic analysis. Test of robustness.
    Bootstrap
  • Purpose. To show how well supported the nodes are
    by the data.
  • Performance. The original data are simulated by
    drawing columns randomly with replacement 100 or
    1000 times. The phylogenetic analysis is repeated
    and the number of nodes common in all 100 or 1000
    trees summarized.
  • Example. Original data 1 replicate 2
    replicate
  • Species 1 AGGA AAGA GGAA
  • Species 2 ACGT AACT CGTT
  • Species 3 ACGT AACT CGTT
  • Species 4 ACTT AACT CTTT
  • Species 5 CCGT CCCT CGTT
  • linear form (2,3)4)5)1 (2,3)4)5)1 (2,3)5)4)1

61
Are there Correct trees??
  • Despite all of these caveats, it is actually
    quite simple to use computer programs calculate
    phylogenetic trees for data sets.
  • Provided the data are clean, outgroups are
    correctly specified, appropriate algorithms are
    chosen, no assumptions are violated, etc., can
    the true, correct tree be found and proven to be
    scientifically valid?
  • Unfortunately, it is impossible to ever
    conclusively state what is the "true" tree for a
    group of sequences (or a group of organisms)
    taxonomy is constantly under revision as new data
    is gathered (example 80s revision of the seals
    and sea lions tree)
Write a Comment
User Comments (0)
About PowerShow.com