Likelihood surface 2dimensions - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Likelihood surface 2dimensions

Description:

(3 insertions) 4 leaves. 5 branches. 2 internal nodes. 3 topologies (x3) (5 ... (7 insertions) In general, an unrooted tree. with N leaves has: 2N 3 branches ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 32
Provided by: jamesh78
Category:

less

Transcript and Presenter's Notes

Title: Likelihood surface 2dimensions


1
Likelihood surface (2-dimensions)
For multiple parameters, imagine extending to
multiple dimensions.
2
Defining what a tree means
unrooted tree (used when the root isnt known)
rooted tree (all real trees are rooted)
ancestral sequence
time vaguely radiates out from somewhere near the
center
divergence time is the sum of (horizontal)
branch lengths
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
Tom Nicholas favorite tree (vertebrate Cyp450)
12
A tree has topology and distances
Note - topologically, these are the SAME tree. In
general, two trees are the same if they can be
inter-converted by branch rotations.
13
The number of tree topologies grows extremely fast
3 leaves 3 branches 1 internal node 1 topology (3
insertions)
4 leaves 5 branches 2 internal nodes 3 topologies
(x3) (5 insertions)
In general, an unrooted tree with N leaves
has 2N 3 branches N 2 internal nodes O(N!)
topologies
5 leaves 7 branches 3 internal nodes 15
topologies (x5) (7 insertions)
14
There are many rooted trees for each unrooted tree
For each unrooted tree, there are 2N - 3 times as
many rooted trees, where N is the number of
leaves ( internal branches 2N 3).
15
Trees can be enormously complex
This tree has 203 7-pass chemosensory protein
sequences and could adopt gtgtgt10100 topologies.
Even a tree of 20 sequences could adopt 2 x 1020
topologies.
Note that this does NOT include branch length
differences!
16
Duplication and Divergence
  • Full gene duplication can occur in two patterns
  • 1) speciation (orthology)
  • 2) duplication within species (paralogy)
  • Gene duplication within a species probably
    occurs only during aberrant DNA rearrangements.
  • Immediately after duplication, the two copies of
    a gene are fully redundant and one often mutates
    to non-functionality and fixes in the population
    (pseudogene).
  • Some of the time (unclear how often) the two
    copies diverge to become functionally distinct
    enough that each function is under negative
    selection and both are retained.

17
Orthology and paralogy trees
(duplication)
18
How duplications can arise
2N ? 4N (e.g. failure in mitosis)
1) genome duplication
2) unequal crossing-over (non-allelic
recombination)
3) replication slippage (seems to happen only
with very short sequences)
4) transposon hops run amok
5) events involving nonhomology-based repair of
double strand breaks (often called nonhomologous
end joining, NHEJ)
19
Main mechanism in animals is probably non-allelic
recombination (NAR)
mispairing of existing duplication (red box)
duplication product
deletion product
20
To reconstruct species phylogeny
  • sequence orthologs from each species
  • construct gene tree it probably corresponds to
    the species tree (question assuming you
    construct the gene tree correctly what could
    violate this?)
  • do this for more than one gene to be sure you
    should get (nearly) the same tree every time.

21
Species phylogeny (and a footnote on
humanocentrism)
This tree is amazingly misleading
correct tree
Not to mention 1) the straight line to humans,
2) leaving out gt90 of all species, including
three hominoids bonobo, siamang, gibbon, and 3)
the fact that WE are an African ape.
22
notice that small cold things are implicitly
less evolved
notice large size (not to mention white male)
23
How do you make a tree from data?
  • data typically takes the form of either a
    multiple sequence alignment or a set of all
    pairwise sequence alignments.
  • common methods for going from the data to a tree
    include
  • - Distance Matrix methods (e.g.
    Neighbor-Joining)
  • - Parsimony
  • - Maximum-likelihood
  • in theory, the tree-enumerating methods
    (especially maximum-likelihood) are superior, but
    they are not always practical for large trees.

24
Distance matrix methods
  • methods based on a set of pairwise distances (no
    multiple alignment needed, though pairwise
    distances usually come from one).
  • all methods (in essence) build a tree that tries
    to best match the distances.
  • usual standard for best match is the least
    squares of the tree distances compared to the
    real pairwise distances

Let Dm be the matrix distances and Dt be the
tree distances. Find the tree (an internally
consistent set of Dt values) that minimizes
25
Example distance matrix (fraction amino acid
divergence)
fraction divergence
EGL-2 dEag rEag UNC-103 dErg HERG EGL-2 0.000 0.2
76 0.353 0.542 0.547 0.525 dEag 0.276 0.000 0.305
0.512 0.501 0.508 rEag 0.353 0.305 0.000 0.533 0.5
15 0.510 UNC-103 0.542 0.512 0.533 0.000 0.274 0.2
55 dErg 0.547 0.501 0.515 0.274 0.000 0.263 HERG 0
.525 0.508 0.510 0.255 0.263 0.000
26
Least-squares solution
  • There are methods that directly solve the
    least-squares problem
  • However, they are computationally slow and
    rarely used (maximum-likelihood better if
    feasible).
  • Fortunately, there are more direct
    approximations that work remarkably well, most
    notably Neighbor-Joining.
  • The direct methods use various types of
    sequential clustering, of which the simplest is
    UPGMA (a horrid acronym that stands for something
    I can never remember).

27
Sequential clustering approach (UPGMA)
repeat until all clustered
28
Neighbor-Joining Algorithm (NJ)
Essentially as on previous slide, but correction
for distance to other leaves is made.
Specifically, for two leaves i and j, we denote
the set of all other leaves as L, and the size of
that set as , and we compute the corrected
distance Dij as
heres an intuitive rationale (consider
clustering the first two leaves)
29
Neighbor-Joining corrects for different rates of
evolution on branches
these two branches have changed faster
30
Parsimony method
  • intuitively appealing find the tree that can
    explain the observed sequences with the smallest
    number of changes.
  • BUT requires enumeration of tree topologies, so
    not feasible with very large data sets (also
    inferior to ML methods, which are not much more
    computationally intensive).

1 AAG 2 AAA 3 GGA 4 AGA
Maximum-likelihood method (next class) is similar
except that it accounts for rates of change and
sums over all possible histories.
31
Assignment for Thursday
1) get the set of assigned sequences (course
website) and load them with Bonsai. 2) build a
pairwise tree with them, then recompute the tree
using uncorrected identities (because it is
more intuitive than scores and corrections for
this assignment). 3) find the pairwise distance
matrix values the underlie the tree and describe
what you think they mean (it is simple). 4) use
these distance matrix values to hand-compute the
first cluster by the N-J algorithm (you dont
have to give exact numbers, just indicate what
leaves are clustered and how the correction is
made in principle). 5) challenge explain
qualitatively what changes about the tree when
you toggle between corrected identities and
uncorrected identities. Send in an image file
of the tree, the tab-delimited text version of
the pairwise distance values, the hand
computation, and your explanation in part 5.
Write a Comment
User Comments (0)
About PowerShow.com