MICA 8006 Protein Sequence Analysis - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

MICA 8006 Protein Sequence Analysis

Description:

Parsimony - Maximum likelihood - Bayesian likelihood ... Parsimony on cleaned alignment, with ML branches; rooted(?) ...what about branch lengths? ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 31
Provided by: steven344
Category:

less

Transcript and Presenter's Notes

Title: MICA 8006 Protein Sequence Analysis


1
MICA 8006 Protein Sequence Analysis Sept. 28 -
alignment methods Sept. 30 - phylogenetic
methods Dr. Steven Cannon Feel free to email
with questions cann0010_at_umn.edu
2
Which sequences are most related in this
alignment?
3
Which sequences are most related in this
alignment?
Indels removed, sorted by pairwise ID
4
What is most related in this alignment?
Indels removed, sorted by pairwise ID -- average
distance tree, Jalview
5
Outline - Terms - Clustering - Distance
methods - UPGMA - NJ - Parsimony -
Maximum likelihood - Bayesian likelihood -
Bootstrapping - Programs, data formats -
Examples
6
Sequence --gt tree 5 10 Alpha
ABCDEFGHIK Beta AB--EFGHIK Gamma
?BCDSFG?? Delta CIKDEFGHIK Epsilon
DIKDEFGHIK --------Gamma !
--2 --Epsilon ! ! --4 ! --3
--Delta 1 ! ! -----Beta
! -----------Alpha
7
Why phylogenetic trees? Branch lengths may be
used to indicate numbers of change or amount of
evolution Beta
Epsilon
-------------------------------------3
1-------2
--------Delta
----------Gamma ------A
8
Some terms Dendrogram or cladogram no branch
lengths topology only Phylogram, phylogeny
usually indicates branch lengths clade subtree
group of sequences
9
Basic clustering methods Calculate distances
between each pair of sequences. Single linkage
similarity between any 2 groups min.
pairwise difference (or maximum similarity)
between any 2 members of 2 groups Complete
linkage similarity bet any 2 groups maximum
pairwise difference (or minimum similarity)
between any 2 members of 2 groups Average
linkage takes average of similarity between
members of 2 clusters UPGMA (Unweighted
Pair-Group Method using Arithmetic
averages)
10
DISTANCE methods - Operate on distance matrices,
not individual characters. - loss of
information - branch lengths may be negative
(non-interpretable) - A problem usually the
distance matrix generates conflicts. Trees are
not usually additive.
11
DISTANCE methods - Operate on distance matrices,
not individual characters. - loss of
information - branch lengths may be negative
(non-interpretable) - A problem usually the
distance matrix generates conflicts. Trees are
not usually additive. UPGMA, WPGMA o Calculate
distances between all taxa. o Group the most
similar. o Calculate distance between new node
and the other taxa. o Similarity between any
single taxon and some group is the average
of all the pairwise similarities in that group.
o Repeat until all nodes and groups have been
joined. o Assumes an additive tree, and equal
rates of evolution.
12
DISTANCE methods UPGMA, WPGMA I
aagtcatgct II aaatcaggct III
cagacagtca Distance matrix I II III
I - II 0.20 - III 0.50 0.50 -
I 0.10 .
/ \ 0.10 \ / 0.40 II
\ \ III
13
DISTANCE methods UPGMA, WPGMA I
aagtcatgct II aaatcaggct III
cagacagtca Distance matrix I II III
I - II 0.20 - III 0.50 0.50
- 'Distortion' is the difference between the
observed matrix and a matrix derived form the
resulting tree. How to minimize the
distortion? The Fitch and Margoliash method
finds a tree with the least distortion in the
t(t-1)/2 pairwise distances of t taxa. (Costly)
I 0.10 .
/ \ 0.10 \ / 0.40 II
\ \ III
14
  • DISTANCE methods
  • Neighbor joining
  • Fast
  • Handles branch lengths badly
  • Not model-based
  • Performs poorly when many sequences

15
Parsimony Parsimony simple stingy The
simplest theory is preferable The preferred tree
requires the smallest number of changes to
explain the differences among the sequences.
Site 1 2 3 4 1 A G G A 2 A G G G 3 A A C A
4 A A C G
16
Tree I 1 A G G A A A C A 3
\ /
---- / \ 2
A G G G A A C G 4 Tree II
1 A G G A A G G G 2
\ / ----
/ \ 3 A A C A
A A C G 4 Tree III 1
A G G A A G G G 2
\ / ----
/ \ 4 A A C G A
A C A 3
Parsimony Site 1 2 3 4 1 A G G A 2 A G G G
3 A A C A 4 A A C G
Site 1 uninformative. Site 2 G --gt A tree I.
Site 3 G --gt C tree I. Site 4 A --gt G tree
II.
17
Maximum likelihood Allows for multiple character
changes on a branch. Likelihood of data
probability of the data (sequences), given the
model (tree sequence model) Ld P(DH) To
evaluate a tree For each sequence and character
in the sequence, calculate the probability of
observing that character, given a substitution
model and an ancestral state in that tree.
Probabilities for each aligned position are
multiplied to get tree likelihood.
18
Bayesian likelihood A search method and
refinement (with a twist) on maximum
likelihood Likelihood of model (tree) given the
data (seqs, matrix) probability of the data
given the tree x the tree probability over the
number of possible trees for this number of
taxa. MCMC -- Markov Chain Monte Carlo -- walk
through tree space, (usually) accepting more
likely trees. Our belief about the phylogeny
changes during the search. We end with
probabilities for each examined tree and clade.
19
Bootstrapping likelihood Generate 1000
pseudo-alignments, sampled with replacement, and
calculate. Get consensus, count clade
frequencies.
20
Programs, data formats PAUP Nexus
format Phylip Phylip / New Hampshire / Newick
format (((a0.2,b0.2)0.4,(c0.3,(d0.5,((e0.5
,f0.2)0.0, g0.3)0.0)0.0)0.1)0.6,(h0.6,i
0.3)0.3,j0.5) Mega2, etc.
21
Neighbor joining on default Clustalw alignment
22
Neighbor joining on cleaned Clustalw alignment
bootstrap
23
Parsimony on cleaned alignment
24
Parsimony on cleaned alignment, with ML branches
25
Parsimony on cleaned alignment, with ML branches
rooted(?) what about branch lengths?
26
NBS seqs in legumes Medicago in red
27
NBS seqs non-TIR subfam. NJ, rooted
28
NBS seqs non-TIR subfam. ParsML, rooted
29
NBS seqs non-TIR subfam. ParsML, rooted
30
Some key points - Prepare a good
alignment. Think about the results. Compare
alternate methods. Consider bootstrap, branch
lengths, rooting Add additional sequences,
species for context Consider gene duplication,
loss, genomic context
Write a Comment
User Comments (0)
About PowerShow.com