Title: Phylogenetic Analysis
1Phylogenetic Analysis
YTSLLLSRQ-
YASLLW-RQA
PASIILSRQA
GRSIVLTRQM
2Phylogenetics
What do I need to do?
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
3Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
4So you have a sequencenow what?
MKILLLCIIFLYYVNAFKNTQKDGVSLQILKKKRSNQVNFLNRKNDYNLI
KNKNPSSSLKSTFDDIKKIISKQLSVEEDKIQMNSNFTKDLGADSLDLVE
LIMALEEKFNVTISDQDALKINTVQDAIDYIEKNNKQ
51 What is it?
Does source organism have its own genome
database?
Unknown/No
Yes
BLAST_at_ genome database(GeneDB, PlasmoDB, etc.)
BLAST_at_ Pubmed
6Why start with genome-specific database?
Genome location/structure
Strain variability
BLAST
Expression data
Pathway data
7PubMed BLAST
8Blastp
PubMed BLAST
9Protein families Conserved Domains
10(No Transcript)
11BLAST Hits
12Downloading sequences FASTA format
13Getting sequences FASTA format
14Saving and editing FASTA files
15Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
16Pair-wise sequence alignment
Smith-Waterman
17Aligning 2 sequences globally
-4
-8
-12
-16
-20
-24
-28
-32
-36
-8
-12
-16
-20
-24
-28
-32
-36
-4
4
-4
2
-12
-16
-20
-24
-28
-32
-36
-8
-12
-4
-8
10
-16
-20
-24
-28
-32
-36
-4
-8
-12
14
-20
-24
-28
-32
-36
-16
-20
-4
-8
-12
-16
18
14
10
-32
-36
-19
-8
-12
-16
-20
14
10
6
-36
-24
-28
-4
-20
-12
-16
-20
-24
-28
15
11
-25
-29
-24
-16
-20
-24
-28
-32
20
-32
16
-36
-26
-25
-34
-25
-35
-28
-28
-32
18Multiple sequence alignment
Progressive
Align 2 closest sequences
Add in next closest sequence
Continue adding.
Hyper dependent on initial matches.
19Multiple sequence alignment
Iterative
Initial MSA Score (low)
Optimize MSA score
Probabilistic methods dont always generate the
same answer
20Multiple sequence alignment programs
Pair-wise alignment type
Global
Local
ClustalX T-Coffee
progressive
POA
MSA Alignment type
HMMs GAs
Dialign
iterative
21Multiple Sequence Alignments
POAVIZ progressive local
CLUSTAL progressive global
22Multiple Sequence Alignments
POAVIZ progressive local
CLUSTAL progressive global
23POAVIZ
24POAVIZ
25POAVIZ
26Multiple Sequence Alignments
POAVIZ progressive local
CLUSTAL progressive global
27CLUSTALX
Parameters
28CLUSTALX
29CLUSTALX Protein Weight Matrices
- 1) BLOSUM (Henikoff). These matrices appear to be
the best available for carrying out data base
similarity (homology searches). - 2) PAM (Dayhoff). These have been extremely
widely used since the late '70s. - 3) GONNET. These matrices were derived using
almost the same procedure as the Dayhoff one
(above) but are much more up to date and are
based on a far larger dataset.
30BLOSUM (BLOck SUbstitution Matrix)
BLOSUM62 Gather proteins with at least 62
identity to obtain actual substitution rates for
these proteins
Pros Best bet for distantly divergent
sequences
31PAM (point accepted mutation)
Gather the substitution rates for PAM1 (99
identical sequences) Assuming that those
substitution rates are consistent over time
( Point mutations / 100 amino acids)
Pros Very good for closely related
sequences Cons Rare mutations
under-represented Substitution rates not
constant over time (both are problems for
phylogenetic estimation)
32CLUSTALX
33CLUSTALX - Aligning
34CLUSTALX - Aligning
35CLUSTALX Alignment view
36CLUSTAL vs POAVIZ
(global vs local)
POAVIZ
CLUSTAL
37Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
38BioEdit Alignment manipulation
Open the .aln file
39BioEdit Alignment manipulation
Back colored view gives more contrast
Select Edit from the mode dropdown
40BioEdit Alignment manipulation
Select Insert so that you dont accidentally
lose part of your sequence
Then select the unaligned beginning (or end)
sequence and delete it.
41BioEdit Alignment manipulation
Now save as a different file .fasta
42Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
43Tree terminology
root
outgroup
common ancestor (node, branch point)
lineage
(branch, edge)
branch length
B
C
D
E
F
G
A
Operational taxonomic units (OTUs, leaves)
44monophyletic
paraphyletic
polyphyletic
45Sequence homology orthologues and paralogues
Ancestral gene
duplication
A
B
Last common ancestor
speciation
Human A
Human B
Rat A
Rat B
orthologues
orthologues
paralogues
orthologues
paralogues
46Methods of estimating phylogenetic relationships
Character-based Maximum Parsimony
(MP)Distance-based Neighbor-Joining
(NJ) Minimum Evolution (ME)Probabilistic Maxim
um likelihood (ML) Bayesian inference
47Methods of estimating phylogenetic relationships
Maximum Parsimony (MP)
48Methods of estimating phylogenetic relationships
Distance-based
Neighbor-Joining (NJ) MethodThe NJ method
involves clustering of neighbor species that are
joined by one node. It does not evaluate all the
possible tree topologies. Not guaranteed to
obtain the optimal tree Minimum
Evolution (ME) MethodEstimates the total branch
length of each topology exhaustively, then
chooses the topology with the least total branch
length. Time intensive for large numbers of
taxa.
49Methods of estimating phylogenetic relationships
Probabilistic methods Maximum likelihood (ML)
Prob ( data model tree )
More likely topology found
Search all possible topologies to optimize
probability
50Bayesian inference
Prior information
Model for selection
need both for everyone in the class
51Methods of estimating phylogenetic relationships
Character Maximum Parsimony (MP)Distance Neigh
bor-Joining (NJ) Minimum Evolution
(ME)Probabilistic Maximum likelihood
(ML) Bayesian inference
52Estimating Phylogenetic Relationships
MEGA
MrBayes
53Estimating Phylogenetic Relationships
MEGA
MrBayes
54MEGA Molecular Evolutionary Genetic Analysis
First we have to get a MEGA formatted file made
Select All Files from the dropdown Files
of Type menuThen choose the .aln file you
just made
55MEGA making a MEGA formatted file
MEGA recognizes that you didnt enter a MEGA
formatted file Click OK
Now click on the Convert to MEGA format button
at the top left hand side of the screen
56MEGA making a MEGA formatted file
Make sure that the file is the right one and that
the formatting is correct. Click OK.
Now we have to make sure that the file looks good
before starting any analysis
57MEGA making a MEGA formatted file
- Make sure all sequences are the same
length-Remove all traces of the consensus
marks
When the file looks good, save it and close both
text formatter windowsNow try Activating the
data file again, this time with the .meg file
you just made
58MEGA input a MEGA formatted file
Make sure that the correct sequence type is
selectedMake sure that the correct characters
are selected for missing data and gaps.
59MEGA input a MEGA formatted file
You should now see the sequence data
explorerMinimize this window and you can begin
analyzing your data
60MEGA choose an algorithm
From the phylogeny window you can choose an
appropriate algorithm.In this case well use
Minimum Evolution.
61MEGA set parameters
There are two major things to think about first
Model and Rates among SitesIn this example,
Ill use the Poisson model with gamma (y2.0)
rate variation
62Identity
Substitution rates
Equal
Base frequencies
Variable
Equal
Transition and/or transversion frequencies
Variable
Symmetrical substitution (G-gtA A-gtG)
Kimura 2-parameter B(E), si(V),
sv(V) Tamura-Nei B(V), si(V), sv(E) Kimura
3-parameter B(V), si(E), sv(V) General Time
Reversible B(V), Sym
Rate variation across sites
Gamma ( G )distribution of rate variation among
sites Proportion of Invariable Sites ( I )
G I GTR
Substitution models (nucleic acid)
63Sophistication
Each site can choose its own substitution model,
and coupled with maximum likelihood probability
estimations or MCMC/Bayesian methods
Mixture models
High dimensional model but requires large dataset
Site specific residue frequencies
probabilistic substitution rates
Poisson
mtREV
extrapolation of observed substitution rates
JTT
PAM
Identity
No model
Substitution models (amino acid)
64MEGA set parameters
There are two major things to think about first
Model and Rates among SitesIn this example,
Ill use the Poisson model with gamma (y2.0)
rate variation
65MEGA choose tree test options
Now switch over to the Test of Phylogeny
tab..In order to determine the validity of your
tree youll need to bootstrap it. Since our
sequence isnt very long, only a couple hundred
replications are needed.Now click the check
button, then click Compute in the main window
66MEGA edit your tree
Your tree should appear. Not a very good one in
this case. Why? Because the sequences were too
identical.The icons on the left allow you to
reroot, flip branches, etc.You can also change
the format of the treeBut lets also compute a
condensed tree(Select that from the Compute
menu)using a cutoff of 50..
67MEGA interpret the tree
Four of the sequences cluster indistinguishably
together, while a single other sequence stands
out. If we look back at our alignments we could
predict this
68Estimating Phylogenetic Relationships
MEGA
MrBayes
69MrBayes Making a NEXUS (.nex) file
70MrBayes Making a NEXUS (.nex) file
71MrBayes Running MrBayes
72MrBayes Running MrBayes
73MrBayes Running MrBayes
74MrBayes Running MrBayes
75MrBayes Running MrBayes
76MrBayes Running MrBayes
77Phylogenetics
Get related sequences of interest
Perform multiple sequence alignments
Edit alignment
Estimate phylogenetic relationships
Interpret results correctly
78Phylogenetics
Interpret results correctly
Quality of aligned sequences
One bad egg
Sequence similarity (think goldilocks)
Use an appropriate model
Use an appropriate estimation method
Use appropriate parameters
Try different things and compare results wisely
Determine the validity of each part of your tree
Develop a model to explain your tree
how does it square with known information? what
can you learn from your sequences? what cant
you learn from your analysis?
79The Intelligent Consumer(You dont have to
completely understand everything in order to use
it properly, but it helps to have a rough idea)
BLAST - stochastic processes - random
walksSequence alignments - Markov processes -
dynamic programming - Viterbi, Forward, and
Backward algorithmsBayesian phylogenetic
inference - Bayes theorem - Bayesian
inference - Metropolis algorithm
80Many uses for multiple sequence analysis
81Protein family analysis
multiple sequence alignment
profile
profileHMM (hidden Markov model)
2
1
2
1
2
1
Find new proteins with same domains
82RNA secondary structure prediction
83Protein secondary structure prediction
84Protein structure prediction homology modeling
Protein sequence with known structure
Aligned sequences with unknown structure
85Comparative genomics