Title: MICa 8006 Protein Sequence Analysis
1MICa 8006 Protein Sequence Analysis Sept. 28 -
alignment methods Sept. 30 - phylogenetic
methods Dr. Steven Cannon Feel free to email
with questions cann0010_at_umn.edu
2What is important in this sequence? gtsequence1 GKT
TLATQIFNVKGEHFDRVIWVVVSKEFNVEKQQDILEKLLEKTEEEKAAEI
ENLFQLLEGKKFLLVLDDVWEKVDLDKIGVPFPDNGSKVLFTTRSESVAV
CGDMGVXMEVECLTPEEAWELFQKKVFENLKSDPEIEELAKEVVKKCGGL
P
3What is important in this sequence? gtsequence1 GKT
TLATQIFNVKGEHFDRVIWVVVSKEFNVEKQQDILEKLLEKTEEEKAAEI
ENLFQLLEGKKFLLVLDDVWEKVDLDKIGVPFPDNGSKVLFTTRSESVAV
CGDMGVXMEVECLTPEEAWELFQKKVFENLKSDPEIEELAKEVVKKCGGL
P
4Why alignments? protein - to find motifs,
conserved regions, differences - to describe
proteins in terms of residue frequencies - to
calculate phylogenetic trees nucleotide - to
find conserved noncoding elements, changes at
synonymous and nonsynonymous sites - to find
differences between closely related
sequences genomic - to find regulatory
elements, repetitive sequences, genomic
duplications, relationships among genomes,
patterns of genomic remodeling and change
5Outline Pairwise alignments - The basis of
most multiple alignments and searches - Dynamic
programming - Similarity matrices Multiple
alignment methods - Combination methods
Clustalw, T-Coffee - HMMs Examples nuts and
bolts - Compare methods and results - Special
concerns for nucleotide alignments - Preparing
alignments for phylogenetic work
6Pairwise alignment align two seqs, minimizing
mismatches Global alignment Use whole seqs.
Penalize for mismatches at seq ends
Needleman-Wunsch wxyzabcdef...j.
--------- ....abckefghijk Local
alignment Don't penalize for mismatches at seq
ends Smith-Waterman "optimal", slow Blast,
Fasta not guaranteed optimal faster
wxyzabcdef j -
abckefghijk
7Optimal methods, dynamic programming Needleman -
Wunsch global Smith - Waterman local. Both
are slow, but "optimal" For more detail google
"dynamic programming tutorial"
8Similarity (or Substitution) Matrices A matrix
with scores for matches and mismatches between
amino acid residues. What residues are one
mutation away from a given residue? What
residues have similar physical properties (are
conservative changes in a specific protein
context)?
9PAM Matrices PAM Point Accepted Mutation (per
100 residues) 1 PAM means an average of 1
fixed mutation between two seqs 1
difference 250 PAM means an average of 250
fixed mutations (many overlying) 80
difference PAM 1 (per 10,000 0.53 chance of
Asp -gt Glu Ala Arg Asn Asp Cys Gln
Glu A R N D C Q E
G H I L K M F P S T W Y V Ala A 9867 2 9
10 3 8 17 Arg R 1 9913 1 0
1 10 0 Asn N 4 1 9822 36 0
4 6 Asp D 6 0 42 9859 0 6
53 Cys C 1 1 0 0 9973 0 0 Gln
Q 3 9 4 5 0 9876 27 Atlas of
Protein Sequence and Structure, Suppl 3, 1978,
M.O. Dayhoff, ed. National Biomedical Research
Foundation, 1979. Updated versions JTT
(Jones, Taylor, Thornton, 1992)
Gonnet (Gonnet, Cohen, Benner, 1992)
10BLOSUM (Blocks Substitution Matrix)
Matrices Align "blocks" of proteins that have
different identity. Patterns which were 60
identical were used to make a substitution
matrix called blosum60, etc. Scores are log
odds of (observed substitutions / expected
substitutions) C S T P A G N D
E O H R K M I L V F Y
W ------------------------
C 9 -------------------
----- S -1 4
T -1 1 5
P -3 -1
-1 7 A
0 1 0 -1 4
G -3 0 -2 -2 0 6 ------------------
------ Is the default blast matrix, because
better performance than PAM for similarity
searches, but PAM easier to interpret for
calculating evolutionary "distance"
11Combination heuristic implementations -
1 Clustalw, T-coffee, Pileup, Lalign, Malign,...
Do pairwise alignments, then combine. 1
wxyzabcdef--i 2 abckefghij
1 wxyzabcdef--i- 1 wxyzabcdef
2 abckefghij 3 x--abckffhij 3
x--abckff-hij 2
abckefghij 3 xabckff-hij
12Combination heuristic implementations -
1 Clustalw - Calculate pairwise
similarities. - Use pairwise scores to make a
guide tree for re-alignment. - Progressively
align all sequences using the guide tree.
Align more-similar sequences first.
13Combination heuristic implementations -
1 Clustalw - Calculate pairwise
similarities. - Use pairwise scores to make a
guide tree for re-alignment. - Progressively
align all sequences using the guide tree.
Align more-similar sequences first. T-Coffee,
3DCoffee - Calculate pairwise local alignments
- lalign. Store a library of pairwise
alignment information - Calculate pairwise
global alignments. - Use pairwise scores to
make a guide tree for re-alignment. -
Progressively align all sequences using the guide
tree. Align more-similar sequences first.
14Another kind of method Hidden Markov Model
alignments psiblast, hmmer, wise2 Not just
for alignments Speech recognition, gene
recognition, intron-exon boundaries, etc.
15Another kind of method Hidden Markov Model
alignments psiblast, hmmer, wise2 Models
residue frequencies well Align all seqs to one
statistical model. - Not a good way to generate
an alignment, but excellent for representing
the knowledge in an alignment, maintaining
alignments aligning lots of sequences
handling indel regions making sensitive
searches Pfam
16Hidden Markov Models, HMMs
Represents match states, insertion states,
deletion states
A
-
T
C
G
G
G
17Clustalw Reticuline oxidase genes in Arabidopsis
thaliana (5')
18T-coffee (on a subset of the sequences)
19T-coffee, then hmm, then hmmalign. Lower case
indel upper case model
20T-coffee, then hmm, then hmmalign -- indels
removed.
21T-coffee nucleotide alignment
22Clustalw nucleotide alignment
23Align nucleotides to amino acids
24Two portions of a nucleotide alignment
of reticuline oxidase genes in Arabidopsis
thaliana
25NBS-LRR dN/dS
26Some resources There are many multiple alignment
programs. "Quick and easy" are marked with A
few of my favorites (use these terms as Google
searches) Alignment on the web Clustalw EBI
(web server at EBI) Clustalw
Des Higgins (web nice docs)
abs.cit.nih.gov/clustalw (download) T-Coffee
(download, web) Links to
other alignment programs Multiple alignment
programs pbil.univ-lyon1.fr/alignment.html
27Multiple alignment editing Jalview EBI
(download) Or, web server do clustalw
alignment at www.ebi.ac.uk/clustalw, then
Jalview from "results" Jalview CCGB
(newer version, using Java WebStart) SeqPup
(download) Se-Al
(download) MacClade Word (use 'courier'
font and option-drag save text) BBEdit (Mac
general text editor) JEdit (Java general text
editor) vim perl
28Other alignment resources Codon-based
alignments TranslateAlign.pl by Dan Kortschak
(ask me) TransAlign life.anu.edu.au/molecu
lar/software/transalign/ BioPerl
scripts/align_on_codons.pl Wise2 HMM
alignments Make initial alignment using
clustalw etc., then realign or add new seqs to
the model. HMMER HMM alignments,
searches, domain pred. SAM-T99 web server
search, 2º struct. pred. HMMPro commercial
HMM program Wise2 codon-based alignments
29Some alignment nuts and bolts Clustalw, t-coffee
default parameters are usually OK - but don't
trust alignments in indel regions. Run on the
web, or download and run locally. Can experiment
with word size, gap opening and extension
penalties Provide fasta-format sequences, push
the button, wait... Then look at the alignment -
and edit it! "Phylogenetic tree" trust only if
indel regions have been manually removed, and
you have a good alignment. Phylip PAUP
30Some key points - Be cautious about indels,
alignment "ends", DNA alignments. There is more
power in protein alignments - distinct character
of AAs - "bogus" information in silent sites -
unknown frame, frame shifts ... but there is
some "hidden" information in silent
sites Alignments can be beauuutiful.