Title: Sequence information, logos and Hidden Markov Models
1Sequence information, logos and Hidden Markov
Models
- Morten Nielsen,
- CBS, BioCentrum,
- DTU
2Information content
- Information and entropy
- Conserved amino acid regions contain high degree
of information (high order low entropy) - Variable amino acid regions contain low degree of
information (low order high entropy) - Shannon information
- D log2(N) S pi log2 pi (for proteins
N20, DNA N4) - Conserved residue pA1, piltgtA0,
- D log2(N) ( 4.3 for proteins)
- Variable region pA0.05, pC0.05, ..,
- D 0
3Sequence logo
MHC class II Logo from 10 sequences
- Height of a column equal to D
- Relative height of a letter is pA
- Highly useful tool to visualize sequence motifs
High information positions
http//www.cbs.dtu.dk/gorodkin/appl/plogo.html
4Sequence information
- ALAKAAAAM
- ALAKAAAAN
- ALAKAAAAR
- ALAKAAAAT
- ALAKAAAAV
- GMNERPILT
- GILGFVFTM
- TLNAWVKVV
- KLNEPVLLL
- AVVPFIVSV
- Description of binding motif
- Example
- PA 6/10
- PG 2/10
- PT PK 1/10
- PC PD PV 0
- Problems
- Few data
- Data redundancy/duplication
5Sequence information Raw sequence counting
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
6Sequence weighting
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
7Pseudo counts
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
- Sequence weighting and pseudo count
- Motif found on more data
8and now you
- cp files from /usr/opt/www/pub/CBS/researchgroups/
immunology/intro/HMM/exercise - Make weight matrix and logos using
- pep2mat -swt 2 -wlc 0 data gt mat
- mat2logo mat
- ghostview logo.ps
- Include sequence weighting
- pep2mat -swt 0 -wlc 0 data gt mat
- make and view logo
- Try the other sequence weighting scheme
(clustering) -swt 1. What difference does this
make? - Include pseudo counts
- pep2mat data gt mat
- make and view logo
9Weight matrices
- Estimate amino acid frequencies from alignment
including sequence weighting and pseudo counts - Construct a weight matrix as
- Wij log(pij/qj)
- Here i is a position in the motif, and j an amino
acid. qj is the prior frequency for amino acid j. - W is a L x 20 matrix, L is motif length
- Score sequences to weight matrix by looking up
and adding L values from matrix
10Weight matrix predictions
- Use the program seq2hmm to evaluate the
prediction accuracy of your weight matrix - seq2hmm -hmm mat -xs eval.set grep -v args
2,3 xycorr - What is going on here?
- By leaving out the -xs option you can generate
the scores at each position in the sequence. This
is often useful for Neural Network training
11MHC class II prediction
DRB10401 peptides
- Complexity of problem
- Peptides of different length
- Weak motif signal
- Alignment crucial
- Gibbs Monte Carlo sampler
- RFFGGDRGAPKRG
- YLDPLIRGLLARPAKLQV
- KPGQPPRLLIYDASNRATGIPA
- GSLFVYNITTNKYKAFLDKQ
- SALLSSDITASVNCAK
- PKYVHQNTLKLAT
- GFKGEQGPKGEP
- DVFKELKVHHANENI
- SRYWAIRTRSGGI
- TYSTNEIDLQLSQEDGQTIE
12Gibbs sample algorithm
Alignment by Gibbs sampler
RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRL
LIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ
SALLSSDITASVNCAK PKYVHQNTLKLAT
GFKGEQGPKGEP DVFKELKVHHANENI
SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI
E ?i,j pij log( pij/qi )
- Maximize E using MC
- Random change in offset
- Random shift on box position
- Accept moves to higher E always
- Accept moves to lower E with decreasing
probability
13Gibbs sampler exercise
- The file clasII.fsa is a FASTA file containing 50
classII epitopes - gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat
classII.fsa - The options -iw and -w 1,0,0,1,0,1,0,0,1 increase
matrix weight on important anchor positions in
binding motif - Make and view logo
- Use the matrix to predict classII epitopes
- cl2pred -mat gibbs.mat classII.eval.dat grep -v
args 4,5 xycorr - Do you understand what is going on in this
command?
14Hidden Markov Models
- Weight matrices do not deal with insertions and
deletions - In alignments, this is done in an ad-hoc manner
by optimization of the two gap penalties for
first gap and gap extension - HMM is a natural frame work where
insertions/deletions are dealt with explicitly
15HMM (a simple example)
- Example from A. Krogh
- Core region defines the number of states in the
HMM (red) - Insertion and deletion statistics are derived
from the non-core part of the alignment (black)
- ACA---ATG
- TCAACTATC
- ACAC--AGC
- AGA---ATC
- ACCG--ATC
Core of alignment
16HMM construction
- 5 matches. A, 2xC, T, G
- 5 transitions in gap region
- C out, G out
- A-C, C-T, T out
- Out transition 3/5
- Stay transition 2/5
.4
ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC
.2
A C G T
.4
.2
.2
.6
.6
.8
A C G T
A C G T
A C G T
A C G T
A C G T
A C G T
.8
1
1.
1.
1.
1.
.4
.8
.8
.2
.2
.2
.2
.8
.2
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2
3.3x10-2
17Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.
4x0.4x0.2x0.6x1x1x0.8x1x0.8 0.0075x10-2
ACAC--AGC 1.2x10-2 AGA---ATC
3.3x10-2 ACCG--ATC 0.59x10-2 Consensus ACAC--AT
C 4.7x10-2, ACA---ATC 13.1x10-2 Exceptional T
GCT--AGG 0.0023x10-2
18Align sequence to HMM - Null model
- Score depends strongly on length
- Null model is a random model. For length L the
score is 0.25L - Log-odds score for sequence S
- Log( P(S)/0.25L)
- Positive score means more likely than Null model
- ACA---ATG 4.9
- TCAACTATC 3.0
- ACAC--AGC 5.3
- AGA---ATC 4.9
- ACCG--ATC 4.6
- Consensus
- ACAC--ATC 6.7
- ACA---ATC 6.3
- Exceptional
- TGCT--AGG -0.97
Note!