Sequence information, logos and Hidden Markov Models - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Sequence information, logos and Hidden Markov Models

Description:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU ... cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args 4,5 | xycorr ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 19

Provided by: clauslun

Category:

more less

Transcript and Presenter's Notes

Title: Sequence information, logos and Hidden Markov Models

1
Sequence information, logos and Hidden Markov
Models

Morten Nielsen,
CBS, BioCentrum,
DTU

2
Information content

Information and entropy
Conserved amino acid regions contain high degree
of information (high order low entropy)
Variable amino acid regions contain low degree of
information (low order high entropy)
Shannon information
D log2(N) S pi log2 pi (for proteins
N20, DNA N4)
Conserved residue pA1, piltgtA0,
D log2(N) ( 4.3 for proteins)
Variable region pA0.05, pC0.05, ..,
D 0

3
Sequence logo
MHC class II Logo from 10 sequences

Height of a column equal to D
Relative height of a letter is pA
Highly useful tool to visualize sequence motifs

High information positions
http//www.cbs.dtu.dk/gorodkin/appl/plogo.html
4
Sequence information

ALAKAAAAM
ALAKAAAAN
ALAKAAAAR
ALAKAAAAT
ALAKAAAAV
GMNERPILT
GILGFVFTM
TLNAWVKVV
KLNEPVLLL
AVVPFIVSV

Description of binding motif
Example
PA 6/10
PG 2/10
PT PK 1/10
PC PD PV 0
Problems
Few data
Data redundancy/duplication

5
Sequence information Raw sequence counting
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
6
Sequence weighting
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
7
Pseudo counts
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Sequence weighting and pseudo count
Motif found on more data

8
and now you

cp files from /usr/opt/www/pub/CBS/researchgroups/
immunology/intro/HMM/exercise
Make weight matrix and logos using
pep2mat -swt 2 -wlc 0 data gt mat
mat2logo mat
ghostview logo.ps
Include sequence weighting
pep2mat -swt 0 -wlc 0 data gt mat
make and view logo
Try the other sequence weighting scheme
(clustering) -swt 1. What difference does this
make?
Include pseudo counts
pep2mat data gt mat
make and view logo

9
Weight matrices

Estimate amino acid frequencies from alignment
including sequence weighting and pseudo counts
Construct a weight matrix as
Wij log(pij/qj)
Here i is a position in the motif, and j an amino
acid. qj is the prior frequency for amino acid j.
W is a L x 20 matrix, L is motif length
Score sequences to weight matrix by looking up
and adding L values from matrix

10
Weight matrix predictions

Use the program seq2hmm to evaluate the
prediction accuracy of your weight matrix
seq2hmm -hmm mat -xs eval.set grep -v args
2,3 xycorr
What is going on here?
By leaving out the -xs option you can generate
the scores at each position in the sequence. This
is often useful for Neural Network training

11
MHC class II prediction
DRB10401 peptides

Complexity of problem
Peptides of different length
Weak motif signal
Alignment crucial
Gibbs Monte Carlo sampler

RFFGGDRGAPKRG
YLDPLIRGLLARPAKLQV
KPGQPPRLLIYDASNRATGIPA
GSLFVYNITTNKYKAFLDKQ
SALLSSDITASVNCAK
PKYVHQNTLKLAT
GFKGEQGPKGEP
DVFKELKVHHANENI
SRYWAIRTRSGGI
TYSTNEIDLQLSQEDGQTIE

12
Gibbs sample algorithm
Alignment by Gibbs sampler
RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRL
LIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ
SALLSSDITASVNCAK PKYVHQNTLKLAT
GFKGEQGPKGEP DVFKELKVHHANENI
SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI
E ?i,j pij log( pij/qi )

Maximize E using MC
Random change in offset
Random shift on box position
Accept moves to higher E always
Accept moves to lower E with decreasing
probability

13
Gibbs sampler exercise

The file clasII.fsa is a FASTA file containing 50
classII epitopes
gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat
classII.fsa
The options -iw and -w 1,0,0,1,0,1,0,0,1 increase
matrix weight on important anchor positions in
binding motif
Make and view logo
Use the matrix to predict classII epitopes
cl2pred -mat gibbs.mat classII.eval.dat grep -v
args 4,5 xycorr
Do you understand what is going on in this
command?

14
Hidden Markov Models

Weight matrices do not deal with insertions and
deletions
In alignments, this is done in an ad-hoc manner
by optimization of the two gap penalties for
first gap and gap extension
HMM is a natural frame work where
insertions/deletions are dealt with explicitly

15
HMM (a simple example)

Example from A. Krogh
Core region defines the number of states in the
HMM (red)
Insertion and deletion statistics are derived
from the non-core part of the alignment (black)

ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC

Core of alignment
16
HMM construction

5 matches. A, 2xC, T, G
5 transitions in gap region
C out, G out
A-C, C-T, T out
Out transition 3/5
Stay transition 2/5

.4
ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC
.2
A C G T
.4
.2
.2
.6
.6
.8
A C G T
A C G T
A C G T
A C G T
A C G T
A C G T
.8
1
1.
1.
1.
1.
.4
.8
.8
.2
.2
.2
.2
.8
.2
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2
3.3x10-2
17
Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.
4x0.4x0.2x0.6x1x1x0.8x1x0.8 0.0075x10-2
ACAC--AGC 1.2x10-2 AGA---ATC
3.3x10-2 ACCG--ATC 0.59x10-2 Consensus ACAC--AT
C 4.7x10-2, ACA---ATC 13.1x10-2 Exceptional T
GCT--AGG 0.0023x10-2
18
Align sequence to HMM - Null model