Sequence information, logos and Hidden Markov Models - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Sequence information, logos and Hidden Markov Models

Description:

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU ... cl2pred -mat gibbs.mat classII.eval.dat | grep -v # | args 4,5 | xycorr ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 19
Provided by: clauslun
Category:

less

Transcript and Presenter's Notes

Title: Sequence information, logos and Hidden Markov Models


1
Sequence information, logos and Hidden Markov
Models
  • Morten Nielsen,
  • CBS, BioCentrum,
  • DTU

2
Information content
  • Information and entropy
  • Conserved amino acid regions contain high degree
    of information (high order low entropy)
  • Variable amino acid regions contain low degree of
    information (low order high entropy)
  • Shannon information
  • D log2(N) S pi log2 pi (for proteins
    N20, DNA N4)
  • Conserved residue pA1, piltgtA0,
  • D log2(N) ( 4.3 for proteins)
  • Variable region pA0.05, pC0.05, ..,
  • D 0

3
Sequence logo
MHC class II Logo from 10 sequences
  • Height of a column equal to D
  • Relative height of a letter is pA
  • Highly useful tool to visualize sequence motifs

High information positions
http//www.cbs.dtu.dk/gorodkin/appl/plogo.html
4
Sequence information
  • ALAKAAAAM
  • ALAKAAAAN
  • ALAKAAAAR
  • ALAKAAAAT
  • ALAKAAAAV
  • GMNERPILT
  • GILGFVFTM
  • TLNAWVKVV
  • KLNEPVLLL
  • AVVPFIVSV
  • Description of binding motif
  • Example
  • PA 6/10
  • PG 2/10
  • PT PK 1/10
  • PC PD PV 0
  • Problems
  • Few data
  • Data redundancy/duplication

5
Sequence information Raw sequence counting
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
6
Sequence weighting
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
7
Pseudo counts
ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV
GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV
  • Sequence weighting and pseudo count
  • Motif found on more data

8
and now you
  • cp files from /usr/opt/www/pub/CBS/researchgroups/
    immunology/intro/HMM/exercise
  • Make weight matrix and logos using
  • pep2mat -swt 2 -wlc 0 data gt mat
  • mat2logo mat
  • ghostview logo.ps
  • Include sequence weighting
  • pep2mat -swt 0 -wlc 0 data gt mat
  • make and view logo
  • Try the other sequence weighting scheme
    (clustering) -swt 1. What difference does this
    make?
  • Include pseudo counts
  • pep2mat data gt mat
  • make and view logo

9
Weight matrices
  • Estimate amino acid frequencies from alignment
    including sequence weighting and pseudo counts
  • Construct a weight matrix as
  • Wij log(pij/qj)
  • Here i is a position in the motif, and j an amino
    acid. qj is the prior frequency for amino acid j.
  • W is a L x 20 matrix, L is motif length
  • Score sequences to weight matrix by looking up
    and adding L values from matrix

10
Weight matrix predictions
  • Use the program seq2hmm to evaluate the
    prediction accuracy of your weight matrix
  • seq2hmm -hmm mat -xs eval.set grep -v args
    2,3 xycorr
  • What is going on here?
  • By leaving out the -xs option you can generate
    the scores at each position in the sequence. This
    is often useful for Neural Network training

11
MHC class II prediction
DRB10401 peptides
  • Complexity of problem
  • Peptides of different length
  • Weak motif signal
  • Alignment crucial
  • Gibbs Monte Carlo sampler

  • RFFGGDRGAPKRG
  • YLDPLIRGLLARPAKLQV
  • KPGQPPRLLIYDASNRATGIPA
  • GSLFVYNITTNKYKAFLDKQ
  • SALLSSDITASVNCAK
  • PKYVHQNTLKLAT
  • GFKGEQGPKGEP
  • DVFKELKVHHANENI
  • SRYWAIRTRSGGI
  • TYSTNEIDLQLSQEDGQTIE

12
Gibbs sample algorithm
Alignment by Gibbs sampler
RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRL
LIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ
SALLSSDITASVNCAK PKYVHQNTLKLAT
GFKGEQGPKGEP DVFKELKVHHANENI
SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI
E ?i,j pij log( pij/qi )
  • Maximize E using MC
  • Random change in offset
  • Random shift on box position
  • Accept moves to higher E always
  • Accept moves to lower E with decreasing
    probability

13
Gibbs sampler exercise
  • The file clasII.fsa is a FASTA file containing 50
    classII epitopes
  • gibbss_mc -iw -w 1,0,0,1,0,1,0,0,1 -m gibbs.mat
    classII.fsa
  • The options -iw and -w 1,0,0,1,0,1,0,0,1 increase
    matrix weight on important anchor positions in
    binding motif
  • Make and view logo
  • Use the matrix to predict classII epitopes
  • cl2pred -mat gibbs.mat classII.eval.dat grep -v
    args 4,5 xycorr
  • Do you understand what is going on in this
    command?

14
Hidden Markov Models
  • Weight matrices do not deal with insertions and
    deletions
  • In alignments, this is done in an ad-hoc manner
    by optimization of the two gap penalties for
    first gap and gap extension
  • HMM is a natural frame work where
    insertions/deletions are dealt with explicitly

15
HMM (a simple example)
  • Example from A. Krogh
  • Core region defines the number of states in the
    HMM (red)
  • Insertion and deletion statistics are derived
    from the non-core part of the alignment (black)
  • ACA---ATG
  • TCAACTATC
  • ACAC--AGC
  • AGA---ATC
  • ACCG--ATC

Core of alignment
16
HMM construction
  • 5 matches. A, 2xC, T, G
  • 5 transitions in gap region
  • C out, G out
  • A-C, C-T, T out
  • Out transition 3/5
  • Stay transition 2/5

.4
ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC
.2
A C G T
.4
.2
.2
.6
.6
.8
A C G T
A C G T
A C G T
A C G T
A C G T
A C G T
.8
1
1.
1.
1.
1.
.4
.8
.8
.2
.2
.2
.2
.8
.2
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2
3.3x10-2
17
Align sequence to HMM
ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2
3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.
4x0.4x0.2x0.6x1x1x0.8x1x0.8 0.0075x10-2
ACAC--AGC 1.2x10-2 AGA---ATC
3.3x10-2 ACCG--ATC 0.59x10-2 Consensus ACAC--AT
C 4.7x10-2, ACA---ATC 13.1x10-2 Exceptional T
GCT--AGG 0.0023x10-2
18
Align sequence to HMM - Null model
  • Score depends strongly on length
  • Null model is a random model. For length L the
    score is 0.25L
  • Log-odds score for sequence S
  • Log( P(S)/0.25L)
  • Positive score means more likely than Null model
  • ACA---ATG 4.9
  • TCAACTATC 3.0
  • ACAC--AGC 5.3
  • AGA---ATC 4.9
  • ACCG--ATC 4.6
  • Consensus
  • ACAC--ATC 6.7
  • ACA---ATC 6.3
  • Exceptional
  • TGCT--AGG -0.97

Note!
Write a Comment
User Comments (0)
About PowerShow.com