Bayesian Segmentation of Protein Sequences - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Bayesian Segmentation of Protein Sequences

Description:

where Oi is a 20 x 1 vector containing the occurrence counts for each amino acid ... The model achieves comparable performance to the state of the art ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 24
Provided by: jsin
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Segmentation of Protein Sequences


1
Bayesian Segmentation of Protein Sequences
Paper A Graphical Model for Protein Secondary
Structure Prediction by Wei Chu, Zoubin
Ghahramani, and David L. Wild ICML(2003)
  • Presentation by Jivko Sinapov

2
Outline
  • Introduction
  • Background
  • Naïve Bayes, HMM, HSMM
  • Baysian Segmentation Model
  • Experiments and Results
  • Discussion

3
Introduction
  • Prediction of class labels for residues in a
    protein sequence

Secondary structure
Sequence VTSYTLSDVVSLKDVVPEWVRIGFSATTGAEYAAHEVLSW
SFHSELS Class -EEEEEEEE--HHHH---EEEEEEEEE-----
-EEEEEEEEEEEEE-
Protein-RNA interface
Sequence DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMG
RDHTLFALSDGK Class 111111001111111001111100101
1111100000001111101000000
Protein-Protein interface
Sequence AVTTYKLVINGKTLKGETTTKAVDAETAEKAFKQYANDNG
VDGVWTYDDA Class 00000000000111111110100000000
000001011110000000000
4
Background
  • Naïve Assumption the label of a residue is
    independent of the labels of neighboring
    residues.
  • e.g. Naïve Bayes on windows

c1
c2
c3
c4
c5
c6
c7
c8
Class labels
s1
s2
s3
s4
s5
s6
s7
s8
Sequence
5
Background
  • Hidden Markov Model
  • Models dependencies between adjacent class labels
  • Observation can be multi-dimensional (PSSM, or
    multiple sequence alignment)
  • Efficient algorithms for learning and prediction
    of class labels
  • Captures some local dependencies that are obvious
    from the data

c1
c2
c3
c4
c5
c6
c7
c8
Class labels
s1
s2
s3
s4
s5
s6
s7
s8
Sequence
6
Previous methods
  • Hidden semi-Markov Model
  • Each state is a segment of the sequence
  • Each state emits a sequence, rather than a single
    amino-acid

coil
sheet
coil
helix
coil
V TSYTLSDV VS LKDV VPE
More formally
7
Sequence segmentation
  • T secondary structural type of the segment, H,
    E, L
  • S ends of each individual structural segments
  • R known amino acid sequence
  • T2 E ß-strand S2 9 R2 S1 1 S2

in Bayesian Segmentation of Protein Secondary
Structure SC Schmidler, JS Liu, DL Brutlag -
Journal of Computational Biology, 2000
8
Bayesian Segmentation of Protein Secondary
Structure
  • Introduces a model that utilizes multiple
    sequence alignments or PSSM
  • Efficient algorithms for learning and inference
  • Improves accuracy by 10 against window-based
    methods
  • Models long range interaction by beta-strands

9
The Model
  • Observation

where Oi is a 20 x 1 vector containing the
occurrence counts for each amino acid at position
i.
  • Set of structure types
  • Segment sequence
  • Segment locations determined by position of last
    residue in the segments

10
  • The variables (m,e,T) describe the segmentation
  • Bayesian approach

11
  • Prior


where
is specified by a 3x3 transition matrix.
12
  • Likelihood

and
  • Likelihood function for each segment

13
  • Likelihood function for each residue

14
  • Dirichlet prior on

15
  • Likelihood function for each segment becomes

16
Inference (i.e. classification)
  • Bayes rule
  • MAP estimate (use Viterbi)
  • Marginal posterior mode estimate (use
    forward-backward)

17
Long Range interactions
18
  • Long range interactions

where r is the number of interacting pairs
is a pair of interacting segments with their
alignment information
and
  • Incorporate into model
  • Prior
  • Conditional
  • Final segmental likelihood for strand

19
Parameter estimation and inference
  • Discrete parameters segment lengths, occurrence
    counts, state transition probabilities.
  • Estimated directly from data
  • Weight parameters for neighboring and
    long-distance dependencies
  • Estimated using a MAP estimate
  • Uses Variational method (!!!)
  • Use Markov-Chain Monte Carlo for inference

20
Experiments and Results
  • Dataset CB513
  • 513 non-homologous protein chains with solved
    structure
  • Removed sequences shorter than 30 and longer than
    550.
  • Do 7-fold cross validation
  • Three types of features
  • Sequence only
  • Multiple sequence alignment profile
  • PSSM

21
Experiments and Results
window-based Naïve Bayes approach 60.5
22
Predicting long range interactions
  • A subset of 153 proteins with long range contact
    maps between beta-strands
  • Good performance in predicting the contact maps
    between a pair of beta-strands AUC 0.89
  • However, no clear difference between using MCMC
    with long range interaction and exact inference
    without such interactions

23
Discussion
  • The model achieves comparable performance to the
    state of the art
  • Can be extended to other problems
  • Can model short-distance and long-distance
    interaction between segments
  • If using sequence only, model achieves 5
    improvement over Naïve Bayes
Write a Comment
User Comments (0)
About PowerShow.com