Protein Family Classification using Sparse Markov Transducers - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Protein Family Classification using Sparse Markov Transducers

Description:

Center for Bioinformation Technology (CBIT) Protein Family Classification using ... One drawback of PSTs is that they rely on exact matches to the conditional ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 19
Provided by: chodon
Category:

less

Transcript and Presenter's Notes

Title: Protein Family Classification using Sparse Markov Transducers


1
Protein Family Classification using Sparse Markov
Transducers
  • E. Eskin, W.N. Grundy, and Y. Singer
  • ISMB 00
  • Talk by Cho, Dong-Yeon

2
Abstract
  • Classifying proteins into families using sparse
    Markov transducers (SMTs)
  • Estimation of a probability distribution
    conditioned on an input sequence
  • Similar to probability suffix trees
  • Allowing for wild-cards
  • Two models
  • Efficient data structures

3
Introduction
  • Protein Classification
  • Pairwise similarity
  • Creating profiles for protein families
  • Consensus patterns using motifs
  • HMM-based approaches
  • Probability suffix trees (PSTs)
  • A PST is a model that predicts the next symbol in
    a sequence based on the previous symbols.
  • This approach is based on the presence of common
    short sequences (motifs) through the protein
    family.
  • One drawback of PSTs is that they rely on exact
    matches to the conditional sequences (e.g.,
    3-hydroxyacyl-CoA dehydrogenase).

VAVIGSGT
VGVLGLGT
VVGGT wild cards
4
  • Sparse Markov Transducers (SMTs)
  • A generalization of PSTs
  • It can condition the probability model over a
    sequence that contains wild-cards.
  • In a transducer, the input symbol alphabet and
    output symbol alphabet can be different.
  • Two methods
  • Single amino acid
  • Protein family
  • Efficient data structure
  • Experiments
  • Pfam database of protein family

5
Sparse Markov Transducers
  • A Markov Transducer of Order L
  • Conditional probability distribution
  • Xk are random variables over an input alphabet
  • Yk is a random variable over an output alphabet
  • Sparse Markov Transducer
  • Conditional probability distribution
  • ? wild card
  • Two approaches for SMT-based protein
    classification
  • A prediction model for each family single amino
    acid
  • A single model for the entire database protein
    family

6
  • Sparse Markov Trees
  • Representationally equivalent to SMTs
  • The topology of a tree encodes the positions of
    the wild-cards in the conditioning sequence of
    the probability distribution.

7
  • Training a Prediction Tree
  • A set of training examples
  • The input symbols are used to identify which leaf
    node is associated with that training example.
  • The output symbol is then used to update the
    count of the appropriate predictor.
  • The predictor kept counts of each output symbol
    seen by that predictor.
  • We smooth each count by adding a constant value
    to the count of each output symbol. Cf) Dirichlet
    distribution

u1 DACDADDDCAA, C CAAAACAD, D AACCAAA, ?
C?0.5, D?0.5
8
  • Mixture of Sparse Prediction Trees
  • We do not know which tree topology can best
    estimate the distribution.
  • A mixture technique employs a weight sum of trees
    as a predictor.
  • Updating the weight of each tree for each input
    string in the data set based on how well the tree
    preformed on predicting the output
  • The prior probability of a tree is defined by the
    topology of the tree.

9
  • Implementation of SMTs
  • Two important parameters
  • MAX_DEPTH the maximum depth of the tree
  • MAX_PHI the maximum number of wild-cards at
    every node

Ten tress in the mixture if MAX_DEPTH2 and
MAX_PHI 1
10
  • Template tree
  • We only store these nodes which are reached
    during training.

AA, AC and CD
11
Efficient Data Structures
  • Performance of the SMT typically improves with
    higher MAX_PHI and MAX_DEPTH.
  • The memory usage become bottleneck because it
    restricts these parameters to values that will
    allow the tree to fit in memory.

12
  • Lazy Evaluation
  • We store the tails of the training sequence and
    recompute the part of the tree on demand when
    necessary.
  • EXPAND_SEQUENCE_COUNT 4

ACDACAC(D)
ACDACAC(A), DACADAC(C), DACAAAC(D), ACACDAC(A),
ADCADAC(D)
13
Methodology
  • Data
  • Two versions of the Pfam database
  • Version 1.0 for comparing results to previous
    one
  • Version 5.2 the latest version
  • 175 protein families
  • A total of 15610 single domain protein sequences
    containing a total 3560959 residues
  • Training and test data with a ratio of 41 for
    each family
  • ? transmembrane receptor 530 protein sequence
    (424 106)
  • The 424 sequences of the training set give 108858
    subsequences that are used to train the model.

14
  • Building SMT Prediction Models
  • A prediction model for each protein family
  • A sliding window of size 11
  • Prediction of the middle symbol a6 using
    neighboring symbols
  • The input symbols are a5a7a4a8a3a9a2a10a1a11.
  • MAX_DEPTH 7 and MAX_PHI 1
  • Classification of a Sequence using a SMT
    Prediction Model
  • Computation of the likelihood for an unknown
    sequence
  • A sequence is classified into a family by
    computing the likelihood of the fit for each of
    the 175 models.

15
  • Building the SMT Classifier Model
  • Estimation of the probability over protein
    families given a sequence of amino acids
  • Input sequence an amino acid sequence from a
    protein family
  • Output symbol the protein family name
  • A sliding window of 10 amino acids a1,,a10
  • MAX_DEPTH5 and MAX_PHI1
  • Classification of a Sequence using an SMT
    Classifier
  • Each position of the sequence gives us a
    probability over the 175 families measuring how
    likely the substring originated from each family.

16
  • Results
  • Time-Space-Performance tradeoffs

17
  • Results of Protein Classification using SMTs
  • The SMT models outperform the PST models.
  • SMT Classifier gt SMT Prediction gt PST Prediction

18
Discussion
  • Sparse Markov Transducers (SMTs)
  • We have presented two methods for protein
    classification using sparse Markov transducers
    (SMTs).
  • Future Work
  • Incorporating biological information into the
    model such as Dirichlet mixture priors
  • Combining a generative and discriminative model
  • Using both positive and negative examples in
    training
Write a Comment
User Comments (0)
About PowerShow.com