Evaluation of Techniques for Classifying Biological Sequences - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Evaluation of Techniques for Classifying Biological Sequences

Description:

Classification algorithms applied on sequence data can be used to gain valuable ... Performance of algorithms was measured using classification accuracy ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 42
Provided by: sarah132
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Techniques for Classifying Biological Sequences


1
Evaluation of Techniques for Classifying
Biological Sequences
  • Authors Mukund Deshpande and George Karypis
  • Speaker Sarah Chan
  • CSIS DB Seminar
  • May 31, 2002

2
Presentation Outline
  • Introduction
  • Traditional Approaches (kNN, Markov Models) to
    Sequence Classification
  • Feature Based Sequence Classification
  • Experimental Evaluation
  • Conclusions

3
Introduction
  • The amount of biological sequences available in
    public databases is increasing exponentially
  • GenBank 16 billion DNA base-pairs
  • PIR over 230,000 protein sequences
  • Strong sequence similarity often translates to
    functional and structural relations
  • Classification algorithms applied on sequence
    data can be used to gain valuable insights on
    functions and relations of sequences
  • E.g. to assign a protein sequence to a protein
    family

4
Introduction
  • K-nearest neighbor, Markov models and Hidden
    Markov models have been extensively used
  • They have considered the sequential constraints
    present in datasets
  • Motivation Few attempts to use traditional
    machine learning classification algorithms such
    as decision trees and support vector machines
  • They were thought of not being able to model
    sequential nature of datasets

5
Focus of This Paper
  • To evaluate some widely used sequence
    classification algorithms
  • K-nearest neighbor
  • Markov models
  • To develop a framework to model sequences such
    that traditional machine learning algorithms can
    be easily applied
  • Represent each sequence as a vector in a derived
    feature space, and then use SVMs to build a
    sequence classifier

6
Problem Definition- Sequence Classification
  • A sequence Sr x1, x2, x3, .. xl is an ordered
    list of symbols
  • The alphabet ? for symbols known in advance and
    of fixed size N
  • Each sequence Sr has a class label Cr
  • Assumption Two class labels only (C, C-)
  • Goal To correctly assign a class label to a test
    sequence

7
Approach 1K Nearest Neighbor (KNN) Classifiers
  • To classify a test sequence Sr
  • Locate K training sequences being most similar to
    Sr
  • Assign to Sr the class label which occurs the
    most in those K sequences
  • Key task to compute similarity between two
    sequences

8
Approach 1K Nearest Neighbor (KNN) Classifiers
  • Alignment score as similarity function
  • Compute an optimal alignment between two
    sequences (by dynamic programming, hence
    computationally expensive), and then
  • Score this alignment the score is a function of
    the no. of matched and unmatched symbols in the
    alignment

9
Approach 1K Nearest Neighbor (KNN) Classifiers
  • Two variations
  • Global alignment score
  • Align sequences across their entire length
  • Can capture position specific patterns
  • Need to be normalized due to varying sequence
    lengths
  • Local alignment score
  • Only portions of two sequences are aligned
  • Can capture small substrings of symbols which are
    present in the two sequences but not necessarily
    at the same position

10
Approach 2.1Simple Markov Chain Classifiers
  • To build a simple Markov chain based
    classification model
  • Partition training sequences according to class
    labels
  • Build a simple Markov chain (M) for each smaller
    dataset
  • To classify a test sequence Sr
  • Compute the likelihood of Sr being generated by
    each Markov chain M, i.e. P(Sr M)
  • Assigns to Sr the class label associated with the
    Markov chain that gives the highest likelihood

11
Approach 2.1Simple Markov Chain Classifiers
  • Log-likelihood ratio (for two class problems)
  • If L(Sr) ? 0, then Cr ? C else Cr ? C-
  • Markov principle (for 1st order Markov chain)
  • each symbol in a sequence depends only on its
    preceding symbol, so

12
Approach 2.1Simple Markov Chain Classifiers
  • Transition probability ?xi-1, xi P(xi xi -1)
  • Each symbol is associated with a state
  • A Transition Probability Matrix (TPM) is built
    for each class

13
Approach 2.1Simple Markov Chain Classifiers
  • Example

14
Approach 2.1Simple Markov Chain Classifiers
  • Higher (kth) order Markov chain
  • Transition probability for a symbol xl is
    computed by looking at its k preceding symbols
  • No. of states Nk, each associated with a
    sequence of k symbols
  • Size of TPM Nk1 (Nk rows x N columns)
  • Pros Better classification accuracy since they
    capture longer ordering constraints
  • Cons No. of states grow exponentially with the
    order ? many infrequent states ? poor probability
    estimates

15
Approach 2.2Interpolated Markov Models (IMM)
  • Build a series of Markov chains starting from the
    0th order up to the kth order
  • Transition probability for a symbol
  • P(xixi-1, xi-2, .., x1, IMMk) sum of weighted
    transition probabilities of the different order
    chains from 0th order up to kth order
  • Weights Often based on distribution of different
    states in various order Markov models
  • The right method appears to be dataset dependent

16
Approach 2.3Selective Markov Models (SMM)
  • Build various order Markov chains
  • Prune non-discriminatory states from higher order
    chains (will explain how)
  • Conditional probability P(xixi-1, xi-2, .., x1,
    SMMk) is the probability corresponding to highest
    order chain among remaining states

17
Approach 2.3Selective Markov Models (SMM)
  • Key task to decide which states are
    non-discriminatory
  • Simplest way use a frequency threshold and prune
    all states which occur less than it
  • Method used in experiment
  • Specify frequency threshold as a parameter ?
  • A state-transition pair is kept only if it occurs
    ? times more frequently than its expected
    frequency, when uniform distribution is assumed

18
Approach 3Feature Based Sequence Classification
  • Sequences are modeled into a form that can be
    used by traditional machine learning algorithms
  • Extraction of features that take sequential
    nature of sequences into account
  • Motivated by Markov models, support vector
    machines (SVMs) are used

19
Approach 3Feature Based Sequence Classification
  • SVM
  • A relatively new learning algorithm by Vapnik
    (1995)
  • Objective Given a training set in a vector
    space, find the best hyperplane (with max.
    margin) that separates two classes
  • Approach Formulate a constrained optimization
    problem, then solve it using constrained
    quadratic programming (QP)
  • Well-suited for high dimensional data
  • Require lots of memory and CPU time

20
Approach 3Feature Based Sequence Classification
  • SVM Maximum margin

(a) A separating hyperplane with a small
margin. (b) A separating hyperplane with a larger
margin. A better generalization is expected from
(b).
21
Approach 3Feature Based Sequence Classification
  • SVM Feature space mapping

Mapping data into a higher dimensional feature
space (by using kernel functions) where they are
linearly separable.
22
Approach 3Feature Based Sequence Classification
  • Vector space view
  • (simple 1st order Markov chain)
  • is equivalent to
    L(Sy) ut w
  • u and w are of length N2, each dimension
    corresponds to a unique pair of symbols
  • Element in u frequency of a particular sequence
  • Element in w log-ratio of conditional
    probabilities for and classes)

23
Approach 3Feature Based Sequence Classification
  • Vector space view - Example
  • (simple 1st order Markov chain)

24
Approach 3Feature Based Sequence Classification
  • Vector space view
  • All variants of Markov chains described
    previously can be transformed in a similar manner
  • Dimensionality of new space
  • For higher order Markov chains Nk1
  • For IMM N N2 .. Nk1
  • For SMM no. of non-pruned states
  • Each sequence is viewed as a frequency vector
  • Allows the use of any traditional classifier that
    operates on objects represented in
    multi-dimensional vectors

25
Experimental Evaluation
  • 5 different datasets, each with 2-3 classes

Table 1
26
Experimental Evaluation
  • Methodology
  • Performance of algorithms was measured using
    classification accuracy
  • Ten-way cross validation was used
  • Experiments were restricted to two class problems

27
KNN Classifiers
Table 2
  • Cosine
  • Sequence Frequency vector of different symbols
    in it
  • Similarity /. sequences cosine of the two
    vectors
  • Does not take sequential constraints into account

28
KNN Classifiers
Table 2
  • 1. Global outperforms the other two for all K
  • 2. For PS-HT and PS-TS, performance of Cosine
    is comparable to that of Global as limited
    sequential info. can be exploited

29
KNN Classifiers
Table 2
  • 3. Local performs very poorly esp. on protein
    seq.
  • ? Not good to base classification only on a
    single substring
  • 4. Accuracy decreases when K increases

30
Simple Markov Chains vs.Their Feature Spaces
Table 3
  • 1. Accuracy improves with order of each model
  • Only exceptions For PS-, accuracy peaks at
    2nd/1st order, as sequences are very short ?
    higher order models their features spaces
    contain very few examples for calculating
    transition probabilities

31
Simple Markov Chains vs.Their Feature Spaces
Table 3
  • 2. SVM achieves higher accuracies than simple
    Markov chains (often 5-10 improvement)

32
IMM vs. Their Feature Spaces
Table 4
  • 1. SVM achieves higher accuracies than IMM for
    most datasets
  • Exceptions For P-, higher order IMM models do
    considerably better (no explanation provided)

33
IMM vs. Their Feature Spaces
Table 4
  • 2. Simple Markov chain based classifiers usually
    outperform IMM
  • Only exceptions PS-, since sequences are
    comparatively short ? greater benefit in using
    different order Markov states

34
IMM Based Classifiers vs.Simple Markov Chain
Based Classifiers
Table 4 IMM Based
Part of Table 3 Simple Markov Chain Based
35
SMM vs. Their Feature Spaces
Table 5a
  • ? parameter (for frequency threshold) used in
    pruning states of different order Markov chains

36
Table 5b
Table 5c
37
SMM vs. Their Feature Spaces
  • 1. SVM usually achieves higher accuracies than
    SMM
  • 2. For many problems SMM achieves higher
    accuracy when ? increases, but the gains are
    rather small
  • Maybe because pruning strategy is too simple

38
Conclusions
  • 1. SVM classifier used on the feature spaces of
    different Markov chains (and its variants)
    achieves substantially better accuracies than the
    corresponding Markov chain classifier.
  • ? The linear classification models learnt by SVM
    is better than those learnt by Markov chain based
    approaches

39
Conclusions
  • 2. Proper feature selection can improve accuracy,
    but increase in amount of info. available does
    not necessarily guarantee so.
  • (Except PS-) The max. accuracy attained by SVM
    on IMMs feature spaces is always lower than that
    attained by it on feature spaces of simple Markov
    chains.
  • Even with simple frequency based feature
    selection, as done in SMM, overall accuracy is
    higher.

40
Conclusions
   
  • 3.
  • KNN by computing global alignments can take
    advantage of the relative positions of symbols in
    aligned sequences
  • Simple experiment SVM incorporated with info.
    about position of symbols was able to achieve an
    accuracy gt 97.
  • Position specific info. can be useful for
    building effective classifiers for biological
    sequences.

 
41
References
   
  • Mukund Deshpande and George Karypis. Evaluation
    of Techniques for Classifying Biological
    Sequences. In proceedings of the 6th Pacific-Asia
    Conference on Knowledge Discovery (PAKDD), 2002.
  • Ming-Husan Yang. Presentation entitled Gentle
    Guide to Support Vector Machines.
  • Alexanda Johannes Smola. Presentation entitled
    Support Vector Learning Concepts and
    Algorithms.
Write a Comment
User Comments (0)
About PowerShow.com