Title: Evaluation of Techniques for Classifying Biological Sequences
1Evaluation of Techniques for Classifying
Biological Sequences
- Authors Mukund Deshpande and George Karypis
- Speaker Sarah Chan
- CSIS DB Seminar
- May 31, 2002
2Presentation Outline
- Introduction
- Traditional Approaches (kNN, Markov Models) to
Sequence Classification - Feature Based Sequence Classification
- Experimental Evaluation
- Conclusions
3Introduction
- The amount of biological sequences available in
public databases is increasing exponentially - GenBank 16 billion DNA base-pairs
- PIR over 230,000 protein sequences
- Strong sequence similarity often translates to
functional and structural relations - Classification algorithms applied on sequence
data can be used to gain valuable insights on
functions and relations of sequences - E.g. to assign a protein sequence to a protein
family
4Introduction
- K-nearest neighbor, Markov models and Hidden
Markov models have been extensively used - They have considered the sequential constraints
present in datasets - Motivation Few attempts to use traditional
machine learning classification algorithms such
as decision trees and support vector machines - They were thought of not being able to model
sequential nature of datasets
5Focus of This Paper
- To evaluate some widely used sequence
classification algorithms - K-nearest neighbor
- Markov models
- To develop a framework to model sequences such
that traditional machine learning algorithms can
be easily applied - Represent each sequence as a vector in a derived
feature space, and then use SVMs to build a
sequence classifier
6Problem Definition- Sequence Classification
- A sequence Sr x1, x2, x3, .. xl is an ordered
list of symbols - The alphabet ? for symbols known in advance and
of fixed size N - Each sequence Sr has a class label Cr
- Assumption Two class labels only (C, C-)
- Goal To correctly assign a class label to a test
sequence
7Approach 1K Nearest Neighbor (KNN) Classifiers
- To classify a test sequence Sr
- Locate K training sequences being most similar to
Sr - Assign to Sr the class label which occurs the
most in those K sequences - Key task to compute similarity between two
sequences
8Approach 1K Nearest Neighbor (KNN) Classifiers
- Alignment score as similarity function
- Compute an optimal alignment between two
sequences (by dynamic programming, hence
computationally expensive), and then - Score this alignment the score is a function of
the no. of matched and unmatched symbols in the
alignment
9Approach 1K Nearest Neighbor (KNN) Classifiers
- Two variations
- Global alignment score
- Align sequences across their entire length
- Can capture position specific patterns
- Need to be normalized due to varying sequence
lengths - Local alignment score
- Only portions of two sequences are aligned
- Can capture small substrings of symbols which are
present in the two sequences but not necessarily
at the same position
10Approach 2.1Simple Markov Chain Classifiers
- To build a simple Markov chain based
classification model - Partition training sequences according to class
labels - Build a simple Markov chain (M) for each smaller
dataset - To classify a test sequence Sr
- Compute the likelihood of Sr being generated by
each Markov chain M, i.e. P(Sr M) - Assigns to Sr the class label associated with the
Markov chain that gives the highest likelihood
11Approach 2.1Simple Markov Chain Classifiers
- Log-likelihood ratio (for two class problems)
- If L(Sr) ? 0, then Cr ? C else Cr ? C-
- Markov principle (for 1st order Markov chain)
- each symbol in a sequence depends only on its
preceding symbol, so
12Approach 2.1Simple Markov Chain Classifiers
- Transition probability ?xi-1, xi P(xi xi -1)
- Each symbol is associated with a state
- A Transition Probability Matrix (TPM) is built
for each class
13Approach 2.1Simple Markov Chain Classifiers
14Approach 2.1Simple Markov Chain Classifiers
- Higher (kth) order Markov chain
- Transition probability for a symbol xl is
computed by looking at its k preceding symbols - No. of states Nk, each associated with a
sequence of k symbols - Size of TPM Nk1 (Nk rows x N columns)
- Pros Better classification accuracy since they
capture longer ordering constraints - Cons No. of states grow exponentially with the
order ? many infrequent states ? poor probability
estimates
15Approach 2.2Interpolated Markov Models (IMM)
- Build a series of Markov chains starting from the
0th order up to the kth order - Transition probability for a symbol
- P(xixi-1, xi-2, .., x1, IMMk) sum of weighted
transition probabilities of the different order
chains from 0th order up to kth order - Weights Often based on distribution of different
states in various order Markov models - The right method appears to be dataset dependent
16Approach 2.3Selective Markov Models (SMM)
- Build various order Markov chains
- Prune non-discriminatory states from higher order
chains (will explain how) - Conditional probability P(xixi-1, xi-2, .., x1,
SMMk) is the probability corresponding to highest
order chain among remaining states
17Approach 2.3Selective Markov Models (SMM)
- Key task to decide which states are
non-discriminatory - Simplest way use a frequency threshold and prune
all states which occur less than it - Method used in experiment
- Specify frequency threshold as a parameter ?
- A state-transition pair is kept only if it occurs
? times more frequently than its expected
frequency, when uniform distribution is assumed
18Approach 3Feature Based Sequence Classification
- Sequences are modeled into a form that can be
used by traditional machine learning algorithms - Extraction of features that take sequential
nature of sequences into account - Motivated by Markov models, support vector
machines (SVMs) are used
19Approach 3Feature Based Sequence Classification
- SVM
- A relatively new learning algorithm by Vapnik
(1995) - Objective Given a training set in a vector
space, find the best hyperplane (with max.
margin) that separates two classes - Approach Formulate a constrained optimization
problem, then solve it using constrained
quadratic programming (QP) - Well-suited for high dimensional data
- Require lots of memory and CPU time
20Approach 3Feature Based Sequence Classification
(a) A separating hyperplane with a small
margin. (b) A separating hyperplane with a larger
margin. A better generalization is expected from
(b).
21Approach 3Feature Based Sequence Classification
- SVM Feature space mapping
Mapping data into a higher dimensional feature
space (by using kernel functions) where they are
linearly separable.
22Approach 3Feature Based Sequence Classification
- Vector space view
- (simple 1st order Markov chain)
- is equivalent to
L(Sy) ut w - u and w are of length N2, each dimension
corresponds to a unique pair of symbols - Element in u frequency of a particular sequence
- Element in w log-ratio of conditional
probabilities for and classes)
23Approach 3Feature Based Sequence Classification
- Vector space view - Example
- (simple 1st order Markov chain)
24Approach 3Feature Based Sequence Classification
- Vector space view
- All variants of Markov chains described
previously can be transformed in a similar manner - Dimensionality of new space
- For higher order Markov chains Nk1
- For IMM N N2 .. Nk1
- For SMM no. of non-pruned states
- Each sequence is viewed as a frequency vector
- Allows the use of any traditional classifier that
operates on objects represented in
multi-dimensional vectors
25Experimental Evaluation
- 5 different datasets, each with 2-3 classes
Table 1
26Experimental Evaluation
- Methodology
- Performance of algorithms was measured using
classification accuracy - Ten-way cross validation was used
- Experiments were restricted to two class problems
27KNN Classifiers
Table 2
- Cosine
- Sequence Frequency vector of different symbols
in it - Similarity /. sequences cosine of the two
vectors - Does not take sequential constraints into account
28KNN Classifiers
Table 2
- 1. Global outperforms the other two for all K
- 2. For PS-HT and PS-TS, performance of Cosine
is comparable to that of Global as limited
sequential info. can be exploited
29KNN Classifiers
Table 2
- 3. Local performs very poorly esp. on protein
seq. - ? Not good to base classification only on a
single substring - 4. Accuracy decreases when K increases
30Simple Markov Chains vs.Their Feature Spaces
Table 3
- 1. Accuracy improves with order of each model
- Only exceptions For PS-, accuracy peaks at
2nd/1st order, as sequences are very short ?
higher order models their features spaces
contain very few examples for calculating
transition probabilities
31Simple Markov Chains vs.Their Feature Spaces
Table 3
- 2. SVM achieves higher accuracies than simple
Markov chains (often 5-10 improvement)
32IMM vs. Their Feature Spaces
Table 4
- 1. SVM achieves higher accuracies than IMM for
most datasets - Exceptions For P-, higher order IMM models do
considerably better (no explanation provided)
33IMM vs. Their Feature Spaces
Table 4
- 2. Simple Markov chain based classifiers usually
outperform IMM - Only exceptions PS-, since sequences are
comparatively short ? greater benefit in using
different order Markov states
34IMM Based Classifiers vs.Simple Markov Chain
Based Classifiers
Table 4 IMM Based
Part of Table 3 Simple Markov Chain Based
35SMM vs. Their Feature Spaces
Table 5a
- ? parameter (for frequency threshold) used in
pruning states of different order Markov chains
36Table 5b
Table 5c
37SMM vs. Their Feature Spaces
- 1. SVM usually achieves higher accuracies than
SMM - 2. For many problems SMM achieves higher
accuracy when ? increases, but the gains are
rather small - Maybe because pruning strategy is too simple
38Conclusions
- 1. SVM classifier used on the feature spaces of
different Markov chains (and its variants)
achieves substantially better accuracies than the
corresponding Markov chain classifier. - ? The linear classification models learnt by SVM
is better than those learnt by Markov chain based
approaches
39Conclusions
- 2. Proper feature selection can improve accuracy,
but increase in amount of info. available does
not necessarily guarantee so. - (Except PS-) The max. accuracy attained by SVM
on IMMs feature spaces is always lower than that
attained by it on feature spaces of simple Markov
chains. - Even with simple frequency based feature
selection, as done in SMM, overall accuracy is
higher.
40Conclusions
 Â
- 3.
- KNN by computing global alignments can take
advantage of the relative positions of symbols in
aligned sequences - Simple experiment SVM incorporated with info.
about position of symbols was able to achieve an
accuracy gt 97. - Position specific info. can be useful for
building effective classifiers for biological
sequences.
Â
41References
 Â
- Mukund Deshpande and George Karypis. Evaluation
of Techniques for Classifying Biological
Sequences. In proceedings of the 6th Pacific-Asia
Conference on Knowledge Discovery (PAKDD), 2002. - Ming-Husan Yang. Presentation entitled Gentle
Guide to Support Vector Machines. - Alexanda Johannes Smola. Presentation entitled
Support Vector Learning Concepts and
Algorithms.