Title: STUDENT RESEARCH SYMPOSIUM 2005
1STUDENT RESEARCH SYMPOSIUM 2005
- Title Strategically using Pairwise
Classification to Improve Category Prediction - Presenter Pinar Donmez
- Advisors Carolyn Penstein Rosé, Jaime Carbonell
- LTI, SCS Carnegie Mellon University
2Outline
- Problem Definition
- Overview Multi-label text classification methods
- Motivation for Ensemble Approaches
- Technical Details Selective Concentration
Classifiers - Formal Evaluation
- Conclusions and Future Work
3Problem Definition
- Multi-label Text Classification (TC) 1-1 mapping
of documents to pre-defined categories - Problems with TC
- Limited training data
- Poor choice of features
- Flaws in the learning process
- Goal Improve the predictions on the unseen data
4Multi-label TC Methods
- ECOC
- Boosting
- Pairwise Coupling and Latent Variable Approach
5ECOC
- Recall Problems with TC
- Limited training data
- Poor choice of features
- Flaws in the learning process
- Result Poor classification performance
- ECOC
- Encode each class in a code vector
- Encode each example in a code vector
- Calculate the probability of each bit being 1
using i.e. decision trees, neural networks, etc. - Combine these probs in a vector
- To classify the given ex, calculate the distance
between this vector and each of the codewords of
classes
6Boosting
- Main idea Evolve a set of weights over the
training set
7Pairwise Coupling
- K classes, N observations
- X (f1, f2, f3, , fp) is an observation with p
features - K2 case is generally easier than Kgt2 cases
since only one decision boundary has to be
learned - Friedmans rule for K-class problem (Kgt2)
-
Max-wins Rule
8Latent Variable Approach
- Usage of hidden variables that tell whether the
corresponding model is good at capturing
particular patterns of the data - Decision is based on the posterior probability
is the likelihood that ith model should be used
for class prediction given input x
where
is the probability of y given input x and ith
model
and
Y. Liu, J. Carbonell, and R. Jin. A pairwise
ensemble approach for accurate genre
classification. In ECML 03, 2003.
9- For each class Ci, Liu et.al., builds a structure
like the following
- Liu et.al., builds a structure like above for
each class - Compute the corresponding score of each test
example - Assign the example to the class with the highest
score -
10Intuition Behind Our Method
- Multiple classes gt single decision boundary is
not powerful enough - Ensemble notion
- Partition the data into focused subsets
- Learn a classification model on each subset
- Combine the predictions of each model
- What is the problem with ensemble techniques?
- When the category space is large, time complexity
to build models on subsets becomes intractable - Our method addresses this problem. But how?
11Technical Details
- Build one-vs-all classifiers iteratively
- At each iteration choose which sub-classifiers to
build based on an analysis of error distributions - Idea Focus on the classes that are highly
confusable - Similar to Boosting
- Boosting modifies the weights of misclassified
examples to penalize inaccurate models - In decision stage
- If a confusable class is chosen for prediction of
a test example, predictions of the
sub-classifiers for that class are also taken
into account -
12Technical Details Part II
- Train K one-vs-all classifiers (K categories)
- Analyze the confusion matrix
- Identify the problematic (hard) classes and their
confusable sets - Build sub-classifiers for each of the hard
classes above - Combine documents from one hard class along with
documents which belong to confusable set of that
class - Continue to build sub-classifiers recursively
until a stopping criterion
13Train K one-vs-all models
B F H
Confusion Matrix
A D
D-vs-F and H classifier
A vs B classifier
B F H
Build A-vs-B and D-vs-F and H
A D
..
Confusion Matrix
Note Continue to build sub-classifiers until
either there is no need or you can not divide any
further!
14How to choose subclassifiers?
- fi(?) ?µi (1- ?)o2i
- gi(ß) µi ß o2i
- where µi avg number of false positives for
class i - o2i stdev of false positives for class
i - Focus on classes which fi(?) gt T
- ( T _ predefined threshold)
- For every i for which the above inequality is
true - Choose all classes j where C(i,j) gt gi(ß)
- C(i,j) entry in the confusion matrix where i is
the predicted class and j is the true class - 3 parameters ? , ß, and T
- Tuned on a held-out set
15Analysis of error distribution for some
classifiers I
- Analysis on the 20newsgroups dataset
- These errors are more uniformly distributed
- The avg number of false positives are not very
high - Two criteria arent met
- Skewed error distribution
- Large of errors
16Analysis of error distribution for some
classifiers II
- Common in all three
- Skewed distribution of errors (false s)
- These peaks will form the sub-classifiers
17Implications of our method
- Objective Obtain high accuracy by choosing a
small set of sub-classifiers within a small
number of iterations - Pros
- Strategically choosing sub-classifiers reduce
training time compared to building one-vs-one
classifiers - O(nlogn) number of classifiers on the average
- Sub-classifiers are trained on more focused sets,
so they are likely to do a better job - Cons
- We focus on the problematic classes to
distinguish for obtaining sub-classifiers. Hence,
performance might be hurt as we increase the
number of iterations
18Evaluation
- Dataset 20 newsgroups
- Evaluation on two versions
- Original 20 Newsgroups (19,997 documents evenly
distributed across 20 classes) - Cleaned version (not include headers stopwords
words occur only once) - Vocabulary size 62,000
J. Rennie and R. Rafkin, Improving Multiclass
Text Classification with SVM. MIT, AI Memo
AIM-2001-026. 2001
19Comparison of Results
- Our method give comparable results as to those of
Liu et.al., - But, Liu et.al., builds exactly 2K2 classifiers
whereas our method builds at most K2 classifiers
in the worst case scenario (K classes) - In the 20 newsgroups experiment, we built 34
classifiers in total whereas in latent variable
approach for the same dataset, 760 classifiers
are built
20Comparison of Results I
- Results are based on the evaluation of the
cleaned version of 20newsgroups dataset - Selective Concentration did as comparably well as
the Latent Variable Approach - Selective Concentration uses O(nlogn) classifiers
on the average while Latent Variable Approach
uses O(n2) classifiers
21Comparison of Results II
- Results are based on the original version of
20newsgroup data - Selective Concentration method is significantly
better than the baseline - Difference between the number of classifiers in
both methods are not very high
22Conclusion and Future Work
- We can achieve comparable accuracy with less
training time by strategically selecting
subclassifiers - O(nlogn) vs O(n2)
- Continued formalization of how different error
distributions affect the advantage of this
approach - Application to semantic role labeling