STUDENT RESEARCH SYMPOSIUM 2005

About This Presentation

Title:

STUDENT RESEARCH SYMPOSIUM 2005

Description:

Problem Definition. Overview: Multi-label text ... Technical Details: Selective Concentration Classifiers. Formal Evaluation ... Problem Definition ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 21

Provided by: pin41

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: STUDENT RESEARCH SYMPOSIUM 2005

1
STUDENT RESEARCH SYMPOSIUM 2005

Title Strategically using Pairwise
Classification to Improve Category Prediction
Presenter Pinar Donmez
Advisors Carolyn Penstein Rosé, Jaime Carbonell
LTI, SCS Carnegie Mellon University

2
Outline

Problem Definition
Overview Multi-label text classification methods
Motivation for Ensemble Approaches
Technical Details Selective Concentration
Classifiers
Formal Evaluation
Conclusions and Future Work

3
Problem Definition

Multi-label Text Classification (TC) 1-1 mapping
of documents to pre-defined categories
Problems with TC
Limited training data
Poor choice of features
Flaws in the learning process
Goal Improve the predictions on the unseen data

4
Multi-label TC Methods

ECOC
Boosting
Pairwise Coupling and Latent Variable Approach

5
ECOC

Recall Problems with TC
Limited training data
Poor choice of features
Flaws in the learning process
Result Poor classification performance
ECOC
Encode each class in a code vector
Encode each example in a code vector
Calculate the probability of each bit being 1
using i.e. decision trees, neural networks, etc.
Combine these probs in a vector
To classify the given ex, calculate the distance
between this vector and each of the codewords of
classes

6
Boosting

Main idea Evolve a set of weights over the
training set

7
Pairwise Coupling

K classes, N observations
X (f1, f2, f3, , fp) is an observation with p
features
K2 case is generally easier than Kgt2 cases
since only one decision boundary has to be
learned
Friedmans rule for K-class problem (Kgt2)

Max-wins Rule
8
Latent Variable Approach

Usage of hidden variables that tell whether the
corresponding model is good at capturing
particular patterns of the data
Decision is based on the posterior probability

is the likelihood that ith model should be used
for class prediction given input x
where
is the probability of y given input x and ith
model
and
Y. Liu, J. Carbonell, and R. Jin. A pairwise
ensemble approach for accurate genre
classification. In ECML 03, 2003.
9

For each class Ci, Liu et.al., builds a structure
like the following

Liu et.al., builds a structure like above for
each class
Compute the corresponding score of each test
example
Assign the example to the class with the highest
score

10
Intuition Behind Our Method

Multiple classes gt single decision boundary is
not powerful enough
Ensemble notion
Partition the data into focused subsets
Learn a classification model on each subset
Combine the predictions of each model
What is the problem with ensemble techniques?
When the category space is large, time complexity
to build models on subsets becomes intractable
Our method addresses this problem. But how?

11
Technical Details

Build one-vs-all classifiers iteratively
At each iteration choose which sub-classifiers to
build based on an analysis of error distributions
Idea Focus on the classes that are highly
confusable
Similar to Boosting
Boosting modifies the weights of misclassified
examples to penalize inaccurate models
In decision stage
If a confusable class is chosen for prediction of
a test example, predictions of the
sub-classifiers for that class are also taken
into account

12
Technical Details Part II

Train K one-vs-all classifiers (K categories)
Analyze the confusion matrix
Identify the problematic (hard) classes and their
confusable sets
Build sub-classifiers for each of the hard
classes above
Combine documents from one hard class along with
documents which belong to confusable set of that
class
Continue to build sub-classifiers recursively
until a stopping criterion

13
Train K one-vs-all models
B F H
Confusion Matrix
A D
D-vs-F and H classifier
A vs B classifier
B F H
Build A-vs-B and D-vs-F and H
A D
..
Confusion Matrix
Note Continue to build sub-classifiers until
either there is no need or you can not divide any
further!
14
How to choose subclassifiers?

fi(?) ?µi (1- ?)o2i
gi(ß) µi ß o2i
where µi avg number of false positives for
class i
o2i stdev of false positives for class
i
Focus on classes which fi(?) gt T
( T _ predefined threshold)
For every i for which the above inequality is
true
Choose all classes j where C(i,j) gt gi(ß)
C(i,j) entry in the confusion matrix where i is
the predicted class and j is the true class
3 parameters ? , ß, and T
Tuned on a held-out set

15
Analysis of error distribution for some
classifiers I

Analysis on the 20newsgroups dataset
These errors are more uniformly distributed
The avg number of false positives are not very
high
Two criteria arent met
Skewed error distribution
Large of errors

16
Analysis of error distribution for some
classifiers II

Common in all three
Skewed distribution of errors (false s)
These peaks will form the sub-classifiers

17
Implications of our method

Objective Obtain high accuracy by choosing a
small set of sub-classifiers within a small
number of iterations
Pros
Strategically choosing sub-classifiers reduce
training time compared to building one-vs-one
classifiers
O(nlogn) number of classifiers on the average
Sub-classifiers are trained on more focused sets,
so they are likely to do a better job
Cons
We focus on the problematic classes to
distinguish for obtaining sub-classifiers. Hence,
performance might be hurt as we increase the
number of iterations

18
Evaluation

Dataset 20 newsgroups
Evaluation on two versions
Original 20 Newsgroups (19,997 documents evenly
distributed across 20 classes)
Cleaned version (not include headers stopwords
words occur only once)
Vocabulary size 62,000

J. Rennie and R. Rafkin, Improving Multiclass
Text Classification with SVM. MIT, AI Memo
AIM-2001-026. 2001
19
Comparison of Results

Our method give comparable results as to those of
Liu et.al.,
But, Liu et.al., builds exactly 2K2 classifiers
whereas our method builds at most K2 classifiers
in the worst case scenario (K classes)
In the 20 newsgroups experiment, we built 34
classifiers in total whereas in latent variable
approach for the same dataset, 760 classifiers
are built

20
Comparison of Results I

Results are based on the evaluation of the
cleaned version of 20newsgroups dataset
Selective Concentration did as comparably well as
the Latent Variable Approach
Selective Concentration uses O(nlogn) classifiers
on the average while Latent Variable Approach
uses O(n2) classifiers

21
Comparison of Results II

Results are based on the original version of
20newsgroup data
Selective Concentration method is significantly
better than the baseline
Difference between the number of classifiers in
both methods are not very high

22
Conclusion and Future Work

We can achieve comparable accuracy with less
training time by strategically selecting
subclassifiers
O(nlogn) vs O(n2)
Continued formalization of how different error
distributions affect the advantage of this
approach
Application to semantic role labeling

Write a Comment

User Comments (0)

About PowerShow.com

STUDENT RESEARCH SYMPOSIUM 2005 - PowerPoint PPT Presentation

STUDENT RESEARCH SYMPOSIUM 2005

Problem Definition. Overview: Multi-label text ... Technical Details: Selective Concentration Classifiers. Formal Evaluation ... Problem Definition ... – PowerPoint PPT presentation