STUDENT RESEARCH SYMPOSIUM 2005 - PowerPoint PPT Presentation

About This Presentation
Title:

STUDENT RESEARCH SYMPOSIUM 2005

Description:

Problem Definition. Overview: Multi-label text ... Technical Details: Selective Concentration Classifiers. Formal Evaluation ... Problem Definition ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 21
Provided by: pin41
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: STUDENT RESEARCH SYMPOSIUM 2005


1
STUDENT RESEARCH SYMPOSIUM 2005
  • Title Strategically using Pairwise
    Classification to Improve Category Prediction
  • Presenter Pinar Donmez
  • Advisors Carolyn Penstein Rosé, Jaime Carbonell
  • LTI, SCS Carnegie Mellon University

2
Outline
  • Problem Definition
  • Overview Multi-label text classification methods
  • Motivation for Ensemble Approaches
  • Technical Details Selective Concentration
    Classifiers
  • Formal Evaluation
  • Conclusions and Future Work

3
Problem Definition
  • Multi-label Text Classification (TC) 1-1 mapping
    of documents to pre-defined categories
  • Problems with TC
  • Limited training data
  • Poor choice of features
  • Flaws in the learning process
  • Goal Improve the predictions on the unseen data

4
Multi-label TC Methods
  • ECOC
  • Boosting
  • Pairwise Coupling and Latent Variable Approach

5
ECOC
  • Recall Problems with TC
  • Limited training data
  • Poor choice of features
  • Flaws in the learning process
  • Result Poor classification performance
  • ECOC
  • Encode each class in a code vector
  • Encode each example in a code vector
  • Calculate the probability of each bit being 1
    using i.e. decision trees, neural networks, etc.
  • Combine these probs in a vector
  • To classify the given ex, calculate the distance
    between this vector and each of the codewords of
    classes

6
Boosting
  • Main idea Evolve a set of weights over the
    training set

7
Pairwise Coupling
  • K classes, N observations
  • X (f1, f2, f3, , fp) is an observation with p
    features
  • K2 case is generally easier than Kgt2 cases
    since only one decision boundary has to be
    learned
  • Friedmans rule for K-class problem (Kgt2)

Max-wins Rule
8
Latent Variable Approach
  • Usage of hidden variables that tell whether the
    corresponding model is good at capturing
    particular patterns of the data
  • Decision is based on the posterior probability

is the likelihood that ith model should be used
for class prediction given input x
where
is the probability of y given input x and ith
model
and
Y. Liu, J. Carbonell, and R. Jin. A pairwise
ensemble approach for accurate genre
classification. In ECML 03, 2003.
9
  • For each class Ci, Liu et.al., builds a structure
    like the following
  • Liu et.al., builds a structure like above for
    each class
  • Compute the corresponding score of each test
    example
  • Assign the example to the class with the highest
    score

10
Intuition Behind Our Method
  • Multiple classes gt single decision boundary is
    not powerful enough
  • Ensemble notion
  • Partition the data into focused subsets
  • Learn a classification model on each subset
  • Combine the predictions of each model
  • What is the problem with ensemble techniques?
  • When the category space is large, time complexity
    to build models on subsets becomes intractable
  • Our method addresses this problem. But how?

11
Technical Details
  • Build one-vs-all classifiers iteratively
  • At each iteration choose which sub-classifiers to
    build based on an analysis of error distributions
  • Idea Focus on the classes that are highly
    confusable
  • Similar to Boosting
  • Boosting modifies the weights of misclassified
    examples to penalize inaccurate models
  • In decision stage
  • If a confusable class is chosen for prediction of
    a test example, predictions of the
    sub-classifiers for that class are also taken
    into account

12
Technical Details Part II
  • Train K one-vs-all classifiers (K categories)
  • Analyze the confusion matrix
  • Identify the problematic (hard) classes and their
    confusable sets
  • Build sub-classifiers for each of the hard
    classes above
  • Combine documents from one hard class along with
    documents which belong to confusable set of that
    class
  • Continue to build sub-classifiers recursively
    until a stopping criterion

13
Train K one-vs-all models
B F H
Confusion Matrix
A D
D-vs-F and H classifier
A vs B classifier
B F H
Build A-vs-B and D-vs-F and H
A D
..
Confusion Matrix
Note Continue to build sub-classifiers until
either there is no need or you can not divide any
further!
14
How to choose subclassifiers?
  • fi(?) ?µi (1- ?)o2i
  • gi(ß) µi ß o2i
  • where µi avg number of false positives for
    class i
  • o2i stdev of false positives for class
    i
  • Focus on classes which fi(?) gt T
  • ( T _ predefined threshold)
  • For every i for which the above inequality is
    true
  • Choose all classes j where C(i,j) gt gi(ß)
  • C(i,j) entry in the confusion matrix where i is
    the predicted class and j is the true class
  • 3 parameters ? , ß, and T
  • Tuned on a held-out set

15
Analysis of error distribution for some
classifiers I
  • Analysis on the 20newsgroups dataset
  • These errors are more uniformly distributed
  • The avg number of false positives are not very
    high
  • Two criteria arent met
  • Skewed error distribution
  • Large of errors

16
Analysis of error distribution for some
classifiers II
  • Common in all three
  • Skewed distribution of errors (false s)
  • These peaks will form the sub-classifiers

17
Implications of our method
  • Objective Obtain high accuracy by choosing a
    small set of sub-classifiers within a small
    number of iterations
  • Pros
  • Strategically choosing sub-classifiers reduce
    training time compared to building one-vs-one
    classifiers
  • O(nlogn) number of classifiers on the average
  • Sub-classifiers are trained on more focused sets,
    so they are likely to do a better job
  • Cons
  • We focus on the problematic classes to
    distinguish for obtaining sub-classifiers. Hence,
    performance might be hurt as we increase the
    number of iterations

18
Evaluation
  • Dataset 20 newsgroups
  • Evaluation on two versions
  • Original 20 Newsgroups (19,997 documents evenly
    distributed across 20 classes)
  • Cleaned version (not include headers stopwords
    words occur only once)
  • Vocabulary size 62,000

J. Rennie and R. Rafkin, Improving Multiclass
Text Classification with SVM. MIT, AI Memo
AIM-2001-026. 2001
19
Comparison of Results
  • Our method give comparable results as to those of
    Liu et.al.,
  • But, Liu et.al., builds exactly 2K2 classifiers
    whereas our method builds at most K2 classifiers
    in the worst case scenario (K classes)
  • In the 20 newsgroups experiment, we built 34
    classifiers in total whereas in latent variable
    approach for the same dataset, 760 classifiers
    are built

20
Comparison of Results I
  • Results are based on the evaluation of the
    cleaned version of 20newsgroups dataset
  • Selective Concentration did as comparably well as
    the Latent Variable Approach
  • Selective Concentration uses O(nlogn) classifiers
    on the average while Latent Variable Approach
    uses O(n2) classifiers

21
Comparison of Results II
  • Results are based on the original version of
    20newsgroup data
  • Selective Concentration method is significantly
    better than the baseline
  • Difference between the number of classifiers in
    both methods are not very high

22
Conclusion and Future Work
  • We can achieve comparable accuracy with less
    training time by strategically selecting
    subclassifiers
  • O(nlogn) vs O(n2)
  • Continued formalization of how different error
    distributions affect the advantage of this
    approach
  • Application to semantic role labeling
Write a Comment
User Comments (0)
About PowerShow.com