ECOC for Text Classification - PowerPoint PPT Presentation

About This Presentation

Title:

ECOC for Text Classification

Description:

Some Recent work ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 40

Provided by: Rayi151

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: ECOC for Text Classification

1
Some Recent work

ECOC for Text Classification
Hybrids of EM Co-Training (with Kamal Nigam)
Learning to build a monolingual corpus from the
web (with Rosie Jones)
Effect of Smoothing on Naive Bayes for text
classification (with Tong Zhang)
Hypertext Categorization using link and extracted
information (with Sean Slattery Yiming Yang)

2
Using Error-Correcting Codes For Text
Classification

Rayid Ghani
Center for Automated Learning Discovery
Carnegie Mellon University

This presentation can be accessed at
http//www.cs.cmu.edu/rayid/talks/
3
Outline

Introduction to ECOC
Intuition Motivation
Some Questions?
Experimental Results
Semi-Theoretical Model
Types of Codes
Drawbacks
Conclusions

4
Introduction

Decompose a multiclass classification problem
into multiple binary problems
One-Per-Class Approach (moderately expensive)
All-Pairs (very expensive)
Distributed Output Code (efficient but what about
performance?)
Error-Correcting Output Codes (?)

5
(No Transcript)
6
Is it a good idea?

Larger margin for error since errors can now be
corrected
One-per-class is a code with minimum hamming
distance (HD) 2
Distributed codes have low HD
The individual binary problems can be harder than
before
Useless unless number of classes gt 5

7
Training ECOC
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1

Given m distinct classes

1. Create an m x n binary matrix M.
2. Each class is assigned ONE row of M.
3. Each column of the matrix divides the classes
into TWO groups.
4. Train the Base classifiers to learn the n
binary problems.
8
Training ECOC

Given m distinct classes
Create an m x n binary matrix M.
Each class is assigned ONE row of M.
Each column of the matrix divides the classes
into TWO groups.
Train the Base classifiers to learn the n binary
problems.

9
Testing ECOC

To test a new instance
Apply each of the n classifiers to the new
instance
Combine the predictions to obtain a binary
string(codeword) for the new point
Classify to the class with the nearest codeword
(usually hamming distance is used as the distance
measure)

10
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
11
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
12
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
13
ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
X
1 1 1 1 0
14

Single classifier learns a complex boundary
once
Ensemble learns a complex boundary multiple
times
ECOC learns a simple boudary multiple times

15
Questions?

How well does it work?
How long should the code be?
Do we need a lot of training data?
What kind of codes can we use?
Are there intelligent ways of creating the code?

16
Previous Work

Combine with Boosting ADABOOST.OC (Schapire,
1997), (Guruswami Sahai, 1999)
Local Learners (Ricci Aha, 1997)
Text Classification (Berger, 1999)

17
Experimental Setup

Generate the code
BCH Codes
Choose a Base Learner
Naive Bayes Classifier as used in text
classification tasks (McCallum Nigam 1998)

18
Dataset

Industry Sector Dataset
Consists of company web pages classified into 105
economic sectors
Standard stoplist
No Stemming
Skip all MIME headers and HTML tags
Experimental approach similar to McCallum et al.
(1998) for comparison purposes.

19
Results
ECOC is 88 accurate!
Classification Accuracies on five random 50-50
train-test splits of the Industry Sector dataset
with a vocabulary size of 10000.
20
Results
Industry Sector Data Set
Naïve Bayes Shrinkage1 ME2 ME/ w Prior3 ECOC 63-bit
66.1 76 79 81.1 88.5
ECOC reduces the error of the Naïve Bayes
Classifier by 66

(McCallum et al. 1998) 2,3. (Nigam et
al. 1999)

21
The Longer the Better!
Table 2 Average Classification Accuracy on 5
random 50-50 train-test splits of the Industry
Sector dataset with a vocabulary size of 10000
words selected using Information Gain.

Longer codes mean larger codeword separation
The minimum hamming distance of a code C is the
smallest distance between any pair of distance
codewords in C
If minimum hamming distance is h, then the code
can correct ? (h-1)/2? errors

22
Size Matters?
23
Size does NOT matter!
24
Semi-Theoretical Model

Model ECOC by a Binomial Distribution B(n,p)
n length of the code
p probability of each bit being classified
incorrectly

25
Semi-Theoretical Model

Model ECOC by a Binomial Distribution B(n,p)
n length of the code
p probability of each bit being classified
incorrectly

of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
26
Semi-Theoretical Model

Model ECOC by a Binomial Distribution B(n,p)
n length of the code
p probability of each bit being classified
incorrectly

of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
27
(No Transcript)
28
Talk.misc.religion Comp.sys.ibm.hardware
Comp.os.windows Comp.sys.ibm.hardware
Comp.os.windows Talk.misc.religion
Comp.os.windows Alt.atheism
Talk.misc.religion Alt.atheism
Comp.sys.ibm.hardware Alt.atheism
Alt.atheism
Talk.misc.religion
Comp.os.windows
Talk.misc.religion Comp.sys.ibm.hardware Comp.
os.windows
Alt.atheism Comp.sys.ibm.hardware Comp.os.win
dows
Alt.atheism Comp.sys.ibm.hardware Talk.misc.r
eligion
29
Types of Codes
Types of Codes

Data-Independent

Data-Dependent

Hand-Constructed Adaptive
Algebraic Random
30
What is a Good Code?

Row Separation
Column Separation (Independence of errors for
each binary classifier)
Efficiency (for long codes)

31
Choosing Codes
Random Algebraic
Row Sep On Average For long codes Guaranteed
Col Sep On Average For long codes Can be Guaranteed
Efficiency No Yes
32
Experimental Results
Code Min Row HD Max Row HD Min Col HD Max Col HD Error Rate
15-Bit BCH 5 15 49 64 20.6
19-Bit Hybrid 5 18 15 69 22.3
15-bit Random 2 (1.5) 13 42 60 24.1
33
Interesting Questions?

NB does not give good probabilitiy estimates-
using ECOC results in better estimates?
Assignment of codewords to classes?
Can Decoding be posed as a supervised learning
task?

34
Drawbacks

Can be computationally expensive
Random Codes throw away the real-world nature of
the data by picking random partitions to create
artificial binary problems

35
Current Work

Combine ECOC with Co-Training to use unlabeled
data
Automatically construct optimal / adaptive codes

36
Conclusion

Performs well on text classification tasks
Can be used when training data is sparse
Algebraic codes perform better than random codes
for a given code length
Hand-constructed codes may not be the answer

37
Background

Co-training seems to be the way to go when there
is (and maybe even when there isnt) a feature
split in the data
Reported results on co-training only deal with
very small (toy) problems mostly binary
classification tasks (Blum Mitchell 98, Nigam
Ghani 2000)

38
Co-Training Challenge