Title: ECOC for Text Classification
1Some Recent work
- ECOC for Text Classification
- Hybrids of EM Co-Training (with Kamal Nigam)
- Learning to build a monolingual corpus from the
web (with Rosie Jones) - Effect of Smoothing on Naive Bayes for text
classification (with Tong Zhang) - Hypertext Categorization using link and extracted
information (with Sean Slattery Yiming Yang)
2Using Error-Correcting Codes For Text
Classification
- Rayid Ghani
- Center for Automated Learning Discovery
- Carnegie Mellon University
This presentation can be accessed at
http//www.cs.cmu.edu/rayid/talks/
3Outline
- Introduction to ECOC
- Intuition Motivation
- Some Questions?
- Experimental Results
- Semi-Theoretical Model
- Types of Codes
- Drawbacks
- Conclusions
4Introduction
- Decompose a multiclass classification problem
into multiple binary problems - One-Per-Class Approach (moderately expensive)
- All-Pairs (very expensive)
- Distributed Output Code (efficient but what about
performance?) - Error-Correcting Output Codes (?)
5(No Transcript)
6Is it a good idea?
- Larger margin for error since errors can now be
corrected - One-per-class is a code with minimum hamming
distance (HD) 2 - Distributed codes have low HD
- The individual binary problems can be harder than
before - Useless unless number of classes gt 5
7Training ECOC
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
1. Create an m x n binary matrix M.
2. Each class is assigned ONE row of M.
3. Each column of the matrix divides the classes
into TWO groups.
4. Train the Base classifiers to learn the n
binary problems.
8Training ECOC
- Given m distinct classes
- Create an m x n binary matrix M.
- Each class is assigned ONE row of M.
- Each column of the matrix divides the classes
into TWO groups. - Train the Base classifiers to learn the n binary
problems.
9Testing ECOC
- To test a new instance
- Apply each of the n classifiers to the new
instance - Combine the predictions to obtain a binary
string(codeword) for the new point - Classify to the class with the nearest codeword
(usually hamming distance is used as the distance
measure)
10ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
11ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
12ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
13ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
X
1 1 1 1 0
14- Single classifier learns a complex boundary
once - Ensemble learns a complex boundary multiple
times - ECOC learns a simple boudary multiple times
15Questions?
- How well does it work?
- How long should the code be?
- Do we need a lot of training data?
- What kind of codes can we use?
- Are there intelligent ways of creating the code?
16Previous Work
- Combine with Boosting ADABOOST.OC (Schapire,
1997), (Guruswami Sahai, 1999) - Local Learners (Ricci Aha, 1997)
- Text Classification (Berger, 1999)
17Experimental Setup
- Generate the code
- BCH Codes
- Choose a Base Learner
- Naive Bayes Classifier as used in text
classification tasks (McCallum Nigam 1998)
18Dataset
- Industry Sector Dataset
- Consists of company web pages classified into 105
economic sectors - Standard stoplist
- No Stemming
- Skip all MIME headers and HTML tags
- Experimental approach similar to McCallum et al.
(1998) for comparison purposes.
19Results
ECOC is 88 accurate!
Classification Accuracies on five random 50-50
train-test splits of the Industry Sector dataset
with a vocabulary size of 10000.
20Results
Industry Sector Data Set
Naïve Bayes Shrinkage1 ME2 ME/ w Prior3 ECOC 63-bit
66.1 76 79 81.1 88.5
ECOC reduces the error of the Naïve Bayes
Classifier by 66
- (McCallum et al. 1998) 2,3. (Nigam et
al. 1999)
21The Longer the Better!
Table 2 Average Classification Accuracy on 5
random 50-50 train-test splits of the Industry
Sector dataset with a vocabulary size of 10000
words selected using Information Gain.
- Longer codes mean larger codeword separation
- The minimum hamming distance of a code C is the
smallest distance between any pair of distance
codewords in C - If minimum hamming distance is h, then the code
can correct ? (h-1)/2? errors
22Size Matters?
23Size does NOT matter!
24Semi-Theoretical Model
- Model ECOC by a Binomial Distribution B(n,p)
- n length of the code
- p probability of each bit being classified
incorrectly
25Semi-Theoretical Model
- Model ECOC by a Binomial Distribution B(n,p)
- n length of the code
- p probability of each bit being classified
incorrectly
of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
26Semi-Theoretical Model
- Model ECOC by a Binomial Distribution B(n,p)
- n length of the code
- p probability of each bit being classified
incorrectly
of Bits Hmin Emax Pave Accuracy
15 5 2 .85 .59
15 5 2 .89 .80
15 5 2 .91 .84
31 11 5 .85 .67
31 11 5 .89 .91
31 11 5 .91 .94
63 31 15 .89 .99
27(No Transcript)
28Talk.misc.religion Comp.sys.ibm.hardware
Comp.os.windows Comp.sys.ibm.hardware
Comp.os.windows Talk.misc.religion
Comp.os.windows Alt.atheism
Talk.misc.religion Alt.atheism
Comp.sys.ibm.hardware Alt.atheism
Alt.atheism
Talk.misc.religion
Comp.os.windows
Talk.misc.religion Comp.sys.ibm.hardware Comp.
os.windows
Alt.atheism Comp.sys.ibm.hardware Comp.os.win
dows
Alt.atheism Comp.sys.ibm.hardware Talk.misc.r
eligion
29Types of Codes
Types of Codes
Hand-Constructed Adaptive
Algebraic Random
30What is a Good Code?
- Row Separation
- Column Separation (Independence of errors for
each binary classifier) - Efficiency (for long codes)
31Choosing Codes
Random Algebraic
Row Sep On Average For long codes Guaranteed
Col Sep On Average For long codes Can be Guaranteed
Efficiency No Yes
32Experimental Results
Code Min Row HD Max Row HD Min Col HD Max Col HD Error Rate
15-Bit BCH 5 15 49 64 20.6
19-Bit Hybrid 5 18 15 69 22.3
15-bit Random 2 (1.5) 13 42 60 24.1
33Interesting Questions?
- NB does not give good probabilitiy estimates-
using ECOC results in better estimates? - Assignment of codewords to classes?
- Can Decoding be posed as a supervised learning
task?
34Drawbacks
- Can be computationally expensive
- Random Codes throw away the real-world nature of
the data by picking random partitions to create
artificial binary problems
35Current Work
- Combine ECOC with Co-Training to use unlabeled
data - Automatically construct optimal / adaptive codes
36Conclusion
- Performs well on text classification tasks
- Can be used when training data is sparse
- Algebraic codes perform better than random codes
for a given code length - Hand-constructed codes may not be the answer
37Background
- Co-training seems to be the way to go when there
is (and maybe even when there isnt) a feature
split in the data - Reported results on co-training only deal with
very small (toy) problems mostly binary
classification tasks (Blum Mitchell 98, Nigam
Ghani 2000)
38Co-Training Challenge
- Task Apply cotraining to a 65 class dataset
containing 130,000 training examples - Result Cotraining fails!
39Solution?
- ECOC seems to work well when there are a large
number of classes - ECOC decomposes a multiclass problems into
several binary problems - Cotraining works well with binary problems
Combine ECOC and Cotrain
40Algorithm
- Learn each bit for ECOC using a cotrained
classifier
41Dataset (Job Descriptions)
- 65 classes
- 32000 examples
- Two feature sets
- Title
- Description
42(No Transcript)
43Results
- 10 Train, 50 unlabeled, 40 test
- NB 40.3
- ECOC 48.9
- EM 30.83
- CoTraining
- ECOC-EM
- ECOC-Cotrain
- ECOC-CoEM