Title: Hierarchical Classification of Real Life Documents
1Hierarchical Classification of Real Life Documents
Ke Wang, Senqiang Zhou Simon Fraser University
Yu He National University of Singapore
2Hierarchical Multi-classed Documents
- Topics are organized into a hierarchy of
increasing specificity - A document is classified into all relevant
classes. - For example, a document on Dance could be reached
from both ArtsPerforming_Arts and Recreation
topics in Yahoo
3New Issues
- Misclassification is non-symmetric
- Travel ? Outdoor Vs. Travel ? Software
- Documents are multi-classed
- Traditional way only one class attached
- Class space is sparse
- 2 - 1 subsets of classes for k classes
- Exploring the similarities between classes
k
4A New Classification Model
- The model of documents
- t1,t2,.,tnC1,,Ck, where t1,t2,.,tn are
keywords and C1,,Ck are classes from a given
class hierarchy - C1,,Ck is called a classset (CS)
- Construct a classifier
- consisting of rules of the form ti1,, tip ?
Ci1,, Cip, that assigns a good classset to a
given new document
5Class Similarity
- Two classsets are similar if they cover similar
documents. - Anc(CS) the set of classes in a classset CS plus
all ancestor classes. - CS1 is more general than CS2 if Anc(CS1)?
Anc(CS2) - Dance is more general than Fast-Dance,Music
because Anc(Dance) ? Anc(Fast-dance,Music)
6Class Similarity (Cont.)
- A document d is covered by a classset CS if CS is
more general than the classset of d - Cover(CS) denotes the set of documents covered by
CS - Cover(CS1) ?Cover(CS2)Cover(CS1? CS2)
7Class Similarity (Cont.)
- The dissimilarity of CS1 and CS2 is defined as
the normalized difference of their coverage
E(CS1,CS2) - (Cover(CS2)-Cover(CS1) Cover(CS1)-Cover(CS2)
)/Cover(CS1) ? Cover(CS2) - The similarity is defined as 1 - E(CS1,CS2)
8The Confidence
- Match(T?CS ) the set of documents that contain
all the terms in T. - The confidence of T?CS is defined as
Match(T?CS ) - ?d E(CSd,CS) Confg(T?CS )
------------------------------------
Match(T?CS )
9Whats behind the Confg ?
- Intuitively, Confg(T?CS ) measures the average
similarity between CS and the classsets of the
documents that match T?CS . - If E(CSd,CS) is binary, i.e., 1 or 0, Confg(T?CS
) degenerates to the standard confidence.
10(No Transcript)
11(No Transcript)
12Construction of Classifier
- Step 1 Find association rules
- Generate all association rules of the form T?CS
that satisfy some user-specified minimum support
and confidence.
13Construction of Classifier(Cont.)
- Step 2 rank the rules
- A document is classified by the matching rule
that has highest confidence. - This selection is called most confidence first
(MCF)
14Construction of Classifier (Cont.)
- Step 3 remove rules of low accuracy
- Let D be the set of training documents
classified by rule T?CS, the accuracy of T?CS is
defined as
15Construction of Classifier (Cont.)
- Confg(T ?CS) is defined with respect to all the
document ? s that match the rule, whereas
Accu(T?CS ) is defined w.r.t the documents
classified by the rule. - Remove the rules with accuracy below a certain
threshold because they contribute negatively to
overall accuracy.
16Construction of Classifier (Cont.)
- Step 4 cut off the ranked list
- If we cut off the list of rules r1,,rm after the
first i rules, r1,,ri, - Cutoff error PrefixError(ri)DefualtError(ri)
- PrefixError(ri) is the sum of the rule error
Error(rj) for all rules rj, 1? j? I - DefualtError(ri) is the error caused by assigning
the default classset to all the documents not
classified by any rule rj
17Experiments
18Experimental Results
- The result on IBM data set
- The error Coverage beats the others.
- The size Confidence gets smaller.
- The time Coverage takes longer.
19(No Transcript)
20Classification Error
21Size Execution Time