Hierarchical Classification of Real Life Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Hierarchical Classification of Real Life Documents

Description:

Hierarchical & Multi-classed Documents ... The model of documents: ... Two classsets are similar if they 'cover' similar documents. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 22
Provided by: szh3
Category:

less

Transcript and Presenter's Notes

Title: Hierarchical Classification of Real Life Documents


1
Hierarchical Classification of Real Life Documents
Ke Wang, Senqiang Zhou Simon Fraser University
Yu He National University of Singapore
2
Hierarchical Multi-classed Documents
  • Topics are organized into a hierarchy of
    increasing specificity
  • A document is classified into all relevant
    classes.
  • For example, a document on Dance could be reached
    from both ArtsPerforming_Arts and Recreation
    topics in Yahoo

3
New Issues
  • Misclassification is non-symmetric
  • Travel ? Outdoor Vs. Travel ? Software
  • Documents are multi-classed
  • Traditional way only one class attached
  • Class space is sparse
  • 2 - 1 subsets of classes for k classes
  • Exploring the similarities between classes

k
4
A New Classification Model
  • The model of documents
  • t1,t2,.,tnC1,,Ck, where t1,t2,.,tn are
    keywords and C1,,Ck are classes from a given
    class hierarchy
  • C1,,Ck is called a classset (CS)
  • Construct a classifier
  • consisting of rules of the form ti1,, tip ?
    Ci1,, Cip, that assigns a good classset to a
    given new document

5
Class Similarity
  • Two classsets are similar if they cover similar
    documents.
  • Anc(CS) the set of classes in a classset CS plus
    all ancestor classes.
  • CS1 is more general than CS2 if Anc(CS1)?
    Anc(CS2)
  • Dance is more general than Fast-Dance,Music
    because Anc(Dance) ? Anc(Fast-dance,Music)

6
Class Similarity (Cont.)
  • A document d is covered by a classset CS if CS is
    more general than the classset of d
  • Cover(CS) denotes the set of documents covered by
    CS
  • Cover(CS1) ?Cover(CS2)Cover(CS1? CS2)

7
Class Similarity (Cont.)
  • The dissimilarity of CS1 and CS2 is defined as
    the normalized difference of their coverage
    E(CS1,CS2)
  • (Cover(CS2)-Cover(CS1) Cover(CS1)-Cover(CS2)
    )/Cover(CS1) ? Cover(CS2)
  • The similarity is defined as 1 - E(CS1,CS2)

8
The Confidence
  • Match(T?CS ) the set of documents that contain
    all the terms in T.
  • The confidence of T?CS is defined as

Match(T?CS ) - ?d E(CSd,CS) Confg(T?CS )
------------------------------------
Match(T?CS )
9
Whats behind the Confg ?
  • Intuitively, Confg(T?CS ) measures the average
    similarity between CS and the classsets of the
    documents that match T?CS .
  • If E(CSd,CS) is binary, i.e., 1 or 0, Confg(T?CS
    ) degenerates to the standard confidence.

10
(No Transcript)
11
(No Transcript)
12
Construction of Classifier
  • Step 1 Find association rules
  • Generate all association rules of the form T?CS
    that satisfy some user-specified minimum support
    and confidence.

13
Construction of Classifier(Cont.)
  • Step 2 rank the rules
  • A document is classified by the matching rule
    that has highest confidence.
  • This selection is called most confidence first
    (MCF)

14
Construction of Classifier (Cont.)
  • Step 3 remove rules of low accuracy
  • Let D be the set of training documents
    classified by rule T?CS, the accuracy of T?CS is
    defined as

15
Construction of Classifier (Cont.)
  • Confg(T ?CS) is defined with respect to all the
    document ? s that match the rule, whereas
    Accu(T?CS ) is defined w.r.t the documents
    classified by the rule.
  • Remove the rules with accuracy below a certain
    threshold because they contribute negatively to
    overall accuracy.

16
Construction of Classifier (Cont.)
  • Step 4 cut off the ranked list
  • If we cut off the list of rules r1,,rm after the
    first i rules, r1,,ri,
  • Cutoff error PrefixError(ri)DefualtError(ri)
  • PrefixError(ri) is the sum of the rule error
    Error(rj) for all rules rj, 1? j? I
  • DefualtError(ri) is the error caused by assigning
    the default classset to all the documents not
    classified by any rule rj

17
Experiments
18
Experimental Results
  • The result on IBM data set
  • The error Coverage beats the others.
  • The size Confidence gets smaller.
  • The time Coverage takes longer.

19
(No Transcript)
20
Classification Error
21
Size Execution Time
Write a Comment
User Comments (0)
About PowerShow.com