Hu Guan, Jingyu Zhou, Minyi Guo - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Hu Guan, Jingyu Zhou, Minyi Guo

Description:

Try to obtain good centroids at the beginning. Main difference between CFC and others ... CFC outperforms SVM and centroid-based approaches. When data is ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 23
Provided by: nlgCsie
Category:
Tags: centroids | guan | guo | jingyu | minyi | zhou

less

Transcript and Presenter's Notes

Title: Hu Guan, Jingyu Zhou, Minyi Guo


1
A Class-Feature-Centroid Classifier for Text
Categorization
  • Hu Guan, Jingyu Zhou, Minyi Guo
  • cs.sjtu.edu.cn
  • www 2009

2
Centroid-Based TC
3
Centroid-Based TC
  • Classification
  • Dot product
  • Pearson Correlation Coefficients
  • Euclidean-based similarity
  • Depends strongly on the quality of centroid
  • Adjust term weight
  • 4 refs
  • Better performance!
  • Above approaches
  • Init same weight, adjusting during training.

4
Motivation
  • Try to obtain good centroids at the beginning.
  • Main difference between CFC and others

5
Design - Weight
Inter-Class
Inner-Class
6
Design - Measure
7
Denormalized Measure
  • Unfair?
  • Large norm gt increase false positives.
  • This happens.
  • Not only in this way.
  • Also happens for AAC or CGC.

8
Denormalized Measure
9
Example
  • 14 documents ( 1 term )
  • 4 classes. 3 terms.

10
Example
  • 0.6, 0.8, 0

11
Example
  • 0.6, 0.8, 0

Classifier C1 C2 C3 C4
AAC 0.8 0.8 0.8 0.2721
CFC 0 0 0 0.9267
12
Example
  • AAC should favor popular words.
  • CFC should favor rare terms.

Classifier C1 C2 C3 C4
AAC 0.8 0.8 0.8 0.2721
CFC 0 0 0 0.9267
13
Datasets
Datasets Reuters-21578 20-newsgroup
Categories 52 20
Unigram 11430 29557
Training 6495 13272
Testing 2557 6627
skewed Balanced
14
Evaluation
  • AAC
  • CGC
  • CFC
  • SVM
  • SVMLight
  • SVMTorch
  • LibSVM

15
Evaluation
  • Micro Averaging F1 (u-F1)
  • Macro Averaging F1 (M-F1)

16
Overall Performance
17
Overall Performance
Sparse Data
18
Study Sparse Data
19
Study Sufficient Data
20
Discussion
  • Why SVM lose? Maybe
  • Skewed
  • No feature selection
  • Simple tokenizer
  • It also hints
  • CFC has a discriminative ability on those raw
    terms.

Denormalized Measure?
Why?
21
Study Denormalized Measure
22
Study Denormalized Measure
  • on CGC

23
Study Weight, Parameter
24
Conclusions
  • CFC outperforms SVM and centroid-based
    approaches.
  • When data is sparase, CFC is much better.
  • Novel centroid inner- inter-class.
  • Denormalized measure.
  • Source code address correction
  • http//202.120.40.15/jzhou/research/cfc
Write a Comment
User Comments (0)
About PowerShow.com