Hu Guan, Jingyu Zhou, Minyi Guo

About This Presentation

Title:

Hu Guan, Jingyu Zhou, Minyi Guo

Description:

Try to obtain good centroids at the beginning. Main difference between CFC and others ... CFC outperforms SVM and centroid-based approaches. When data is ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 23

Provided by: nlgCsie

Category:

more less

Transcript and Presenter's Notes

Title: Hu Guan, Jingyu Zhou, Minyi Guo

1
A Class-Feature-Centroid Classifier for Text
Categorization

Hu Guan, Jingyu Zhou, Minyi Guo
cs.sjtu.edu.cn
www 2009

2
Centroid-Based TC
3
Centroid-Based TC

Classification
Dot product
Pearson Correlation Coefficients
Euclidean-based similarity
Depends strongly on the quality of centroid
Adjust term weight
4 refs
Better performance!
Above approaches
Init same weight, adjusting during training.

4
Motivation

Try to obtain good centroids at the beginning.
Main difference between CFC and others

5
Design - Weight
Inter-Class
Inner-Class
6
Design - Measure
7
Denormalized Measure

Unfair?
Large norm gt increase false positives.
This happens.
Not only in this way.
Also happens for AAC or CGC.

8
Denormalized Measure
9
Example

14 documents ( 1 term )
4 classes. 3 terms.

10
Example

0.6, 0.8, 0

11
Example

0.6, 0.8, 0

Classifier C1 C2 C3 C4
AAC 0.8 0.8 0.8 0.2721
CFC 0 0 0 0.9267
12
Example

AAC should favor popular words.
CFC should favor rare terms.

Classifier C1 C2 C3 C4
AAC 0.8 0.8 0.8 0.2721
CFC 0 0 0 0.9267
13
Datasets
Datasets Reuters-21578 20-newsgroup
Categories 52 20
Unigram 11430 29557
Training 6495 13272
Testing 2557 6627
skewed Balanced
14
Evaluation

AAC
CGC
CFC
SVM
SVMLight
SVMTorch
LibSVM

15
Evaluation

Micro Averaging F1 (u-F1)
Macro Averaging F1 (M-F1)

16
Overall Performance
17
Overall Performance
Sparse Data
18
Study Sparse Data
19
Study Sufficient Data
20
Discussion

Why SVM lose? Maybe
Skewed
No feature selection
Simple tokenizer
It also hints
CFC has a discriminative ability on those raw
terms.

Denormalized Measure?
Why?
21
Study Denormalized Measure
22
Study Denormalized Measure

on CGC

23
Study Weight, Parameter
24
Conclusions

CFC outperforms SVM and centroid-based
approaches.
When data is sparase, CFC is much better.
Novel centroid inner- inter-class.
Denormalized measure.
Source code address correction
http//202.120.40.15/jzhou/research/cfc

Write a Comment

User Comments (0)

About PowerShow.com

Hu Guan, Jingyu Zhou, Minyi Guo - PowerPoint PPT Presentation

Hu Guan, Jingyu Zhou, Minyi Guo

Try to obtain good centroids at the beginning. Main difference between CFC and others ... CFC outperforms SVM and centroid-based approaches. When data is ... – PowerPoint PPT presentation