Experiments on Feature Selection and Document Classification - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Experiments on Feature Selection and Document Classification

Description:

KNN Classification Puzzles. What We Want. Hierarchical classification. Do not determine by major ... Using double the least son number of the Level 2. 100. 20 ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 15
Provided by: nam574
Category:

less

Transcript and Presenter's Notes

Title: Experiments on Feature Selection and Document Classification


1
Experiments on Feature Selection and Document
Classification
  • Hsin-Chen Chiao
  • 7/11/2002

2
Characteristics of KNN
  • Flat classification
  • Determined by Major
  • Ex Blue and Green
  • How Many and how Much
  • Determined by K

3
KNN Classification Puzzles


4
What We Want
  • Hierarchical classification
  • Do not determine by major
  • K value will suit each case

5
Hierarchical Classification for KNN
  • Assumption
  • Classification trees are well built.
  • Document similarities are well calculated.

6
Basic Calculation in HC for KNNBottom-Up
Calculation
A
A1
A2
A3
A4
A5
Sim(A) 1/n(sim(A1)sim(A2)sim(A5))
  • Unbalanced classification tree should
  • introduce Other class.

7
How to ClassifyTop-Down Classification
  • Using Level 3 to determine Level 2
  • Recursive to Level 4 to
  • determine Level 3
  • Classify and take

Level 1
Level 2
Level 3
8
Determine the Value of K
  • Using double the least son number of the Level 2


100
20
300
In this case, K 202 40
9
Experiment Result
  • Using STIC(87) as corpus,
  • 4118 training data, 1364 testing data
  • Level 1 with four classes, Level 2 with forty
  • Using flat Knn top one precision about 58
  • Using one level hierarchical classification top
    one precision about 61

10
Crucial Point --- Feature Selection
  • In all experiments, the weight of feature uses
    Document Frequency.
  • DF will suffer some problems in Hierarchical
    Classification.
  • High level with low freq feature
  • Low level with high freq feature
  • Because of the lack of training data, Bi-gram
    will result in Cns superior than Eng

11
Crucial Point --- Feature Selection (II)
  • When training data is not enough, Chinese term
    should be selected.
  • Article similarity will be abnormally high if
    features in an article are few.
  • Removing stopword is not easy.

12
Weight of feature
  • DF
  • TFiDF
  • TFCFiDF (also determined by major)

13
Statistical Report
  • Level 2 29.7 classes with less than 20 articles
  • Level 3 77.8 classes with less than 20 articles

14
Future Work
  • Chinese using terms meaningful
  • Different classes with different feature weight
  • In flat knn, we could also apply the classify
    and take method.
Write a Comment
User Comments (0)
About PowerShow.com