Text categorization using sentences - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Text categorization using sentences

Description:

International Conference On Computational Linguistics. ... combination method showed middle values between the Cen(S) and Sim(S,T) curves ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 25

Provided by: Prav98

Category:

more less

Transcript and Presenter's Notes

Title: Text categorization using sentences

1
Text categorization using sentences

Ko, Y., Park, J., and Seo, J. 2002. Automatic
text categorization using the importance of
sentences. In Proceedings of the 19th
international Conference on Computational
Linguistics - Volume 1 (Taipei, Taiwan, August 24
- September 01, 2002). International Conference
On Computational Linguistics. Association for
Computational Linguistics, Morristown, NJ, 1-7.

Presented by Praveenkumar Ponnusamy EECS 800
Statistical Natural Language Processing November
14, 2005
2
Text categorization using sentences

Introduction
Proposed system
Evaluation
Results
Summary

3
Introduction

Assign text documents to predefined categories
based on their content
Application areas classifying news, email,
newsgroup, chat
Training phase feature extraction and indexing
Classification algorithms Naïve Bayes, Rocchio,
k-Nearest Neighbor, Support Vector Machines

4
Introduction

Vector space model using TF and IDF not adequate
Each sentence has different importance to
identify the topic
Better results possible by assigning term weights
based on the importance of sentences

5
Proposed system

Two kinds of summarization techniques for
identifying important and unimportant sentences
Similarity between title and sentences
Importance of terms in each sentence
Importance of sentences derived
Term weights in each sentence modified in
proportion to the sentence importance
English and Korean newsgroup data sets used

6
Proposed system

Preprocessing
Contents Subject and body used
Contents segmented into sentences
Content words POS tagging
English - Brill tagger
Korean - Sogang tagger
Sentences vector of content words
TF values as term weights of content words

7
Proposed system

Importance of sentences by title
Title summarizes the important content
Sentences similar to the title contain important
terms
Title and sentences as vector of content words
Similarity calculated by inner product of
sentence and title vectors
Quality of the title

8
Proposed system

Sentence importance by Term importance
Sentences dissimilar to the title but with
important terms are important
Sentence importance importance of terms in it.
Importance of terms TF, IDF, chi-square values
Final score

9
Proposed system

Indexing
TF value of a term calculated using the final
importance score.
Modified TF value of term t in document d
Weight of a term t in document d

10
Evaluation Data Sets

English UseNet
20,000 articles
20 UseNet groups
4,000 docs (20) for test data
16,000 docs (80) for training data
Vocabulary 51,018 words

Korean UseNet
10,331 documents
15 categories
3,107 docs (30) for test data
7,224 docs (70) for training data
Vocabulary 69,793 words

11
Evaluation

Basis system used TF
Proposed system used WTF
Performance measures
Recall, Precision
F1 measure
Micro-averaging method
Macro-averaging method

12
Evaluation

Setting the number of features
Validation set from training data used
Performance curve using SVM shown here k1k21

13
Evaluation

For SVM, number of features is set to 7,000 based
on performance and running time.
Similarly for other classifiers
Naïve Bayes 7,000
Rocchio 10,000
k-NN 9,000

14
Evaluation

Setting the constant weights k1 and k2- range
0.0 to 3.0
Compared Sim(S,T), Cen(S) and the combination
method
For combination method, k1k2
Performance curve using SVM is shown features
7,000

15
Evaluation

Best performance for
SVM at k11.5 and k20.4
Naïve Bayes at k11.9 and k23.0
Rocchio at k12.0 and k20.0
k-NN at k10.8 and k22.8

16
Results

English newsgroup data set

17
Results

Korean newsgroup data set

18
Results

A collection of small tightly clustered documents
with wide separation between the clusters should
produce the best performance (Salton et al.,
1975)
Cohesion within a category similarity between
documents in same category
Cohesion between categories measure of
similarity between categories
High cohesion within a category and low cohesion
between categories.

19
Results

D total training document set
Ik training document set in k-th category
Ck centroid vector of k-th category
Cglob centroid vector of the total training
documents

20
Results
21
Results
22
Results

Cen(S) had the highest cohesion value within a
category
Sim(S,T) showed the lowest cohesion value between
the categories

23
Results

Cohesion values, within and between the
categories, for the combination method showed
middle values between the Cen(S) and Sim(S,T)
curves
The proposed indexing method reformed the vector
space for better performance

24
Summary