Text categorization using sentences - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Text categorization using sentences

Description:

International Conference On Computational Linguistics. ... combination method showed middle values between the Cen(S) and Sim(S,T) curves ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 25
Provided by: Prav98
Category:

less

Transcript and Presenter's Notes

Title: Text categorization using sentences


1
Text categorization using sentences
  • Ko, Y., Park, J., and Seo, J. 2002. Automatic
    text categorization using the importance of
    sentences. In Proceedings of the 19th
    international Conference on Computational
    Linguistics - Volume 1 (Taipei, Taiwan, August 24
    - September 01, 2002). International Conference
    On Computational Linguistics. Association for
    Computational Linguistics, Morristown, NJ, 1-7.

Presented by Praveenkumar Ponnusamy EECS 800
Statistical Natural Language Processing November
14, 2005
2
Text categorization using sentences
  • Introduction
  • Proposed system
  • Evaluation
  • Results
  • Summary

3
Introduction
  • Assign text documents to predefined categories
    based on their content
  • Application areas classifying news, email,
    newsgroup, chat
  • Training phase feature extraction and indexing
  • Classification algorithms Naïve Bayes, Rocchio,
    k-Nearest Neighbor, Support Vector Machines

4
Introduction
  • Vector space model using TF and IDF not adequate
  • Each sentence has different importance to
    identify the topic
  • Better results possible by assigning term weights
    based on the importance of sentences

5
Proposed system
  • Two kinds of summarization techniques for
    identifying important and unimportant sentences
  • Similarity between title and sentences
  • Importance of terms in each sentence
  • Importance of sentences derived
  • Term weights in each sentence modified in
    proportion to the sentence importance
  • English and Korean newsgroup data sets used

6
Proposed system
  • Preprocessing
  • Contents Subject and body used
  • Contents segmented into sentences
  • Content words POS tagging
  • English - Brill tagger
  • Korean - Sogang tagger
  • Sentences vector of content words
  • TF values as term weights of content words

7
Proposed system
  • Importance of sentences by title
  • Title summarizes the important content
  • Sentences similar to the title contain important
    terms
  • Title and sentences as vector of content words
  • Similarity calculated by inner product of
    sentence and title vectors
  • Quality of the title

8
Proposed system
  • Sentence importance by Term importance
  • Sentences dissimilar to the title but with
    important terms are important
  • Sentence importance importance of terms in it.
  • Importance of terms TF, IDF, chi-square values
  • Final score

9
Proposed system
  • Indexing
  • TF value of a term calculated using the final
    importance score.
  • Modified TF value of term t in document d
  • Weight of a term t in document d

10
Evaluation Data Sets
  • English UseNet
  • 20,000 articles
  • 20 UseNet groups
  • 4,000 docs (20) for test data
  • 16,000 docs (80) for training data
  • Vocabulary 51,018 words
  • Korean UseNet
  • 10,331 documents
  • 15 categories
  • 3,107 docs (30) for test data
  • 7,224 docs (70) for training data
  • Vocabulary 69,793 words

11
Evaluation
  • Basis system used TF
  • Proposed system used WTF
  • Performance measures
  • Recall, Precision
  • F1 measure
  • Micro-averaging method
  • Macro-averaging method

12
Evaluation
  • Setting the number of features
  • Validation set from training data used
  • Performance curve using SVM shown here k1k21

13
Evaluation
  • For SVM, number of features is set to 7,000 based
    on performance and running time.
  • Similarly for other classifiers
  • Naïve Bayes 7,000
  • Rocchio 10,000
  • k-NN 9,000

14
Evaluation
  • Setting the constant weights k1 and k2- range
    0.0 to 3.0
  • Compared Sim(S,T), Cen(S) and the combination
    method
  • For combination method, k1k2
  • Performance curve using SVM is shown features
    7,000

15
Evaluation
  • Best performance for
  • SVM at k11.5 and k20.4
  • Naïve Bayes at k11.9 and k23.0
  • Rocchio at k12.0 and k20.0
  • k-NN at k10.8 and k22.8

16
Results
  • English newsgroup data set

17
Results
  • Korean newsgroup data set

18
Results
  • A collection of small tightly clustered documents
    with wide separation between the clusters should
    produce the best performance (Salton et al.,
    1975)
  • Cohesion within a category similarity between
    documents in same category
  • Cohesion between categories measure of
    similarity between categories
  • High cohesion within a category and low cohesion
    between categories.

19
Results
  • D total training document set
  • Ik training document set in k-th category
  • Ck centroid vector of k-th category
  • Cglob centroid vector of the total training
    documents

20
Results
21
Results
22
Results
  • Cen(S) had the highest cohesion value within a
    category
  • Sim(S,T) showed the lowest cohesion value between
    the categories

23
Results
  • Cohesion values, within and between the
    categories, for the combination method showed
    middle values between the Cen(S) and Sim(S,T)
    curves
  • The proposed indexing method reformed the vector
    space for better performance

24
Summary
  • New indexing method using two summarization
    techniques
  • Data sets from two languages
  • Four different classifiers
  • Better performance
  • Verified by cohesion analysis
  • Try on newspapers and other types of documents
  • Use more structural information in the document.
Write a Comment
User Comments (0)
About PowerShow.com