Classification Algorithms for NETNEWS Articles - PowerPoint PPT Presentation

About This Presentation
Title:

Classification Algorithms for NETNEWS Articles

Description:

determining the area of an essay. directing an email message to a proper folder ... rec/games/video/nintendo -- rec/aviation/hang-gliding ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 44
Provided by: hsu370
Learn more at: http://www.cs.ucf.edu
Category:

less

Transcript and Presenter's Notes

Title: Classification Algorithms for NETNEWS Articles


1
Classification Algorithms for NETNEWS Articles
Master's Thesis By Wen-Lin Hsu Advisory
Committee Dr. Sheau-Dong Lang,Chairman Dr. Ronald
Dutton Dr. Mostafa Bassiouni
2
Overview
  • Introduction
  • Related work
  • Basic method
  • Improved methods and results
  • Conclusion
  • Future research

3
Introduction
  • Text categorization (classification)
  • definition
  • the process of deciding the appropriate
    categories for a given document
  • applications
  • determining the area of an essay
  • directing an email message to a proper folder
  • routing the NETNEWS articles to their newsgroups

4
Related Work
  • NewsWeeder project using user feedback
  • Classification of NETNEWS articles using AI
    learning techniques
  • Automatic web page categorization for IR systems
    using Knowledge Based (KB) techniques
  • Filtering junk mail using Bayesian approaches

5
Existing Techniques
  • Use batch updating in B-trees
  • Use of inverted lists to update database
    incrementally
  • Remove redundant words using corpus statistics
  • Select most relevant articles to train
  • Clustering

6
Baseline Algorithm
  • SMART (Salton 1970s still the best)
  • Vector space model
  • Term weighting scheme
  • Inverted file

7
Baseline Algorithm (contd)
  • Inverted file

8
Baseline Algorithm (contd)
  • Term list for each incoming article

9
Baseline Algorithm (contd)
  • Normalized similarity measure
  • tf-idf weighting scheme

10
Data Set
  • NETNEWS articles

11
Sample Articles
--------------------------------------------------
-- I found this number 1-800-IAM-RICH rec/travel/c
ruise --gt rec/arts/disney/animation ------------
---------------------------------------- Please
send email if interested. Includes box,docs and
shipping in the U.S. rec/games/video/nintendo
--gt rec/radio/amateur/swap ---------------------
------------------------------- No, you just fly
around. rec/games/video/nintendo --gt
rec/aviation/hang-gliding ------------------------
---------------------------- Looking for open
court, indoor or out in the Wilsonville-Portland,
Oregon area. rec/sport/volleyball --gt
rec/sport/basketball/women
12
Statistics (June, July August 1997)
13
Experimental Results
  • show the competitiveness of our system
  • compare our results to those reported by Weiss at
    Johns Hopkins, 1996
  • achieve comparable results (88 vs. 89) using
    same approach and similar data.
  • results using our data

14
Methods Used to Improve the Baseline Algorithm
  • 1. Batch Routing
  • 2. Batch Updating
  • 3. Feature Reduction
  • 4. Top-k Approach
  • 5. Multi-level Routing
  • 6. Multiple Representatives

15
Improved Method 1
  • Batch routing
  • slightly improves efficiency
  • routing accuracy unchanged
  • slightly higher storage

16
Results of Batch Routing
  • Runtimes vs. batch size

17
Improved Method 2
  • Batch updating
  • 2.1 adding new terms and newsgroups
  • 2.2 2.1 removing unwanted terms and
  • newsgroups

18
Improved Method 2.1
  • Batch updating
  • adding new terms and newsgroups to the inverted
    file after n1 articles
  • improves accuracy
  • increases runtime and storage requirements

19
Results of Batch Updating
  • Routing accuracy vs. updating with new terms and
    newsgroups

20
Increased time and storage when terms and new
groups are included in the updating scheme after
every 100 articles (1000 for rec).
21
Improved Method 2.2
  • Batch updating
  • 1. adding new terms and newsgroups
  • 2. removing unwanted terms and newsgroups
  • improves efficiency and storage requirements in
    1. without losing accuracy

22
Routing Accuracy vs. Updates and term removal
23
Overall Performance of Updating
24
Improved Method 3
  • Feature reduction
  • reduce the size of the training set
  • pre-manipulate the training data
  • select articles based on their similarity values
  • retrain using selected articles
  • improves efficiency, storage, and accuracy

25
Results for Feature Reduction
  • BL all articles in June
  • I all correctly routed articles (64 in I)
  • II articles with similarity greater than mean
    (32 in I)
  • III articles with similarity within one standard
    deviation
  • of mean (32 in I)

26
Feature Reduction Updating
BLU all articles in June I all correctly
routed articles II articles with similarity
greater than mean IIIarticles with similarity
within one standard deviation
27
Improvement in Routing Accuracy with Feature
Reduction Updating
28
Improved Method 4
  • Top-k Approach
  • re-evaluate the system performance
  • give suggestions to users
  • get feedback from users
  • needs no extra storage
  • requires more time
  • improves accuracy

29
Results for Top-k Approach
  • with updating

30
Routing accuracy using the top-k ranks.Each
figure shows the results w/o updates.
31
Improved Method 5
  • Multi-level Routing
  • accuracy ? efficiency? storage?

32
Results for Multi-level Routing
33
Improved Method 6
  • Multiple representatives
  • improves accuracy significantly
  • requires little extra storage
  • takes time to cluster articles before training

34
One Representative per Newsgroup
35
Two Representatives per Newsgroup
36
K-means Clustering Algorithm
  • 1. Randomly select k articles as the first
    cluster centers
  • 2. Distribute the articles among the cluster
    centers
  • 3. Update the cluster center of the new clusters
  • 4. If any cluster center has changed in this
    iteration, then go back to 2.

37
Results for Multiple Representatives
38
Improvements in Accuracy using Multiple
Representatives
39
Multiple Representatives Top-k Approach
  • Routing accuracy for the top-10 rank

40
Multiple Representatives Feature Reduction
41
Multiple Representatives Updating
42
Conclusion
43
Future Research
  • Updating
  • find optimal frequencies to update
  • Multiple representatives
  • find optimal number of representatives for each
    group
  • Updating Multiple representatives
  • find a suitable updating scheme
Write a Comment
User Comments (0)
About PowerShow.com