Text Classification With Labeled and Unlabeled Data - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Text Classification With Labeled and Unlabeled Data

Description:

Text Classification With Labeled and Unlabeled Data. Presenter: Aleksandar Milisic ... Text Classification What and Why? Text Classification assigning ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 18
Provided by: milisical
Category:

less

Transcript and Presenter's Notes

Title: Text Classification With Labeled and Unlabeled Data


1
Text Classification With Labeled and Unlabeled
Data
Presenter Aleksandar Milisic Supervisor Dr.
David Albrecht
2
Overview
  • Text Classification What and Why?
  • Text Clustering
  • Support Vector Machines (SVMs) with Cluster
    Features
  • Course of Project
  • Results
  • Conclusion
  • Future Work

3
Text Classification What and Why?
  • Text Classification assigning documents to
    predefined classes (categories)
  • Labeling manually is time-consuming and sometimes
    impossible the process needs to be automated
  • To minimize labeling, automated text classifiers
    need to be able to utilize unlabeled data

4
How Does It Work?
  • Text documents are represented using feature
    vectors
  • Documents (both labeled and unlabeled) are
    clustered into similar groups
  • - features representing relationship to created
  • clusters are added to the feature vectors
  • The augmented feature vectors are then classified
    by a Support Vector Machine (SVM) (Raskutti et
    al, 2002)
  • This novel approach was the basis of this project

5
Representing Text
With paperless offices becoming more common,
companies start using document databases with
classification schemes
Feature Vector
6
Clustering
Feature Vectors
Labeled feature vectors Unlabeled feature vectors

7
SVMs With Cluster Features
Added
Features
Labeled feature vectors (Class 1) Labeled feature
vectors (Not Class 1) Unlabeled feature
vectors Support Vectors Separating Hyperplane

8
Augmented Feature Vector
Added Features
Original Word Frequencies
  • Examples of added features
  • - binary closest cluster indicator
  • - similarity to cluster centroids etc.

9
Areas of Investigation
  • Investigated the following questions
  • The value of added cluster features
  • Performance of different clustering algorithms
  • - Single-Pass (Raskutti et al.,
    2002)
  • - Snob (Wallace and Boulton, 1968)
  • Transductive Support Vector Machines (TSVMs)
    with
  • clustering
  • Different factors influencing performance
  • - type of features, number of
    clusters etc.

10
Course of Project
  • Implemented Single-Pass clustering algorithm and
    tested with SVMs
  • - using variations on number and type
    of features added
  • Combined Snob with SVMs
  • - using different attribute types
  • Tested the approaches on two different data sets
  • - with random splits containing 1, 5 and 10
    labeled data out of the whole training set

11
Initial Results
  • Initial results showed that adding cluster
    features
  • actually degrades SVM performance.
  • Various slightly modified versions of the
    Single-Pass
  • clustering algorithm as well as Snob were
    tested, all
  • giving negative results when combined with
    SVMs.
  • However, one approach showed an improvement . .
    .

12
Partitioning the Data
  • Training set is divided into k partitions with
    each
  • partition being clustered separately
  • - features added to documents relative to k
    sets of
  • clusters
  • - k partitions means k x number of cluster
  • features
  • - used k 5 in experiments

13
Results
  • Average number of bits for test set of size 600

Labeled Data
14
Results (cont.)
  • Average number of bits for test set of size 3299

Labeled Data
15
Conclusion
  • Results suggest that performance of SVMs depends
    on
  • - number of features
  • - type of features
  • - clustering method
  • Partitioning the data
  • - improves the quality of the
    features
  • - improves overall performance
  • Issues with use of Snob and clustering in general
    in text classification

16
Future Work
  • Extending the SVMCluster approach to
    multi-labeled
  • classification
  • Investigating new sets of cluster features
  • Determining
  • - optimal number of clusters used for
    adding cluster
  • features
  • - optimal number of partitions
  • Investigating better methods of using Snob in
    text
  • classification

17
References
  • Raskutti, B., Ferra, H. and Kowalczyk, A.
    (2002).
  • Using Unlabeled Data for Text
    Classification
  • through Addition of Cluster Parameters,
    In
  • International Conference on Machine
    Learning
  • (Accepted)
  • Wallace C.S. and Boulton, D.M., An Information
  • Measure for Classification, Computer
    Journal,
  • Vol.11, No.2, 1968, pp 185-194
Write a Comment
User Comments (0)
About PowerShow.com