Text Classification With Labeled and Unlabeled Data

About This Presentation

Title:

Text Classification With Labeled and Unlabeled Data

Description:

Text Classification With Labeled and Unlabeled Data. Presenter: Aleksandar Milisic ... Text Classification What and Why? Text Classification assigning ... – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 18

Provided by: milisical

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification With Labeled and Unlabeled Data

1
Text Classification With Labeled and Unlabeled
Data
Presenter Aleksandar Milisic Supervisor Dr.
David Albrecht
2
Overview

Text Classification What and Why?
Text Clustering
Support Vector Machines (SVMs) with Cluster
Features
Course of Project
Results
Conclusion
Future Work

3
Text Classification What and Why?

Text Classification assigning documents to
predefined classes (categories)
Labeling manually is time-consuming and sometimes
impossible the process needs to be automated
To minimize labeling, automated text classifiers
need to be able to utilize unlabeled data

4
How Does It Work?

Text documents are represented using feature
vectors
Documents (both labeled and unlabeled) are
clustered into similar groups
- features representing relationship to created
clusters are added to the feature vectors
The augmented feature vectors are then classified
by a Support Vector Machine (SVM) (Raskutti et
al, 2002)
This novel approach was the basis of this project

5
Representing Text
With paperless offices becoming more common,
companies start using document databases with
classification schemes
Feature Vector
6
Clustering
Feature Vectors
Labeled feature vectors Unlabeled feature vectors

7
SVMs With Cluster Features
Added
Features
Labeled feature vectors (Class 1) Labeled feature
vectors (Not Class 1) Unlabeled feature
vectors Support Vectors Separating Hyperplane

8
Augmented Feature Vector
Added Features
Original Word Frequencies

Examples of added features
- binary closest cluster indicator
- similarity to cluster centroids etc.

9
Areas of Investigation

Investigated the following questions

The value of added cluster features
Performance of different clustering algorithms
- Single-Pass (Raskutti et al.,
2002)
- Snob (Wallace and Boulton, 1968)
Transductive Support Vector Machines (TSVMs)
with
clustering
Different factors influencing performance
- type of features, number of
clusters etc.

10
Course of Project

Implemented Single-Pass clustering algorithm and
tested with SVMs
- using variations on number and type
of features added
Combined Snob with SVMs
- using different attribute types
Tested the approaches on two different data sets
- with random splits containing 1, 5 and 10
labeled data out of the whole training set

11
Initial Results