Text Classification with Support Vector Machine - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Text Classification with Support Vector Machine

Description:

a collection of xml documents, and. Each document contains an ... 02 Military Aircraft Operations. 03 Aircraft. 01 Helicopters. 02 Bombers. 02--Agriculture ... – PowerPoint PPT presentation

Number of Views:809

Avg rating:3.0/5.0

Slides: 30

Provided by: zsj

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification with Support Vector Machine

1
Text Classification with Support Vector Machine

Xuemei Li
Department of Computer Science
Old Dominion University
April 24, 2006

2
Outline

Problem and Approach
Support Vector Machine General
Architecture
Preliminary Results
Normalization and Parameter Selection
Parameter Selection Experiments
Conclusion

3
Objective

Problem
a collection of xml documents, and
Each document contains an average of 2700 words
a list of subject areas
25 broad subject fields, divided into
251 narrower groups
Classification Assign each document to the most
pertinent subject area (s)

4
Input Xml Document Sample
lt?xml version"1.0" ?gt - lt!-- XML document
generated using OCR technology from ScanSoft,
Inc. --gt - ltdocument ssdoc-vers"SSDOC1.0"
ocr-vers"OmniPage Pro 14" xmlns"x-schemahttp//
www.scansoft.com/omnipage/xml/ssdoc-schema2.xml"gt
- ltpage width"12240" height"15840" x-res"300"
y-res"300" bpp"1" orientation"0" skew"0"
filename"C\testbed\OCR\firstlast\ADA424473.pdf"
language"0"gt - ltregion reg-type"horizontal"gt
ltrc l"620" t"14784" r"11572" b"15139" /gt -
ltparagraph para-type"text" align"left"
left-indent"0" right-indent"0" start-indent"0"
line-spacing"180"gt - ltln baseline"15029"
ff"Times New Roman" fs"600" char-attr"bold"gt
ltwd l"715" t"14923" r"1094" b"15072"
char-attr"non-bold"gtAprillt/wdgt ltwd l"1142"
t"14923" r"1478" b"15038" char-attr"non-bold"gt
2004lt/wdgt ltwd l"9720" t"14904" r"10397"
b"15082" fs"800" char-attr"italic"gtDefenselt/wdgt
ltwd l"10430" t"14904" r"11160" b"15038"
fs"1000"gtHorizonslt/wdgt ltwd l"11371"
t"14856" r"11477" b"15034" fs"1100"gt1lt/wdgt
5
Sample Subject Fields Groups

01--Aviation Technology
01 Aerodynamics
02 Military Aircraft Operations
03 Aircraft
01 Helicopters
02 Bombers
02--Agriculture
03--Astronomy and Astrophysics
04--Atmospheric Sciences
05--Behavioral and Social Sciences
06--Biological and Medical Sciences
07--Chemistry
08--Earth Sciences and Oceanography

6
Approach

Our approach Support Vector Machine
training SVM for each subject area
Existing library
LibSVM http//www.csie.ntu.edu.tw/cjlin/libsvm/
different SVM formulations
cross validation for model selection
both C and Java sources
Other project in our research group used LibSVM
before
Rainbow http//www.cs.umass.edu/mccallum/bow/

7
Implementation Technology

JAVA jdk1.4.2_06
IDE Eclipse3.0.2
Commons Digester 1.7
Lucene 1.4.3
LibSVM

8
Support Vector Machine -General

Overview
A learning method introduced by V. Vapnik
It is widely used in pattern recognition areas
such as face detection, isolated handwriting
digit recognition, gene classification, etc.
Key Ideas
Maximize Margins
Construct Kernels
A list of SVM applications is available at
http//www.clopinet.com/isabelle/Projects/SVM/appl
ist.html

9
Support Vector Machine Linear Separator

Many decision boundaries can separate these two
classes
Which one should we choose?

http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
10
Support Vector Machine Maximize Margins
Basic idea Choose the one to separate two
classes with largest margin
http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
11
Non-linear Separator

A curve is needed to fully separate the green and
red objects

http//www.statsoft.com/textbook/stsvm.html
12
The Kernel Trick

General idea the original feature space can
always be mapped to some higher-dimensional
feature space where the training set is separable

http//www.statsoft.com/textbook/stsvm.html
13
Architecture
Terms
Features/weights
Xml files
TF
Document Representation
LibSVM Train
Lucene
DF
label
SVM Model
Features
Features/weights
Xml files
Terms
TF
Document Representation
LibSVM Predictor
Lucene
DF
Prediction
14
Number of Terms in Document

Original Xml file
Including duplicated terms
Before LibSVM
Including only unique terms

15
Document Representation

Different kinds of documents contain different
terms and these term occurrences can be viewed as
clues for document classification
TF (Term Frequency)
Each distinct word corresponds to a feature
The number of times word occurs in the document
as its value
TFIDF (Term Frequency Inverse Document Frequency)
Each distinct word corresponds to a feature
TF(log(N/DF)1)

16
Lucene

Lucene is a high-performance text search engine
and is suitable for cross-platform full-text
search application.
It provides direct API for TF and DF.
built-in standard analyzer of Lucene
This analyzer is a JavaCC-based parser
with rules for email addresses, acronyms,
hostnames, floating point numbers
stop word removal.

17
Data Format of LibSVM

Training Data File(.trn)
label featureIdfeatureValue
featureIdfeatureValue
Each document example starts with a label, 1 or
-1, followed by a list of feature id value
pairs.
Feature id number must be greater than 0
Test Data File(.lst)
Similar to training data file

18
LibSVM

4 SVM types
C-SVM classification
The range of C is from zero to infinity
Nu-SVM classification
Basically it is the same thing as C-SVM but with
different parameters
The range of nu is always between0,1
Nu is related to the ratio of support vectors and
the ratio of the training error.
Epsilon-SVM regression
Nu-SVM regression

19
LibSVM

4 kernels
Linear a special case of RBF
Polynomial
If a high degree is used, numerical difficulties
may happen
RBF
a reasonable first choice
Sigmoid
In general its accuracy is not better than RBF

20
Testbed
21
Baseline SVM Evaluation

16 subject categories with more than 150 samples
Determine accuracy
Assume P is the number of positives in test set,
N is the number of negatives in test set.
Assume NP is the number of right classified
positives, MN is the number of right classified
negatives, then
Accuracy (NPMN) / (PN)

22
Preliminary Results
23
Normalization and Parameter Selection
Terms
Features/weights
Xml files
TF
Document Representation
Lucene
Grid.py Tool
Normalization
DF
label
Best C and Gamma
Features
LibSVM Train
Ranges
Xml files
Terms
SVM Model
Features/weights
TF
Document Representation
LibSVM Predictor
Normalization
Lucene
DF
24
A practical Guide to Support Vector
Classification