Text Classification with Support Vector Machine - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Text Classification with Support Vector Machine

Description:

a list of subject areas. 25 broad subject fields, divided ... 04--Atmospheric Sciences. 05--Behavioral and Social Sciences. 06--Biological and Medical Sciences ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 26

Provided by: zsj

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification with Support Vector Machine

1
Text Classification with Support Vector Machine

Xuemei Li
Department of Computer Science
Old Dominion University
April 24, 2006

2
Outline

Problem and Approach
Support Vector Machine General
Architecture
Preliminary Results
Normalization and Parameter Selection
Parameter Selection Experiments
Conclusion

3
Objective

Problem
a collection of xml documents, and
Each document contains an average of 2700 words
a list of subject areas
25 broad subject fields, divided into
251 narrower groups
Classification Assign each document to the most
pertinent subject area (s)

4
Input Xml Document Sample
lt?xml version"1.0" ?gt - lt!-- XML document
generated using OCR technology from ScanSoft,
Inc. --gt - ltdocument ssdoc-vers"SSDOC1.0"
ocr-vers"OmniPage Pro 14" xmlns"x-schemahttp//
www.scansoft.com/omnipage/xml/ssdoc-schema2.xml"gt
- ltpage width"12240" height"15840" x-res"300"
y-res"300" bpp"1" orientation"0" skew"0"
filename"C\testbed\OCR\firstlast\ADA424473.pdf"
language"0"gt - ltregion reg-type"horizontal"gt
ltrc l"620" t"14784" r"11572" b"15139" /gt -
ltparagraph para-type"text" align"left"
left-indent"0" right-indent"0" start-indent"0"
line-spacing"180"gt - ltln baseline"15029"
ff"Times New Roman" fs"600" char-attr"bold"gt
ltwd l"715" t"14923" r"1094" b"15072"
char-attr"non-bold"gtAprillt/wdgt ltwd l"1142"
t"14923" r"1478" b"15038" char-attr"non-bold"gt
2004lt/wdgt ltwd l"9720" t"14904" r"10397"
b"15082" fs"800" char-attr"italic"gtDefenselt/wdgt
ltwd l"10430" t"14904" r"11160" b"15038"
fs"1000"gtHorizonslt/wdgt ltwd l"11371"
t"14856" r"11477" b"15034" fs"1100"gt1lt/wdgt
5
Sample Subject Fields Groups

01--Aviation Technology
01 Aerodynamics
02 Military Aircraft Operations
03 Aircraft
01 Helicopters
02 Bombers
02--Agriculture
03--Astronomy and Astrophysics
04--Atmospheric Sciences
05--Behavioral and Social Sciences
06--Biological and Medical Sciences
07--Chemistry
08--Earth Sciences and Oceanography

6
Approach

Our approach Support Vector Machine
training SVM for each subject area
Existing library
LibSVM http//www.csie.ntu.edu.tw/cjlin/libsvm/
different SVM formulations
cross validation for model selection
both C and Java sources
Other project in our research group used LibSVM
before
Rainbow http//www.cs.umass.edu/mccallum/bow/

7
Implementation Technology

JAVA jdk1.4.2_06
IDE Eclipse3.0.2
Commons Digester 1.7
Lucene 1.4.3
LibSVM

8
Support Vector Machine -General

Overview
A learning method introduced by V. Vapnik
It is widely used in pattern recognition areas
such as face detection, isolated handwriting
digit recognition, gene classification, etc.
Key Ideas
Maximize Margins
Construct Kernels
A list of SVM applications is available at
http//www.clopinet.com/isabelle/Projects/SVM/appl
ist.html

9
Support Vector Machine Linear Separator

Many decision boundaries can separate these two
classes
Which one should we choose?

http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
10
Support Vector Machine Maximize Margins
Basic idea Choose the one to separate two
classes with largest margin
http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
11
Non-linear Separator

A curve is needed to fully separate the green and
red objects

http//www.statsoft.com/textbook/stsvm.html
12
The Kernel Trick

General idea the original space can always be
mapped to some higher-dimensional feature space
where the training set is separable

http//www.statsoft.com/textbook/stsvm.html
13
Architecture
Terms
Features/weights
Xml files
TF
Document Representation
LibSVM Train
Lucene
DF
label
SVM Model
Features
Features/weights
Xml files
Terms
TF
Document Representation
LibSVM Predictor
Lucene
DF
Prediction
14
Lucene

Lucene is a high-performance text search engine
and is suitable for cross-platform full-text
search application.
It provides direct API for TF and DF.
built-in standard analyzer of Lucene
This analyzer is a JavaCC-based parser
with rules for email addresses, acronyms,
hostnames, floating point numbers
stop word removal.

15
Document Representation

Different kinds of documents contain different
terms and these term occurrences can be viewed as
clues for document classification
TF (Term Frequency)
Each distinct word corresponds to a feature
The number of times word occurs in the document
as its value
TFIDF (Term Frequency Inverse Document Frequency)
Each distinct word corresponds to a feature
TF(log(N/DF)1)

16
Data Format of LibSVM

Training Data File(.trn)
label featureIdfeatureValue
featureIdfeatureValue
Each document is a line in Training Data File
Each document example starts with a label, 0 or
1, followed by a list of feature id value pairs.
Feature id number must be greater than 0
Test Data File(.lst)
Similar to training data file

17
Testbed
18
Baseline SVM Evaluation

16 subject categories with more than 150 samples
Determine accuracy
Assume P is the number of positives in test set,
N is the number of negatives in test set.
Assume NP is the number of right classified
positives, MN is the number of right classified
negatives, then
Accuracy (NPMN) / (PN)

19
Preliminary Results
20
LibSVM Model Selection

C-SVM
C is the penalty parameter of the error function
4 kernels
Linear a special case of RBF
Polynomial
If a high degree is used, numerical difficulties
may happen
RBF exp (-?x-y²)
a reasonable first choice
Sigmoid
In general its accuracy is not better than RBF

21
A practical Guide to Support Vector
Classification

Transform data to the format of an SVM software
Conduct simple scaling/normalization on the data
Consider the RBF kernel
Use crosss-validation to find the best parameter
C and Gamma
Grid.py
Use the best parameter C and Gamma to train the
training set
Test

22
Normalization and Parameter Selection
Terms
Features/weights
Xml files
TF
Document Representation
Lucene
Grid.py Tool
Normalization
DF
label
Best C and Gamma
Features
LibSVM Train
Ranges
Xml files
Terms
SVM Model
Features/weights
TF
Document Representation
LibSVM Predictor
Normalization
Lucene
DF
23
Experiment Result with normalization and
parameter selection
24
Conclusion