Title: Text Classification with Support Vector Machine
1Text Classification with Support Vector Machine
- Xuemei Li
- Department of Computer Science
- Old Dominion University
- April 24, 2006
2Outline
- Problem and Approach
- Support Vector Machine General
- Architecture
- Preliminary Results
- Normalization and Parameter Selection
- Parameter Selection Experiments
- Conclusion
3Objective
- Problem
- a collection of xml documents, and
- Each document contains an average of 2700 words
- a list of subject areas
- 25 broad subject fields, divided into
- 251 narrower groups
- Classification Assign each document to the most
pertinent subject area (s)
4Input Xml Document Sample
 lt?xml version"1.0" ?gt - lt!-- XML document
generated using OCR technology from ScanSoft,
Inc. Â --gt - ltdocument ssdoc-vers"SSDOC1.0"
ocr-vers"OmniPage Pro 14" xmlns"x-schemahttp//
www.scansoft.com/omnipage/xml/ssdoc-schema2.xml"gt
- ltpage width"12240" height"15840" x-res"300"
y-res"300" bpp"1" orientation"0" skew"0"
filename"C\testbed\OCR\firstlast\ADA424473.pdf"
language"0"gt - ltregion reg-type"horizontal"gt Â
ltrc l"620" t"14784" r"11572" b"15139" /gt -
ltparagraph para-type"text" align"left"
left-indent"0" right-indent"0" start-indent"0"
line-spacing"180"gt - ltln baseline"15029"
ff"Times New Roman" fs"600" char-attr"bold"gt Â
ltwd l"715" t"14923" r"1094" b"15072"
char-attr"non-bold"gtAprillt/wdgt  ltwd l"1142"
t"14923" r"1478" b"15038" char-attr"non-bold"gt
2004lt/wdgt  ltwd l"9720" t"14904" r"10397"
b"15082" fs"800" char-attr"italic"gtDefenselt/wdgt
 ltwd l"10430" t"14904" r"11160" b"15038"
fs"1000"gtHorizonslt/wdgt  ltwd l"11371"
t"14856" r"11477" b"15034" fs"1100"gt1lt/wdgt
5Sample Subject Fields Groups
- 01--Aviation Technology
- 01 Aerodynamics
- 02 Military Aircraft Operations
- 03 Aircraft
- 01 Helicopters
- 02 Bombers
-
- 02--Agriculture
- 03--Astronomy and Astrophysics
- 04--Atmospheric Sciences
- 05--Behavioral and Social Sciences
- 06--Biological and Medical Sciences
- 07--Chemistry
- 08--Earth Sciences and Oceanography
6Approach
- Our approach Support Vector Machine
- training SVM for each subject area
- Existing library
- LibSVM http//www.csie.ntu.edu.tw/cjlin/libsvm/
- different SVM formulations
- cross validation for model selection
- both C and Java sources
- Other project in our research group used LibSVM
before - Rainbow http//www.cs.umass.edu/mccallum/bow/
7Implementation Technology
- JAVA jdk1.4.2_06
- IDE Eclipse3.0.2
- Commons Digester 1.7
- Lucene 1.4.3
- LibSVM
8Support Vector Machine -General
- Overview
- A learning method introduced by V. Vapnik
- It is widely used in pattern recognition areas
such as face detection, isolated handwriting
digit recognition, gene classification, etc. - Key Ideas
- Maximize Margins
- Construct Kernels
- A list of SVM applications is available at
http//www.clopinet.com/isabelle/Projects/SVM/appl
ist.html
9Support Vector Machine Linear Separator
- Many decision boundaries can separate these two
classes - Which one should we choose?
http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
10Support Vector Machine Maximize Margins
Basic idea Choose the one to separate two
classes with largest margin
http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
11Non-linear Separator
- A curve is needed to fully separate the green and
red objects
http//www.statsoft.com/textbook/stsvm.html
12The Kernel Trick
- General idea the original space can always be
mapped to some higher-dimensional feature space
where the training set is separable
http//www.statsoft.com/textbook/stsvm.html
13Architecture
Terms
Features/weights
Xml files
TF
Document Representation
LibSVM Train
Lucene
DF
label
SVM Model
Features
Features/weights
Xml files
Terms
TF
Document Representation
LibSVM Predictor
Lucene
DF
Prediction
14Lucene
- Lucene is a high-performance text search engine
and is suitable for cross-platform full-text
search application. - It provides direct API for TF and DF.
- built-in standard analyzer of Lucene
- This analyzer is a JavaCC-based parser
- with rules for email addresses, acronyms,
hostnames, floating point numbers - stop word removal.
15Document Representation
- Different kinds of documents contain different
terms and these term occurrences can be viewed as
clues for document classification - TF (Term Frequency)
- Each distinct word corresponds to a feature
- The number of times word occurs in the document
as its value - TFIDF (Term Frequency Inverse Document Frequency)
- Each distinct word corresponds to a feature
- TF(log(N/DF)1)
16Data Format of LibSVM
- Training Data File(.trn)
- label featureIdfeatureValue
featureIdfeatureValue - Each document is a line in Training Data File
- Each document example starts with a label, 0 or
1, followed by a list of feature id value pairs. - Feature id number must be greater than 0
- Test Data File(.lst)
- Similar to training data file
17Testbed
18Baseline SVM Evaluation
- 16 subject categories with more than 150 samples
- Determine accuracy
- Assume P is the number of positives in test set,
N is the number of negatives in test set. - Assume NP is the number of right classified
positives, MN is the number of right classified
negatives, then -
- Accuracy (NPMN) / (PN)
19Preliminary Results
20LibSVM Model Selection
- C-SVM
- C is the penalty parameter of the error function
- 4 kernels
- Linear a special case of RBF
- Polynomial
- If a high degree is used, numerical difficulties
may happen - RBF exp (-?x-y²)
- a reasonable first choice
- Sigmoid
- In general its accuracy is not better than RBF
21A practical Guide to Support Vector
Classification
- Transform data to the format of an SVM software
- Conduct simple scaling/normalization on the data
- Consider the RBF kernel
- Use crosss-validation to find the best parameter
C and Gamma - Grid.py
- Use the best parameter C and Gamma to train the
training set - Test
22Normalization and Parameter Selection
Terms
Features/weights
Xml files
TF
Document Representation
Lucene
Grid.py Tool
Normalization
DF
label
Best C and Gamma
Features
LibSVM Train
Ranges
Xml files
Terms
SVM Model
Features/weights
TF
Document Representation
LibSVM Predictor
Normalization
Lucene
DF
23Experiment Result with normalization and
parameter selection
24Conclusion
- Baseline results
- The average accuracy of half of 16 categories are
greater than 70 - 6 of 16 categories are between 65-70
- 2 of 16 are between 60-65
- Normalization and Parameter Selection Experiments
- Accuracy can be improved 7-10 by normalization
with best c and gamma
25