Text Classification with Support Vector Machine - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Text Classification with Support Vector Machine

Description:

a list of subject areas. 25 broad subject fields, divided ... 04--Atmospheric Sciences. 05--Behavioral and Social Sciences. 06--Biological and Medical Sciences ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 26
Provided by: zsj
Category:

less

Transcript and Presenter's Notes

Title: Text Classification with Support Vector Machine


1
Text Classification with Support Vector Machine
  • Xuemei Li
  • Department of Computer Science
  • Old Dominion University
  • April 24, 2006

2
Outline
  • Problem and Approach
  • Support Vector Machine General
  • Architecture
  • Preliminary Results
  • Normalization and Parameter Selection
  • Parameter Selection Experiments
  • Conclusion

3
Objective
  • Problem
  • a collection of xml documents, and
  • Each document contains an average of 2700 words
  • a list of subject areas
  • 25 broad subject fields, divided into
  • 251 narrower groups
  • Classification Assign each document to the most
    pertinent subject area (s)

4
Input Xml Document Sample
  lt?xml version"1.0" ?gt - lt!-- XML document
generated using OCR technology from ScanSoft,
Inc.   --gt - ltdocument ssdoc-vers"SSDOC1.0"
ocr-vers"OmniPage Pro 14" xmlns"x-schemahttp//
www.scansoft.com/omnipage/xml/ssdoc-schema2.xml"gt
- ltpage width"12240" height"15840" x-res"300"
y-res"300" bpp"1" orientation"0" skew"0"
filename"C\testbed\OCR\firstlast\ADA424473.pdf"
language"0"gt - ltregion reg-type"horizontal"gt  
ltrc l"620" t"14784" r"11572" b"15139" /gt -
ltparagraph para-type"text" align"left"
left-indent"0" right-indent"0" start-indent"0"
line-spacing"180"gt - ltln baseline"15029"
ff"Times New Roman" fs"600" char-attr"bold"gt  
ltwd l"715" t"14923" r"1094" b"15072"
char-attr"non-bold"gtAprillt/wdgt   ltwd l"1142"
t"14923" r"1478" b"15038" char-attr"non-bold"gt
2004lt/wdgt   ltwd l"9720" t"14904" r"10397"
b"15082" fs"800" char-attr"italic"gtDefenselt/wdgt
  ltwd l"10430" t"14904" r"11160" b"15038"
fs"1000"gtHorizonslt/wdgt   ltwd l"11371"
t"14856" r"11477" b"15034" fs"1100"gt1lt/wdgt
5
Sample Subject Fields Groups
  • 01--Aviation Technology
  • 01 Aerodynamics
  • 02 Military Aircraft Operations
  • 03 Aircraft
  • 01 Helicopters
  • 02 Bombers
  • 02--Agriculture
  • 03--Astronomy and Astrophysics
  • 04--Atmospheric Sciences
  • 05--Behavioral and Social Sciences
  • 06--Biological and Medical Sciences
  • 07--Chemistry
  • 08--Earth Sciences and Oceanography

6
Approach
  • Our approach Support Vector Machine
  • training SVM for each subject area
  • Existing library
  • LibSVM http//www.csie.ntu.edu.tw/cjlin/libsvm/
  • different SVM formulations
  • cross validation for model selection
  • both C and Java sources
  • Other project in our research group used LibSVM
    before
  • Rainbow http//www.cs.umass.edu/mccallum/bow/

7
Implementation Technology
  • JAVA jdk1.4.2_06
  • IDE Eclipse3.0.2
  • Commons Digester 1.7
  • Lucene 1.4.3
  • LibSVM

8
Support Vector Machine -General
  • Overview
  • A learning method introduced by V. Vapnik
  • It is widely used in pattern recognition areas
    such as face detection, isolated handwriting
    digit recognition, gene classification, etc.
  • Key Ideas
  • Maximize Margins
  • Construct Kernels
  • A list of SVM applications is available at
    http//www.clopinet.com/isabelle/Projects/SVM/appl
    ist.html

9
Support Vector Machine Linear Separator
  • Many decision boundaries can separate these two
    classes
  • Which one should we choose?

http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
10
Support Vector Machine Maximize Margins
Basic idea Choose the one to separate two
classes with largest margin
http//www.cs.utexas.edu/mooney/cs391L/svm.ppt
11
Non-linear Separator
  • A curve is needed to fully separate the green and
    red objects

http//www.statsoft.com/textbook/stsvm.html
12
The Kernel Trick
  • General idea the original space can always be
    mapped to some higher-dimensional feature space
    where the training set is separable

http//www.statsoft.com/textbook/stsvm.html
13
Architecture
Terms
Features/weights
Xml files
TF
Document Representation
LibSVM Train
Lucene
DF
label
SVM Model
Features
Features/weights
Xml files
Terms
TF
Document Representation
LibSVM Predictor
Lucene
DF
Prediction
14
Lucene
  • Lucene is a high-performance text search engine
    and is suitable for cross-platform full-text
    search application.
  • It provides direct API for TF and DF.
  • built-in standard analyzer of Lucene
  • This analyzer is a JavaCC-based parser
  • with rules for email addresses, acronyms,
    hostnames, floating point numbers
  • stop word removal.

15
Document Representation
  • Different kinds of documents contain different
    terms and these term occurrences can be viewed as
    clues for document classification
  • TF (Term Frequency)
  • Each distinct word corresponds to a feature
  • The number of times word occurs in the document
    as its value
  • TFIDF (Term Frequency Inverse Document Frequency)
  • Each distinct word corresponds to a feature
  • TF(log(N/DF)1)

16
Data Format of LibSVM
  • Training Data File(.trn)
  • label featureIdfeatureValue
    featureIdfeatureValue
  • Each document is a line in Training Data File
  • Each document example starts with a label, 0 or
    1, followed by a list of feature id value pairs.
  • Feature id number must be greater than 0
  • Test Data File(.lst)
  • Similar to training data file

17
Testbed
18
Baseline SVM Evaluation
  • 16 subject categories with more than 150 samples
  • Determine accuracy
  • Assume P is the number of positives in test set,
    N is the number of negatives in test set.
  • Assume NP is the number of right classified
    positives, MN is the number of right classified
    negatives, then
  • Accuracy (NPMN) / (PN)

19
Preliminary Results
20
LibSVM Model Selection
  • C-SVM
  • C is the penalty parameter of the error function
  • 4 kernels
  • Linear a special case of RBF
  • Polynomial
  • If a high degree is used, numerical difficulties
    may happen
  • RBF exp (-?x-y²)
  • a reasonable first choice
  • Sigmoid
  • In general its accuracy is not better than RBF

21
A practical Guide to Support Vector
Classification
  • Transform data to the format of an SVM software
  • Conduct simple scaling/normalization on the data
  • Consider the RBF kernel
  • Use crosss-validation to find the best parameter
    C and Gamma
  • Grid.py
  • Use the best parameter C and Gamma to train the
    training set
  • Test

22
Normalization and Parameter Selection
Terms
Features/weights
Xml files
TF
Document Representation
Lucene
Grid.py Tool
Normalization
DF
label
Best C and Gamma
Features
LibSVM Train
Ranges
Xml files
Terms
SVM Model
Features/weights
TF
Document Representation
LibSVM Predictor
Normalization
Lucene
DF
23
Experiment Result with normalization and
parameter selection
24
Conclusion
  • Baseline results
  • The average accuracy of half of 16 categories are
    greater than 70
  • 6 of 16 categories are between 65-70
  • 2 of 16 are between 60-65
  • Normalization and Parameter Selection Experiments
  • Accuracy can be improved 7-10 by normalization
    with best c and gamma

25
  • Thank You !
Write a Comment
User Comments (0)
About PowerShow.com