DOCUMENT CLASSIFICATION WITH SVM - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

DOCUMENT CLASSIFICATION WITH SVM

Description:

Score of each feature is evaluated using frequency of corresponding unigram ... success , whether they're about superheroes ( batman , superman , spawn ) , or ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 20
Provided by: knlpS
Category:

less

Transcript and Presenter's Notes

Title: DOCUMENT CLASSIFICATION WITH SVM


1
DOCUMENT CLASSIFICATIONWITH SVM
  • Studies in Computational Linguistics IIOpinion
    Mining and Sentiment Analysis
  • Hwang Inbeom

2
Overview
  • Considered only unigrams as features
  • Score of each feature is evaluated using
    frequency of corresponding unigram
  • Implemented with Ruby MySQL

3
Implementation
  • Ruby
  • http//www.ruby-lang.org/ko/
  • Simple and productive
  • SQL
  • http//www.w3schools.com/sql/default.asp
  • Ruby/MySQL connector
  • http//www.tmtm.org/en/ruby/mysql/

4
Implementation ( contd.)
  • Ruby examples

check if hello exists in table tf my
Mysqlnew(localhost, inbeom,
inbeom, inbeom) res
my.query(SELECT FROM tf
WHERE word hello) if
res.num_rows gt 0 puts hello exists in the
table! end
print all words in a file File.foreach(filename
) do l words l.split words.each do
w puts w end end
5
Dataset
  • Bo Pangs polarity dataset v2.0
  • 1000 positive and 1000 negative movie reviews
  • Plain text format

films adapted from comic books have had plenty of
success , whether they're about superheroes (
batman , superman , spawn ) , or geared toward
kids ( casper ) or the arthouse crowd ( ghost
world ) , but there's never really been a comic
book like from hell before . for starters , it
was created by alan moore ( and eddie campbell )
, who brought the medium to a whole new level in
the mid '80s with a 12-part series called the
watchmen .
6
Stop Word Elimination
  • List of stop words could be found on the web
  • http//www.lextek.com/manuals/onix/stopwords1.html
  • Unigrams in this list are excluded in feature
    evaluation process

7
Unigram Frequency
  • Counted the number of appearance of each unigram
    in both positive and negative document sets
  • Number of entries was over 45,000

8
Unigram Frequency Implementation
  • Algorithm
  • create an empty table in the previous slide
  • while(there is a document d unprocessed)
  • for every word w in d
  • insert w into table if w is not inserted yet
  • update a table row which word is w
  • pos pos 1 if d is in positive set
  • neg neg 1 if d is in negative set

9
Distribution of Unigram Frequency
  • Excluded unigrams occurred less than 5 times,
    occurred only in positive or negative set, and
    total occurring count is more than 2000 times
  • 12,830 unigrams used as features

10
Distribution of Unigram Frequency
  • Unigrams occurred more in positive set
  • Several name of someone or movie titles ranked
    very highly
  • Mulan, Flynt, Lebowski, Jude, Winslet, Homer,
  • Other positive words
  • Hatred, whisperer, astounding, exotica,
    fascination,
  • Unigrams occurred more in negative set
  • Seagal, Jawbreaker, Jakob, Hudson, magoo,

11
Method 1 Unigram Presence
  • Set the score(u,d) of unigram u as 1 if u is in
    the document d
  • Implemented with hash table
  • Algorithm for each document d,
  • for every word w in document d
  • If w is not a stop word
  • hd, w 1
  • print w1hw1, w2hw2,

12
Method 2 Unigram Frequency
  • Presence score of each feature is multiplied by
    its number of occurrings in a document d
  • hd, w occurringd, w

13
Method 3 Thresholding
  • Cut off frequent and rare unigrams
  • 2000 gt number of occurrings gt 5
  • About 13,000 unigrams remained
  • Applied to former two methods
  • Thresholding presence
  • Thresholding frequency

14
Method 4 Scoring
  • Assigned base score to each unigram
  • Feature score is evaluated by multiplying unigram
    frequency to base score

15
Base Scoring Method
  • Base score function of a unigram

16
Feature Score
  • Base score is multiplied by number of appearances
    of each unigram in a document
  • Could be improved by applying a smoothing function

17
Evaluation Environment
  • Training set
  • 900 documents from both sets
  • Unigram frequencies are counted in this set
  • Test set
  • 100 documents from both sets

18
Evaluation Results
19
????? )
Write a Comment
User Comments (0)
About PowerShow.com