Title: DOCUMENT CLASSIFICATION WITH SVM
1DOCUMENT CLASSIFICATIONWITH SVM
- Studies in Computational Linguistics IIOpinion
Mining and Sentiment Analysis - Hwang Inbeom
2Overview
- Considered only unigrams as features
- Score of each feature is evaluated using
frequency of corresponding unigram - Implemented with Ruby MySQL
3Implementation
- Ruby
- http//www.ruby-lang.org/ko/
- Simple and productive
- SQL
- http//www.w3schools.com/sql/default.asp
- Ruby/MySQL connector
- http//www.tmtm.org/en/ruby/mysql/
4Implementation ( contd.)
check if hello exists in table tf my
Mysqlnew(localhost, inbeom,
inbeom, inbeom) res
my.query(SELECT FROM tf
WHERE word hello) if
res.num_rows gt 0 puts hello exists in the
table! end
print all words in a file File.foreach(filename
) do l words l.split words.each do
w puts w end end
5Dataset
- Bo Pangs polarity dataset v2.0
- 1000 positive and 1000 negative movie reviews
- Plain text format
films adapted from comic books have had plenty of
success , whether they're about superheroes (
batman , superman , spawn ) , or geared toward
kids ( casper ) or the arthouse crowd ( ghost
world ) , but there's never really been a comic
book like from hell before . for starters , it
was created by alan moore ( and eddie campbell )
, who brought the medium to a whole new level in
the mid '80s with a 12-part series called the
watchmen .
6Stop Word Elimination
- List of stop words could be found on the web
- http//www.lextek.com/manuals/onix/stopwords1.html
- Unigrams in this list are excluded in feature
evaluation process
7Unigram Frequency
- Counted the number of appearance of each unigram
in both positive and negative document sets - Number of entries was over 45,000
8Unigram Frequency Implementation
- Algorithm
- create an empty table in the previous slide
- while(there is a document d unprocessed)
- for every word w in d
- insert w into table if w is not inserted yet
- update a table row which word is w
- pos pos 1 if d is in positive set
- neg neg 1 if d is in negative set
9Distribution of Unigram Frequency
- Excluded unigrams occurred less than 5 times,
occurred only in positive or negative set, and
total occurring count is more than 2000 times - 12,830 unigrams used as features
10Distribution of Unigram Frequency
- Unigrams occurred more in positive set
- Several name of someone or movie titles ranked
very highly - Mulan, Flynt, Lebowski, Jude, Winslet, Homer,
- Other positive words
- Hatred, whisperer, astounding, exotica,
fascination, - Unigrams occurred more in negative set
- Seagal, Jawbreaker, Jakob, Hudson, magoo,
11Method 1 Unigram Presence
- Set the score(u,d) of unigram u as 1 if u is in
the document d - Implemented with hash table
- Algorithm for each document d,
- for every word w in document d
- If w is not a stop word
- hd, w 1
- print w1hw1, w2hw2,
12Method 2 Unigram Frequency
- Presence score of each feature is multiplied by
its number of occurrings in a document d - hd, w occurringd, w
13Method 3 Thresholding
- Cut off frequent and rare unigrams
- 2000 gt number of occurrings gt 5
- About 13,000 unigrams remained
- Applied to former two methods
- Thresholding presence
- Thresholding frequency
14Method 4 Scoring
- Assigned base score to each unigram
- Feature score is evaluated by multiplying unigram
frequency to base score
15Base Scoring Method
- Base score function of a unigram
16Feature Score
- Base score is multiplied by number of appearances
of each unigram in a document - Could be improved by applying a smoothing function
17Evaluation Environment
- Training set
- 900 documents from both sets
- Unigram frequencies are counted in this set
- Test set
- 100 documents from both sets
18Evaluation Results
19????? )