Blogvox2: A Modular Domain Independent Sentiment Analysis System - PowerPoint PPT Presentation

About This Presentation
Title:

Blogvox2: A Modular Domain Independent Sentiment Analysis System

Description:

Hillary Clinton, Obama and Howard Dean are just some of the famous politicians who use blogs ... Better tool for Politicians. Better tool for the Average American. ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 57
Provided by: Pav5
Category:

less

Transcript and Presenter's Notes

Title: Blogvox2: A Modular Domain Independent Sentiment Analysis System


1
Blogvox2 A Modular Domain Independent Sentiment
Analysis System
  • Sandeep Balijepalli
  • Masters Thesis, 2007

2
Overview
  • Introduction / Motivation
  • Problem Statement Contribution
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

3
Overview
  • Introduction / Motivation
  • Problem Statement Contribution
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

4
Social Media Blogs
Social media defines the socialization of
information as well as the tools to facilitate
conversations . 1 (Examples MySpace,
YouTube, Wikipedia)
Blogs are popular due to their ability to express
opinions and express critiques on topics.
We focus on political blogs since they have lots
of sentiments words associated with them.
Examples Hillary Clinton, Obama and Howard Dean
are just some of the famous politicians who use
blogs
1 http//en.wikipedia.org/wiki/Social_media
5
Motivation
  • Lack of domain independent framework for
    sentiment analysis
  • Upcoming Elections
  • Better tool for Politicians.
  • Better tool for the Average American.
  • Need of sentence level analysis for sentiment
    classification
  • Opinmind was propriety.

6
Overview
  • Introduction / Motivation
  • Problem Statement Contribution
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

7
Problem Statement
  • Analyze the sentiment detection on sentence
    level.
  • Examine the performance of various techniques
    employed for classification.
  • Develop a sentiment analysis framework that is
    domain is Independent.

8
Contribution
  • Sentence level sentiment analysis framework.
  • Prototype applications to use the framework.
  • Performance analysis of different filter
    techniques.
  • Worked with Justin Martineau to develop trend
    analysis.
  • Akshay Java provided the political URL dataset.

9
Overview
  • Introduction / Motivation
  • Problem Statement Contribution
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

10
Related Work
  • Blogvox1 1
  • Document level scoring module, sentence level
    should be focused
  • Classification is based on the bag of words
    approach, other machine level analysis will
    improve the results
  • Turney (2002) 2
  • Unsupervised review classification.
  • Deals with Paragraph level and its difficult for
    classification of sentences in blogs with their
    method.

1 Akshay Java, Pranam Kolari, Tim Finin, James
Mayfield, Anupam Joshi, and Justin Martineau,
BlogVox Separating Blog Wheat from Blog Chaff,
January 2007
2 Peter D. Turney, Thumbs up or thumbs down?
semantic orientation applied to unsupervised
classification of reviews, Proceedings of the
40th Annual Meeting on Association for
Computational Linguistics, July 07-12, 2002,
Philadelphia, Pennsylvania
11
Related Work cont
  • Pang, Lee Vaithyanathan(2002) 1
  • Different techniques are analyzed and shown that
    unigrams perform well in movie domain.
  • But according to Engstrom 2, these techniques
    are domain dependent.
  • Soo-Min Kim and Eduard Hovy 3
  • They have seed wordlist and unigram approach to
    identify the sentence sentiments.
  • This is not sufficient as the seed wordlist is
    from the wordnet dataset introduces lot of
    noise 4

1 Bo Pang, Lillian Lee, and Shivakumar
Vaithyanathan. Thumbs up? sentiment
classification using machine learning techniques.
2002.
2 Charlotte Engstrom. Topic dependence in
sentiment classification. Masters thesis,
University of Cambridge, July 2004.
3 "Determining the Sentiment of Opinions",
Soo-Min Kim and Eduard Hovy. Proceedings of the
20th International Conference on Computational
Linguistics (COLING), August 23-27, Geneva,
Switzerland. 2004.
4 Brian Eriksson, Sentiment Classification of
Movie Reviews using Linguistic Parsing,2005
12
Overview
  • Introduction / Motivation
  • Problem Statement Conclusion
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

13
Framework
www.dailykos.com www.mediamatters.com
http//www.dailykos.com/storyonly/2007/6/5/1211/30
670

Obama is good. I like Edwards.
Bush Clinton Obama
If President Bush and Vice President Cheney can
blurt out vulgar language.
Hillary Clinton
14
Datasets (Political URLs)
Datasets employed
  • Lada A. Adamic Political Dataset 3028
    political URLs.
  • Lada A. Adamic Labeled Dataset 1490 blogs.
  • Twitter Dataset 1
  • Spinn3r Dataset live feeds 2

Experimental analysis
109 feeds were used for experimental analysis
2 www.tailrank.com
1 www.twitter.com
15
Overview
  • Introduction / Motivation
  • Problem Statement
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

16
Overview of Filters in sentiment analysis
Sentences
Parts of Speech
Pattern Recognizer
No
No
No
Objective Sentences
Yes
Yes
Yes
Multiple Indexer
17
Datasets (filter)
Pattern Matching Dataset classified - Manually
92 Positive patterns
163 Negative Patterns
Training (Naïve Bayes)
Political Dataset classified - Manually
Movie Dataset
5331 Negative sentences 5000 Neutral
sentences 5331 Positive sentences
273 Negative sentences 320 Neutral
sentences 178 Positive sentences
The political wordlist contains 1
2712 Negative words 915 Positive words
1 Akshay Java, Pranam Kolari, Tim Finin, James
Mayfield, Anupam Joshi, and Justin Martineau,
BlogVox Separating Blog Wheat from Blog Chaff,
January 2007
18
Pattern Recognition filter - Overview
Pattern recognizer filter is custom developed
domain based filter for identification of
patterns.
Sentences
Parts of Speech
Pattern Recognizer
No
No
No
Objective Sentences
Yes
Yes
Yes
Multiple Indexer
19
Pattern Recognition filter Working Model
Chunked Sentences
Sentences
She is well respected and won many admirers for
her staunch support for women. I hate George
Bush. John Edwards is my least favorite.
Pattern Recognizer
No
I want to be like Hillary.
Yes
I admire Hillary. Obama is annoying
Multiple Indexer
Sentence "they hate Bush Date Thu
Apr 19, 2007 at 081412 PM PDT Url
www.mediamatters.com Permalink
http//mediamatters.org/items/200508290005 Polarit
y negative Strength 1
Current Index
Sentence I like Clinton. Date Thu
Apr 19, 2007 at 081412 PM PDT Url
www.dailykos.com Permalink http//www.dailykos.c
om/story/2007/4/13/114310/235 Polarity
Positive Strength 1
20
Naïve Bayes Filter- Overview
Naïve Bayes classifier is a simple probabilistic
classifier based on applying the Bayes theorem
with strong (Naïve) independent assumptions.
Sentences
Parts of Speech
Pattern Recognizer
No
No
No
Objective Sentences
Yes
Yes
Multiple Indexer
21
Naïve Bayes Analysis - Outline
Each document d is represented by the document
vector
f1,f2,. fm
where
- set of predefined feature vectors
d (n1(d),..nm(d))
- where ni(d) no. of times feature vector fi
occurs in d.
If a sentence S ?Wi, where i is the no words
in that sentence Then,
Probability of a sentence being positive is
P(Spos) ?(Wipos) / (Wineg) (Wipos) (Winet)
Probability of a sentence being negative is
P(Sneg) ?(Wineg) / (Wineg) (Wipos) (Winet)
This is a slight modification of the naïve
bayes method.
22
Naïve Bayes Analysis Working model
Example Unigram analysis




1.28 / 5 Probability of the word present in the training
dataset (negative)
Hillary is an exciting leader.




27.4 / 5 .6 .6
Probability of the word present in the training
set (positive)
Hence, the sentence is positive.
Similarly, for Bigrams we use two words together
instead of one.
23
Threshold analysis for naives bayes filter
  • .7 misses lots of subjective sentences, hence
    threshold value of .7 will not capture the
    expected number of subjective sentences.
  • .5 indexes lots of sentences that are both
    objective and subjective. Indexing of unwanted
    sentences needs to be avoided which is why we do
    not chose .5 as our threshold value
  • According to our experimental analysis. Optimal
    threshold value is .6

24
Parts of Speech Filter - Overview
Sentences
Parts of Speech
Parts of Speech
Pattern Recognizer
No
No
No
Objective Sentences
Yes
Yes
Multiple Indexer
25
Parts of speech analysis - Outline
Part-of-speech tagging, also called grammatical
tagging, is the process of marking up the words
in a text as corresponding to a particular part
of speech, based on both its definition, as well
as its context . - wiki
For Example Mr. Bill Clinton, the former
president of the United States, will become
personal advisor of Hillary, Clinton announced
yesterday in New York.
For Example Mr.NNP BillNNP ClintonNNP ,,
theDT formerJJ presidentNN of theDT UnitedIN
StatesNNP ,, willMD becomeVB PersonalNN
advisorNN ofIN HillaryNN,, ClintonNN
announcedVBD yesterdayRB inIN NewNNP
YorkNNP..
NN singular or mass noun NNP proper noun
DT singular determiner JJ adjective
IN preposition
VB verb, base form VBD verb, past tense
  • Working model
  • The Unigrams and bigrams are tagged with Parts of
    speech for analysis. 1
  • Each sentence is passed and experiments are
    carried out against the tagged naïve bayes for
    analysis.
  • The working is similar to Naïve Bayes filters.

1 www.lingpipe.com
26
Named Entity Overview
Sentences
Parts of Speech
Parts of Speech
Pattern Recognizer
No
No
No
Objective Sentences
Yes
Yes
Multiple Indexer
27
Named Entity Overview cont
Problem I hate Bush, but I like Obama
  • Current approaches discards these sentences.
  • Our solutions Named Entity, reduce the score of
    the sentence.
  • This will return the number of entities 2.
  • Anything more than 1 returned by the named
    entity, the system will reduce the score, rather
    than removing the search results.

28
Overview
  • Introduction / Motivation
  • Problem Statement
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

29
Search and Trend Analysis
Search Analysis
  • Queries are boosted for performance
    enhancement.

Query George Bush Results
George Bush is a great guy Georges
last name Bush is I dislike
Bush I love George

Terms Together - high score
Terms spacing up to 10 words Medium score
Either one of the terms low score
30
Search Trend Analysis Search screen shots
Two Panel View
31
Search Trend Analysis Search screen shots
cont
Four Panel View
32
Search Trend Analysis Search screen shots
Cont
Polarity Distribution
33
Search and Trend Analysis cont
Top topics are terms that have always been in
the point of discussion in blogosphere. (eg
Bush, Iraq, Bomb)
  • Terms are computed analyzing the frequencies of
    the words in index.
  • Top 100 English words, dates and numbers are
    screened out

Hot topics are terms that have currently been
in the point of discussion in blogosphere. (eg
Virginia, Immigration)
  • Computed by employing K-L Divergence.

Dkl(PQ) ?I P(i) log(P(i)/T))
Dkl - Kullback - Leibler Divergence
T Target wordValue
P True Value
34
Search Trend Analysis Search screen shots
Cont
Top Term Analysis
35
Overview
  • Introduction / Motivation
  • Problem Statement
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

36
Effect of Pattern Matching Analysis
Pattern Matching analysis
Pattern matching does not capture most of the
subjective sentences
37
Effect of Pattern Matching Analysis Cont
  • Problem
  • Bloggers do not write in a formal manner.
  • Bloggers generally do not care about the grammar,
    spelling and punctuations in their blog
  • Less Pattern Dataset collected 95 Pos and 162
    Neg.
  • Examples that caused problems
  • Bush suckz (Causes problems)

Slang terms
Possible Solutions
  • Spelling checker is one of the ways to improve
    the results.
  • Requires more pattern dataset to improve the
    analysis.

38
Effect of Pattern Matching Analysis Cont
Confusion Matrix
  • Accuracy 58
  • True Positive Rate (Recall) 18
  • False Positive Rate (FP) 2
  • True Negative Rate 98
  • False Negative Rate (FN) 82
  • Precision 92
  • (Positive Subjective, Negative Objective
    Sentence)

39
Effect of Naive Bayes Analysis
Unigram analysis
  • Unigram captures most of the subjective
    sentences.

40
Unigram vs Patterns
  • The graph shows that unigrams perform better than
    pattern matching techniques.

41
Effect of Naive Bayes Analysis cont
Confusion matrix (Unigrams)
  • Accuracy 77
  • True Positive Rate (Recall) 63
  • False Positive Rate (FP) 10
  • True Negative Rate 90
  • False Negative Rate (FN) 37
  • Precision (Positive) 86(Positive
    Subjective, Negative Objective Sentence)

42
Effect of Naive Bayes Analysis
Bigram analysis
  • Bigrams perform better than Pattern matching.
  • Bigrams do not perform as well as unigrams.
  • Lack of domain independent dataset affects the
    results.

43
Effect of Naive Bayes Analysis cont
Confusion Matrix (Bigram)
  • Accuracy 70
  • True Positive Rate (Recall) 50
  • False Positive Rate (FP) 10
  • True Negative Rate 91
  • False Negative Rate (FN) 50
  • Precision (Positive) 83(Positive
    Subjective, Negative Objective Sentence)

44
Effect of Naive Bayes Analysis
Unigram Bigram analysis
  • Results similar to unigram which implies that
    the addition of bigrams does not seem to make a
    significant difference

45
Effect of Naive Bayes Analysis cont
Confusion Matrix (Unigram Bigram)
  • Accuracy 77
  • True Positive Rate (Recall) 64
  • False Positive Rate (FP) 10
  • True Negative Rate 90
  • False Negative Rate (FN) 36
  • Precision (Positive) 86
  • (Positive Subjective, Negative Objective
    Sentence)

46
Effect of Pattern Matching Analysis Cont
  • Problem
  • We used the Movie training dataset 1 along with
    the custom developed political dataset.
  • Possible Solutions
  • More domain specific dataset should be collected
    for improvement of this technique.
  • Analysis on trigrams would be useful for
    comparisons.

1 http//www.cs.cornell.edu/People/pabo/movie-re
view-data/
47
Effect of Parts of Speech Analysis
Parts of Speech analysis
  • Parts of speech does not perform as well as
    unigrams.

48
Effect of Parts of Speech Analysis Cont
  • Problem
  • Currently, the training set data for this
    analysis is not blog specific, but is collected
    from the news articles, which follow a standard
    format and procedure.

Possible Solutions
  • Develop or obtain a blog specific training
    dataset.
  • Combining this with other filters could improve
    the results.

49
Effect of Parts of Speech Analysis Cont
Confusion matrix (Parts of Speech)
  • Accuracy 73
  • True Positive Rate (Recall) 60
  • False Positive Rate (FP) 13
  • True Negative Rate 87
  • False Negative Rate (FN) 40
  • Precision (Positive) 82
  • (Positive Subjective, Negative Objective
    Sentence)

50
Results
  • Unigram Unigram Bigram outperform all
    other filter analysis.
  • Although, parts of speech tagging performs well,
    the precision is less when compared to other
    filter analysis.
  • Pattern matching technique can be improved by
    obtaining a larger dataset which is a non-trivial
    task.

51
Overview
  • Introduction / Motivation
  • Problem Statement
  • Related Work
  • Framework
  • Sentiment Filters
  • Search and Trend Analysis
  • Experiments Results
  • Conclusion Future Work

52
Conclusion
  • Analyzed the sentence level classification of
    sentiments.
  • Focused on pattern matching, naïve Bayes and
    parts of speech filters for opinion
    classification.
  • Analyzed and presented the performance for each
    sentiment filter.
  • Developed a robust framework which is domain
    independent.
  • Developed two different prototype applications.
    (two-panel view four-panel view)

53
Future Work
  • Incorporating other filters (such as SVMs),
    adding stemming and spelling checker .
  • Identify ways to deal with sarcastic sentences.
  • Negations have to be captured
  • Developing and improving the dataset should
    significantly improve the results.

54
Acknowledgements
  • Lada A. Adamic for her datasets.
  • Tailrank twitter for their dataset.
  • Justin Martineau.
  • Akshay Java and Pranam Kolari for their help in
    compiling the datasets.
  • Alark Joshi

55
Thank You !!
  • Questions?

56
Experimental Information
  • Confusion Matrix
  • A confusion matrix 1 contains information
    about actual and predicted classifications done
    by a classification system. Performance of such
    systems is commonly evaluated using the data in
    the matrix. 2

1 Provost, F., Kohavi, R. (1998). On applied
research in machine learning. Machine Learning ,
30, 127-- 132.
2 http//www2.cs.uregina.ca/dbd/cs831/notes/con
fusion_matrix/confusion_matrix.html
Write a Comment
User Comments (0)
About PowerShow.com