Combining Lexiconbased Methods to Detect Opinionated Blogs - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Combining Lexiconbased Methods to Detect Opinionated Blogs

Description:

High Frequency (HF) Lexicon. For each of IMDb movie & 2006 blog training data ... LF Lexicon & LF Regex. Compile a set of Low Frequency (LF) terms in the blog ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 40

Provided by: SLIS69

Category:

more less

Transcript and Presenter's Notes

Title: Combining Lexiconbased Methods to Detect Opinionated Blogs

1
Combining Lexicon-based Methodsto Detect
Opinionated Blogs

Kiduk Yang, Ning Yu, Hui Zhang
WIDIT Laboratory
School of Library Information Science
Indiana University

2
Research Questions Noise Reduction

Does noise in blog data affect IR performance?
What are the characteristics of blog noise?
Non-English (NE) blogs
Large proportion of NE tokens
High frequency NE stopwords low frequency
English stopwords
Non-blog content
Non-post/comment text generated by blogware
e.g., sidebar/navigation, header/footer,
advertisement, etc.
Spam postings
How can blog noise be identified?
NE blogs
Language tags in feeds
NE tokens in permalinks
Non-blog content
Mark-up tags in permalinks

3
Research Questions Opinion Detection

How to retrieve blogs about something that are
opinionated?
What are the evidences of opinion?
Opinion Terminology
Words often used in expressing an opinion
e.g., Skype sucks, Skype rocks, Skype is
cool
Opinion Collocations
Collocations that mark an opinion
e.g., I think tomato is a fruit, Tomato is a
vegetable to me
Opinion Morphology
Word morphing to emphasize an opinion
e.g., Vista is soooo buggy, Vista is metacool
How can they be leveraged?
Opinion classification via Supervised Learning
Document scoring using Opinion Lexicons
How can they be combined to detect opinionated
blogs?
Weighted sum optimized via Dynamic Tuning

4
WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
5
WIDIT Approach Noise Reduction

Non-English (NE) blog Identification
Language tags in Feeds
Not always present
Extract all unique tags from the feed data
Identify tags that indicate non-English language
Flag permalinks in feeds with non-English
language tag
NE tokens in Permalinks
NE-blogs w/ English tokens English blogs w/ NE
tokens
For each permalink,
Identify tokens consisting of non-ASCII
characters (i.e., NE tokens)
Compute the NE Content Rate NECR NE_tokens /
tokens
Compute the frequency proportion of English
stopwords Est(f), Est(p)
Flag if large NE content or some NE content with
few E-stopwords
NECR gt 0.5 with Est(f)lt10
NECR lt 0.5 with Est(f)lt10 and Est(p)lt0.02)
NECR lt 0.5 with dlengt1000, Est(f)lt20 and
Est(p)lt0.01)

6
WIDIT Approach Noise Reduction

Non-blog content Exclusion
Extract all unique tag patterns from the
permalink data
Compile a list of content noise tags with high
frequency
Construct regular expressions (regex) to
identify content noise tags
Apply regex to the unique tag set
Modify regex based on the examination of regex
results
Repeat steps 4 5 until
regex correctly identifies content noise tags
For each permalink,
Extract blog segment using the content regex
Extract ltpostcommentgt text
If no ltpostcommentgt tags, extract ltcontentgt text
If no ltcontentgt tag, extract ltbodygt text
Exclude noise text using the noise regex
e.g., lt(divspan).?(footerprofilesidenavadv
ertisesponsor).?gt

7
WIDIT Approach Noise Reduction

Noise Reduction (NR) Statistics
335,691 (11.7) NE permalinks excluded
Blog length reduction
Over 50 permalinks with length difference
Average length reduction 551 bytes
74.3 million (7.4) tokens excluded by NR
21,283 (0.6) unique tokens excluded by NR

8
WIDIT Approach Topic Reranking

Topic Reranking factors
Exact Match of query title text in document
Near Match of query title/description text in
document
All of the query terms occur in sequence near
each other
Noun Phrase Match
Non-Rel Match
Non-relevant phrases/nouns from the topic
narrative occur in document
Topic Reranking (TR) Method
Compute TR scores for each document
Document-length normalized frequency
Categorize initial retrieval into reranking
groups
g1 exact match (query title to document. title
body)
g2 exact match (multi-term query title to
document title only)
g3 exact match (query title to doc. body only)
g4 other
Rerank documents within groups using combined TR
score

9
WIDIT Approach Opinion Lexicons

Lexicon-based Opinion Detection
Construct Opinion Lexicons from multiple sources
of opinion evidence
Opinion Terminology, Opinion Collocations,
Opinion Morphology
Score documents using Opinion Lexicons
Opinion Terminology
Wilsons Lexicons
A subset of Wilsons subjectivity terms
4747 strong 2190 weak subjective terms with
polarity
240 emphasis terms
High Frequency (HF) Lexicon
For each of IMDb movie 2006 blog training data
Extract high frequency terms from positive
training data (e.g., movie review)
Exclude terms that occur in negative training
data (e.g., movie plot summary)
Select a set of opinion terms
Combine the IMDb blog term sets
Assign polarity strength to each term
Expand with synonyms antonyms from Wordnet

10
WIDIT Approach Opinion Lexicons

Opinion Collocations
I-You (IU) Lexicon
For each of movie review positive blog training
data
Extract n-grams that begin/end with IU anchors
(e.g., I, You, my, your, me)
Select a set of opinion collocations
Combine the movie blog term sets
Assign strength polarity to each collocation
Add verb conjugations noun plurals
Expand with HF Wilson terms
Acronym Lexicon
Select opinion collocations from netlingo
acronyms
e.g., afaik (as far as I know), imho (in my
humble opinion)
Assign strength polarity to each collocation

11
WIDIT Approach Opinion Lexicons

Opinion Morphology
When expressing opinion, people become creative
and tend to use uncommon/rare terms
(Wiebe,Wilson, Bruce, Bell, Martin, 2004)
LF Lexicon LF Regex
Compile a set of Low Frequency (LF) terms in the
blog collection
Exclude terms that occur frequently in negative
training data
Construct regular expressions (LF regex) to
identify Opinion Morph (OM) terms
Based on examination of HF terms LF patterns
Compound words (e.g., crazygood, ohmygod)
Repeat-character words (e.g., sooo, fantaaastic)
Morph-spelled words (e.g., luv, hizzarious)
Apply regex to LF term set
Iteratively refine regex based on the examination
of regex results
Exclude regex matches from LF term set
Select OM terms (LF lexicon) from the remaining
set

12
WIDIT Approach Opinion Reranking

Opinion Reranking factors
Opinion Terminology
Wilsons lexicon, HF lexicon
Opinion Collocations
AC lexicon, IU lexicon
Opinion Morphology
LF lexicon, LF regex
Opinion Reranking (OR) Method
Compute OR scores for each document
Document-length normalized frequency
Rerank topic-reranked documents using
combined OR score topic-reranking groups

13
WIDIT Approach Polarity Detection

For each opinion-reranked document,
Compute positive negative polarity scores
Presence of valence shifters near opinion terms
reverse polarity
e.g., not, never, no, without, hardly, barely,
scarecely
Combine polarity scores using D-tuned formula
fsc(p), fsc(n)
Apply polarity detection heuristic
Positive polarity if
most of opinion factors are positive,
fsc(p)-fsc(n) gt threshold
fsc(p)/fsc(n) gt threshold2
Negative polarity if
most of opinion factors are negative,
Fsc(n)-fsc(p) gt threshold
Fsc(n)/fsc(p) gt threshold2
Mixed polarity otherwise

14
WIDIT Approach Dynamic Tuning

Reranking formula
RS aNSorig ß?(wiNSi)
wi weight of reranking factor i
NSi normalized score of factor i
(Si Smin) / (Smax Smin)
a weight of original score
ß weight of overall reranking score
How to determine a, ß, wi?
Too many parameters for exhaustive combinations
Linear combination may not suffice
Dynamic Tuning
Real-time display of parameter tuning effect on
performance
To guide the user towards local optimum
By harnessing both human intelligence (pattern
recognition) w/ computational power of machine

15
WIDIT Approach Dynamic Tuning

Opinion Reranking

16
WIDIT Approach Dynamic Tuning

Polarity Detection

17
Results Overview

Noise Reduction
Positive effect on retrieval performance
Topic Reranking
16 improvement (Qshort), 9 improvement (Qlong)
over initial result
Opinion Reranking
15 improvement (Qshort), 11 improvement (Qlong)
over TopicRR

18
Relative Performance Opinion Reraking Effects
(short query)
Good OnTopic retrieval ? good opinion retrieval
- but not necessarily due to oprinion reranking
19
Relative Performance Polarity Detection
Steeper slope ? worse relative performance
20
Result Overview 2006 vs. 2007
Topic MAP
Opinion MAP
21
Noise Reduction Effect
22
Reranking Effect
23
Opinion Reranking Factors
24
Concluding Remarks

Noise Reduction
Positive effect on retrieval performance
Reranking
Most influential component of the system
Next year
Improve baseline performance
Spam filtering?
Incorporate Machine Learning into the fusion pool

25
Questions?

Wilsons lexicon
http//www.cs.pitt.edu/mpqa/opinionfinderrelease/
Movie Review Data
http//www.cs.cornell.edu/people/pabo/movie-review
-data/
Movie Plot Summaries
http//www.imdb.com/Sections/Plots/
Netlingo Terms
http//www.netlingo.com/emailsh.cfm
WIDIT Lexicons
http//elvis.slis.indiana.edu/lexlist.htm

26
Result At a Glance
27
Reranking EffectDynamic Tuning
r1 topic reranking r2 opinion reranking
s0R reranking w/o Dynamic Tuning s0R1
reranking w Dynamic Tuning

Opinion reranking
- sacrifices topical performance for the sake
of opinion detection
Dynamic Tuning
- improves reranking performance across the
board

28
Query Length Effect
29
Term Weight Effect
30
Result At a Glance
31
Result At a Glance
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
WIDIT Approach Fusion

Weighted Sum Fusion Formula
FS ?(wiNSi)
Fusion Type
Baseline (Min-Max) fusion wi 1
MAP fusion wi MAP of training runs
Fusion Combinations
By Query Length
Short, Long, Long w/ nouns
By Term Weight
Okapi, SMART
Fusion Levels
Baseline results
Topic-reranked results
Opinion-reranked results

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
37
WIDIT Approach Noise Reduction

Non-English (NE) blog Identification
Language tags in Feeds (FD)
16,121 permalinks flagged (1,473 not flagged by
PM)
NE tokens in Permalinks (PM)
334,219 permalinks flagged (11.6)
NE blog Validation
OnTopic (relgt0) NE blogs in 2006 qrels file
24 (PMonly)
Suspected qrels error
NE blogs in 2006 qrels file
59 (FDonly) 2043 (PMonly) 203 (both) 2304
All but 3 manually validated as Non-English
blog-- BLOG06-20051230-022-0008930772 short
blog with no content -- BLOG06-20060118-000-00086
92678 short blog-- BLOG06-20051208-020-002294594
8 Uhhhh.... (long) It sucks!