Title: Combining Lexiconbased Methods to Detect Opinionated Blogs
1Combining Lexicon-based Methodsto Detect
Opinionated Blogs
- Kiduk Yang, Ning Yu, Hui Zhang
- WIDIT Laboratory
- School of Library Information Science
- Indiana University
2Outline
- Research Questions
- WIDIT Approach
- Results
3Research Questions
- Does noise in blog data affect IR performance?
- Compare performance of runs with noise and
without noise - What are the characteristics of blog noise?
- How can blog noise be identified?
- How to retrieve blogs about something that are
opinionated? - Optimize the retrieval of blogs about a target
(i.e. OnTopic) - Boost the rankings of blogs with evidences of
opinion - What are the evidences of opinion?
- How can they be leveraged?
- How can they be combined to detect opinionated
blogs?
4Research Questions Noise Reduction
- What are the characteristics of blog noise?
- Non-English (NE) blogs
- Large proportion of NE tokens
- High frequency NE stopwords low frequency
English stopwords - Non-blog content
- Non-post/comment text generated by blogware
- e.g., sidebar/navigation, header/footer,
advertisement, etc. - Spam postings
- How can blog noise be identified?
- NE blogs
- Language tags in feeds
- NE tokens in permalinks
- Non-blog content
- Mark-up tags in permalinks
5Research Questions Opinion Detection
- What are the evidences of opinion?
- Opinion Terminology
- Words often used in expressing an opinion
- e.g., Skype sucks, Skype rocks, Skype is
cool - Opinion Collocations
- Collocations that mark an opinion
- e.g., I think tomato is a fruit, Tomato is a
vegetable to me - Opinion Morphology
- Word morphing to emphasize an opinion
- e.g., Vista is soooo buggy, Vista is metacool
- How can they be leveraged?
- Opinion classification via Supervised Learning
- Document scoring using Opinion Lexicons
- How can they be combined to detect opinionated
blogs? - Weighted sum optimized via Dynamic Tuning
6WIDIT Approach
Blog collection
- Noise Reduction
- Non-English blog elimination
- Non-blog content exclusion
- On-topic Retrieval
- Initial Retrieval
- On-Topic Reranking
- Opinion Detection
- Opinion Reranking
- Polarity Classification
Subset where target appears
Subset where target appears
On-topic
On-topic
7WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
8WIDIT Approach Noise Reduction
- Non-English (NE) blog Identification
- Language tags in Feeds
- Not always present
- Extract all unique tags from the feed data
- Identify tags that indicate non-English language
- Flag permalinks in feeds with non-English
language tag - NE tokens in Permalinks
- NE-blogs w/ English tokens English blogs w/ NE
tokens - For each permalink,
- Identify tokens consisting of non-ASCII
characters (i.e., NE tokens) - Compute the NE Content Rate NECR NE_tokens /
tokens - Compute the frequency proportion of English
stopwords Est(f), Est(p) - Flag if large NE content or some NE content with
few E-stopwords - NECR gt 0.5 with Est(f)lt10
- NECR lt 0.5 with Est(f)lt10 and Est(p)lt0.02)
- NECR lt 0.5 with dlengt1000, Est(f)lt20 and
Est(p)lt0.01)
9WIDIT Approach Noise Reduction
- Non-English (NE) blog Identification
- Language tags in Feeds (FD)
- 16,121 permalinks flagged (1,473 not flagged by
PM) - NE tokens in Permalinks (PM)
- 334,219 permalinks flagged (11.6)
- NE blog Validation
- OnTopic (relgt0) NE blogs in 2006 qrels file
- 24 (PMonly)
- Suspected qrels error
- NE blogs in 2006 qrels file
- 59 (FDonly) 2043 (PMonly) 203 (both) 2304
- All but 3 manually validated as Non-English
blog-- BLOG06-20051230-022-0008930772 short
blog with no content -- BLOG06-20060118-000-00086
92678 short blog-- BLOG06-20051208-020-002294594
8 Uhhhh.... (long) It sucks!
10Suspected Non-English blogs in qrels.blog06
11WIDIT Approach Noise Reduction
- Non-blog content Exclusion
- Extract all unique tag patterns from the
permalink data - Compile a list of content noise tags with high
frequency - Construct regular expressions (regex) to
identify content noise tags - Apply regex to the unique tag set
- Modify regex based on the examination of regex
results - Repeat steps 4 5 until
- regex correctly identifies content noise tags
- For each permalink,
- Extract blog segment using the content regex
- Extract ltpostcommentgt text
- If no ltpostcommentgt tags, extract ltcontentgt text
- If no ltcontentgt tag, extract ltbodygt text
- Exclude noise text using the noise regex
- e.g., lt(divspan).?(footerprofilesidenavadv
ertisesponsor).?gt
12WIDIT Approach Noise Reduction
- Noise Reduction (NR) Statistics
- Blog length reduction
- Over 50 blogs with length difference
- Average length reduction 551 bytes
- 74.3 million (7.4) tokens excluded by NR
- 21,283 (0.6) unique tokens excluded by NR
13WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
14WIDIT Approach Topic Reranking
- Topic Reranking factors
- Exact Match of query title text in document
- Near Match of query title/description text in
document - All of the query terms occur in sequence near
each other - Noun Phrase Match
- Non-Rel Match
- Non-relevant phrases/nouns from the topic
narrative occur in document - Topic Reranking (TR) Method
- Compute TR scores for each document
- Document-length normalized frequency
- Categorize initial retrieval into reranking
groups - g1 exact match (query title to document. title
body) - g2 exact match (multi-term query title to
document title only) - g3 exact match (query title to doc. body only)
- g4 other
- Rerank documents within groups using combined TR
score
15WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
16WIDIT Approach Opinion Lexicons
- Lexicon-based Opinion Detection
- Construct Opinion Lexicons from multiple sources
of opinion evidence - Opinion Terminology, Opinion Collocations,
Opinion Morphology - Score documents using Opinion Lexicons
- Opinion Terminology
- Wilsons Lexicons
- A subset of Wilsons subjectivity terms
- 4747 strong 2190 weak subjective terms with
polarity - 240 emphasis terms, 88 negation n-grams
- High Frequency (HF) Lexicon
- For each of IMDb movie 2006 blog training data
- Extract high frequency terms from positive
training data (e.g., movie review) - Exclude terms that occur in negative training
data (e.g., movie plot summary) - Select a set of opinion terms
- Combine the IMDb blog term sets
- Assign polarity strength to each term
- Expand with synonyms antonyms from Wordnet
17WIDIT Approach Opinion Lexicons
- Opinion Collocations
- I-You (IU) Lexicon
- For each of movie review positive blog training
data - Extract n-grams that begin/end with IU anchors
(e.g., I, You, my, your, me) - Select a set of opinion collocations
- Combine the movie blog term sets
- Assign strength polarity to each collocation
- Add verb conjugations noun plurals
- Expand with HF Wilson terms
- Acronym Lexicon
- Select opinion collocations from netlingo
acronyms - e.g., afaik (as far as I know), imho (in my
humble opinion) - Assign strength polarity to each collocation
18WIDIT Approach Opinion Lexicons
- Opinion Morphology
- When expressing opinion, people become creative
and tend to use uncommon/rare terms
(Wiebe,Wilson, Bruce, Bell, Martin, 2004) - LF Lexicon LF Regex
- Compile a set of Low Frequency (LF) terms in the
blog collection - Exclude terms that occur frequently in negative
training data - Construct regular expressions (LF regex) to
identify Opinion Morph (OM) terms - Based on examination of HF terms LF patterns
- Compound words (e.g., crazygood, ohmygod)
- Repeat-character words (e.g., sooo, fantaaastic)
- Morph-spelled words (e.g., luv, hizzarious)
- Apply regex to LF term set
- Iteratively refine regex based on the examination
of regex results - Exclude regex matches from LF term set
- Select OM terms (LF lexicon) from the remaining
set
19WIDIT Approach Opinion Reranking
- Opinion Reranking factors
- Opinion Terminology
- Wilsons lexicon, HF lexicon
- Opinion Collocations
- AC lexicon, IU lexicon
- Opinion Morphology
- LF lexicon, LF regex
- Opinion Reranking (OR) Method
- Compute OR scores for each document
- Document-length normalized frequency
- Rerank topic-reranked documents using
- combined OR score topic-reranking groups
20WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
21WIDIT Approach Dynamic Tuning
- Reranking formula
- RS aNSorig ß?(wiNSi)
- wi weight of reranking factor i
- NSi normalized score of factor i
- (Si Smin) / (Smax Smin)
- a weight of original score
- ß weight of overall reranking score
- How to determine a, ß, wi?
- Too many parameters for exhaustive combinations
- Linear combination may not suffice
- Dynamic Tuning
- Real-time display of parameter tuning effect on
performance - To guide the user towards local optimum
- By harnessing both human intelligence (pattern
recognition) w/ computational power of machine
22WIDIT Approach Dynamic Tuning
23WIDIT Approach Dynamic Tuning
24WIDIT Approach Fusion
- Weighted Sum Fusion Formula
- FS ?(wiNSi)
- Fusion Type
- Baseline (Min-Max) fusion wi 1
- MAP fusion wi MAP of training runs
- Fusion Combinations
- By Query Length
- Short, Long, Long w/ nouns
- By Term Weight
- Okapi, SMART
- Fusion Levels
- Baseline results
- Topic-reranked results
- Opinion-reranked results
wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
25WIDIT Approach Polarity Detection
- For each opinion-reranked document,
- Compute positive negative polarity scores
- Combine polarity scores using D-tuned formula
- fsc(p), fsc(n)
- Apply polarity detection heuristic
- Positive polarity if
- most of opinion factors are positive,
- fsc(p)-fsc(n) gt threshold
- fsc(p) gtgt fsc(n)
- Negative polarity if
- most of opinion factors are negative,
- Fsc(n)-fsc(p) gt threshold
- Fsc(n) gtgt fsc(p)
- Mixed polarity otherwise
26Result Overview
- Independent Variables
- Noise Reduction
- Query Length
- Topic Reranking
- Opinion Reranking
- Dynamic Tuning
- Fusion
- Topic Difficulty
- Failure analysis
27Results Summary
- Noise Reduction
- Adverse effect on retrieval performance
- Many relevant documents had contents excluded by
the WIDIT Noise Reduction module - Query Length
- Longer the query, the better the performance
- Topic Reranking
- 4 improvement (Qshort),10 improvement (Qlong)
over initial result - Opinion Reranking
- 15 improvement (Qshort), 10 improvement (Qlong)
over TopicRR - Dynamic Tuning
- 4 improvement (Qshort), 9 improvement (Qlong)
over no tuning - Fusion
- 20 improvement (Qshort) over best baseline
non-fusion - Topic Difficulty
- Improvement by Opinion reranking not related to
topic difficulty
28Concluding Remarks
- Noise Reduction
- Good idea, but faulty implementation
- Effect on retrieval is not yet clear
- Post-retrieval Reranking, Dynamic Tuning, and
Fusion all improve retrieval perfomance - Compound effect is even more beneficial
- Opinion Modules
- Need better training data
29Result At a Glance
30(No Transcript)
31Query Length Effect
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37Topic Difficulty
38Failure Analysis
- Possible reasons for failure (General)
- Sense Ambiguity
- 877 Sonic
- game? Team(sonics)? Software? Toothbrush?
- Usage Ambiguity
- 881 Fox News Report
- (non-rel) used more as a news source than the
target of discussion - Narrow Search
- 887 World Trade Organization
- time frame, reaction to the meeting but not WTO
in general - 900 MacDonald
- regarding to the food only
- 890 Olympics
- overall appeal and impression of the Winter and
Summer Olympics.
39Failure Analysis
- Possible reasons for failure (WIDIT-specific)
OnTopic - Noise Effect
- 898 Business Intelligence Resources
- pages having sidebar with link to business
intelligence resources - Document Length Normalization
- 874 Coretta Scott King
- 866 Whole Foods
- (reldoc) long article with small portion of
relevant information - Exact Match Failure
- 869 Mohammad Cartoon
- Non-rel docs with exact Topic title
- Stopword Failure
- 866 whole foods
- stopword list contains whole
40Failure Analysis
- Possible reasons for failure (WIDIT-specific)
Opinion - Retrieved documents contain opinion but not on
the target. (20) - Document on topic but opinions are on non-topic
portion 1(898) 2(879) - Opinion about the original post (e.g.good
stuff) - Inconsistent Assessment? (20)
- 891 Intel 1(1) 2(3) 3(0)
- 879 Hybrid cars 1(0) 2(3)
- 899 cholesterol 1(4) 2(1)
- 882 seahawks 1(1)
- Others
- Few relevant document
- 898 Business Intelligence Resources has 10
relevant documents. - IU module failed when there are lots of comments
following a post. - 899 1
41Questions?
- Wilsons lexicon
- http//www.cs.pitt.edu/mpqa/opinionfinderrelease/
- Movie Review Data
- http//www.cs.cornell.edu/people/pabo/movie-review
-data/ - Movie Plot Summaries
- http//www.imdb.com/Sections/Plots/
- Netlingo Terms
- http//www.netlingo.com/emailsh.cfm
- WIDIT Lexicons
- http//elvis.slis.indiana.edu/lexlist.htm