Title: Combining Lexiconbased Methods to Detect Opinionated Blogs
1Combining Lexicon-based Methodsto Detect
Opinionated Blogs
- Kiduk Yang, Ning Yu, Hui Zhang
- WIDIT Laboratory
- School of Library Information Science
- Indiana University
2Research Questions Noise Reduction
- Does noise in blog data affect IR performance?
- What are the characteristics of blog noise?
- Non-English (NE) blogs
- Large proportion of NE tokens
- High frequency NE stopwords low frequency
English stopwords - Non-blog content
- Non-post/comment text generated by blogware
- e.g., sidebar/navigation, header/footer,
advertisement, etc. - Spam postings
- How can blog noise be identified?
- NE blogs
- Language tags in feeds
- NE tokens in permalinks
- Non-blog content
- Mark-up tags in permalinks
3Research Questions Opinion Detection
- How to retrieve blogs about something that are
opinionated? - What are the evidences of opinion?
- Opinion Terminology
- Words often used in expressing an opinion
- e.g., Skype sucks, Skype rocks, Skype is
cool - Opinion Collocations
- Collocations that mark an opinion
- e.g., I think tomato is a fruit, Tomato is a
vegetable to me - Opinion Morphology
- Word morphing to emphasize an opinion
- e.g., Vista is soooo buggy, Vista is metacool
- How can they be leveraged?
- Opinion classification via Supervised Learning
- Document scoring using Opinion Lexicons
- How can they be combined to detect opinionated
blogs? - Weighted sum optimized via Dynamic Tuning
4WIDIT Blog System Architecture
Wilsons Lexicons
Netlingo Terms
BlogData
IMDbData
Blogsw/o Noise
Noise Reduction
Blogs
Opinion Lexicons
Document Indexing
Opinion Reranking
OnTopicResults
OpinionResults
InvertedIndex
DynamicTuning
Topic Reranking
Retrieval
InitialResults
Fusion
ExpandedQuery
LongQuery
ShortQuery
PolarityDetection
FusionResult
PolarityResult
Query Indexing
Topics
5WIDIT Approach Noise Reduction
- Non-English (NE) blog Identification
- Language tags in Feeds
- Not always present
- Extract all unique tags from the feed data
- Identify tags that indicate non-English language
- Flag permalinks in feeds with non-English
language tag - NE tokens in Permalinks
- NE-blogs w/ English tokens English blogs w/ NE
tokens - For each permalink,
- Identify tokens consisting of non-ASCII
characters (i.e., NE tokens) - Compute the NE Content Rate NECR NE_tokens /
tokens - Compute the frequency proportion of English
stopwords Est(f), Est(p) - Flag if large NE content or some NE content with
few E-stopwords - NECR gt 0.5 with Est(f)lt10
- NECR lt 0.5 with Est(f)lt10 and Est(p)lt0.02)
- NECR lt 0.5 with dlengt1000, Est(f)lt20 and
Est(p)lt0.01)
6WIDIT Approach Noise Reduction
- Non-blog content Exclusion
- Extract all unique tag patterns from the
permalink data - Compile a list of content noise tags with high
frequency - Construct regular expressions (regex) to
identify content noise tags - Apply regex to the unique tag set
- Modify regex based on the examination of regex
results - Repeat steps 4 5 until
- regex correctly identifies content noise tags
- For each permalink,
- Extract blog segment using the content regex
- Extract ltpostcommentgt text
- If no ltpostcommentgt tags, extract ltcontentgt text
- If no ltcontentgt tag, extract ltbodygt text
- Exclude noise text using the noise regex
- e.g., lt(divspan).?(footerprofilesidenavadv
ertisesponsor).?gt
7WIDIT Approach Noise Reduction
- Noise Reduction (NR) Statistics
- 335,691 (11.7) NE permalinks excluded
- Blog length reduction
- Over 50 permalinks with length difference
- Average length reduction 551 bytes
- 74.3 million (7.4) tokens excluded by NR
- 21,283 (0.6) unique tokens excluded by NR
8WIDIT Approach Topic Reranking
- Topic Reranking factors
- Exact Match of query title text in document
- Near Match of query title/description text in
document - All of the query terms occur in sequence near
each other - Noun Phrase Match
- Non-Rel Match
- Non-relevant phrases/nouns from the topic
narrative occur in document - Topic Reranking (TR) Method
- Compute TR scores for each document
- Document-length normalized frequency
- Categorize initial retrieval into reranking
groups - g1 exact match (query title to document. title
body) - g2 exact match (multi-term query title to
document title only) - g3 exact match (query title to doc. body only)
- g4 other
- Rerank documents within groups using combined TR
score
9WIDIT Approach Opinion Lexicons
- Lexicon-based Opinion Detection
- Construct Opinion Lexicons from multiple sources
of opinion evidence - Opinion Terminology, Opinion Collocations,
Opinion Morphology - Score documents using Opinion Lexicons
- Opinion Terminology
- Wilsons Lexicons
- A subset of Wilsons subjectivity terms
- 4747 strong 2190 weak subjective terms with
polarity - 240 emphasis terms
- High Frequency (HF) Lexicon
- For each of IMDb movie 2006 blog training data
- Extract high frequency terms from positive
training data (e.g., movie review) - Exclude terms that occur in negative training
data (e.g., movie plot summary) - Select a set of opinion terms
- Combine the IMDb blog term sets
- Assign polarity strength to each term
- Expand with synonyms antonyms from Wordnet
10WIDIT Approach Opinion Lexicons
- Opinion Collocations
- I-You (IU) Lexicon
- For each of movie review positive blog training
data - Extract n-grams that begin/end with IU anchors
(e.g., I, You, my, your, me) - Select a set of opinion collocations
- Combine the movie blog term sets
- Assign strength polarity to each collocation
- Add verb conjugations noun plurals
- Expand with HF Wilson terms
- Acronym Lexicon
- Select opinion collocations from netlingo
acronyms - e.g., afaik (as far as I know), imho (in my
humble opinion) - Assign strength polarity to each collocation
11WIDIT Approach Opinion Lexicons
- Opinion Morphology
- When expressing opinion, people become creative
and tend to use uncommon/rare terms
(Wiebe,Wilson, Bruce, Bell, Martin, 2004) - LF Lexicon LF Regex
- Compile a set of Low Frequency (LF) terms in the
blog collection - Exclude terms that occur frequently in negative
training data - Construct regular expressions (LF regex) to
identify Opinion Morph (OM) terms - Based on examination of HF terms LF patterns
- Compound words (e.g., crazygood, ohmygod)
- Repeat-character words (e.g., sooo, fantaaastic)
- Morph-spelled words (e.g., luv, hizzarious)
- Apply regex to LF term set
- Iteratively refine regex based on the examination
of regex results - Exclude regex matches from LF term set
- Select OM terms (LF lexicon) from the remaining
set
12WIDIT Approach Opinion Reranking
- Opinion Reranking factors
- Opinion Terminology
- Wilsons lexicon, HF lexicon
- Opinion Collocations
- AC lexicon, IU lexicon
- Opinion Morphology
- LF lexicon, LF regex
- Opinion Reranking (OR) Method
- Compute OR scores for each document
- Document-length normalized frequency
- Rerank topic-reranked documents using
- combined OR score topic-reranking groups
13WIDIT Approach Polarity Detection
- For each opinion-reranked document,
- Compute positive negative polarity scores
- Presence of valence shifters near opinion terms
reverse polarity - e.g., not, never, no, without, hardly, barely,
scarecely - Combine polarity scores using D-tuned formula
- fsc(p), fsc(n)
- Apply polarity detection heuristic
- Positive polarity if
- most of opinion factors are positive,
- fsc(p)-fsc(n) gt threshold
- fsc(p)/fsc(n) gt threshold2
- Negative polarity if
- most of opinion factors are negative,
- Fsc(n)-fsc(p) gt threshold
- Fsc(n)/fsc(p) gt threshold2
- Mixed polarity otherwise
14WIDIT Approach Dynamic Tuning
- Reranking formula
- RS aNSorig ß?(wiNSi)
- wi weight of reranking factor i
- NSi normalized score of factor i
- (Si Smin) / (Smax Smin)
- a weight of original score
- ß weight of overall reranking score
- How to determine a, ß, wi?
- Too many parameters for exhaustive combinations
- Linear combination may not suffice
- Dynamic Tuning
- Real-time display of parameter tuning effect on
performance - To guide the user towards local optimum
- By harnessing both human intelligence (pattern
recognition) w/ computational power of machine
15WIDIT Approach Dynamic Tuning
16WIDIT Approach Dynamic Tuning
17Results Overview
- Noise Reduction
- Positive effect on retrieval performance
- Topic Reranking
- 16 improvement (Qshort), 9 improvement (Qlong)
over initial result - Opinion Reranking
- 15 improvement (Qshort), 11 improvement (Qlong)
over TopicRR
18Relative Performance Opinion Reraking Effects
(short query)
Good OnTopic retrieval ? good opinion retrieval
- but not necessarily due to oprinion reranking
19Relative Performance Polarity Detection
Steeper slope ? worse relative performance
20Result Overview 2006 vs. 2007
Topic MAP
Opinion MAP
21Noise Reduction Effect
22Reranking Effect
23Opinion Reranking Factors
24Concluding Remarks
- Noise Reduction
- Positive effect on retrieval performance
- Reranking
- Most influential component of the system
- Next year
- Improve baseline performance
- Spam filtering?
- Incorporate Machine Learning into the fusion pool
25Questions?
- Wilsons lexicon
- http//www.cs.pitt.edu/mpqa/opinionfinderrelease/
- Movie Review Data
- http//www.cs.cornell.edu/people/pabo/movie-review
-data/ - Movie Plot Summaries
- http//www.imdb.com/Sections/Plots/
- Netlingo Terms
- http//www.netlingo.com/emailsh.cfm
- WIDIT Lexicons
- http//elvis.slis.indiana.edu/lexlist.htm
26Result At a Glance
27Reranking EffectDynamic Tuning
r1 topic reranking r2 opinion reranking
s0R reranking w/o Dynamic Tuning s0R1
reranking w Dynamic Tuning
- Opinion reranking
- - sacrifices topical performance for the sake
of opinion detection -
- Dynamic Tuning
- - improves reranking performance across the
board
28Query Length Effect
29Term Weight Effect
30Result At a Glance
31Result At a Glance
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36WIDIT Approach Fusion
- Weighted Sum Fusion Formula
- FS ?(wiNSi)
- Fusion Type
- Baseline (Min-Max) fusion wi 1
- MAP fusion wi MAP of training runs
- Fusion Combinations
- By Query Length
- Short, Long, Long w/ nouns
- By Term Weight
- Okapi, SMART
- Fusion Levels
- Baseline results
- Topic-reranked results
- Opinion-reranked results
wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
37WIDIT Approach Noise Reduction
- Non-English (NE) blog Identification
- Language tags in Feeds (FD)
- 16,121 permalinks flagged (1,473 not flagged by
PM) - NE tokens in Permalinks (PM)
- 334,219 permalinks flagged (11.6)
- NE blog Validation
- OnTopic (relgt0) NE blogs in 2006 qrels file
- 24 (PMonly)
- Suspected qrels error
- NE blogs in 2006 qrels file
- 59 (FDonly) 2043 (PMonly) 203 (both) 2304
- All but 3 manually validated as Non-English
blog-- BLOG06-20051230-022-0008930772 short
blog with no content -- BLOG06-20060118-000-00086
92678 short blog-- BLOG06-20051208-020-002294594
8 Uhhhh.... (long) It sucks!
38Suspected Non-English blogs in qrels.blog06
39(No Transcript)