WIDIT at TREC2006 Blog Track: Searching for Opinionated Posts about Entities PowerPoint PPT Presentation

presentation player overlay
1 / 36
About This Presentation
Transcript and Presenter's Notes

Title: WIDIT at TREC2006 Blog Track: Searching for Opinionated Posts about Entities


1
WIDIT at TREC-2006 Blog Track Searching for
Opinionated Posts about Entities
  • Kiduk Yang, Ning Yu, Hui Zhang, Alejandro Valerio
  • WIDIT Laboratory
  • School of Library Information Science
  • Indiana University

2
Research Questions
  • Does noise reduction affect blog retrieval
    performance?
  • Generate indexes with and without noise
  • Compare the retrieval performances of two indexes
  • How can topical search system retrieve
    opinionated blogs?
  • Retrieve blogs about a target (i.e. OnTopic)
  • Optimize OnTopic retrieval
  • Leverage mutiple evidences of subjectiveness/opini
    on to identify Opinion blogs

3
Research Questions
  • What are the evidences of subjectiveness/opinion
    and how can they be leveraged to retrieve
    opinionated blog?
  • Opinion Lexicon
  • words often used in expressing an opinion
  • e.g., Skype sucks, Skype rocks, Skype is
    cool
  • Extract high frequency terms from training data.
  • Opinion Collocations
  • collocations that mark an opinion
  • e.g., I believe God exists, God is dead to me
  • Extract IU (I, you, etc.) collocations from
    traning data
  • Opinion Morphology
  • word morphing to emphasize an opinion
  • e.g., Skype is soooo buggy, Skype is
    bugfested
  • Extract low frequency, non-dictionary terms from
    training data

4
WIDIT Approach
Blog collection
Subset where target appears
  • Noise Reduction
  • Non-English blog elimination
  • Non-post content exclusion
  • On-topic Retrieval
  • Initial Retrieval
  • On-Topic Reranking
  • Fusion
  • Opinion Retrieval
  • Opinion Reranking
  • HF, IU, LF evidences
  • Fusion

Subset where target appears
On-topic
On-topic
5
WIDIT Blog System Architecture
6
Noise Reduction
  • Non-English (NE) Blog Elimination Module
  • NE Blog Characteristics
  • Large proportion of NE terms
  • High frequency NE stopwords low frequency E
    stopwords
  • NE Blog Identification Heuristic
  • Large NE content with NE stopwords or few E
    stopwords
  • Some NE content with few (relative to blog
    length) E stopwords
  • Non-post Content Exclusion Module
  • Module Construction
  • Extract all HTML tag patterns from the blog
    collection
  • Manually identify patterns of true blog content
    (post, comments) noise
  • Construct regex from identified patterns
  • Noise Exclusion Heuristic
  • Exclude script style texts
  • Extract post and comments
  • If no post/comment tag, extract text between
    ltcontentgt tags
  • If none found, use the entire body text
  • Exclude form contents blog noise (e.g.,
    sidebar, navigation, ads, etc.)

7
Indexing
  • Document Indexing Module
  • Stop Stem
  • Generate inverted indexes with without noise
  • Compute Okapi SMART term weights
  • Query Indexing Module
  • Identify non-relevance text in topic
  • A not relevant, but B relevant (A negative, B
    positive)
  • A not relevant unless B (A weak negative, B
    strong positive)
  • not relevant if/when A (A negative)
  • Identify nouns noun phrases
  • Expand acronyms and abbreviations
  • Stop Stem
  • Formulate queries
  • Short, Long, Long w/ nouns

8
Topic Reranking
  • Topic Reranking Factors
  • Exact Match
  • Query title string in document
  • Proximity Match
  • Padded (between terms) query title/description
    string in document
  • Noun Phrase Match
  • Non-Rel Match
  • Non-relevance phrases/nouns in document
  • Topic Reranking Method
  • Compute Topic Reranking scores
  • Categorize initial retrieval into reranking
    groups
  • g1 exact match (query title to document. title
    body)
  • g2 exact match (multi-term query title to
    document title only)
  • g3 exact match (query title to doc. body only)
  • g4 other
  • Boost the rank of documents using reranking
    scores within groups

9
Training Data
  • Purpose
  • To extract Opinion lexicons for Opinion Modules
  • Construction
  • Training topics were created from BlogPulse
    Technorati search
  • A web interface was designed to display the
    initial results set, from which we evaluated and
    made assessments of topic and opinion relevance.
  • Data
  • 20 topics
  • 700 documents
  • 3-level relevance
  • On-topic only, Opinion-only, On-topic Opinion

10
Opinion Reranking
  • Opinion Reranking Factors
  • High Frequency (HF) terms
  • Terms that occur frequently in opinion blogs
  • Low Frequency (LF) patterns
  • Creative/rare term patterns used in expressing
    opinions
  • I-You (IU) collocations
  • N-gram collocation patterns with I, You, etc.
  • Adjective-Verb terms
  • Adjectives verbs used to express opinions
  • Opinion Reranking Method
  • Compute Opinion Reranking scores
  • Boost the topic-reranked rank of documents using
    reranking scores within topic-reranking groups

11
Opinion Reranking
  • High Frequency (HF) Module
  • Strategy
  • Identify opinion blogs using HF terms
  • HF terms terms that only occur frequently in
    opinion blogs.
  • Method
  • Build a HF lexicon
  • Extract high frequency terms from positive
    training data
  • Exclude terms that occur in negative training
    data
  • Manually select a set of opinion terms
  • Compute HF scores
  • Document-length normalized frequency
  • HF terms in document
  • HF terms near query title string in document

12
Opinion Reranking
  • Low Frequency (LF) Module
  • Hypothesis
  • When expressing opinion, people become creative
    and tend to use uncommon/rare terms
    (Wiebe,Wilson, Bruce, Bell, Martin, 2004)
  • e.g., sooo good
  • Method
  • Identify rare terms patterns that are used to
    express an opinion
  • Extract low frequency terms from positive
    training data
  • Exclude terms that occur in negative training
    data dictionary terms
  • Manually examine the remainder to construct a LF
    lexicon and LF regular expressions
  • Compute LF scores
  • Document-length normalized frequency
  • LF terms patterns in document
  • LF terms patterns near query title string in
    document

13
Opinion Reranking
  • I-You (IU) Module
  • Hypothesis
  • IU collocations tend to occur frequently in
    opinion blogs
  • IU term anchors I, you, my, your, me, etc.
  • Method
  • Construct IU lexicon
  • Extract n-grams that begin/end with IU terms from
    positive training data
  • Examine the n-grams to compile a IU collocation
    lexicon
  • e.g., I loathe, I believe, etc.
  • Compute IU scores
  • Document-length normalized frequency
  • Padded IU collocation string within document
    sentence
  • Padded IU collocations string within document
    sentence near query title string

14
Opinion Reranking
  • Adjective-Verb (AV) Module
  • Hypothesis
  • Opinion blogs have a high density of Opinion
    Adjectives Verbs
  • Method
  • Construct AV lexicons
  • Manually compile a AV seed set
  • e.g., good, bad, support, against, like, hate
  • Expand the seed set with synonyms antonyms from
    lexical sources (AV1)
  • Expand AV1 with similar AV terms using
    Distributional Similarity (AV2)
  • Compute AV scores
  • AV1 score Document-length normalized frequency
  • AV1 terms near query title string in document
  • AV2 score AV2 density in document
  • AV2 term frequency / total adjectiveverb
    frequency

15
Opinion Reranking
  • AV expansion by Distributional Similarity
  • Objective
  • Find a cluster of similar words given a seed set
    of Opinion AV
  • Hypothesis
  • Similar words have similar distributional
    (co-occurrence) patterns.
  • Learning Subjective Language (Wiebe et al.,
    2004)
  • Method
  • Split the training data into a training set and a
    validation set
  • Find terms that co-occur with seed set terms in
    the training set
  • Refine the expanded term set E(n)
  • Classify the validation set with E(1)..E(n)
  • Select E(k), which has the highest classification
    performance
  • Manually filter E(k) to create the final Opinion
    AV lexicon

16
Dynamic Tuning
17
Dynamic Tuning
18
Dynamic Tuning
19
Reranking Formula
  • RS aNSorig ß?(wiNSi)
  • wi weight of reranking factor i
  • NSi normalized score of factor i
  • (Si Smin) / (Smax Smin)
  • a weight of original score
  • ß weight of overall reranking score
  • Topic Reranking
  • Submission
  • 0.85NSorig 0.15(3ex1 2ex2 2px1 1px2
    1px3 1ph) - 1nr 0.5nr2
  • Post-Submission (w/ Dynamic Tuning)
  • 0.85NSorig 0.15(4.5ex1 3ex2 3px1
    2px2 1px3 4ph) - 3nr - 10nr2
  • Opinion Reranking
  • Submission
  • 0.5NSorig 0.5(1hf1 0.5hf2 2iu1 1iu2
    2lf1 1lf2 0.3av 0.2av2)
  • Post-Submission (w/ Dynamic Tuning)
  • 0.72NSorig 0.28(1hf1 1hf2 6iu1 1iu2
    2lf1 1lf2 0.1av 0.1av2)

20
Fusion
  • Fusion Formula
  • Weighted Sum
  • FS ?(wiNSi)
  • Fusion Combinations
  • By Query Length
  • Short, Long, Long w/ nouns
  • By Term Weight
  • Okapi, SMART
  • Fusion Levels
  • Baseline results
  • Topic-reranked results
  • Opinion-reranked results

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin)
21
Result Overview
  • Independent Variables
  • Noise Reduction
  • Query Length
  • Topic Reranking
  • Opinion Reranking
  • Dynamic Tuning
  • Fusion
  • Topic Difficulty
  • Failure analysis

22
Results Summary
  • Noise Reduction
  • Adverse effect on retrieval performance
  • Many relevant documents had contents excluded by
    the WIDIT Noise Reduction module
  • Query Length
  • Longer the query, the better the performance
  • Topic Reranking
  • 4 improvement (Qshort),10 improvement (Qlong)
    over initial result
  • Opinion Reranking
  • 15 improvement (Qshort), 10 improvement (Qlong)
    over TopicRR
  • Dynamic Tuning
  • 4 improvement (Qshort), 9 improvement (Qlong)
    over no tuning
  • Fusion
  • 20 improvement (Qshort) over best baseline
    non-fusion
  • Topic Difficulty
  • Improvement by Opinion reranking not related to
    topic difficulty

23
Concluding Remarks
  • Noise Reduction
  • Good idea, but faulty implementation
  • Effect on retrieval is not yet clear
  • Post-retrieval Reranking, Dynamic Tuning, and
    Fusion all improve retrieval perfomance
  • Compound effect is even more beneficial
  • Opinion Modules
  • Need better training data

24
Result At a Glance
  • Topic MAP
  • Opinion MAP

25
(No Transcript)
26
Query Length Effect
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Topic Difficulty
33
Failure Analysis
  • Possible reasons for failure (General)
  • Sense Ambiguity
  • 877 Sonic
  • game? Team(sonics)? Software? Toothbrush?
  • Usage Ambiguity
  • 881 Fox News Report
  • (non-rel) used more as a news source than the
    target of discussion
  • Narrow Search
  • 887 World Trade Organization
  • time frame, reaction to the meeting but not WTO
    in general
  • 900 MacDonald
  • regarding to the food only
  • 890 Olympics
  • overall appeal and impression of the Winter and
    Summer Olympics.

34
Failure Analysis
  • Possible reasons for failure (WIDIT-specific)
    OnTopic
  • Noise Effect
  • 898 Business Intelligence Resources
  • pages having sidebar with link to business
    intelligence resources
  • Document Length Normalization
  • 874 Coretta Scott King
  • 866 Whole Foods
  • (reldoc) long article with small portion of
    relevant information
  • Exact Match Failure
  • 869 Mohammad Cartoon
  • Non-rel docs with exact Topic title
  • Stopword Failure
  • 866 whole foods
  • stopword list contains whole

35
Failure Analysis
  • Possible reasons for failure (WIDIT-specific)
    Opinion
  • Retrieved documents contain opinion but not on
    the target. (20)
  • Document on topic but opinions are on non-topic
    portion 1(898) 2(879)
  • Opinion about the original post (e.g.good
    stuff)
  • Inconsistent Assessment? (20)
  • 891 Intel 1(1) 2(3) 3(0)
  • 879 Hybrid cars 1(0) 2(3)
  • 899 cholesterol 1(4) 2(1)
  • 882 seahawks 1(1)
  • Others
  • Few relevant document
  • 898 Business Intelligence Resources has 10
    relevant documents.
  • IU module failed when there are lots of comments
    following a post.
  • 899 1

36
Questions?
Write a Comment
User Comments (0)
About PowerShow.com