Title: Improving Search Results Quality by Customizing Summary Lengths
1Improving Search Results Quality by Customizing
Summary Lengths
- Michael Kaisser?, Marti Hearst?
- and John B. Lowe?
- ?University of Edinburgh,
- ?UC Berkeley, ?Powerset, Inc.
- ACL-08 HLT
2Talk Outline
- How best to display search results?
- Experiment 1 Is there a correlation between
response type and response length? - Experiment 2 Can humans predict the best
response length? - Summary and Outlook
3Motivation
- Web Search result listings today are largely
standardized display a documents surrogate
(Marchionini et al., 2008) - Typically One header line, two lines text
fragments, one line for URL - But Is this the best way to present search
results? Especially Is this the optimal length
for every query?
(Source Yahoo!)
4Experiment 1 Research Question
- Do different types of queries require responses
of different lengths?
- (And if so, is the preferred response type
dependent on the expected semantic response
type?)
5Experiment 1 Setup
- Data used
- 12,790 queries from Powersets query database
- Contains search engines query logs and hand
crafted queries - disproportionally large number of natural
language queries
6Experiment 1 Setup
- Disproportionally large number of natural
language queries. - Examples
- date of next US election
- Hip Hop
- A synonym for material
- highest volcano
- What problems do federal regulations cause?
- I want to make my own candles
- industrial music
7Excursus Mechanical Turk
- Amazon web services API for computers to
integrate "artificial artificial intelligence" - requesters can upload Human Intelligence Tasks
(HITs) - Workers work on these HITs and are paid small
- sums of money
- Examples
- can you see a person in the photo?
- is the document relevant to a query?
- is the review of this product positive or
negative?
8Excursus Mechanical Turk
- Amazon web services API for computers to
integrate "artificial artificial intelligence" - requesters can upload Human Intelligence Tasks
(HITs) - Workers work on these HITs and are paid small
- sums of money
- ? Mechanical Turk is/can also be seen as a
platform for online experiments
9Experiment 1
- Turkers are asked to classify queries by
- Expected response type
- Best response length
- Each query is done by three different subjects.
10(No Transcript)
11Experiment 1 Results
- Distribution of length categories differs across
individual expected response categories. - Some results are intuitive
- Queries for numbers want short results
- Advice queries want longer results
- Some results are more surprising
- Different length distributions for Person vs.
Organization
12Experiment 2 Research Question
- Can human judges correctly predict the preferred
result length?
13Experiment 2 Setup
- Experiment 1 produced 1099 high-confidence
queries (where all three turkers agreed on
semantic category and length) - For 170 of these turkers manually created
snippets from Wikipedia of different lengths - Phrase
- Sentence
- Paragraph
- Section
- Article (in this case a link to the article was
displayed) - Note Categories differ slightly from first
experiment
14Experiment 2 Setup
Manually created snippets from Wikipedia of
different lengths
15Experiment 2 Setup
- Displayed
- Instructions
- Query
- One response from one length category
- Rating scale
- Each Hit was shown to ten turkers.
16Experiment 2 Setup
Instructions Below you see a search engine
query and a possible response. We would like you
to give us your opinion about the response. We
are especially interested in the length of the
response. Is it suitable for the query? Is there
too much or not enough information? Please rate
the response on a scale from 0 (very bad
response) to 10 (very good response).
17(No Transcript)
18Experiment 2 Significance
Significance results of unweighted linear
regression on the data for the second experiment,
which was separated into four groups based on the
predicted preferred length.
19Experiment 2 Details
- 146 queries
- 5 length categories per query
- 10 judgments per query
- 7,300 judgments
- 124 judges
- 16 judges did more than 146 hits
- 2 of these 16 were excluded (scammers)
- 0.01 per judgment
- 73 paid at judges, plus 73 Amazon fees
- 146 for Experiment 2 (excluding snippet
generation)
20Experiment 2 Results
- Results
- Human judges can predict the preferred result
lengths (at least for a subset of especially
clear queries)
21Experiment 2 Results
- Results
- Human judges can predict the preferred result
lengths (at least for a subset of especially
clear queries) - ? Standard results listings are often too short
(and sometimes too long)
22Outlook
- Can queries be automatically classified according
to their predicted result length? - Initial Experiment
- Unigram word counts
- 805 training queries, 286 test queries
- Three length bins (long, short, other)
- Weka NaiveBayesMultinomial
- Initial Result
- 78 of queries correctly classified
23 24MT Demographics - Age
Survey, data and graphs from Panos Ipeirotis
blog http//behind-the-enemy-lines.blogspot.com/
2008/03/mechanical-turk-demographics.html
25MT Demographics - Gender
Survey, data and graphs from Panos Ipeirotis
blog http//behind-the-enemy-lines.blogspot.com/
2008/03/mechanical-turk-demographics.html
26MT Demographics - Education
Survey, data and graphs from Panos Ipeirotis
blog http//behind-the-enemy-lines.blogspot.com/
2008/03/mechanical-turk-demographics.html
27MT Demographics - Income
Survey, data and graphs from Panos Ipeirotis
blog http//behind-the-enemy-lines.blogspot.com/
2008/03/mechanical-turk-demographics.html
28MT Demographics - Purpose
Survey, data and graphs from Panos Ipeirotis
blog http//behind-the-enemy-lines.blogspot.com/
2008/03/mechanical-turk-demographics.html
29(No Transcript)
30Excursus Mechanical Turk
Example HIT (not ours)