Title: Why We Search
1Why We Search
- Eytan Adar
- University of Washington
- May 12, 2007
- Dan Weld, Brian Bershad, and Steve Gribble
2Power in prediction
- Based on blogs can we figure out which ad words
to buy? - Based on event on TV can we gauge online
response? - What kind of news events do groups respond to?
How do they respond? - Integrate other behavioral data
- Purchase habits
- Brand awareness
- Etc.
3Power in prediction
- Can we understand what events impact/predict/corre
late online behavior? - Who responds to an event?
- When do they respond?
- How much?
- Why do they respond?
- Attention as a resource
- Indicator for other investments
4- Daily lives
- Information side effects
- Attention
- searches, mentions, news, votes, etc.
5Searches about news
Blog posts about news
time
Predictive, Correlated
6Suntan lotion sales
Sunshine
time
Predictive, Causal
7Agenda
- Transform text behavioral data to more useful
form - Infrastructure to compare different behavioral
data - Analysis visualization technique to compare
behaviors over time - Some observations
8iraq war
X 15M (MSN Logs) X 12.2M (AOL Logs) May 06
As of all queries (in that period)
iraq war
iraq war
Query Event Stream (QES)
9X 14M Posts
of blog posts that mention phrase
10X 13K Articles from CNN/BBC
of news articles that mention phrase number
of inlinks
11X 2.5K Shows (TV.com)
of episodes that mention phrase number of
votes
12Phrases/Queries ? Topics
- We want to know that britney spears is the same
as - spears britney or just
- britney
- Solution look at clicks and results
- 1M queries from MSN logs that appear 2 times
- Overlapping clicks/result sets indicate
relatedness of queries (similarity measure) - Naïve clustering
- Query Event Stream (QES) ? Topic Event Stream
(TES)
13Experimental Set
- We take the 3638 most frequent queries from MSN
- AOL 3627 ( 99)
- BLOG 1975 (54)
- NEWS 1704 (47)
- TV 1602 (44)
- Compare topic A in one set to topic A in another
- Limits spurious correlations
14Correlations
- Do we even have a chance?
- Equivalent to convolution
- Try for some delay range, d, find max value
- Negative/Positive correlations
r
d
0
15Delays (high correlation)
38 are at 0
16Explorer
17Explorer
18Explorer
19Explorer
20Explorer
21Explorer
22Explorer
23Max-correlation delay 3 hours
time
Same correlations delays, but very different
shapes
24How do we compare these? Visual summary of
differences?
25magnitude
time
26Capture not just delay or difference, but
specific behaviors
peak
fall
rise
run
27Dynamic Time Warping (DTW)
DTWi,j min(DTWi-1,j cost,
DTWi,j-1 cost,
DTWi-1,j-1 cost)
0
28Curve 1
Reference Curve
29DTWRadar Summary of differences between two
times series
Curve 1
Reference Curve
Curve 1 has bigger response on average
Curve 1 lags on average
30Reference Curve
Curve 1
31Explorer with Radar
32Some Findings
- Randomly selected some topics and labeled them
- People, places, events, news, etc.
- So why do we search? Or blog? Or react to news?
331) News of the Weird
- Bloggers pick up on weird stories first
- igor vovkovinskiy
- uss oriskany
Blog
Search (MSN)
Blog
Search (MSN)
Curves normalized to max value for readability
34Blogs lead versus lag in the news
352) Anticipated Events
- Pressure to be new
- Bloggers dont talk about anticipated events
- TV Shows
Search (MSN)
Blog
363) Familiarity Breeds Contempt
- We get tired of certain kinds of news
- Takes a really big spike for us to get excited
- enron trial
Search (MSN)
News
374) Correlation vs. Causation
- poseidon
- Both responsd to movie release, but one to
marketing and one to satire - Need other, more specific, data streams to infer
causation
TV
Search (MSN)
385) Influence of the portal
- mothers day
- Demographics?
- Hypothesis Whats on the front page drives
search - Portals present news stories and information
- Users react to that information
- Different portals ? different searches
39Summary
Unstructured Source Data
Conversion from a number of different data sources
Conversion / Data Cleaning
Time Series
Explorer, DTWRadar
Model Building
Models
Number of findings indicating the relationship of
data
Time Series Analysis Algorithms
Predictions