Title: Agglomerative Clustering of a Search Engine Query Log. Dou
1Agglomerative Clustering of a Search Engine Query
Log
Sixth ACM SIGKDD International Conference on
Knowledge Discovery Data Mining Monday,
August 21, 2000
- Doug Beeferman
- Lycos Research Group
- Lycos, Inc.
- Waltham, MA USA
Adam Berger School of Computer Science Carnegie
Mellon University Pittsburgh, PA USA
2Talk overview
- Motivation
- Approach
- Algorithmic details
- Sample application Related searches
- Experimental results
- Other Applications
3Motivation
- Group together similar Internet entities
- Search queries rare books, magazines,
bookstores - URLs http//www.amazon.com, http//www.bn.com
- Users Doug Beeferman, Adam Berger
- Applications
- Recommendation systems
- Automatic taxonomies
- Automatic communities
4Approach
- Exploit search result click log data, which
bridges all three.
1
2
3
Lycos
Search Results
Search
1. www.bn.com 2. www.amazon.com 3. www.salon.com
bn.com
DB
books
DB searched for books and then went to bn.com
(DB, books, bn.com)
5Approach
- Accumulate click log over timeTime
User Query string Selected URL
1 ltDBgt ltbooksgt lthttp//www.bn.comgt
2 ltABgt ltmagazinesgt lthttp//www.amazon.comgt
3 ltDBgt ltbiographiesgt lthttp//www.bn.comgt
4 ltABgt ltbooksgt lthttp//www.amazon.comgt - DB searched for books and then went to bn.com
- AB searched for magazines and then went to
amazon.com - DB searched for biographies and then went to
bn.com - AB searched for books and then went to
amazon.com
6Approach
- Factor data into bipartite graphs
7Approach
- Choose two of the three entities, say queries and
URLs. - Apply agglomerative clustering algorithm to the
nodes in its bipartite graph - Merge the two most similar nodes in the
left-hand-side - Merge the two most similar nodes in the
right-hand-side - Repeat until some termination condition applies.
- At termination queries are clustered on the
left-hand-side, and URLs are clustered on the
right-hand-side.
8Distance metric
- Similarity of two vertices in graph defined as
the number of neighbors in common, normalized by
the total number of unique neighbors
9Approach
- Agglomerative clustering in action 0 1 2 n
A
1
B
2
C
3
D
4
E
5
F
6
G
7
H
10Approach
- Agglomerative clustering in action 0 1 2 n
A
1
B
2
C, F
3
D
4
E
5
6
G
7
H
11Approach
- Agglomerative clustering in action 0 1 2 n
A
1
B
2
C, F
3, 5
D
4
E
6
G
7
H
12Approach
- Agglomerative clustering in action 0 1 2 n
A, D, E, H
1
B
C, F
2, 6, 3, 5
G
4
7
13Algorithm
- 1. Score all pairs of query nodes according to
distance metric D. - 2. Merge the two query nodes q1 and q2 for which
the distance D(q1, q2) is minimal. - 3. Score all pairs of URL nodes according to
distance metric D. - 4. Merge the two URL nodes u1 and u2 for which
the distance D(u1, u2) is minimal. - 5. Unless a termination condition applies, go to
step 1.
14Algorithm complexity
- Naïve implementation requires a distance metric
computation per pair of nodes, or ?(Q2 U2)
time per iteration. - But only a the neighbors of the affected nodes
change during a merge, so the per-iteration cost
is proportional to the maximum degree of any node.
15Application Related searches
- Lycos.com related searches feature
- Baseline algorithm based on past audience overlap
of individual query pairs - Data-intensive poor coverage of rare queries
16Application Related searches
Query
Suggestion
- Current approach
- (Query, Suggestion) pairs must be evidenced
explicitly in data. - Alternate approach using query clusters
- Train a set of query clusters using click log
data - For an input query q, identify its cluster Q
- Output the top members of Q as suggestions
17Application Related searches
- Click log data
- About 500,000 click records from a portion of a
day on Lycos.com - Pornographic queries filtered out
- Queries canonicalized
- downcased
- Whitespace collapsed
- No URL canonicalization
- 243,000 unique queries and 362,000 unique URLs
remained - Ran agglomerative clustering algorithm for
100,000 iterations
18Examples of query clusters
casinos las vegas online casinos las vegas strip
lyrics guitar tabs tabs song lyrics
disney vacations disney world tickets
irs irs.gov internal revenue service forms
movies hollywood hbo bad movies
American airlines aadvantage aa.com
fitness muscle fitness magazines
19Related Searches Experiments
- 1. Baseline
- 2. Full-replacement Draw all suggestions for an
input query from its cluster - 3. Hybrid Replace only weak suggestions
- Impact of experiments measured by clickthrough
rate of the related searches feature
20Related Searches Results
- 1. Baseline 1.16
- 2. Full-replacement 1.03
- 3. Hybrid 1.31
Clickthrough rate
Experiment
21Related Searches Conclusions
- Full-replacement strategy (preferring
cluster-derived suggestions) is inferior to
baseline - Hybrid strategy is superior to both, overcoming
data sparseness problems inherent in baseline
query pair model.
22Other applications
- Cluster URLs with respect to queries
- Cluster URLs with respect to users
- Cluster users with respect to URLs, queries
- All these treat documents as mere URLs. Such
content-ignorance has its advantages - Faster Less data manipulation
- Handles Web pages that lack text
- Handles Web pages that are restricted access
- Handles Web pages that are dynamic