Agglomerative Clustering of a Search Engine Query Log. Dou - PowerPoint PPT Presentation

About This Presentation
Title:

Agglomerative Clustering of a Search Engine Query Log. Dou

Description:

Agglomerative Clustering of a Search Engine Query Log. Doug Beeferman. Lycos Research Group ... About 500,000 click records from a portion of a day on Lycos.com ... – PowerPoint PPT presentation

Number of Views:311
Avg rating:3.0/5.0
Slides: 23
Provided by: dbeef
Category:

less

Transcript and Presenter's Notes

Title: Agglomerative Clustering of a Search Engine Query Log. Dou


1
Agglomerative Clustering of a Search Engine Query
Log
Sixth ACM SIGKDD International Conference on
Knowledge Discovery Data Mining Monday,
August 21, 2000
  • Doug Beeferman
  • Lycos Research Group
  • Lycos, Inc.
  • Waltham, MA USA

Adam Berger School of Computer Science Carnegie
Mellon University Pittsburgh, PA USA
2
Talk overview
  • Motivation
  • Approach
  • Algorithmic details
  • Sample application Related searches
  • Experimental results
  • Other Applications

3
Motivation
  • Group together similar Internet entities
  • Search queries rare books, magazines,
    bookstores
  • URLs http//www.amazon.com, http//www.bn.com
  • Users Doug Beeferman, Adam Berger
  • Applications
  • Recommendation systems
  • Automatic taxonomies
  • Automatic communities

4
Approach
  • Exploit search result click log data, which
    bridges all three.

1
2
3
Lycos
Search Results
Search
1. www.bn.com 2. www.amazon.com 3. www.salon.com
bn.com
DB
books
DB searched for books and then went to bn.com
(DB, books, bn.com)
5
Approach
  • Accumulate click log over timeTime
    User Query string Selected URL
    1 ltDBgt ltbooksgt lthttp//www.bn.comgt
    2 ltABgt ltmagazinesgt lthttp//www.amazon.comgt
    3 ltDBgt ltbiographiesgt lthttp//www.bn.comgt
    4 ltABgt ltbooksgt lthttp//www.amazon.comgt
  • DB searched for books and then went to bn.com
  • AB searched for magazines and then went to
    amazon.com
  • DB searched for biographies and then went to
    bn.com
  • AB searched for books and then went to
    amazon.com

6
Approach
  • Factor data into bipartite graphs

7
Approach
  • Choose two of the three entities, say queries and
    URLs.
  • Apply agglomerative clustering algorithm to the
    nodes in its bipartite graph
  • Merge the two most similar nodes in the
    left-hand-side
  • Merge the two most similar nodes in the
    right-hand-side
  • Repeat until some termination condition applies.
  • At termination queries are clustered on the
    left-hand-side, and URLs are clustered on the
    right-hand-side.

8
Distance metric
  • Similarity of two vertices in graph defined as
    the number of neighbors in common, normalized by
    the total number of unique neighbors

9
Approach
  • Agglomerative clustering in action 0 1 2 n

A
1
B
2
C
3
D
4
E
5
F
6
G
7
H
10
Approach
  • Agglomerative clustering in action 0 1 2 n

A
1
B
2
C, F
3
D
4
E
5
6
G
7
H
11
Approach
  • Agglomerative clustering in action 0 1 2 n

A
1
B
2
C, F
3, 5
D
4
E
6
G
7
H
12
Approach
  • Agglomerative clustering in action 0 1 2 n

A, D, E, H
1
B
C, F
2, 6, 3, 5
G
4
7
13
Algorithm
  • 1. Score all pairs of query nodes according to
    distance metric D.
  • 2. Merge the two query nodes q1 and q2 for which
    the distance D(q1, q2) is minimal.
  • 3. Score all pairs of URL nodes according to
    distance metric D.
  • 4. Merge the two URL nodes u1 and u2 for which
    the distance D(u1, u2) is minimal.
  • 5. Unless a termination condition applies, go to
    step 1.

14
Algorithm complexity
  • Naïve implementation requires a distance metric
    computation per pair of nodes, or ?(Q2 U2)
    time per iteration.
  • But only a the neighbors of the affected nodes
    change during a merge, so the per-iteration cost
    is proportional to the maximum degree of any node.

15
Application Related searches
  • Lycos.com related searches feature
  • Baseline algorithm based on past audience overlap
    of individual query pairs
  • Data-intensive poor coverage of rare queries

16
Application Related searches
Query
Suggestion
  • Current approach
  • (Query, Suggestion) pairs must be evidenced
    explicitly in data.
  • Alternate approach using query clusters
  • Train a set of query clusters using click log
    data
  • For an input query q, identify its cluster Q
  • Output the top members of Q as suggestions

17
Application Related searches
  • Click log data
  • About 500,000 click records from a portion of a
    day on Lycos.com
  • Pornographic queries filtered out
  • Queries canonicalized
  • downcased
  • Whitespace collapsed
  • No URL canonicalization
  • 243,000 unique queries and 362,000 unique URLs
    remained
  • Ran agglomerative clustering algorithm for
    100,000 iterations

18
Examples of query clusters
casinos las vegas online casinos las vegas strip
lyrics guitar tabs tabs song lyrics
disney vacations disney world tickets
irs irs.gov internal revenue service forms
movies hollywood hbo bad movies
American airlines aadvantage aa.com
fitness muscle fitness magazines
19
Related Searches Experiments
  • 1. Baseline
  • 2. Full-replacement Draw all suggestions for an
    input query from its cluster
  • 3. Hybrid Replace only weak suggestions
  • Impact of experiments measured by clickthrough
    rate of the related searches feature

20
Related Searches Results
  • 1. Baseline 1.16
  • 2. Full-replacement 1.03
  • 3. Hybrid 1.31

Clickthrough rate
Experiment
21
Related Searches Conclusions
  • Full-replacement strategy (preferring
    cluster-derived suggestions) is inferior to
    baseline
  • Hybrid strategy is superior to both, overcoming
    data sparseness problems inherent in baseline
    query pair model.

22
Other applications
  • Cluster URLs with respect to queries
  • Cluster URLs with respect to users
  • Cluster users with respect to URLs, queries
  • All these treat documents as mere URLs. Such
    content-ignorance has its advantages
  • Faster Less data manipulation
  • Handles Web pages that lack text
  • Handles Web pages that are restricted access
  • Handles Web pages that are dynamic
Write a Comment
User Comments (0)
About PowerShow.com