Topical Query Decomposition - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Topical Query Decomposition

Description:

Topical Query Decomposition Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08 – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 23

Provided by: Ming155

Category:

more less

Transcript and Presenter's Notes

Title: Topical Query Decomposition

1
Topical Query Decomposition
Francesco Bonchi Carlos
Castillo Debora Donato
Aristides Gionis Yahoo! Research Barcelona,
Spain KDD 08
2
Abstract

Given a query and a document retrieval system
To produce a small set of queries whose union of
resulting documents corresponds approximately to
that of the original query.
Set cover problem
Greedy algorithm
Clustering problem
Two-phase algorithm based on hierarchical
agglomerative clustering (dynamic programming)

3
Introduction

A query log L
A list of pairs lt q, D(q) gt
q query,
D(q) its result a set of documents that answer
query q
Q(q) the maximal set of queries pi, where for
each pi, the set D(pi) has at least one document
in common with the documents returned by q

4
(No Transcript)
5

The goal is to compute a cover.
Selecting a subcollection C ? Q(q7) such that it
covers almost all of D(q7)

6
Problem Statement 1/3

Red-Blue set cover problem
Ub1,bn, r1,rm ( for a query q )
Bb1,bn (i.e.,
document set)
Rr1,rm (i.e., query
set)
SS1,,Sk is provided from L (query log L)
Si ? U
SiB blue points in Si (SiB Si ? B)
SiR red points in Si (SiR Si ? B)
Goal To find a subcollection C ? S that covers
many blue points of U without covering too many
red points.

7
Problem Statement 2/3

For each query q, the candidate queries Q(q)
For each set Si with blue and red points, its
weight is
scatter sc(Si) (coherence opposite of scatter)

8
Problem Statement 3/3

Our goal is to find a subcollection C ? S that
covers almost all the blue points of U and has
large coherence.
More precisely, we want that C satisfies the
following properties
Cover-blue
Not-cover-red
Small-overlap
Coherence

9
Greedy Algorithm 1/2

At i-th iteration , minimizes s(S,VB,VR)
lC, lR, lO are parameters that weight the
relative importance of the three terms.
VB blue balls were already selected at before
iterations
VR red balls were already selectedat before
iterations

D. Peleg. Approximation algorithm for the
label-covermax and red-blue set cover problem.
Journal of Discrete Algorithms, 2007
10
Greedy Algorithm 2/2
11
Integer Programming

SiS2.Sl lt10
Si lt 1

12
Clustering-Based Method

Two-phase approach
First phase all points in set B are clustered
using a hierarchical agglomerative clustering
algorithm. (CLUTO toolkit)
Second phases to match the clusters of the
hierarchy produced by the agglomerative algorithm
with the sets of S.
The main idea is to match sets of S into clusters
of G
Every node T ? G corresponds to a cluster
T(B) be the set of points in B

13
Clustering-Based Method
Dendrogram G
14
Clustering-Based Method -Dynamic Programming -
1/2

Complete Coverage
for each set S ? S v.s. for each node T? G ,
Matching score m(T, S)
m(T) the score of the best matching set in S.
Optimal cost of covering the points of TB with
sets in S.

15
Clustering-Based Method -Dynamic Programming -
2/2

Partial Coverage
lU weights the relative importance between the
two terms, the scatter cost of the sets S and the
number of uncovered points.

16
Application

Query log L 2.9 million distinct queries
A majority of users only looks at the first page
of results, while few users request more result
pages.
D(q) any user asking for q in the query log
navigated, and consider the set of result
documents for the query
24 million distinct documents seen by the users

17
Application - Candidate queries for the cover