Online Clustering of Result Set for Adaptive Search - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Online Clustering of Result Set for Adaptive Search

Description:

... of the underlying news search engine (Yahoo! News) the implemented solution had ... Clustering algorithm works on the pairwise similarities of news articles. ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 14

Provided by: erenman

Category:

more less

Transcript and Presenter's Notes

Title: Online Clustering of Result Set for Adaptive Search

1
Online Clustering of Result Set for Adaptive
Search
2
What Does it Mean to Do Adaptive Search?

Different users may be looking for different
information by using the same query string.
Even the same user may have different information
needs at different points in time about the same
topic.
Can we adapt the search algorithm to capture
these differences and adapt to different
information needs?

3
How Do We Plan to Achieve Adaptation?

The two main variables in tailored information
retrieval are the identity of the user and the
time the query was issued.
Time is an indicator of the state the user (I.e.
the state of her information need) is in and the
amount/diversity of information available.
Our approach
Learn the user behavior models to capture
individuals needs (This part is covered in our
previous work)
Group the documents at query time to capture the
current state of available (and accessible)
information sources (This is where online
clustering will be used)

4
Online Clustering of Search Results

Problem Definition Group the search results at
the time of query so that topically related
documents are grouped together.
Motivation
An offline clustering/classification of search
results would not be query dependent. Two
documents could be related for the query machine
learning but they may not be as relevant for a
rather more specific one machine learning for
recommender systems.
The time of query changes the set of available
documents. A news search on hurricane query
today and hurricane query a week ago would
result in very different set of documents

5
Online Clustering of Search Results - Test Case

To evaluate the performance of the algorithm on a
frequently updated index (with deletions as well
as addition operations) we chose to work on News
articles.
Due to architecture of the underlying news search
engine (Yahoo! News) the implemented solution had
to satisfy the following conditions
Access to 100 documents at most per query
Work in a stateless/memoryless environment
Fetch original results, cluster them and return
the clustered results in less than 300milliseconds

6
Online Clustering of Search Results - Clustering
Algorithm

Clustering Algorithm
Agglomerative Hierarchical Clustering Mixture of
Multinomials, Cluto, KNN, Kmeans were implemented
and compared. Aglomerative Hierarchical
Clustering algorithm outperformed others in news
domain.
Clustering algorithm works on the pairwise
similarities of news articles.
Pairwise similarity matrix is computed at the
time of query
The similarity of a cluster to another one is
computed as a factor of the pairwise document
similarities.
Similarity metric is an approximation of cosine
similarity measure.

7
Online Clustering of Search Results - Clustering
Algorithm

Stopping criteria Threshold on the similarity of
the clusters. The optimum threshold value was
discovered with an exhaustive search within a
range on the hold-out dataset.

8
Online Clustering of Search Results - Naming the
Clusters

Users may not realize how and why the documents
were grouped together. Having names for clusters
that would serve as short descriptions of
clusters would help users process and navigate
through the result set easier.
Common naming approaches
Choose the most representative article for each
cluster and use its title as the cluster label
(Google news search http//news.google.com)
Find the most representative phrases within the
documents belonging to each cluster (Clusty
http//news.clusty.com/)

9
Online Clustering of Search Results - Naming the
Clusters

Our Approach
News articles are usually formulated to answer
more than one of the following questions where,
when, who, what and how.
Articles belonging to the same cluster may be
sharing answers to only some of these questions.
We want to partition the titles into substrings
corresponding to one of these questions. And then
choose among these and continuous combinations of
these substrings the most common one.

10
Online Clustering of Search Results - Naming the
Clusters

Methodology
Use POS taggers to find the boundaries between
these parts.
Break the sentence into smaller parts at these
points and generate a candidate set composed of
these shorter substrings and their continuous
concatenations.
Example
Title Wilma forces Caribbean tourists to flee
Substrings Wilma, Caribbean tourists, flee
Candidates Wilma, Wilma forces Caribbean
tourists, Caribbean tourists, Wilma forces
Caribbean tourists to flee, flee, Caribbean
tourists to flee
Score the candidates based on coverage, frequency
or length or a combination of these metrics.
Choose the candidate with the highest score to be
the cluster label.

11
Online Clustering of Search Results - Naming the
Clusters

Scoring Algorithms
Coverage Score A metric to measure how well the
candidate covers the information in all titles.
Frequency Score Is the frequency score for the
full string.
Length normalized frequency Score Candidates
with less number of words will have more
frequencies. To avoid this bias towards shorter
candidates we normalize the frequency scores
based on the number of words they include

12
Online Clustering of Search Results - Evaluation

Clustering Algorithm
300 queries were selected at random from daily
query logs.
10 users were asked to compare proposed
clustering algorithm results to Google clusters.
The results were presented side by side to the
users.
The proposed clustering algorithm was shown to
outperform Google clusters in both in consistency
and in coverage

13
Online Clustering of Search Results - Evaluation

Naming Experiments
160 queries were chosen at random from the daily
query logs. The results of these queries were
clustered and saved. 8 users were asked to either
select from the list of candidates the best label
or type in the name of their own choice if they
could not find an appropriate candidate. At least
3 judgments were collected per query.
Our experimental results showed that length
normalized frequency scores outperforms the
others (60 match with user labels).
In the case of ties, coverage scores were found
to be the best tie-breaker.