Title: Online Clustering of Result Set for Adaptive Search
1Online Clustering of Result Set for Adaptive
Search
2What Does it Mean to Do Adaptive Search?
- Different users may be looking for different
information by using the same query string. - Even the same user may have different information
needs at different points in time about the same
topic. - Can we adapt the search algorithm to capture
these differences and adapt to different
information needs?
3How Do We Plan to Achieve Adaptation?
- The two main variables in tailored information
retrieval are the identity of the user and the
time the query was issued. - Time is an indicator of the state the user (I.e.
the state of her information need) is in and the
amount/diversity of information available. - Our approach
- Learn the user behavior models to capture
individuals needs (This part is covered in our
previous work) - Group the documents at query time to capture the
current state of available (and accessible)
information sources (This is where online
clustering will be used)
4Online Clustering of Search Results
- Problem Definition Group the search results at
the time of query so that topically related
documents are grouped together. - Motivation
- An offline clustering/classification of search
results would not be query dependent. Two
documents could be related for the query machine
learning but they may not be as relevant for a
rather more specific one machine learning for
recommender systems. - The time of query changes the set of available
documents. A news search on hurricane query
today and hurricane query a week ago would
result in very different set of documents
5Online Clustering of Search Results - Test Case
- To evaluate the performance of the algorithm on a
frequently updated index (with deletions as well
as addition operations) we chose to work on News
articles. - Due to architecture of the underlying news search
engine (Yahoo! News) the implemented solution had
to satisfy the following conditions - Access to 100 documents at most per query
- Work in a stateless/memoryless environment
- Fetch original results, cluster them and return
the clustered results in less than 300milliseconds
6Online Clustering of Search Results - Clustering
Algorithm
- Clustering Algorithm
- Agglomerative Hierarchical Clustering Mixture of
Multinomials, Cluto, KNN, Kmeans were implemented
and compared. Aglomerative Hierarchical
Clustering algorithm outperformed others in news
domain. - Clustering algorithm works on the pairwise
similarities of news articles. - Pairwise similarity matrix is computed at the
time of query - The similarity of a cluster to another one is
computed as a factor of the pairwise document
similarities. - Similarity metric is an approximation of cosine
similarity measure.
7Online Clustering of Search Results - Clustering
Algorithm
- Stopping criteria Threshold on the similarity of
the clusters. The optimum threshold value was
discovered with an exhaustive search within a
range on the hold-out dataset.
8Online Clustering of Search Results - Naming the
Clusters
- Users may not realize how and why the documents
were grouped together. Having names for clusters
that would serve as short descriptions of
clusters would help users process and navigate
through the result set easier. - Common naming approaches
- Choose the most representative article for each
cluster and use its title as the cluster label
(Google news search http//news.google.com) - Find the most representative phrases within the
documents belonging to each cluster (Clusty
http//news.clusty.com/)
9Online Clustering of Search Results - Naming the
Clusters
- Our Approach
- News articles are usually formulated to answer
more than one of the following questions where,
when, who, what and how. - Articles belonging to the same cluster may be
sharing answers to only some of these questions. - We want to partition the titles into substrings
corresponding to one of these questions. And then
choose among these and continuous combinations of
these substrings the most common one.
10Online Clustering of Search Results - Naming the
Clusters
- Methodology
- Use POS taggers to find the boundaries between
these parts. - Break the sentence into smaller parts at these
points and generate a candidate set composed of
these shorter substrings and their continuous
concatenations. - Example
- Title Wilma forces Caribbean tourists to flee
- Substrings Wilma, Caribbean tourists, flee
- Candidates Wilma, Wilma forces Caribbean
tourists, Caribbean tourists, Wilma forces
Caribbean tourists to flee, flee, Caribbean
tourists to flee - Score the candidates based on coverage, frequency
or length or a combination of these metrics. - Choose the candidate with the highest score to be
the cluster label.
11Online Clustering of Search Results - Naming the
Clusters
- Scoring Algorithms
- Coverage Score A metric to measure how well the
candidate covers the information in all titles. - Frequency Score Is the frequency score for the
full string. - Length normalized frequency Score Candidates
with less number of words will have more
frequencies. To avoid this bias towards shorter
candidates we normalize the frequency scores
based on the number of words they include
12Online Clustering of Search Results - Evaluation
- Clustering Algorithm
- 300 queries were selected at random from daily
query logs. - 10 users were asked to compare proposed
clustering algorithm results to Google clusters.
The results were presented side by side to the
users. - The proposed clustering algorithm was shown to
outperform Google clusters in both in consistency
and in coverage
13Online Clustering of Search Results - Evaluation
- Naming Experiments
- 160 queries were chosen at random from the daily
query logs. The results of these queries were
clustered and saved. 8 users were asked to either
select from the list of candidates the best label
or type in the name of their own choice if they
could not find an appropriate candidate. At least
3 judgments were collected per query. - Our experimental results showed that length
normalized frequency scores outperforms the
others (60 match with user labels). - In the case of ties, coverage scores were found
to be the best tie-breaker.