Online Clustering of Result Set for Adaptive Search - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Online Clustering of Result Set for Adaptive Search

Description:

... of the underlying news search engine (Yahoo! News) the implemented solution had ... Clustering algorithm works on the pairwise similarities of news articles. ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 14
Provided by: erenman
Category:

less

Transcript and Presenter's Notes

Title: Online Clustering of Result Set for Adaptive Search


1
Online Clustering of Result Set for Adaptive
Search
2
What Does it Mean to Do Adaptive Search?
  • Different users may be looking for different
    information by using the same query string.
  • Even the same user may have different information
    needs at different points in time about the same
    topic.
  • Can we adapt the search algorithm to capture
    these differences and adapt to different
    information needs?

3
How Do We Plan to Achieve Adaptation?
  • The two main variables in tailored information
    retrieval are the identity of the user and the
    time the query was issued.
  • Time is an indicator of the state the user (I.e.
    the state of her information need) is in and the
    amount/diversity of information available.
  • Our approach
  • Learn the user behavior models to capture
    individuals needs (This part is covered in our
    previous work)
  • Group the documents at query time to capture the
    current state of available (and accessible)
    information sources (This is where online
    clustering will be used)

4
Online Clustering of Search Results
  • Problem Definition Group the search results at
    the time of query so that topically related
    documents are grouped together.
  • Motivation
  • An offline clustering/classification of search
    results would not be query dependent. Two
    documents could be related for the query machine
    learning but they may not be as relevant for a
    rather more specific one machine learning for
    recommender systems.
  • The time of query changes the set of available
    documents. A news search on hurricane query
    today and hurricane query a week ago would
    result in very different set of documents

5
Online Clustering of Search Results - Test Case
  • To evaluate the performance of the algorithm on a
    frequently updated index (with deletions as well
    as addition operations) we chose to work on News
    articles.
  • Due to architecture of the underlying news search
    engine (Yahoo! News) the implemented solution had
    to satisfy the following conditions
  • Access to 100 documents at most per query
  • Work in a stateless/memoryless environment
  • Fetch original results, cluster them and return
    the clustered results in less than 300milliseconds

6
Online Clustering of Search Results - Clustering
Algorithm
  • Clustering Algorithm
  • Agglomerative Hierarchical Clustering Mixture of
    Multinomials, Cluto, KNN, Kmeans were implemented
    and compared. Aglomerative Hierarchical
    Clustering algorithm outperformed others in news
    domain.
  • Clustering algorithm works on the pairwise
    similarities of news articles.
  • Pairwise similarity matrix is computed at the
    time of query
  • The similarity of a cluster to another one is
    computed as a factor of the pairwise document
    similarities.
  • Similarity metric is an approximation of cosine
    similarity measure.

7
Online Clustering of Search Results - Clustering
Algorithm
  • Stopping criteria Threshold on the similarity of
    the clusters. The optimum threshold value was
    discovered with an exhaustive search within a
    range on the hold-out dataset.

8
Online Clustering of Search Results - Naming the
Clusters
  • Users may not realize how and why the documents
    were grouped together. Having names for clusters
    that would serve as short descriptions of
    clusters would help users process and navigate
    through the result set easier.
  • Common naming approaches
  • Choose the most representative article for each
    cluster and use its title as the cluster label
    (Google news search http//news.google.com)
  • Find the most representative phrases within the
    documents belonging to each cluster (Clusty
    http//news.clusty.com/)

9
Online Clustering of Search Results - Naming the
Clusters
  • Our Approach
  • News articles are usually formulated to answer
    more than one of the following questions where,
    when, who, what and how.
  • Articles belonging to the same cluster may be
    sharing answers to only some of these questions.
  • We want to partition the titles into substrings
    corresponding to one of these questions. And then
    choose among these and continuous combinations of
    these substrings the most common one.

10
Online Clustering of Search Results - Naming the
Clusters
  • Methodology
  • Use POS taggers to find the boundaries between
    these parts.
  • Break the sentence into smaller parts at these
    points and generate a candidate set composed of
    these shorter substrings and their continuous
    concatenations.
  • Example
  • Title Wilma forces Caribbean tourists to flee
  • Substrings Wilma, Caribbean tourists, flee
  • Candidates Wilma, Wilma forces Caribbean
    tourists, Caribbean tourists, Wilma forces
    Caribbean tourists to flee, flee, Caribbean
    tourists to flee
  • Score the candidates based on coverage, frequency
    or length or a combination of these metrics.
  • Choose the candidate with the highest score to be
    the cluster label.

11
Online Clustering of Search Results - Naming the
Clusters
  • Scoring Algorithms
  • Coverage Score A metric to measure how well the
    candidate covers the information in all titles.
  • Frequency Score Is the frequency score for the
    full string.
  • Length normalized frequency Score Candidates
    with less number of words will have more
    frequencies. To avoid this bias towards shorter
    candidates we normalize the frequency scores
    based on the number of words they include

12
Online Clustering of Search Results - Evaluation
  • Clustering Algorithm
  • 300 queries were selected at random from daily
    query logs.
  • 10 users were asked to compare proposed
    clustering algorithm results to Google clusters.
    The results were presented side by side to the
    users.
  • The proposed clustering algorithm was shown to
    outperform Google clusters in both in consistency
    and in coverage

13
Online Clustering of Search Results - Evaluation
  • Naming Experiments
  • 160 queries were chosen at random from the daily
    query logs. The results of these queries were
    clustered and saved. 8 users were asked to either
    select from the list of candidates the best label
    or type in the name of their own choice if they
    could not find an appropriate candidate. At least
    3 judgments were collected per query.
  • Our experimental results showed that length
    normalized frequency scores outperforms the
    others (60 match with user labels).
  • In the case of ties, coverage scores were found
    to be the best tie-breaker.
Write a Comment
User Comments (0)
About PowerShow.com