Clustering WebSearch Results Using TransductionBased Relevance Model - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Clustering WebSearch Results Using TransductionBased Relevance Model

Description:

Result preprocessing. Similarity measurement. Transduction-based Relevance Model (TRM) ... 1. Result preprocessing. User query a list of ranked results ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 22
Provided by: hkpu
Category:

less

Transcript and Presenter's Notes

Title: Clustering WebSearch Results Using TransductionBased Relevance Model


1
Clustering Web-Search Results Using
Transduction-Based Relevance Model
  • Lurong Xiao and Edward Hung
  • Hong Kong Polytechnic University
  • csehung_at_comp.polyu.edu.hk

2
Outline
  • Motivation
  • Transduction-based Clustering Algorithm (TCA)
  • Result preprocessing
  • Similarity measurement
  • Transduction-based Relevance Model (TRM)
  • Clustering Algorithm
  • Performance Evaluation
  • Conclusion

3
Motivation
  • Existing search engines return a long list of
    ranked results
  • Keyword jaguar

4
(No Transcript)
5
(No Transcript)
6
Challenges
  • Search engines return titles, snippets and links
  • Time consuming to download and process original
    results
  • Short texts ? hard to extract reliable features
    and produce good clusters
  • Online ? need fast response time
  • Clusters should have readable descriptions
  • No study on relationship between results and
    distributions

7
Our Approach
  • Transduction-based Relevance Model (TRM)
  • No assumption on distribution
  • Determine relevance (relationship) between two
    results

8
Our Approach
  • Clustering algorithm
  • Clustering based on relevance
  • Compute Importance value of each result (how
    closely other results are relevant to it?)
  • An important result ? cluster representative
  • Importance of results relevant to them are
    suppressed
  • Repeat to pick important results above threshold
  • Assign other results to most relevant clusters
  • Results with very low relevance ? outliers (a
    special cluster with miscellaneous topics)
  • Adjust importance and relevance thresholds ?
    determine cluster number and granularity of
    clustering

9
Transduction-based Clustering Algorithm
  • Result preprocessing
  • Similarity measurement
  • Transduction-based relevance model
  • Clustering algorithm

10
1. Result preprocessing
  • User query ? a list of ranked results
  • First n results X x1, x2, , xn
  • Term set of m terms
  • an m x 1 cell array all terms in titles and
    snippets (except non-term tokens and frequent
    terms)
  • m x n histogram matrix H(hi,j)mxn where H aHt
    (1- a)Hs
  • H(k,i) weighted occurrence of k-th term in xi

11
2. Similarity measurement
  • Normalized tf-idf weighted vectors
  • Term frequency
  • Inverse document frequency
  • Weight of k-th term in i-th result
  • cosine similarity
  • Distance between results xi, xj
  • n x n distance matrix D

12
3. Transduction-based Relevance Model (TRM)
  • Construct a KNN graph based on distance matrix D
  • Result xi ? node
  • xj is a KNN of xi ? xj in N(xi) ? edge (xi, xj)
  • Affinity weight matrix W(wi,j)nxn

13
3. Transduction-based Relevance Model (TRM)
  • Propagation coefficient matrix Q(qi,j)nxn
  • Propagate relevance to get the Relevance matrix
    R(ri,j)nxn
  • R Q x R
  • ri,i1
  • R ? I
  • xi has a high relevance to xj if xi is near to
    nodes (e.g. xk) with high relevances to xj

14
3. Transduction-based Relevance Model (TRM)
  • Iteratively propagation algorithm
  • Iteratively apply until R becomes stable

15
3. Transduction-based Relevance Model (TRM)
  • Matrix solution
  • Applying matrix equations instead of iteratively
    propagations of relevance values

16
3. Transduction-based Relevance Model (TRM)
  • Importance of node xi
  • A node is more important iff it has more other
    nodes highly relevant to it

17
4. Clustering Algorithm
  • Iteratively find Nc or fewer cluster
    representatives with Imi gt threshold
  • Pick node xi with highest importance value Imi
  • Avoid to pick nodes highly relevant to xi
  • Suppression attenuation
  • Assign other nodes to the most relevant cluster
    representatives
  • If relevance lt threshold ? outlier or special
    cluster for miscellaneous topics

18
Performance Evaluation
  • Query log of June 10, 2007 from google
  • Keywords with multiple subtopics and commonly
    used in literature
  • jaguar, iraq, java
  • a0.5, s1, w00.001, 20-NN graph
  • Search results 50, 75, 100
  • Relevance threshold 0.001, 0.01
  • Importance threshold 2.5, 3, 3.5
  • 98.9 accuracy (higher than k-medoids clustering
    88.3)
  • Running time 0.3s 0.7s

19
Performance Evaluation
  • Higher relevance threshold ? higher accuracy
  • More relevant results are usually more likely to
    be correct
  • Lower importance threshold ? more clusters ?
    higher accuracy
  • Members of new clusters were usually outliers or
    from clusters with low ranks (relevances)
  • More data ? more clusters
  • Some results were once too small in quantity and
    relevance to form a cluster
  • They now become important enough to form their
    own cluster

20
Example on member movement in different cluster
numbers
21
Conclusion
  • Transduction-based Clustering Algorithm (TCA)
  • Analyze inter-result relationship by propagating
    relevance through local interactions
  • Given importance threshold, TCA decides the
    number of clusters automatically
  • Search results clustering and outlier detection
  • Fast running time and high accuracy, suitable for
    online users
Write a Comment
User Comments (0)
About PowerShow.com