On Ranking the Effectiveness of Searches - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

On Ranking the Effectiveness of Searches

Description:

There is a considerable interest in estimating the effectiveness of search. ... Bayesian model selection using the Laplace criterion [Minka, 1999] ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 22

Provided by: nlgCsie

Category:

more less

Transcript and Presenter's Notes

Title: On Ranking the Effectiveness of Searches

1
On Ranking the Effectiveness of Searches

Vishwa Vinay, Ingemar J. Cox
University College London
Natasa Milic-Fraying, Ken Wood
Microsoft Research Ltd., Cambridge
SIGIR 2006

2
Introduction

There is a considerable interest in estimating
the effectiveness of search.
Such an estimation is useful for
Providing feedback to the user
Providing feedback to the search engine
Providing feedback to the database creators
Optimizing information fusion for meta-search
engines

3
Previous Work

Two classes strategies have been proposed
Analysis of the query
Ex query length, IDF values of query terms and
clarity score depends on a language model
Achieved limited success
Analysis of the retrieved document set
The best performance to date (0.439 in Kendalls
t statistics)

4
Four Proposed Measures

Focus on the geometry of the retrieved document
set
The clustering tendency as measured by the
Cox-Lewis statistic
The sensitivity of document perturbation
The sensitivity to query perturbation
The local intrinsic dimensionality

5
The Clustering Tendency

The cluster hypothesis
Document relevant to a given query are likely to
be similar to each other
The lack of clusters in the retrieved data sets
implies that the set does not contain relevant
documents
Detecting the randomness in the retrieved set

6
Measuring the Clustering Tendency

The Cox-Lewis statistic (for pattern learning) is
based on the ratio of two distances
The distance from a randomly generated sampling
point to its nearest neighbor in the dataset,
called the marked point
The distance between the marked point and its
nearest neighbor
When the data contains inherent clusters, the
distance Drand between the random sampling point
and its marked point is likely to be much larger
than the distance Dnn between the marked point
and its nearest neighbor

7
Approximation of the Cox-Lewis Statistic (1/2)

Sampling window
A region in the data representation space from
which the random points are picked
The smallest hyper-rectangle that contains all
the documents in the retrieved set
The computed Cox-Lewis ratio is normalized by the
average length of the sides of the
hyper-rectangle

8
Approximation of the Cox-Lewis Statistic (2/2)

Approximation process
Each random point is generated by starting with a
point randomly selected from the retrieved set of
documents
Each non-zero term weight is replaced with a
value chosen uniformly from the range that
corresponds to the side of the hyper-rectangle in
that dimension
The dependency on the sparsity is maintained
The clustering tendency of a given set of points
is dependant on their sparsity

9
Query-Dependent Extension of Cosine

where c is the vector of terms common to both di
and dj with weights ck being the average of dik
and djk
Query-specific Cox-Lewis statistic
where psp is the sampling point, dmp is the
marked point, and dnn is the nearest neighbor of
the marked point

10
Document Perturbation (1/2)

A document is randomly selected from the
retrieved set and used as a pseudo-query over the
retrieved set
The selected document will be ranked first in the
new result list
Using a perturbed version of the selected
document as the pseudo-query, the new rank of the
selected document will increase
Plotting the new rank of the selected document
vs. the level of introduced noise (a)
The slope is expected to be related to clustering
tendency

11
Document Perturbation (2/2)

For each query
Issue query to dataset
Collect 100 results
Calculate the variance along each dimension in
the retrieved set
For each document di in this set
For a 0.01, 0.1, 1, 10
For s 1 10
For each term j present
in this document
Weight
Original_weight Gaussian(0, avj )
Find similarity of the
noisy doc with all 100 original documents
Find rank of di in the
list of similarities
Find average rank over multiple
samples for doc di and this a

12
Query Perturbation (1/2)

The original query is perturbed to retrieve
documents from the collection
We attempt to measure how distant the original
retrieved set is from the set retrieved by the
perturbed query
The rationale
If the originally retrieved set forms a tight
cluster that is significantly distant from other
topical clusters in the collection, and if the
magnitude of the added noise is small, then a
noisy query will still retrieve most documents
from the original set

13
Query Perturbation (2/2)

For each query
Issue query to dataset
Collect 100 results called the original_set
Calculate the variance along each dimension in
the entire collection
For a 0.01, 0.1, 1, 10
For s 1 10
For each term k present in this
query
Weight Original_weight
Gaussian(0, avk)
Issue query to the entire
dataset
Collect 100 results called the
noisy_set
Find the Levenshtein distance
between original_set and noisy_set
Find average distance over multiple
samples for this a
Find average distance for each given a
Plot average distance VS. a for the range of
alphas

14
Local Intrinsic Dimensionality (1/2)

The number of parameters required to represent a
set of N points in a D dimensional space is
called the intrinsic dimensionality and will
always be less than min(N, D)
Methods for intrinsic dimensionality analysis
Latent semantic analysis (LSA)
Requires a threshold for significance of
eigenvalues
Bayesian model selection using the Laplace
criterion Minka, 1999
Suggest the optimal number of components for
principal component analysis (PCA)

15
Local Intrinsic Dimensionality (2/2)

Given the set of retrieved documents, for each
point in the set, we identify its closest K
neighbors within the retrieved set, where K
ranges from 5 to 20 in 5 steps
The number of components suggested by the Laplace
criterion for the K1 data points is the
intrinsic dimensionality
As we increase K, we can observe the rate of
change in the intrinsic dimensionality
The underlying assumption
A high dimensional dataset can be decomposed
into a lower dimensionality component and noise.
If there is a large amount of noise in the data,
the number of parameters required to model
essentially random set of points is small

16
Experiments

Document collection TREC disk 45
200 TREC topics 301-450 and 601-650
The description field is used to formulate the
query
IR system TF-IDF weighting in the Lemur toolkit
The average precision for each query achieved by
the IR system forms a ground truth ranking of
queries according to the search effectiveness

17
Experiment Results (1/4)

Table 1 Correlation between each of the
features and the Average Precision
Note the best performance to date on the same
dataset is 0.439

18
Experiment Results (2/4)

Combining predictive measures
The same mean normalization
Adjust x to (x.mean(Y))/mean(X)
Min-max normalization
Adjust x to (x-min(X))/(max(X)-min(X))
Inverse tan (arctan) normalization
Adjust x to (2.tan-1(x)/p), which is between 0,1

19
Experiment Results (3/4)

Table 2 Combining four search effectiveness
measures

20
Experiment Results (4/4)

Table 3 Effectiveness of identifying the poorly
performing searches

21
Conclusions

Methods for estimating search effectiveness are
presented by examining properties of the
retrieved set
Four measures are investigated the clustering
tendency, the sensitivity to the document
perturbation and the query perturbation,
respectively, and the rate of change in the local
intrinsic dimensionality
The measures explored can be used for comparing
the complexity of different document collections
and the effects of different document
representations on search