Approaches to Collection Selection and Results Merging for Distributed Information Retrieval.

About This Presentation

Title:

Approaches to Collection Selection and Results Merging for Distributed Information Retrieval.

Description:

What's wrong with convenient IR systems ? Insufficient bandwidth, Server overload, ... Figure 1. Single index IR vs. distributed IR, Collection Selection. ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 13

Provided by: tarv5

Category:

more less

Transcript and Presenter's Notes

Title: Approaches to Collection Selection and Results Merging for Distributed Information Retrieval.

1
Approaches to Collection Selection and Results
Mergingfor Distributed Information Retrieval.

By Y. Rasolofo, F. Abbaci, J. Savoy.

2
Whats wrong with convenient IR systems ?

Insufficient bandwidth,
Server overload,
Unacceptable time consumptions.

3
(No Transcript)
4
Collection Selection.

Two ways to select collections
Pick up collections with the N highest scores.
Pick up collections that have scores higher than
some threshold are picked
Two main methods of collection picking using
Collections descriptions
Collections statistics

5
Approach Proposed

Main evaluation criteria
Number of terms included in each document
surrogate
Distance between number of terms and they
frequencies

6
Approach Proposed
Document Score calculations
Distances
and
7
Result Set Merging - Prior Work

Round robin
Raw Score Merging
What if collection stats are very different? idf?
Score merging with collection weights
Normalize by max document score
CORI
wi 1 C(si sm) / sm)
Sm is mean document score C is number of
collections

8
Result Set Merging - LSMusing result Length to
calculate Merging Score

No collection statistics neccesary. Inputs are
document scores and result lengths
Increase document scores if above mean score,
decrease if below
Collections that return a lot of documents are
more likely to return relevant documents
si log(1 ((li K)/sum(j1..C, lj))
wi 1 (si sm) / sm)
K is a constant (600), si is the score for the
ith collection, li is the number of documents
returned by the ith collection

9
Analysis of Experiments Results Major
Assumptions

Test documents came from the TREC8 and TREC9
conferences
Only topic titles having two words on average
were used to simulate typical queries sent by
search engine users.

10
Analysis of Experiments Results Interesting
Findings

Spelling errors found in multiple queries led to
no documents returned
The authors used two methodologies to show that
their approach was better than those compared
against
TREC_EVAL was used to compute average precision
after retrieving 5, 10, 15, 20, 30, 100, 200, 500
and 1000 documents
Sign test verification was used to validate that
their findings were statistically significant

11
Analysis of Experiments Results TREC8 Results
12
Analysis of Experiments Results TREC9 Results

Write a Comment

User Comments (0)