Result Merging in a Peer-to-Peer Web Search Engine MINERVA

About This Presentation

Title:

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Description:

Saarland university, Max Planck Institute for Computer Science, ... Conquer Deep Web by specialized web crawlers. MINERVA project ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 18

Provided by: chernov

Category:

more less

Transcript and Presenter's Notes

Title: Result Merging in a Peer-to-Peer Web Search Engine MINERVA

1
Result Merging in a Peer-to-Peer Web Search
Engine MINERVA
Master thesis project
Speaker Sergey Chernov Supervisor Prof. Dr.
Gerhard Weikum Saarland university, Max Planck
Institute for Computer Science, Database and
Information Systems Group
2
Overview
1 Result merging problem in MINERVA system
2 Selected merging strategies GIDF, ICF, CORI, LM
3 Our approach result merging with the
preference-based language model

4 Summary and future work

3
Problems of present Web Search Engines

Size of indexable Web
Web is huge, its difficult to cover all
Timely re-crawls are required
Deep Web
Monopoly of Google
Controls 80 of web search requests
Sites may be censored by engine
Make use of Peer-to-Peer technology
Exploit previously unused CPU/memory/disk power
Keep up-to-date results for small portions of Web
Conquer Deep Web by specialized web crawlers

4
MINERVA project

MINERVA is a Peer-to-Peer Web search engine

Representative statistics
C1
Si
C6
P1
P6
Peer with local search engine
Pi
S6
S1
Index on crawled pages
P2
C2
Ci
P5
C5
S2
S5
S3
Distributed directory based on Chord protocol
S4
P4
P3
C3
C4
Chord ring
5
Query processing in distributed IR
ltltR1, R2, R3,gt,qgt
ltP,qgt
Merging
Selection
ltP,qgt
Retrieval
RM
RM
P1
P1
R1
................... ...................
................... ...................
................... ...................
................... ...................
P2
P3
P2
R2
................... ...................
................... ...................
P4
P3
R3
................... ...................
................... ...................
P5
q query, P set of peers, P subset of
peers most relevant for q Ri ranked result
list of documents from Pi, Rm merged result
list of documents
P6
6
Naive merging approaches

How we can combine results from peers?
1. Retrieve k documents with the highest
similarity scores
Problem scores incomparable
2. Take the same number of documents from each
peer
Problem different database quality
3. Fetch best documents from peers, re-rank them
and select top-k
Problem good solution, but how to compute final
scores?

7
Result merging problem

Objective make scores completely comparable
Solution Replace all collection-dependent
statistics with global ones
Baseline obtain document scores estimation as
they were placed in single database
Difficulty Overlapping influence statistics for
score estimation.
Methods
GIDF Global Inverted Document Frequency
ICF Inverted Collection Frequency
CORI merging used in CORI system
LM Language Modeling

LeftScore RightScore
P
DB123
Single database
8
Selected result merging methods (1)

GIDF compute Global Inverted Document Frequency
DFi number of documents with particular term
on peer i,
Di overall number of documents on peer i.
ICF replace IDF with Inverted Collection
Frequency value
CF number of peers with particular term,
C number of collections (peers) in the
system

9
Selected result merging methods (2)

CORI COllection Retrieval Inference network
DatabaseRank obtained during Database
Selection step
LocalScore Scores computed with local
statistics
constants are heuristics tuned for INQUERY
search engine
LM Language Modeling
? smoothing parameter, heuristic tradeoff
between two models

P(q global language model from all documents on
all peers)
P(q document language model)
10
Experimental setup

TREC-2002, 2003 and 2004 Web Track datasets
4 topics
50 peers, 10-15 per topic
documents are replicated twice
25 title queries, the topic distillation task
2002 and 2003 Web Track
3 database selection algorithms
RANDOM thats it
CORI de-facto standard
IDEAL manually created

11
Experiments CORI database ranking, all merging
methods
12
Experiments all database rankings, the best LM
merging method
13
Experiments IDEAL database ranking, the best LM
merging method, limited statistics
14
Preference-based language model (1)

1. Execute query on the best peer
2. First top-k results assumed relevant
(pseudo-relevance feedback)
3. Estimate preference-based LM on these top-k
documents
4. Compute cross-entropy between LM of the
document and preference-based LM
5. Combine this ranking with the LM merging
method

15
Preference-based language model (2)

globally normalized similarity score
preference-based similarity score
both are combined into final result merging
scores
where
Q query
tk term in Q
G entire document set over all peers
Dij document
U set of pseudo-relevant documents

16
Experiments - IDEAL database ranking,
preference-based language model merging
17
Conclusions

All merging algorithms are very close in absolute
retrieval effectiveness
Language modeling methods are more effective than
TFIDF based methods
Limited statistics is reasonable choice in a
peer-to-peer setting
The pseudo-relevance feedback information from
the topically organized collections slightly
improves the retrieval quality

Write a Comment

User Comments (0)

About PowerShow.com

Result Merging in a Peer-to-Peer Web Search Engine MINERVA - PowerPoint PPT Presentation

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Saarland university, Max Planck Institute for Computer Science, ... Conquer Deep Web by specialized web crawlers. MINERVA project ... – PowerPoint PPT presentation