Title: Result Merging in a Peer-to-Peer Web Search Engine MINERVA
1Result Merging in a Peer-to-Peer Web Search
Engine MINERVA
Master thesis project
Speaker Sergey Chernov Supervisor Prof. Dr.
Gerhard Weikum Saarland university, Max Planck
Institute for Computer Science, Database and
Information Systems Group
2Overview
1 Result merging problem in MINERVA system
2 Selected merging strategies GIDF, ICF, CORI, LM
3 Our approach result merging with the
preference-based language model
- 4 Summary and future work
3Problems of present Web Search Engines
- Size of indexable Web
- Web is huge, its difficult to cover all
- Timely re-crawls are required
- Deep Web
- Monopoly of Google
- Controls 80 of web search requests
- Sites may be censored by engine
- Make use of Peer-to-Peer technology
- Exploit previously unused CPU/memory/disk power
- Keep up-to-date results for small portions of Web
- Conquer Deep Web by specialized web crawlers
4MINERVA project
- MINERVA is a Peer-to-Peer Web search engine
Representative statistics
C1
Si
C6
P1
P6
Peer with local search engine
Pi
S6
S1
Index on crawled pages
P2
C2
Ci
P5
C5
S2
S5
S3
Distributed directory based on Chord protocol
S4
P4
P3
C3
C4
Chord ring
5Query processing in distributed IR
ltltR1, R2, R3,gt,qgt
ltP,qgt
Merging
Selection
ltP,qgt
Retrieval
RM
RM
P1
P1
R1
................... ...................
................... ...................
................... ...................
................... ...................
P2
P3
P2
R2
................... ...................
................... ...................
P4
P3
R3
................... ...................
................... ...................
P5
q query, P set of peers, P subset of
peers most relevant for q Ri ranked result
list of documents from Pi, Rm merged result
list of documents
P6
6Naive merging approaches
- How we can combine results from peers?
- 1. Retrieve k documents with the highest
similarity scores -
- Problem scores incomparable
- 2. Take the same number of documents from each
peer - Problem different database quality
- 3. Fetch best documents from peers, re-rank them
and select top-k - Problem good solution, but how to compute final
scores?
7Result merging problem
- Objective make scores completely comparable
- Solution Replace all collection-dependent
statistics with global ones - Baseline obtain document scores estimation as
they were placed in single database - Difficulty Overlapping influence statistics for
score estimation. - Methods
- GIDF Global Inverted Document Frequency
- ICF Inverted Collection Frequency
- CORI merging used in CORI system
- LM Language Modeling
LeftScore RightScore
P
DB123
Single database
8Selected result merging methods (1)
- GIDF compute Global Inverted Document Frequency
-
-
- DFi number of documents with particular term
on peer i, - Di overall number of documents on peer i.
- ICF replace IDF with Inverted Collection
Frequency value - CF number of peers with particular term,
- C number of collections (peers) in the
system
9Selected result merging methods (2)
- CORI COllection Retrieval Inference network
-
-
- DatabaseRank obtained during Database
Selection step - LocalScore Scores computed with local
statistics - constants are heuristics tuned for INQUERY
search engine - LM Language Modeling
-
-
- ? smoothing parameter, heuristic tradeoff
between two models
P(q global language model from all documents on
all peers)
P(q document language model)
10Experimental setup
- TREC-2002, 2003 and 2004 Web Track datasets
- 4 topics
- 50 peers, 10-15 per topic
- documents are replicated twice
- 25 title queries, the topic distillation task
2002 and 2003 Web Track - 3 database selection algorithms
- RANDOM thats it
- CORI de-facto standard
- IDEAL manually created
11Experiments CORI database ranking, all merging
methods
12Experiments all database rankings, the best LM
merging method
13Experiments IDEAL database ranking, the best LM
merging method, limited statistics
14Preference-based language model (1)
- 1. Execute query on the best peer
- 2. First top-k results assumed relevant
(pseudo-relevance feedback) - 3. Estimate preference-based LM on these top-k
documents - 4. Compute cross-entropy between LM of the
document and preference-based LM - 5. Combine this ranking with the LM merging
method
15Preference-based language model (2)
- globally normalized similarity score
- preference-based similarity score
- both are combined into final result merging
scores - where
- Q query
- tk term in Q
- G entire document set over all peers
- Dij document
- U set of pseudo-relevant documents
16Experiments - IDEAL database ranking,
preference-based language model merging
17Conclusions
- All merging algorithms are very close in absolute
retrieval effectiveness - Language modeling methods are more effective than
TFIDF based methods - Limited statistics is reasonable choice in a
peer-to-peer setting - The pseudo-relevance feedback information from
the topically organized collections slightly
improves the retrieval quality