Result Merging in a Peer-to-Peer Web Search Engine MINERVA - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Result Merging in a Peer-to-Peer Web Search Engine MINERVA

Description:

Saarland university, Max Planck Institute for Computer Science, ... Conquer Deep Web by specialized web crawlers. MINERVA project ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 18
Provided by: chernov
Category:

less

Transcript and Presenter's Notes

Title: Result Merging in a Peer-to-Peer Web Search Engine MINERVA


1
Result Merging in a Peer-to-Peer Web Search
Engine MINERVA
Master thesis project
Speaker Sergey Chernov Supervisor Prof. Dr.
Gerhard Weikum Saarland university, Max Planck
Institute for Computer Science, Database and
Information Systems Group
2
Overview
1 Result merging problem in MINERVA system
2 Selected merging strategies GIDF, ICF, CORI, LM
3 Our approach result merging with the
preference-based language model
  • 4 Summary and future work

3
Problems of present Web Search Engines
  • Size of indexable Web
  • Web is huge, its difficult to cover all
  • Timely re-crawls are required
  • Deep Web
  • Monopoly of Google
  • Controls 80 of web search requests
  • Sites may be censored by engine
  • Make use of Peer-to-Peer technology
  • Exploit previously unused CPU/memory/disk power
  • Keep up-to-date results for small portions of Web
  • Conquer Deep Web by specialized web crawlers

4
MINERVA project
  • MINERVA is a Peer-to-Peer Web search engine

Representative statistics
C1
Si
C6
P1
P6
Peer with local search engine
Pi
S6
S1
Index on crawled pages
P2
C2
Ci
P5
C5
S2
S5
S3
Distributed directory based on Chord protocol
S4
P4
P3
C3
C4
Chord ring
5
Query processing in distributed IR
ltltR1, R2, R3,gt,qgt
ltP,qgt
Merging
Selection
ltP,qgt
Retrieval
RM
RM
P1
P1
R1
................... ...................
................... ...................
................... ...................
................... ...................
P2
P3
P2
R2
................... ...................
................... ...................
P4
P3
R3
................... ...................
................... ...................
P5
q query, P set of peers, P subset of
peers most relevant for q Ri ranked result
list of documents from Pi, Rm merged result
list of documents
P6
6
Naive merging approaches
  • How we can combine results from peers?
  • 1. Retrieve k documents with the highest
    similarity scores
  • Problem scores incomparable
  • 2. Take the same number of documents from each
    peer
  • Problem different database quality
  • 3. Fetch best documents from peers, re-rank them
    and select top-k
  • Problem good solution, but how to compute final
    scores?

7
Result merging problem
  • Objective make scores completely comparable
  • Solution Replace all collection-dependent
    statistics with global ones
  • Baseline obtain document scores estimation as
    they were placed in single database
  • Difficulty Overlapping influence statistics for
    score estimation.
  • Methods
  • GIDF Global Inverted Document Frequency
  • ICF Inverted Collection Frequency
  • CORI merging used in CORI system
  • LM Language Modeling

LeftScore RightScore
P
DB123
Single database
8
Selected result merging methods (1)
  • GIDF compute Global Inverted Document Frequency
  • DFi number of documents with particular term
    on peer i,
  • Di overall number of documents on peer i.
  • ICF replace IDF with Inverted Collection
    Frequency value
  • CF number of peers with particular term,
  • C number of collections (peers) in the
    system

9
Selected result merging methods (2)
  • CORI COllection Retrieval Inference network
  • DatabaseRank obtained during Database
    Selection step
  • LocalScore Scores computed with local
    statistics
  • constants are heuristics tuned for INQUERY
    search engine
  • LM Language Modeling
  • ? smoothing parameter, heuristic tradeoff
    between two models

P(q global language model from all documents on
all peers)
P(q document language model)
10
Experimental setup
  • TREC-2002, 2003 and 2004 Web Track datasets
  • 4 topics
  • 50 peers, 10-15 per topic
  • documents are replicated twice
  • 25 title queries, the topic distillation task
    2002 and 2003 Web Track
  • 3 database selection algorithms
  • RANDOM thats it
  • CORI de-facto standard
  • IDEAL manually created

11
Experiments CORI database ranking, all merging
methods
12
Experiments all database rankings, the best LM
merging method
13
Experiments IDEAL database ranking, the best LM
merging method, limited statistics
14
Preference-based language model (1)
  • 1. Execute query on the best peer
  • 2. First top-k results assumed relevant
    (pseudo-relevance feedback)
  • 3. Estimate preference-based LM on these top-k
    documents
  • 4. Compute cross-entropy between LM of the
    document and preference-based LM
  • 5. Combine this ranking with the LM merging
    method

15
Preference-based language model (2)
  • globally normalized similarity score
  • preference-based similarity score
  • both are combined into final result merging
    scores
  • where
  • Q query
  • tk term in Q
  • G entire document set over all peers
  • Dij document
  • U set of pseudo-relevant documents


16
Experiments - IDEAL database ranking,
preference-based language model merging
17
Conclusions
  • All merging algorithms are very close in absolute
    retrieval effectiveness
  • Language modeling methods are more effective than
    TFIDF based methods
  • Limited statistics is reasonable choice in a
    peer-to-peer setting
  • The pseudo-relevance feedback information from
    the topically organized collections slightly
    improves the retrieval quality
Write a Comment
User Comments (0)
About PowerShow.com