Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys

Description:

structured overlay network for search (e.g. Chord, P-Grid) ... Document partitioning: broadcast search ... Wikipedia query log for 2 months (08/2004 and 09/2004) ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 26

Provided by: mikkokon

Category:

more less

Transcript and Presenter's Notes

Title: Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys

1
Scalable Peer-to-Peer Web Retrieval with Highly
Discriminative Keys

Ivana Podnar arko, Martin Rajman, Toan Luu,
Fabius Klemm, Karl Aberer
School of Computer and Communication
SciencesEPFL, Lausanne, Switzerland
FER, University of Zagreb, Croatia
Contact karl.aberer_at_epfl.ch

2
Contents

Motivation
Indexing and Retrieval model (HDKs)
Scalability analysis
Experimental results
Conclusion

3
Motivation

Clustered retrieval engines are reaching
scalability limits
Fast growing public Web
Immense volume of privately owned content that
will never be indexed by search engines like
Google or Yahoo
Dynamically changing content
P2P retrieval as a scalable alternative
Involve large number of peer machines (millions)
Exploit scalable P2P search techniques
Support community-oriented search

4
P2P full text retrieval

Goals
retrieval performance comparable to
state-of-the-art engines
scalable in terms of generated traffic (indexing
and retrieval)
Two basic approaches
Document partitioning ? unstructured overlay
network for search (e.g. Gnutella)
Term partitioning ? structured overlay network
for search (e.g. Chord, P-Grid)
Problem communication cost for search Li et al,
IPTPS 2003
Document partitioning broadcast search
Term partitioning long posting lists transmitted
over network, in particular when processing
multi-term queries

5
Approach

Some facts about web retrieval
queries are in general short (on average 2 to 3
terms)
users pose queries containing frequent terms
users are interested in a few high-precision
answers (fast)
Full-text information retrieval engine built over
a structured P2P network specifically considering
these observations
ALVIS PEERS
EU FP6 research project (2004-2006)

6
Contents

Motivation
Indexing and Retrieval model (HDKs)
Scalability analysis
Experimental results
Conclusion

7
P2PIR Architecture

Structured P2P network with N peers
logarithmic lookup cost for keys
Large document collection D
Each peer a) indexes part of the global
collection D (Pi) and b)
maintains part of the global index

IR PEER
IR PEER
Web service IF
Ranking
Web service IF
HDK Indexing/Querying
LI
Ranking
HDK Indexing/Querying
LI
GKI
P2P
GKI
P2P
IR PEER
LI
Local single-term index
Web service IF
Ranking
HDK Indexing/Querying
LI
Global key index(k, postinglist(k))
GKI
GKI
P2P
8
Single-term P2P indexing
key single-term
Querying peer
Q t1,t2
Peer3
Peer1
t1d7, d8 t2d7
t1d1, d2 t2d1, d3
Peer2
Local index
t1d4, d5 t2d6
9
HDK-based P2P indexing
key set of terms
Querying peer
Q t1,t2, t3
Peer3
Peer1
t1d7, d8 t2d8 t3d7
t1d1, d2 t2d1, d3
Peer2
t1d4, d5 t2d6 t3d5, d6
Retrieval traffic is bounded by DFmax and query
size!
10
Single-term vs. HDK-based P2P indexing
comparable retrieval quality(extended vocabulary)
voc. sizecould growexponentially!
11
Keys and key filtering

Non-Discriminative Keys (NDKs)
e.g. t1 is an NDK iff
t1 appears in more than DFmax collection
documents
posting lists truncated to top-DFmax documents
Highly-Discriminative Keys (HDKs)
e.g. (t1, t2) is an HDK iff
t1 t2 appear in less than DFmax collection
documents (discriminative w.r.t document
collection)
t1 and t2 are non-discriminative (redundancy
filter)
t1 and t2 are within a window of size w
(proximity filter)
the no. of terms comprising a key is limited by
smax (size filter)
posting lists by definition contain only ? DFmax
documents

Key filtering enables scalable indexing!
12
Contents

Motivation
Indexing and Retrieval model (HDKs)
Scalability analysis
Experimental results
Conclusion

13
Scalability analysis (indexing)

What is the upper bound on the index size for a
very large document collection?

D collection size in no. of terms
s no. of terms comprising a key
w window size
ISs index size associated with keys of size s
Pf, (s-1) probability of NDK occurrences where
NDK size is (s-1)

14
Scalability analysis (indexing)

Zipf model

z(r)
Ff
Fr ? DFmax
Fr
r
NDKs
HDKs
15
Scalability analysis (indexing)
z(r)

C increases for an increasing collection size, a
remains const.

Ff
Fr
r
Theorem Probability Pf,(s-1) of NDK occurrence
remains constant!
16
Scalability analysis (retrieval)

Retrieval traffic is bounded by DFmax and the
number of keys a query is mapped to (constant)

Scalability theoretically guaranteed,but what
are the constants? Experiments!
17
Contents

Motivation
Indexing and Retrieval model (HDKs)
Scalability analysis
Experimental results
Conclusion

18
Experiment

System fully implemented in Java (available on
request)
Document collection
20.000, 40.000, ..., 140.000 documents from
Wikipedia (www.wikipedia.org)
Query log
Wikipedia query log for 2 months (08/2004 and
09/2004)
3,000 randomly chosen queries from 2,000,000
unique queries with more than 20 hits
No. of peers 4, 8, ..., 28
PCs running RedHat Linux with 1GB memory
100 Mbit Ethernet
Each peer indexes 5.000 documents
DFmax 400 or 500, smax 3, w 20

19
Indexing costs

HDK vs single-term (ST) indexing
experimentally HDK / ST 13.9 (for 140.000
documents)
theoretically HDK / ST 40.7 (overestimated
upper bound!)

20
Retrieval costs

Retrieval traffic per query (Wikipedia query
log)
remains constant with a growing collection size
for the HDK approach (linear for single-term)

21
Estimated total generated traffic

Assumptions
monthly indexing
no. of queries per month 1,5 106 (true no. of
queries from the wikipedia log, conservative
estimate)
for 1 billion documents, HDK generates 42 times
less overall traffic

22
Retrieval performance

Overlap on top 20 documents
comparable performance of the HDK-based approach
to the centralized single-term engine with BM25

23
Conclusion

Novel indexing model based on indexing terms and
term sets
Theoretical scalability model proves the proposed
solution scales to large networks in terms of
generated traffic both for indexing and
retrieval
Running P2P prototype that exhibits retrieval
performance fully comparable to a centralized
term-based retrieval system
Associated resource requirements (storage,
bandwidth consumption) grow in a scalable way as
shown by experiments.

24
Ongoing work

Further reduce the number of indexing keys using
query-driven indexing to produce and store only
profitable keys for query answering

25
Acknowledgement

The work presented in this paper was carried out
in the framework of the EPFL Center for Global
Computing and supported by the Swiss National
Funding Agency OFES as part of the European FP 6
STREP project ALVIS (002068)

Write a Comment

User Comments (0)