Title: Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys
1Scalable Peer-to-Peer Web Retrieval with Highly
Discriminative Keys
- Ivana Podnar arko, Martin Rajman, Toan Luu,
Fabius Klemm, Karl Aberer - School of Computer and Communication
SciencesEPFL, Lausanne, Switzerland - FER, University of Zagreb, Croatia
- Contact karl.aberer_at_epfl.ch
2Contents
- Motivation
- Indexing and Retrieval model (HDKs)
- Scalability analysis
- Experimental results
- Conclusion
3Motivation
- Clustered retrieval engines are reaching
scalability limits - Fast growing public Web
- Immense volume of privately owned content that
will never be indexed by search engines like
Google or Yahoo - Dynamically changing content
- P2P retrieval as a scalable alternative
- Involve large number of peer machines (millions)
- Exploit scalable P2P search techniques
- Support community-oriented search
4P2P full text retrieval
- Goals
- retrieval performance comparable to
state-of-the-art engines - scalable in terms of generated traffic (indexing
and retrieval) - Two basic approaches
- Document partitioning ? unstructured overlay
network for search (e.g. Gnutella) - Term partitioning ? structured overlay network
for search (e.g. Chord, P-Grid) - Problem communication cost for search Li et al,
IPTPS 2003 - Document partitioning broadcast search
- Term partitioning long posting lists transmitted
over network, in particular when processing
multi-term queries
5Approach
- Some facts about web retrieval
- queries are in general short (on average 2 to 3
terms) - users pose queries containing frequent terms
- users are interested in a few high-precision
answers (fast) - Full-text information retrieval engine built over
a structured P2P network specifically considering
these observations - ALVIS PEERS
- EU FP6 research project (2004-2006)
6Contents
- Motivation
- Indexing and Retrieval model (HDKs)
- Scalability analysis
- Experimental results
- Conclusion
7P2PIR Architecture
- Structured P2P network with N peers
- logarithmic lookup cost for keys
- Large document collection D
- Each peer a) indexes part of the global
collection D (Pi) and b)
maintains part of the global index
IR PEER
IR PEER
Web service IF
Ranking
Web service IF
HDK Indexing/Querying
LI
Ranking
HDK Indexing/Querying
LI
GKI
P2P
GKI
P2P
IR PEER
LI
Local single-term index
Web service IF
Ranking
HDK Indexing/Querying
LI
Global key index(k, postinglist(k))
GKI
GKI
P2P
8Single-term P2P indexing
key single-term
Querying peer
Q t1,t2
Peer3
Peer1
t1d7, d8 t2d7
t1d1, d2 t2d1, d3
Peer2
Local index
t1d4, d5 t2d6
9HDK-based P2P indexing
key set of terms
Querying peer
Q t1,t2, t3
Peer3
Peer1
t1d7, d8 t2d8 t3d7
t1d1, d2 t2d1, d3
Peer2
t1d4, d5 t2d6 t3d5, d6
Retrieval traffic is bounded by DFmax and query
size!
10Single-term vs. HDK-based P2P indexing
comparable retrieval quality(extended vocabulary)
voc. sizecould growexponentially!
11Keys and key filtering
- Non-Discriminative Keys (NDKs)
- e.g. t1 is an NDK iff
- t1 appears in more than DFmax collection
documents - posting lists truncated to top-DFmax documents
- Highly-Discriminative Keys (HDKs)
- e.g. (t1, t2) is an HDK iff
- t1 t2 appear in less than DFmax collection
documents (discriminative w.r.t document
collection) - t1 and t2 are non-discriminative (redundancy
filter) - t1 and t2 are within a window of size w
(proximity filter) - the no. of terms comprising a key is limited by
smax (size filter) - posting lists by definition contain only ? DFmax
documents
Key filtering enables scalable indexing!
12Contents
- Motivation
- Indexing and Retrieval model (HDKs)
- Scalability analysis
- Experimental results
- Conclusion
13Scalability analysis (indexing)
- What is the upper bound on the index size for a
very large document collection?
- D collection size in no. of terms
- s no. of terms comprising a key
- w window size
- ISs index size associated with keys of size s
- Pf, (s-1) probability of NDK occurrences where
NDK size is (s-1)
14Scalability analysis (indexing)
z(r)
Ff
Fr ? DFmax
Fr
r
NDKs
HDKs
15Scalability analysis (indexing)
z(r)
- C increases for an increasing collection size, a
remains const.
Ff
Fr
r
Theorem Probability Pf,(s-1) of NDK occurrence
remains constant!
16Scalability analysis (retrieval)
- Retrieval traffic is bounded by DFmax and the
number of keys a query is mapped to (constant)
Scalability theoretically guaranteed,but what
are the constants? Experiments!
17Contents
- Motivation
- Indexing and Retrieval model (HDKs)
- Scalability analysis
- Experimental results
- Conclusion
18Experiment
- System fully implemented in Java (available on
request) - Document collection
- 20.000, 40.000, ..., 140.000 documents from
Wikipedia (www.wikipedia.org) - Query log
- Wikipedia query log for 2 months (08/2004 and
09/2004) - 3,000 randomly chosen queries from 2,000,000
unique queries with more than 20 hits - No. of peers 4, 8, ..., 28
- PCs running RedHat Linux with 1GB memory
- 100 Mbit Ethernet
- Each peer indexes 5.000 documents
- DFmax 400 or 500, smax 3, w 20
19Indexing costs
- HDK vs single-term (ST) indexing
- experimentally HDK / ST 13.9 (for 140.000
documents) - theoretically HDK / ST 40.7 (overestimated
upper bound!)
20Retrieval costs
- Retrieval traffic per query (Wikipedia query
log) - remains constant with a growing collection size
for the HDK approach (linear for single-term)
21Estimated total generated traffic
- Assumptions
- monthly indexing
- no. of queries per month 1,5 106 (true no. of
queries from the wikipedia log, conservative
estimate) - for 1 billion documents, HDK generates 42 times
less overall traffic
22Retrieval performance
- Overlap on top 20 documents
- comparable performance of the HDK-based approach
to the centralized single-term engine with BM25
23Conclusion
- Novel indexing model based on indexing terms and
term sets - Theoretical scalability model proves the proposed
solution scales to large networks in terms of
generated traffic both for indexing and
retrieval - Running P2P prototype that exhibits retrieval
performance fully comparable to a centralized
term-based retrieval system - Associated resource requirements (storage,
bandwidth consumption) grow in a scalable way as
shown by experiments.
24Ongoing work
- Further reduce the number of indexing keys using
query-driven indexing to produce and store only
profitable keys for query answering
25Acknowledgement
- The work presented in this paper was carried out
in the framework of the EPFL Center for Global
Computing and supported by the Swiss National
Funding Agency OFES as part of the European FP 6
STREP project ALVIS (002068)