Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys

Description:

structured overlay network for search (e.g. Chord, P-Grid) ... Document partitioning: broadcast search ... Wikipedia query log for 2 months (08/2004 and 09/2004) ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 26
Provided by: mikkokon
Category:

less

Transcript and Presenter's Notes

Title: Scalable PeertoPeer Web Retrieval with Highly Discriminative Keys


1
Scalable Peer-to-Peer Web Retrieval with Highly
Discriminative Keys
  • Ivana Podnar arko, Martin Rajman, Toan Luu,
    Fabius Klemm, Karl Aberer
  • School of Computer and Communication
    SciencesEPFL, Lausanne, Switzerland
  • FER, University of Zagreb, Croatia
  • Contact karl.aberer_at_epfl.ch

2
Contents
  • Motivation
  • Indexing and Retrieval model (HDKs)
  • Scalability analysis
  • Experimental results
  • Conclusion

3
Motivation
  • Clustered retrieval engines are reaching
    scalability limits
  • Fast growing public Web
  • Immense volume of privately owned content that
    will never be indexed by search engines like
    Google or Yahoo
  • Dynamically changing content
  • P2P retrieval as a scalable alternative
  • Involve large number of peer machines (millions)
  • Exploit scalable P2P search techniques
  • Support community-oriented search

4
P2P full text retrieval
  • Goals
  • retrieval performance comparable to
    state-of-the-art engines
  • scalable in terms of generated traffic (indexing
    and retrieval)
  • Two basic approaches
  • Document partitioning ? unstructured overlay
    network for search (e.g. Gnutella)
  • Term partitioning ? structured overlay network
    for search (e.g. Chord, P-Grid)
  • Problem communication cost for search Li et al,
    IPTPS 2003
  • Document partitioning broadcast search
  • Term partitioning long posting lists transmitted
    over network, in particular when processing
    multi-term queries

5
Approach
  • Some facts about web retrieval
  • queries are in general short (on average 2 to 3
    terms)
  • users pose queries containing frequent terms
  • users are interested in a few high-precision
    answers (fast)
  • Full-text information retrieval engine built over
    a structured P2P network specifically considering
    these observations
  • ALVIS PEERS
  • EU FP6 research project (2004-2006)

6
Contents
  • Motivation
  • Indexing and Retrieval model (HDKs)
  • Scalability analysis
  • Experimental results
  • Conclusion

7
P2PIR Architecture
  • Structured P2P network with N peers
  • logarithmic lookup cost for keys
  • Large document collection D
  • Each peer a) indexes part of the global
    collection D (Pi) and b)
    maintains part of the global index

IR PEER
IR PEER
Web service IF
Ranking
Web service IF
HDK Indexing/Querying
LI
Ranking
HDK Indexing/Querying
LI
GKI
P2P
GKI
P2P
IR PEER
LI
Local single-term index
Web service IF
Ranking
HDK Indexing/Querying
LI
Global key index(k, postinglist(k))
GKI
GKI
P2P
8
Single-term P2P indexing
key single-term
Querying peer
Q t1,t2
Peer3
Peer1
t1d7, d8 t2d7
t1d1, d2 t2d1, d3
Peer2
Local index
t1d4, d5 t2d6
9
HDK-based P2P indexing
key set of terms
Querying peer
Q t1,t2, t3
Peer3
Peer1
t1d7, d8 t2d8 t3d7
t1d1, d2 t2d1, d3
Peer2
t1d4, d5 t2d6 t3d5, d6
Retrieval traffic is bounded by DFmax and query
size!
10
Single-term vs. HDK-based P2P indexing
comparable retrieval quality(extended vocabulary)
voc. sizecould growexponentially!
11
Keys and key filtering
  • Non-Discriminative Keys (NDKs)
  • e.g. t1 is an NDK iff
  • t1 appears in more than DFmax collection
    documents
  • posting lists truncated to top-DFmax documents
  • Highly-Discriminative Keys (HDKs)
  • e.g. (t1, t2) is an HDK iff
  • t1 t2 appear in less than DFmax collection
    documents (discriminative w.r.t document
    collection)
  • t1 and t2 are non-discriminative (redundancy
    filter)
  • t1 and t2 are within a window of size w
    (proximity filter)
  • the no. of terms comprising a key is limited by
    smax (size filter)
  • posting lists by definition contain only ? DFmax
    documents

Key filtering enables scalable indexing!
12
Contents
  • Motivation
  • Indexing and Retrieval model (HDKs)
  • Scalability analysis
  • Experimental results
  • Conclusion

13
Scalability analysis (indexing)
  • What is the upper bound on the index size for a
    very large document collection?
  • D collection size in no. of terms
  • s no. of terms comprising a key
  • w window size
  • ISs index size associated with keys of size s
  • Pf, (s-1) probability of NDK occurrences where
    NDK size is (s-1)

14
Scalability analysis (indexing)
  • Zipf model

z(r)
Ff
Fr ? DFmax
Fr
r
NDKs
HDKs
15
Scalability analysis (indexing)
z(r)
  • C increases for an increasing collection size, a
    remains const.

Ff
Fr
r
Theorem Probability Pf,(s-1) of NDK occurrence
remains constant!
16
Scalability analysis (retrieval)
  • Retrieval traffic is bounded by DFmax and the
    number of keys a query is mapped to (constant)

Scalability theoretically guaranteed,but what
are the constants? Experiments!
17
Contents
  • Motivation
  • Indexing and Retrieval model (HDKs)
  • Scalability analysis
  • Experimental results
  • Conclusion

18
Experiment
  • System fully implemented in Java (available on
    request)
  • Document collection
  • 20.000, 40.000, ..., 140.000 documents from
    Wikipedia (www.wikipedia.org)
  • Query log
  • Wikipedia query log for 2 months (08/2004 and
    09/2004)
  • 3,000 randomly chosen queries from 2,000,000
    unique queries with more than 20 hits
  • No. of peers 4, 8, ..., 28
  • PCs running RedHat Linux with 1GB memory
  • 100 Mbit Ethernet
  • Each peer indexes 5.000 documents
  • DFmax 400 or 500, smax 3, w 20

19
Indexing costs
  • HDK vs single-term (ST) indexing
  • experimentally HDK / ST 13.9 (for 140.000
    documents)
  • theoretically HDK / ST 40.7 (overestimated
    upper bound!)

20
Retrieval costs
  • Retrieval traffic per query (Wikipedia query
    log)
  • remains constant with a growing collection size
    for the HDK approach (linear for single-term)

21
Estimated total generated traffic
  • Assumptions
  • monthly indexing
  • no. of queries per month 1,5 106 (true no. of
    queries from the wikipedia log, conservative
    estimate)
  • for 1 billion documents, HDK generates 42 times
    less overall traffic

22
Retrieval performance
  • Overlap on top 20 documents
  • comparable performance of the HDK-based approach
    to the centralized single-term engine with BM25

23
Conclusion
  • Novel indexing model based on indexing terms and
    term sets
  • Theoretical scalability model proves the proposed
    solution scales to large networks in terms of
    generated traffic both for indexing and
    retrieval
  • Running P2P prototype that exhibits retrieval
    performance fully comparable to a centralized
    term-based retrieval system
  • Associated resource requirements (storage,
    bandwidth consumption) grow in a scalable way as
    shown by experiments.

24
Ongoing work
  • Further reduce the number of indexing keys using
    query-driven indexing to produce and store only
    profitable keys for query answering

25
Acknowledgement
  • The work presented in this paper was carried out
    in the framework of the EPFL Center for Global
    Computing and supported by the Swiss National
    Funding Agency OFES as part of the European FP 6
    STREP project ALVIS (002068)
Write a Comment
User Comments (0)
About PowerShow.com