Query-Driven Indexing for Scalable P2P Text Retrieval - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Query-Driven Indexing for Scalable P2P Text Retrieval

Description:

Joint work with: Toan Luu. Ivana Podnar arko. Martin Rajman. Karl Aberer ... Each time a new document is indexed, some posting lists for a key k can reach ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 26
Provided by: pcsy1
Category:

less

Transcript and Presenter's Notes

Title: Query-Driven Indexing for Scalable P2P Text Retrieval


1
Query-Driven Indexing for Scalable P2P Text
Retrieval
Infoscale07, June 6-8, 2007 Suzhou, China
  • Gleb Skobeltsyn
  • EPFL, Switzerland
  • June 6, 2007
  • Joint work with
  • Toan Luu
  • Ivana Podnar Žarko
  • Martin Rajman
  • Karl Aberer

2
Goal
  • Our goal is to achieve scalable full-text
    retrieval with structured P2P networks (DHTs)
  • Each peer
  • Provides resources (bandwidth, storage)
  • Searches the whole network
  • Publishes its own documents

DHT
3
Naïve (single-term) approach
  • ... is to distribute the global inverted index in
    a DHT

This slide was borrowed from B. T. Loo, J. M.
Hellerstein, R. Huebsch, S. Shenker, I. Stoica
presentation Enhancing P2P File-Sharing with an
Internet-Scale Query Processor
4
Indexing with Highly Discriminative Keys
1 Scalable Peer-to-Peer Web Retrieval with
Highly Discriminative Keys I. Podnar, M. Rajman,
T. Luu, F. Klemm, K. Aberer in ICDE07,
Istambul, Turkey
5
Indexing with HDKs main properties
  • Distributed index contains key,PL pairs
  • Each key corresponds to a term or a set of terms
  • Each key is assigned to a posting list
  • Each posting list stores at most DFmax top-ranked
    document references.
  • Data-Driven key generation
  • Each time a new document is indexed, some posting
    lists for a key k can reach the max size of DFmax
  • It triggers the generation of new keys (k
    other frequent keys)
  • Proximity Filter a document qualifies for a key
    t1t2 if t1 is close to t2 (specified by a window
    size w).

6
HDK exhaustive data driven indexing
  • Pros
  • ICDE07 paper proves that the number of keys
    grows linearly
  • Elegant key generation mechanism
  • Low bandwidth while query processing (PLs of
    limited size)
  • Cons
  • Practically the number of keys is LARGE 68M for
    0.6M docs
  • High bandwidth consumption at indexing
  • Problem
  • Too many keys are superfluous (almost never used)

7
Query Driven Indexing
  • Lets index only what is queried!

8
Contents
  • Introduction
  • HDK approach for indexing
  • Query-driven approach for indexing/retrieval
  • Indexing structure
  • Example
  • ONM
  • Scalability
  • Evaluation
  • Conclusion

9
Query-Driven Index (QDI)
  • Query-Driven Indexing strategy solves
    the Too-Many-Keys problem
  • Avoids maintenance of superfluous keys
  • Generates only such keys that are requested by
    users
  • Utilizes query-log to discover such keys
  • Problems
  • Indexing of a new key requires a
    bandwidth-efficient mechanism to obtain the top-k
    posting list associated with the key
  • Opportunistic Notification Mechanism
  • (smart-broadcast)
  • Incomplete index causes degradation of query
    results quality
  • Show that the degradation is low

10
Which keys to index?
  • Each single-term found in the document collection
    is has to be indexed.
  • We call all single-term keys a basic single term
    index.
  • The posting lists are truncated at DFmax.
  • A key k is non-superfluous and can be activated
    iff
  • k is popular QF(k) QFmin, where QF(k) is the
    popularity of the key k derived from the
    available query log and QFmin is a parameter for
    our model (popularity filter).
  • k contains from 2 to smax terms 2k smax,
    where smax is a parameter of our model (size
    filter).
  • all immediate sub-keys of k (of size k-1) are
    indexed and their associated postings lists are
    truncated (redundancy filter).

11
QDI Retrieval
  • Single term index is generated
  • Process abc
  • Probe Pabc
  • Probe Pab Pbc and Pac
  • Probe Pa Pb and Pc
  • Obtain top-DFmax results for a, b and c (ranked
    w.r.t a, b and c respectively)
  • Contact peers in the list, re-rank the obtained
    results w.r.t abc
  • Output top-10
  • Inc. the QF for ab, bc and ac
  • Activate (index) ac

popular
12
QDI Retrieval 2
  • Assume the frequency of b is below DFmax
  • Note, how the redundancy filter would simplify
    the lattice in such a case
  • (grayed nodes cannot be activated)

13
QDI Retrieval 3
  • Single term index is generated and ac is indexed
  • Process abc
  • Probe Pabc
  • Probe Pab Pbc and Pac obtain the result for ac
  • Probe Pb and obtain the result for b
  • Contact all peers in the list to re-rank the
    obtained results w.r.t abc
  • Output top-10
  • Inc. the QF for ab, bc and ac

14
Opportunistic Notification Mechanism
  • ONM used to activate a new multi-term key
  • ONM is a smart broadcast with the following
    features
  • It is based on the shower multicast 2 each
    peer within a specified range is contacted only
    once
  • Notifications are small and low-priority gt
    piggybacking
  • Broadcast is split into several multicast
    sessions, each time pruning low-score documents
  • It uses the high-performance DHT layer 3
  • 2 A. Datta, M. Hauswirth, R. Schmidt, R. John,
    K. Aberer
  • Range Queries in Tree-Structured Overlays, in
    P2P05
  • 3 F. Klemm, J.-Y. Le Boudec, D. Kostic, K.
    Aberer
  • Improving the Throughput of Distributed Hash
    Tables Using Congestion-Aware Routing, in
    IPTPS'07

15
Scalability
  • The retrieval traffic is bounded by a constant
    due to trun-cated posting lists (depends on DFmax
    and a query size)
  • The indexing traffic depends on the number of
    keys to be activated.
  • The number of keys in the HDK approach (UPPER
    BOUND) is proven to grow linearly with the number
    of peers, if each peer provides a limited number
    of documents
  • The number of keys does not depend on the
    document collection size but only on the size of
    the query log
  • We can use the QFmin parameter to adjust the
    tradeoff indexing traffic lt-gt retrieval
    quality

16
Contents
  • Introduction
  • HDK approach for indexing
  • Query-driven approach for indexing/retrieval
  • Indexing structure
  • Example
  • ONM
  • Scalability
  • Evaluation
  • Conclusion

17
Overlap experiment
  • Use the Wikipedia query-log (9M
    queries/9-10.2004) to build the index
  • Choose randomly 3K test queries
  • Answer each test query with Google and compare to
    the union of top-DFmax Google results for each of
    its combinations that are indexed according to
    the logs.
  • Mimics our P2PIR system if Googles ranking is
    used.
  • Example

Non-superfluous (indexed) combinations
Original query
X
X
overlap_at_53/560
18
Overlap example
  • Cut-n-paste from the simulation log

gtid481, qwhat did babe ruth do in the 1920
1920 babe ruth, qf0 ----gt Ov_at_100 100
1920 babe, qf0 ---------gt Ov_at_100 9
1920 ruth, qf1 ---------gt Ov_at_100 33
babe ruth, qf495 -------gt Ov_at_100 69
---1920, qf716 ------------gt Ov_at_100 1
---babe, qf3196 -----------gt Ov_at_100 2
---ruth, qf1653 -----------gt Ov_at_100 7
Size 192, Keys used 2, Overlap_at_100 94
19
Overlap with Google
20
Overlap with Yahoo
21
Overlap with Google (no/partial/full overlap)
22
P2P Index Simulations
  • Number of keys depends only on the query log size
    and QFmin!
  • Does not depend on the collection size!
  • Number of keys is much smaller than for the HDK
    approach 68M keys for 650K doc

23
Real query logs?
  • Wikipedia queries are unrealistic (too skewed) as
    users know what they want.
  • Real web-queries might
  • perform worse?
  • Large scale experiments
  • with real web queries and
  • the TREC collection in 4
  • 4 Web Text Retrieval with a P2P Query-Driven
    Index G. Skobeltsyn, T. Luu, I. Podnar
    Žarko, M. Rajman, K. Aberer To appear in
    SIGIR07

24
Conclusions
  • We presented the query-driven indexing strategy
    for scalable web text retrieval with structured
    P2P networks
  • Stores posting lists in a DHT for terms and term
    combinations
  • Stores at most DFmax top document references in a
    posting list
  • Efficiently collects the query statistics in a
    distributed fashion
  • Based on this statistics activates (indexes) only
    popular keys
  • Computes the result of a multi-term query based
    only on the index entries available at the
    moment no costly intersections
  • We also showed that
  • With real query-logs our approach achieves good
    retrieval quality
  • The QFmin parameter adjusts the traffic/quality
    tradeoff

25
Last slide
  • Thank you for your attention!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com