Title: QueryDriven Indexing for Scalable P2P Text Retrieval
1Query-Driven Indexing for Scalable P2P Text
Retrieval
Infoscale07, June 6-8, 2007 Suzhou, China
- Gleb Skobeltsyn
- EPFL, Switzerland
- June 6, 2007
- Joint work with
- Toan Luu
- Ivana Podnar arko
- Martin Rajman
- Karl Aberer
2Goal
- Our goal is to achieve scalable full-text
retrieval with structured P2P networks (DHTs)
- Each peer
- Provides resources (bandwidth, storage)
- Searches the whole network
- Publishes its own documents
DHT
3Naïve (single-term) approach
- ... is to distribute the global inverted index in
a DHT
This slide was borrowed from B. T. Loo, J. M.
Hellerstein, R. Huebsch, S. Shenker, I. Stoica
presentation Enhancing P2P File-Sharing with an
Internet-Scale Query Processor
4Indexing with Highly Discriminative Keys
1 Scalable Peer-to-Peer Web Retrieval with
Highly Discriminative Keys I. Podnar, M. Rajman,
T. Luu, F. Klemm, K. Aberer in ICDE07,
Istambul, Turkey
5Indexing with HDKs main properties
- Distributed index contains key,PL pairs
- Each key corresponds to a term or a set of terms
- Each key is assigned to a posting list
- Each posting list stores at most DFmax top-ranked
document references. - Data-Driven key generation
- Each time a new document is indexed, some posting
lists for a key k can reach the max size of DFmax
- It triggers the generation of new keys (k
other frequent keys) - Proximity Filter a document qualifies for a key
t1t2 if t1 is close to t2 (specified by a window
size w).
6HDK exhaustive data driven indexing
- Pros
- ICDE07 paper proves that the number of keys
grows linearly - Elegant key generation mechanism
- Low bandwidth while query processing (PLs of
limited size) - Cons
- Practically the number of keys is LARGE 68M for
0.6M docs - High bandwidth consumption at indexing
- Problem
- Too many keys are superfluous (almost never used)
7Query Driven Indexing
- Lets index only what is queried!
8Contents
- Introduction
- HDK approach for indexing
- Query-driven approach for indexing/retrieval
- Indexing structure
- Example
- ONM
- Scalability
- Evaluation
- Conclusion
9Query-Driven Index (QDI)
- Query-Driven Indexing strategy solves
the Too-Many-Keys problem - Avoids maintenance of superfluous keys
- Generates only such keys that are requested by
users - Utilizes query-log to discover such keys
- Problems
- Indexing of a new key requires a
bandwidth-efficient mechanism to obtain the top-k
posting list associated with the key - Opportunistic Notification Mechanism
- (smart-broadcast)
- Incomplete index causes degradation of query
results quality - Show that the degradation is low
10Which keys to index?
- Each single-term found in the document collection
is has to be indexed. - We call all single-term keys a basic single term
index. - The posting lists are truncated at DFmax.
- A key k is non-superfluous and can be activated
iff - k is popular QF(k) QFmin, where QF(k) is the
popularity of the key k derived from the
available query log and QFmin is a parameter for
our model (popularity filter). - k contains from 2 to smax terms 2k smax,
where smax is a parameter of our model (size
filter). - all immediate sub-keys of k (of size k-1) are
indexed and their associated postings lists are
truncated (redundancy filter).
11QDI Retrieval
- Single term index is generated
- Process abc
- Probe Pabc
- Probe Pab Pbc and Pac
- Probe Pa Pb and Pc
- Obtain top-DFmax results for a, b and c (ranked
w.r.t a, b and c respectively) - Contact peers in the list, re-rank the obtained
results w.r.t abc - Output top-10
- Inc. the QF for ab, bc and ac
- Activate (index) ac
popular
12QDI Retrieval 2
- Assume the frequency of b is below DFmax
- Note, how the redundancy filter would simplify
the lattice in such a case - (grayed nodes cannot be activated)
13QDI Retrieval 3
- Single term index is generated and ac is indexed
- Process abc
- Probe Pabc
- Probe Pab Pbc and Pac obtain the result for ac
- Probe Pb and obtain the result for b
- Contact all peers in the list to re-rank the
obtained results w.r.t abc - Output top-10
- Inc. the QF for ab, bc and ac
14Opportunistic Notification Mechanism
- ONM used to activate a new multi-term key
- ONM is a smart broadcast with the following
features - It is based on the shower multicast 2 each
peer within a specified range is contacted only
once - Notifications are small and low-priority gt
piggybacking - Broadcast is split into several multicast
sessions, each time pruning low-score documents - It uses the high-performance DHT layer 3
- 2 A. Datta, M. Hauswirth, R. Schmidt, R. John,
K. Aberer - Range Queries in Tree-Structured Overlays, in
P2P05 - 3 F. Klemm, J.-Y. Le Boudec, D. Kostic, K.
Aberer - Improving the Throughput of Distributed Hash
Tables Using Congestion-Aware Routing, in
IPTPS'07
15Scalability
- The retrieval traffic is bounded by a constant
due to trun-cated posting lists (depends on DFmax
and a query size) - The indexing traffic depends on the number of
keys to be activated. - The number of keys in the HDK approach (UPPER
BOUND) is proven to grow linearly with the number
of peers, if each peer provides a limited number
of documents - The number of keys does not depend on the
document collection size but only on the size of
the query log - We can use the QFmin parameter to adjust the
tradeoff indexing traffic lt-gt retrieval
quality
16Contents
- Introduction
- HDK approach for indexing
- Query-driven approach for indexing/retrieval
- Indexing structure
- Example
- ONM
- Scalability
- Evaluation
- Conclusion
17Overlap experiment
- Use the Wikipedia query-log (9M
queries/9-10.2004) to build the index - Choose randomly 3K test queries
- Answer each test query with Google and compare to
the union of top-DFmax Google results for each of
its combinations that are indexed according to
the logs. - Mimics our P2PIR system if Googles ranking is
used. - Example
Non-superfluous (indexed) combinations
Original query
X
X
overlap_at_53/560
18Overlap example
- Cut-n-paste from the simulation log
gtid481, qwhat did babe ruth do in the 1920
1920 babe ruth, qf0 ----gt Ov_at_100 100
1920 babe, qf0 ---------gt Ov_at_100 9
1920 ruth, qf1 ---------gt Ov_at_100 33
babe ruth, qf495 -------gt Ov_at_100 69
---1920, qf716 ------------gt Ov_at_100 1
---babe, qf3196 -----------gt Ov_at_100 2
---ruth, qf1653 -----------gt Ov_at_100 7
Size 192, Keys used 2, Overlap_at_100 94
19Overlap with Google
20Overlap with Yahoo
21Overlap with Google (no/partial/full overlap)
22P2P Index Simulations
- Number of keys depends only on the query log size
and QFmin! - Does not depend on the collection size!
- Number of keys is much smaller than for the HDK
approach 68M keys for 650K doc
23Real query logs?
- Wikipedia queries are unrealistic (too skewed) as
users know what they want. - Real web-queries might
- perform worse?
- Large scale experiments
- with real web queries and
- the TREC collection in 4
- 4 Web Text Retrieval with a P2P Query-Driven
Index G. Skobeltsyn, T. Luu, I. Podnar
arko, M. Rajman, K. Aberer To appear in
SIGIR07
24Conclusions
- We presented the query-driven indexing strategy
for scalable web text retrieval with structured
P2P networks - Stores posting lists in a DHT for terms and term
combinations - Stores at most DFmax top document references in a
posting list - Efficiently collects the query statistics in a
distributed fashion - Based on this statistics activates (indexes) only
popular keys - Computes the result of a multi-term query based
only on the index entries available at the
moment no costly intersections - We also showed that
- With real query-logs our approach achieves good
retrieval quality - The QFmin parameter adjusts the traffic/quality
tradeoff
25Last slide
- Thank you for your attention!
- Questions?