Query-Driven Indexing for Scalable P2P Text Retrieval - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Query-Driven Indexing for Scalable P2P Text Retrieval

Description:

Joint work with: Toan Luu. Ivana Podnar arko. Martin Rajman. Karl Aberer ... Each time a new document is indexed, some posting lists for a key k can reach ... – PowerPoint PPT presentation

Number of Views:14

Avg rating:3.0/5.0

Slides: 26

Provided by: pcsy1

Category:

more less

Transcript and Presenter's Notes

Title: Query-Driven Indexing for Scalable P2P Text Retrieval

1
Query-Driven Indexing for Scalable P2P Text
Retrieval
Infoscale07, June 6-8, 2007 Suzhou, China

Gleb Skobeltsyn
EPFL, Switzerland
June 6, 2007

Joint work with
Toan Luu
Ivana Podnar Žarko
Martin Rajman
Karl Aberer

2
Goal

Our goal is to achieve scalable full-text
retrieval with structured P2P networks (DHTs)

Each peer
Provides resources (bandwidth, storage)
Searches the whole network
Publishes its own documents

DHT
3
Naïve (single-term) approach

... is to distribute the global inverted index in
a DHT

This slide was borrowed from B. T. Loo, J. M.
Hellerstein, R. Huebsch, S. Shenker, I. Stoica
presentation Enhancing P2P File-Sharing with an
Internet-Scale Query Processor
4
Indexing with Highly Discriminative Keys
1 Scalable Peer-to-Peer Web Retrieval with
Highly Discriminative Keys I. Podnar, M. Rajman,
T. Luu, F. Klemm, K. Aberer in ICDE07,
Istambul, Turkey
5
Indexing with HDKs main properties

Distributed index contains key,PL pairs
Each key corresponds to a term or a set of terms
Each key is assigned to a posting list
Each posting list stores at most DFmax top-ranked
document references.
Data-Driven key generation
Each time a new document is indexed, some posting
lists for a key k can reach the max size of DFmax
It triggers the generation of new keys (k
other frequent keys)
Proximity Filter a document qualifies for a key
t1t2 if t1 is close to t2 (specified by a window
size w).

6
HDK exhaustive data driven indexing

Pros
ICDE07 paper proves that the number of keys
grows linearly
Elegant key generation mechanism
Low bandwidth while query processing (PLs of
limited size)
Cons
Practically the number of keys is LARGE 68M for
0.6M docs
High bandwidth consumption at indexing
Problem
Too many keys are superfluous (almost never used)

7
Query Driven Indexing

Lets index only what is queried!

8
Contents

Introduction
HDK approach for indexing
Query-driven approach for indexing/retrieval
Indexing structure
Example
ONM
Scalability
Evaluation
Conclusion

9
Query-Driven Index (QDI)

Query-Driven Indexing strategy solves
the Too-Many-Keys problem
Avoids maintenance of superfluous keys
Generates only such keys that are requested by
users
Utilizes query-log to discover such keys
Problems
Indexing of a new key requires a
bandwidth-efficient mechanism to obtain the top-k
posting list associated with the key
Opportunistic Notification Mechanism
(smart-broadcast)
Incomplete index causes degradation of query
results quality
Show that the degradation is low

10
Which keys to index?

Each single-term found in the document collection
is has to be indexed.
We call all single-term keys a basic single term
index.
The posting lists are truncated at DFmax.
A key k is non-superfluous and can be activated
iff
k is popular QF(k) QFmin, where QF(k) is the
popularity of the key k derived from the
available query log and QFmin is a parameter for
our model (popularity filter).
k contains from 2 to smax terms 2k smax,
where smax is a parameter of our model (size
filter).
all immediate sub-keys of k (of size k-1) are
indexed and their associated postings lists are
truncated (redundancy filter).

11
QDI Retrieval

Single term index is generated
Process abc
Probe Pabc
Probe Pab Pbc and Pac
Probe Pa Pb and Pc
Obtain top-DFmax results for a, b and c (ranked
w.r.t a, b and c respectively)
Contact peers in the list, re-rank the obtained
results w.r.t abc
Output top-10
Inc. the QF for ab, bc and ac
Activate (index) ac

popular
12
QDI Retrieval 2

Assume the frequency of b is below DFmax
Note, how the redundancy filter would simplify
the lattice in such a case
(grayed nodes cannot be activated)

13
QDI Retrieval 3

Single term index is generated and ac is indexed
Process abc
Probe Pabc
Probe Pab Pbc and Pac obtain the result for ac
Probe Pb and obtain the result for b
Contact all peers in the list to re-rank the
obtained results w.r.t abc
Output top-10
Inc. the QF for ab, bc and ac

14
Opportunistic Notification Mechanism

ONM used to activate a new multi-term key
ONM is a smart broadcast with the following
features
It is based on the shower multicast 2 each
peer within a specified range is contacted only
once
Notifications are small and low-priority gt
piggybacking
Broadcast is split into several multicast
sessions, each time pruning low-score documents
It uses the high-performance DHT layer 3
2 A. Datta, M. Hauswirth, R. Schmidt, R. John,
K. Aberer
Range Queries in Tree-Structured Overlays, in
P2P05
3 F. Klemm, J.-Y. Le Boudec, D. Kostic, K.
Aberer
Improving the Throughput of Distributed Hash
Tables Using Congestion-Aware Routing, in
IPTPS'07

15
Scalability

The retrieval traffic is bounded by a constant
due to trun-cated posting lists (depends on DFmax
and a query size)
The indexing traffic depends on the number of
keys to be activated.
The number of keys in the HDK approach (UPPER
BOUND) is proven to grow linearly with the number
of peers, if each peer provides a limited number
of documents
The number of keys does not depend on the
document collection size but only on the size of
the query log
We can use the QFmin parameter to adjust the
tradeoff indexing traffic lt-gt retrieval
quality

16
Contents

Introduction
HDK approach for indexing
Query-driven approach for indexing/retrieval
Indexing structure
Example
ONM
Scalability
Evaluation
Conclusion

17
Overlap experiment

Use the Wikipedia query-log (9M
queries/9-10.2004) to build the index
Choose randomly 3K test queries
Answer each test query with Google and compare to
the union of top-DFmax Google results for each of
its combinations that are indexed according to
the logs.
Mimics our P2PIR system if Googles ranking is
used.
Example

Non-superfluous (indexed) combinations
Original query
X
X
overlap_at_53/560
18
Overlap example

Cut-n-paste from the simulation log

gtid481, qwhat did babe ruth do in the 1920
1920 babe ruth, qf0 ----gt Ov_at_100 100
1920 babe, qf0 ---------gt Ov_at_100 9
1920 ruth, qf1 ---------gt Ov_at_100 33
babe ruth, qf495 -------gt Ov_at_100 69
---1920, qf716 ------------gt Ov_at_100 1
---babe, qf3196 -----------gt Ov_at_100 2
---ruth, qf1653 -----------gt Ov_at_100 7
Size 192, Keys used 2, Overlap_at_100 94
19
Overlap with Google
20
Overlap with Yahoo
21
Overlap with Google (no/partial/full overlap)
22
P2P Index Simulations

Number of keys depends only on the query log size
and QFmin!
Does not depend on the collection size!
Number of keys is much smaller than for the HDK
approach 68M keys for 650K doc

23
Real query logs?

Wikipedia queries are unrealistic (too skewed) as
users know what they want.
Real web-queries might
perform worse?
Large scale experiments
with real web queries and
the TREC collection in 4
4 Web Text Retrieval with a P2P Query-Driven
Index G. Skobeltsyn, T. Luu, I. Podnar
Žarko, M. Rajman, K. Aberer To appear in
SIGIR07

24
Conclusions

We presented the query-driven indexing strategy
for scalable web text retrieval with structured
P2P networks
Stores posting lists in a DHT for terms and term
combinations
Stores at most DFmax top document references in a
posting list
Efficiently collects the query statistics in a
distributed fashion
Based on this statistics activates (indexes) only
popular keys
Computes the result of a multi-term query based
only on the index entries available at the
moment no costly intersections
We also showed that
With real query-logs our approach achieves good
retrieval quality
The QFmin parameter adjusts the traffic/quality
tradeoff