Flood Little, Cache More: Effective ResultReuse in P2P IR Systems presentation

About This Presentation

Transcript and Presenter's Notes

Title: Flood Little, Cache More: Effective ResultReuse in P2P IR Systems

1
Flood Little, Cache More Effective Result-Reuse
in P2P IR Systems

Christian Zimmer, Srikanta Bedathur, Gerhard
Weikum
Max-Planck Institute for Informatics,
Saarbrücken, Germany
http//www.mpi-inf.mpg.de

2
Outline of the Talk

Motivation
System Architecture
Caching Framework
Exact Caching (EC)
Approximate Caching (AC)
Experimental Evaluation
Conclusions Open Issues

3
Motivation

Basics
High Potential of P2P-based Information Retrieval
(P2P IR) systems
benefits in general scalable, efficient,
resilient to failures and dynamics, democratic,
privacy preserving, and resilient to
authoritarian controls
benefits from intellectual input of users click
streams, query logs, bookmarks, etc.
Performance Challenges
providing high quality results (recall
precision)
enabling high scalability (number of
participating peers huge amounts of data).
unreliable networks slow response times,
intermittent loss of good results
extra load on network many peers for good recall

4
Motivation (con't)

Caching of Results
Traditional performance booster (using previous
query executions to help in the future)
Remember popular items to avoid computing /
fetching
Typical Issues
What to Cache?
value of cached items
inverted lists / full results / partial results
Where to Cache?
on querying peers, every node along lookup path
(UIC), spread to neighbors (DiCAS), on good nodes
(View Trees)
How much to Cache?
buffer size
When to drop from Cache?
buffering policy
Goals of Caching?
response time improvements, query result-quality
improvement

5
System Architecture

Maintaining Metadata
Autonomous peers with local index (local search
engine)
Distributed global directory layered on top of
distributed hash table (DHT)
DHT partitions term space such that each peer is
responsible for subset of terms
Peers distribute per-term summaries (Posts) to
global directory (size of the index, number of
documents containing this term, etc.)
Directory manages aggregated statistical
information in compact form

Minerva Search Architecture
6
System Architecture

Query Execution
Multi-term query a b c
Peerlist requests to retrieve metadata from
directory (metadata retrieval)
Compute most promising peers for complete query
(e.g., CORI, DTF)
Complete query forwarded to these peers executing
query locally (local result retrieval)
Local results returned and merged to global query
result

Minerva Search Architecture
query a b c
7
Caching Framework

Main Goals
Caching for result-quality improvement
Integration of result caching with query routing
(reduces message traffic)
Cache placement for seamless reuse
Aggressive result-reuse under certain conditions
Where and What to Cache?
Potential locations for caching
Query initiator or additional overlays limited
utility to network
Directory choose one directory peer involved in
query execution using deterministic scheme
(avoids load balancing concerns)
Caching full results
Metadata of results (URL, statistics, etc.)
Set of source peers contributing to cached
results

8
Caching Framework (con't)

Extending Query Execution
Query Routing
initiating peer sends full query to all directory
peers responsible for query terms
directory checks availability of cached result
and if available returns it to initiator
Adding / Updating Cache
query initiator computes full query result and
cached result for top-k items
initiator determines directory peer responsible
for maintaining cached result
directory peer incorporates received cache result
in its cache
Two Caching Strategies based on Caching Framework
Exact Caching (EC)
P2P counterpart of traditional result caching
Approximate Caching (AC)
aggressively reuse cached results of query subsets

9
Exact Caching (EC)

Main Property
Only used if stored result generated by exactly
same query
Caching Approach
After query execution cached results stored at
directory (by selecting one directory peer)
Request for a b c by another peer
Metadata retrieval returns in addition cached
result
Initiator satisfied saves additional
communication at same result-quality
Improving local result retrieval from additional
peers
Updating cached result

query a b c
query a b c
10
Approximate Caching (AC)

Limitation of Exact Caching
EC only applicable when exact query was executed
before
Approximate Caching tries to overcome this issue
if cached result for complete query is not
available
Caching Approach
Aggressively retrieve and combine cached results
of subsets of requested query to approximate full
query
Avoids local result retrieval
Metadata retrieval
querying peer requests peerlists for all query
terms
directory peers return all existing maximal
cached results for subsets of query term set
querying peer only considers cached results for
maximal subqueries received from directory
By Design
directory peers for query terms responsible for
all possible subqueries
if AC strategy not satisfying, metadata retrieval
already done

11
Approximate Caching (AC) (con't)

An Example
Request for a b c d
No cached result for full query, but directory
stores cached results for subqueries
Metadata retrieval returns in addition all cached
results for maximal subqueries
To combine subquery results, querying peer only
considers maximal ones
Unsatisfactory Approximate Result
Querying peer retrieves local results from
top-ranked peers for full query

query a b c d
D(d)
D(c)
D(a)
D(b)
12
Approximate Caching (AC) (con't)

How to Combine Cached Results of Different
Subqueries
Having determined document set contained in all
cached results for maximal subqueries, documents
need to be ranked for approximate result for full
query
Consider document scores scored,p,q from cached
results for document d as local result of peer p
concerning (sub-)query q
Final Score Computation
To rank the document set and get approximate
result
scored maxp,q (q ? scored,p,q)
takes different query sizes into account longer
queries more selective and approximate better
full query
more than one cached result can include a
document only consider maximal score

13
Experimental Evaluation

Experimental Setup
P2P IR Benchmark recently proposed for P2P system
evaluation ExpDB 2006
gt 800,000 documents from Wikipedia
99 Google Zeitgeist queries (1-3 query terms)
Documents distributed to 1,000 peers (with
controlled overlap)
In addition AOL query-log (real-world log with
time ordering)
Result retrieval returns top-25 local results per
peer final result obtains top-50 documents for
full query
Measurements
Relative Recall fraction of ideal result
documents included in results of P2P query
processing
Ideal results as top-50 result documents of
centralized query execution including combined
document collection
Network Resource Consumption total network
traffic incurred during query processing
number of messages transfered across network
number of communication rounds

14
Experimental Evaluation (con't)

I. Improving Recall with Exact Caching (EC)
Focus on query result improvement by asking
additional peers
Updated cached result stored in directory
Initial query processing disseminates query to 5
of network each improvement step considers up to
5 additional network peers
Relative recall averaged over all 99 Zeitgeist
queries

15
Experimental Evaluation (con't)
16
Experimental Evaluation (con't)

II. Cache Management Strategies
Assumes bounded cache space at directory peers
such that cache management policy influences
recall for Exact Caching strategy
Cache at directory peer restricted to three
cached results each
Synthetic query workload from Zeitgeist queries
all possible 9180 one- and two-term queries from
single query terms
assuming a power law distribution (total of
102,158 requests)
Cache replacement strategies LFU, LRU, FIFO,
RAN, UNL (upper bound), and NOC (lower bound)
Measures overall relative recall and cache hit
ratio

17
Experimental Evaluation (con't)
18
Experimental Evaluation (con't)

III. Cost Analysis
Network cost analysis per query network traffic,
number of messages, and communication rounds in
three scenarios
No Caching (NC) standard query processing (5 of
network)
EC Single-Step (EC-SS) Exact Caching without
query result improvement
EC Multi-Step (EC-MS) Exact Caching with query
result improvement up to 50 of network in 5
steps
Details (different phases, assumptions etc.) see
paper!

19
Experimental Evaluation (con't)
NC No Caching
EC-SS EC Single -Step
EC-MS EC Multi-Step
average relative recall
0.32
0.32
0.71 (122)
network traffic (per query)
55.3 Kbytes
23.1 Kbytes (-58.2)
41.0 Kbytes (-25.9)
messages (per query)
106
25.7 (-75.8)
61.4 (-42.1)
response time (rounds)
2
1.19 (-40.3)
1.60 (-20.0)
20
Experimental Evaluation (con't)

IV. Approximate Caching Scenarios
4000 generated random 3- and 4-term queries from
benchmark query set
Comparison of 5 scenarios against standard query
routing (SQR)
Effectiveness of AC in terms of relative recall
depending on number of peers contributed to
cached subquery result

21
Experimental Evaluation (con't)
22
Experimental Evaluation (con't)

V. Real-World Query-Log
Using AOL query-log to have time-order of
queries overall 57,344 requests with 39,640
unique queries
Combination of EC and AC
Results 25 hit rate recall imrovement from
0.45 to 0.52

23
Experimental Evaluation (con't)

VI. Impact of Churn
On benefits of EC-MS
Different churn rates fraction of peers leave
network

24
Experimental Evaluation (con't)
25
Conclusions Open Issues

Conclusions
Introduced simple, yet effective, caching
framework to take advantage of previous work of
peers in P2P network
Exact Caching (EC)
possibility to improve recall - or to reduce
response time / network cost
experiments used Wikipedia benchmark and
real-world query-log
investigated various cache replacement strategies
and considered churn in P2P
Approximate Caching (AC)
aggressive reuse of cached results of subqueries
- if full query results not available
demands on existing cached results for satisfying
outcomes
Open Issues
Proactive Caching (anticipate interesting
queries, e.g., from existing logs)
Maintaining cache freshness (new or better
results are available)
Replication (metadata and/or documents)

26
Thank You For Your Attention! Questions or
Comments?
27
Distributed Hash Tables (DHTs)

Distributed Hash Tables (DHTs)
Minerva search engine based on Distributed Hash
Table (DHT) to achieve scalability,
fault-tolerance, and robustness
Second generation of structured overlay networks
Minerva uses Chord with fingertables
Data items distributed to nodes using consistent
hashing
id Hash (key)
Lookup-method to find location of data items with
given key in O(log N) hops

lookup(54)
K54

Write a Comment

User Comments (0)

About PowerShow.com

Flood Little, Cache More: Effective ResultReuse in P2P IR Systems PowerPoint PPT Presentation