Title: Flood Little, Cache More: Effective ResultReuse in P2P IR Systems
1Flood Little, Cache More Effective Result-Reuse
in P2P IR Systems
- Christian Zimmer, Srikanta Bedathur, Gerhard
Weikum - Max-Planck Institute for Informatics,
Saarbrücken, Germany - http//www.mpi-inf.mpg.de
2Outline of the Talk
- Motivation
- System Architecture
- Caching Framework
- Exact Caching (EC)
- Approximate Caching (AC)
- Experimental Evaluation
- Conclusions Open Issues
3Motivation
- Basics
- High Potential of P2P-based Information Retrieval
(P2P IR) systems - benefits in general scalable, efficient,
resilient to failures and dynamics, democratic,
privacy preserving, and resilient to
authoritarian controls - benefits from intellectual input of users click
streams, query logs, bookmarks, etc. - Performance Challenges
- providing high quality results (recall
precision) - enabling high scalability (number of
participating peers huge amounts of data). - unreliable networks slow response times,
intermittent loss of good results - extra load on network many peers for good recall
4Motivation (con't)
- Caching of Results
- Traditional performance booster (using previous
query executions to help in the future) - Remember popular items to avoid computing /
fetching - Typical Issues
- What to Cache?
- value of cached items
- inverted lists / full results / partial results
- Where to Cache?
- on querying peers, every node along lookup path
(UIC), spread to neighbors (DiCAS), on good nodes
(View Trees) - How much to Cache?
- buffer size
- When to drop from Cache?
- buffering policy
- Goals of Caching?
- response time improvements, query result-quality
improvement
5System Architecture
- Maintaining Metadata
- Autonomous peers with local index (local search
engine) - Distributed global directory layered on top of
distributed hash table (DHT) - DHT partitions term space such that each peer is
responsible for subset of terms - Peers distribute per-term summaries (Posts) to
global directory (size of the index, number of
documents containing this term, etc.) - Directory manages aggregated statistical
information in compact form
Minerva Search Architecture
6System Architecture
- Query Execution
- Multi-term query a b c
- Peerlist requests to retrieve metadata from
directory (metadata retrieval) - Compute most promising peers for complete query
(e.g., CORI, DTF) - Complete query forwarded to these peers executing
query locally (local result retrieval) - Local results returned and merged to global query
result
Minerva Search Architecture
query a b c
7Caching Framework
- Main Goals
- Caching for result-quality improvement
- Integration of result caching with query routing
(reduces message traffic) - Cache placement for seamless reuse
- Aggressive result-reuse under certain conditions
- Where and What to Cache?
- Potential locations for caching
- Query initiator or additional overlays limited
utility to network - Directory choose one directory peer involved in
query execution using deterministic scheme
(avoids load balancing concerns) - Caching full results
- Metadata of results (URL, statistics, etc.)
- Set of source peers contributing to cached
results
8Caching Framework (con't)
- Extending Query Execution
- Query Routing
- initiating peer sends full query to all directory
peers responsible for query terms - directory checks availability of cached result
and if available returns it to initiator - Adding / Updating Cache
- query initiator computes full query result and
cached result for top-k items - initiator determines directory peer responsible
for maintaining cached result - directory peer incorporates received cache result
in its cache - Two Caching Strategies based on Caching Framework
- Exact Caching (EC)
- P2P counterpart of traditional result caching
- Approximate Caching (AC)
- aggressively reuse cached results of query subsets
9Exact Caching (EC)
- Main Property
- Only used if stored result generated by exactly
same query - Caching Approach
- After query execution cached results stored at
directory (by selecting one directory peer) - Request for a b c by another peer
- Metadata retrieval returns in addition cached
result - Initiator satisfied saves additional
communication at same result-quality - Improving local result retrieval from additional
peers - Updating cached result
query a b c
query a b c
10Approximate Caching (AC)
- Limitation of Exact Caching
- EC only applicable when exact query was executed
before - Approximate Caching tries to overcome this issue
if cached result for complete query is not
available - Caching Approach
- Aggressively retrieve and combine cached results
of subsets of requested query to approximate full
query - Avoids local result retrieval
- Metadata retrieval
- querying peer requests peerlists for all query
terms - directory peers return all existing maximal
cached results for subsets of query term set - querying peer only considers cached results for
maximal subqueries received from directory - By Design
- directory peers for query terms responsible for
all possible subqueries - if AC strategy not satisfying, metadata retrieval
already done
11Approximate Caching (AC) (con't)
- An Example
- Request for a b c d
- No cached result for full query, but directory
stores cached results for subqueries - Metadata retrieval returns in addition all cached
results for maximal subqueries - To combine subquery results, querying peer only
considers maximal ones - Unsatisfactory Approximate Result
- Querying peer retrieves local results from
top-ranked peers for full query
query a b c d
D(d)
D(c)
D(a)
D(b)
12Approximate Caching (AC) (con't)
- How to Combine Cached Results of Different
Subqueries - Having determined document set contained in all
cached results for maximal subqueries, documents
need to be ranked for approximate result for full
query - Consider document scores scored,p,q from cached
results for document d as local result of peer p
concerning (sub-)query q - Final Score Computation
- To rank the document set and get approximate
result - scored maxp,q (q ? scored,p,q)
- takes different query sizes into account longer
queries more selective and approximate better
full query - more than one cached result can include a
document only consider maximal score
13Experimental Evaluation
- Experimental Setup
- P2P IR Benchmark recently proposed for P2P system
evaluation ExpDB 2006 - gt 800,000 documents from Wikipedia
- 99 Google Zeitgeist queries (1-3 query terms)
- Documents distributed to 1,000 peers (with
controlled overlap) - In addition AOL query-log (real-world log with
time ordering) - Result retrieval returns top-25 local results per
peer final result obtains top-50 documents for
full query - Measurements
- Relative Recall fraction of ideal result
documents included in results of P2P query
processing - Ideal results as top-50 result documents of
centralized query execution including combined
document collection - Network Resource Consumption total network
traffic incurred during query processing - number of messages transfered across network
- number of communication rounds
14Experimental Evaluation (con't)
- I. Improving Recall with Exact Caching (EC)
- Focus on query result improvement by asking
additional peers - Updated cached result stored in directory
- Initial query processing disseminates query to 5
of network each improvement step considers up to
5 additional network peers - Relative recall averaged over all 99 Zeitgeist
queries
15Experimental Evaluation (con't)
16Experimental Evaluation (con't)
- II. Cache Management Strategies
- Assumes bounded cache space at directory peers
such that cache management policy influences
recall for Exact Caching strategy - Cache at directory peer restricted to three
cached results each - Synthetic query workload from Zeitgeist queries
- all possible 9180 one- and two-term queries from
single query terms - assuming a power law distribution (total of
102,158 requests) - Cache replacement strategies LFU, LRU, FIFO,
RAN, UNL (upper bound), and NOC (lower bound) - Measures overall relative recall and cache hit
ratio
17Experimental Evaluation (con't)
18Experimental Evaluation (con't)
- III. Cost Analysis
- Network cost analysis per query network traffic,
number of messages, and communication rounds in
three scenarios - No Caching (NC) standard query processing (5 of
network) - EC Single-Step (EC-SS) Exact Caching without
query result improvement - EC Multi-Step (EC-MS) Exact Caching with query
result improvement up to 50 of network in 5
steps - Details (different phases, assumptions etc.) see
paper!
19Experimental Evaluation (con't)
NC No Caching
EC-SS EC Single -Step
EC-MS EC Multi-Step
average relative recall
0.32
0.32
0.71 (122)
network traffic (per query)
55.3 Kbytes
23.1 Kbytes (-58.2)
41.0 Kbytes (-25.9)
messages (per query)
106
25.7 (-75.8)
61.4 (-42.1)
response time (rounds)
2
1.19 (-40.3)
1.60 (-20.0)
20Experimental Evaluation (con't)
- IV. Approximate Caching Scenarios
- 4000 generated random 3- and 4-term queries from
benchmark query set - Comparison of 5 scenarios against standard query
routing (SQR) - Effectiveness of AC in terms of relative recall
depending on number of peers contributed to
cached subquery result
21Experimental Evaluation (con't)
22Experimental Evaluation (con't)
- V. Real-World Query-Log
- Using AOL query-log to have time-order of
queries overall 57,344 requests with 39,640
unique queries - Combination of EC and AC
- Results 25 hit rate recall imrovement from
0.45 to 0.52
23Experimental Evaluation (con't)
- VI. Impact of Churn
- On benefits of EC-MS
- Different churn rates fraction of peers leave
network
24Experimental Evaluation (con't)
25Conclusions Open Issues
- Conclusions
- Introduced simple, yet effective, caching
framework to take advantage of previous work of
peers in P2P network - Exact Caching (EC)
- possibility to improve recall - or to reduce
response time / network cost - experiments used Wikipedia benchmark and
real-world query-log - investigated various cache replacement strategies
and considered churn in P2P - Approximate Caching (AC)
- aggressive reuse of cached results of subqueries
- if full query results not available - demands on existing cached results for satisfying
outcomes - Open Issues
- Proactive Caching (anticipate interesting
queries, e.g., from existing logs) - Maintaining cache freshness (new or better
results are available) - Replication (metadata and/or documents)
26Thank You For Your Attention! Questions or
Comments?
27Distributed Hash Tables (DHTs)
- Distributed Hash Tables (DHTs)
- Minerva search engine based on Distributed Hash
Table (DHT) to achieve scalability,
fault-tolerance, and robustness - Second generation of structured overlay networks
- Minerva uses Chord with fingertables
- Data items distributed to nodes using consistent
hashing - id Hash (key)
- Lookup-method to find location of data items with
given key in O(log N) hops
lookup(54)
K54