Flood Little, Cache More: Effective ResultReuse in P2P IR Systems PowerPoint PPT Presentation

presentation player overlay
1 / 26
About This Presentation
Transcript and Presenter's Notes

Title: Flood Little, Cache More: Effective ResultReuse in P2P IR Systems


1
Flood Little, Cache More Effective Result-Reuse
in P2P IR Systems
  • Christian Zimmer, Srikanta Bedathur, Gerhard
    Weikum
  • Max-Planck Institute for Informatics,
    Saarbrücken, Germany
  • http//www.mpi-inf.mpg.de

2
Outline of the Talk
  • Motivation
  • System Architecture
  • Caching Framework
  • Exact Caching (EC)
  • Approximate Caching (AC)
  • Experimental Evaluation
  • Conclusions Open Issues

3
Motivation
  • Basics
  • High Potential of P2P-based Information Retrieval
    (P2P IR) systems
  • benefits in general scalable, efficient,
    resilient to failures and dynamics, democratic,
    privacy preserving, and resilient to
    authoritarian controls
  • benefits from intellectual input of users click
    streams, query logs, bookmarks, etc.
  • Performance Challenges
  • providing high quality results (recall
    precision)
  • enabling high scalability (number of
    participating peers huge amounts of data).
  • unreliable networks slow response times,
    intermittent loss of good results
  • extra load on network many peers for good recall

4
Motivation (con't)
  • Caching of Results
  • Traditional performance booster (using previous
    query executions to help in the future)
  • Remember popular items to avoid computing /
    fetching
  • Typical Issues
  • What to Cache?
  • value of cached items
  • inverted lists / full results / partial results
  • Where to Cache?
  • on querying peers, every node along lookup path
    (UIC), spread to neighbors (DiCAS), on good nodes
    (View Trees)
  • How much to Cache?
  • buffer size
  • When to drop from Cache?
  • buffering policy
  • Goals of Caching?
  • response time improvements, query result-quality
    improvement

5
System Architecture
  • Maintaining Metadata
  • Autonomous peers with local index (local search
    engine)
  • Distributed global directory layered on top of
    distributed hash table (DHT)
  • DHT partitions term space such that each peer is
    responsible for subset of terms
  • Peers distribute per-term summaries (Posts) to
    global directory (size of the index, number of
    documents containing this term, etc.)
  • Directory manages aggregated statistical
    information in compact form

Minerva Search Architecture
6
System Architecture
  • Query Execution
  • Multi-term query a b c
  • Peerlist requests to retrieve metadata from
    directory (metadata retrieval)
  • Compute most promising peers for complete query
    (e.g., CORI, DTF)
  • Complete query forwarded to these peers executing
    query locally (local result retrieval)
  • Local results returned and merged to global query
    result

Minerva Search Architecture
query a b c
7
Caching Framework
  • Main Goals
  • Caching for result-quality improvement
  • Integration of result caching with query routing
    (reduces message traffic)
  • Cache placement for seamless reuse
  • Aggressive result-reuse under certain conditions
  • Where and What to Cache?
  • Potential locations for caching
  • Query initiator or additional overlays limited
    utility to network
  • Directory choose one directory peer involved in
    query execution using deterministic scheme
    (avoids load balancing concerns)
  • Caching full results
  • Metadata of results (URL, statistics, etc.)
  • Set of source peers contributing to cached
    results

8
Caching Framework (con't)
  • Extending Query Execution
  • Query Routing
  • initiating peer sends full query to all directory
    peers responsible for query terms
  • directory checks availability of cached result
    and if available returns it to initiator
  • Adding / Updating Cache
  • query initiator computes full query result and
    cached result for top-k items
  • initiator determines directory peer responsible
    for maintaining cached result
  • directory peer incorporates received cache result
    in its cache
  • Two Caching Strategies based on Caching Framework
  • Exact Caching (EC)
  • P2P counterpart of traditional result caching
  • Approximate Caching (AC)
  • aggressively reuse cached results of query subsets

9
Exact Caching (EC)
  • Main Property
  • Only used if stored result generated by exactly
    same query
  • Caching Approach
  • After query execution cached results stored at
    directory (by selecting one directory peer)
  • Request for a b c by another peer
  • Metadata retrieval returns in addition cached
    result
  • Initiator satisfied saves additional
    communication at same result-quality
  • Improving local result retrieval from additional
    peers
  • Updating cached result

query a b c
query a b c
10
Approximate Caching (AC)
  • Limitation of Exact Caching
  • EC only applicable when exact query was executed
    before
  • Approximate Caching tries to overcome this issue
    if cached result for complete query is not
    available
  • Caching Approach
  • Aggressively retrieve and combine cached results
    of subsets of requested query to approximate full
    query
  • Avoids local result retrieval
  • Metadata retrieval
  • querying peer requests peerlists for all query
    terms
  • directory peers return all existing maximal
    cached results for subsets of query term set
  • querying peer only considers cached results for
    maximal subqueries received from directory
  • By Design
  • directory peers for query terms responsible for
    all possible subqueries
  • if AC strategy not satisfying, metadata retrieval
    already done

11
Approximate Caching (AC) (con't)
  • An Example
  • Request for a b c d
  • No cached result for full query, but directory
    stores cached results for subqueries
  • Metadata retrieval returns in addition all cached
    results for maximal subqueries
  • To combine subquery results, querying peer only
    considers maximal ones
  • Unsatisfactory Approximate Result
  • Querying peer retrieves local results from
    top-ranked peers for full query

query a b c d
D(d)
D(c)
D(a)
D(b)
12
Approximate Caching (AC) (con't)
  • How to Combine Cached Results of Different
    Subqueries
  • Having determined document set contained in all
    cached results for maximal subqueries, documents
    need to be ranked for approximate result for full
    query
  • Consider document scores scored,p,q from cached
    results for document d as local result of peer p
    concerning (sub-)query q
  • Final Score Computation
  • To rank the document set and get approximate
    result
  • scored maxp,q (q ? scored,p,q)
  • takes different query sizes into account longer
    queries more selective and approximate better
    full query
  • more than one cached result can include a
    document only consider maximal score

13
Experimental Evaluation
  • Experimental Setup
  • P2P IR Benchmark recently proposed for P2P system
    evaluation ExpDB 2006
  • gt 800,000 documents from Wikipedia
  • 99 Google Zeitgeist queries (1-3 query terms)
  • Documents distributed to 1,000 peers (with
    controlled overlap)
  • In addition AOL query-log (real-world log with
    time ordering)
  • Result retrieval returns top-25 local results per
    peer final result obtains top-50 documents for
    full query
  • Measurements
  • Relative Recall fraction of ideal result
    documents included in results of P2P query
    processing
  • Ideal results as top-50 result documents of
    centralized query execution including combined
    document collection
  • Network Resource Consumption total network
    traffic incurred during query processing
  • number of messages transfered across network
  • number of communication rounds

14
Experimental Evaluation (con't)
  • I. Improving Recall with Exact Caching (EC)
  • Focus on query result improvement by asking
    additional peers
  • Updated cached result stored in directory
  • Initial query processing disseminates query to 5
    of network each improvement step considers up to
    5 additional network peers
  • Relative recall averaged over all 99 Zeitgeist
    queries

15
Experimental Evaluation (con't)
16
Experimental Evaluation (con't)
  • II. Cache Management Strategies
  • Assumes bounded cache space at directory peers
    such that cache management policy influences
    recall for Exact Caching strategy
  • Cache at directory peer restricted to three
    cached results each
  • Synthetic query workload from Zeitgeist queries
  • all possible 9180 one- and two-term queries from
    single query terms
  • assuming a power law distribution (total of
    102,158 requests)
  • Cache replacement strategies LFU, LRU, FIFO,
    RAN, UNL (upper bound), and NOC (lower bound)
  • Measures overall relative recall and cache hit
    ratio

17
Experimental Evaluation (con't)
18
Experimental Evaluation (con't)
  • III. Cost Analysis
  • Network cost analysis per query network traffic,
    number of messages, and communication rounds in
    three scenarios
  • No Caching (NC) standard query processing (5 of
    network)
  • EC Single-Step (EC-SS) Exact Caching without
    query result improvement
  • EC Multi-Step (EC-MS) Exact Caching with query
    result improvement up to 50 of network in 5
    steps
  • Details (different phases, assumptions etc.) see
    paper!

19
Experimental Evaluation (con't)
NC No Caching
EC-SS EC Single -Step
EC-MS EC Multi-Step
average relative recall
0.32
0.32
0.71 (122)
network traffic (per query)
55.3 Kbytes
23.1 Kbytes (-58.2)
41.0 Kbytes (-25.9)
messages (per query)
106
25.7 (-75.8)
61.4 (-42.1)
response time (rounds)
2
1.19 (-40.3)
1.60 (-20.0)
20
Experimental Evaluation (con't)
  • IV. Approximate Caching Scenarios
  • 4000 generated random 3- and 4-term queries from
    benchmark query set
  • Comparison of 5 scenarios against standard query
    routing (SQR)
  • Effectiveness of AC in terms of relative recall
    depending on number of peers contributed to
    cached subquery result

21
Experimental Evaluation (con't)
22
Experimental Evaluation (con't)
  • V. Real-World Query-Log
  • Using AOL query-log to have time-order of
    queries overall 57,344 requests with 39,640
    unique queries
  • Combination of EC and AC
  • Results 25 hit rate recall imrovement from
    0.45 to 0.52

23
Experimental Evaluation (con't)
  • VI. Impact of Churn
  • On benefits of EC-MS
  • Different churn rates fraction of peers leave
    network

24
Experimental Evaluation (con't)
25
Conclusions Open Issues
  • Conclusions
  • Introduced simple, yet effective, caching
    framework to take advantage of previous work of
    peers in P2P network
  • Exact Caching (EC)
  • possibility to improve recall - or to reduce
    response time / network cost
  • experiments used Wikipedia benchmark and
    real-world query-log
  • investigated various cache replacement strategies
    and considered churn in P2P
  • Approximate Caching (AC)
  • aggressive reuse of cached results of subqueries
    - if full query results not available
  • demands on existing cached results for satisfying
    outcomes
  • Open Issues
  • Proactive Caching (anticipate interesting
    queries, e.g., from existing logs)
  • Maintaining cache freshness (new or better
    results are available)
  • Replication (metadata and/or documents)

26
Thank You For Your Attention! Questions or
Comments?
27
Distributed Hash Tables (DHTs)
  • Distributed Hash Tables (DHTs)
  • Minerva search engine based on Distributed Hash
    Table (DHT) to achieve scalability,
    fault-tolerance, and robustness
  • Second generation of structured overlay networks
  • Minerva uses Chord with fingertables
  • Data items distributed to nodes using consistent
    hashing
  • id Hash (key)
  • Lookup-method to find location of data items with
    given key in O(log N) hops

lookup(54)
K54
Write a Comment
User Comments (0)
About PowerShow.com