The Case for a Hybrid P2P Search Infrastructure - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

The Case for a Hybrid P2P Search Infrastructure

Description:

The Case for a Hybrid P2P Search Infrastructure. Boon Thau Loo. UC Berkeley ... Hybrid Search Infrastructure: Flooding for popular items, DHT for rare items ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 29
Provided by: unkn492
Category:

less

Transcript and Presenter's Notes

Title: The Case for a Hybrid P2P Search Infrastructure


1
The Case for a Hybrid P2P Search Infrastructure
  • Boon Thau Loo
  • UC Berkeley
  • (with Ryan Huebsch,
  • Joseph Hellerstein, Ion Stoica)

IPTPS 2004 (26 Feb 2004)
2
Our Goal
  • P2P Search - What is the best design?
  • Target Environment
  • Items stored in P2P network, queried by keywords
  • Replicas of items follow a long-tailed
    distribution
  • Popular items at head of distribution
  • Rare items at tail of distribution
  • Typical of P2P file-sharing environments

3
Design Choices
  • Unstructured networks
  • Queries are flooded for bounded number of hops
  • E.g. Gnutella and Kazaa
  • No guarantees on recall
  • Structured Networks
  • Inverted Indexes on Distributed Hash tables
    (DHTs)
  • Inverted Lists indexed by keyword.
  • ltBritney Doc4, Doc5, Doc6, Doc9, Doc11, gt
  • ltQuantum Doc1, Doc3gt
  • Query execution
  • Routes query to all sites hosting keyword in
    query
  • Intersection of multiple Inverted Lists.
  • Guarantees perfect recall (absence of network
    failures)

4
Our Proposal Hybrid Solution
Flood-based Network (All items)
DHT (Index Rare Items)
5
Gnutella Network
Oct 2003 Crawl
  • Crawl of Gnutella Network
  • Based on multiple crawlers from 30 vantage points
    on PlanetLab
  • ? 100,000 nodes, 20 million files
  • Ultrapeer-based Topology
  • Queries flooded among ultrapeers
  • Leaf nodes shielded from query traffic

Ultrapeer nodes
Leaf nodes
gt100 Files
0 Files
0-100 Files
6
Gnutella Measurements
  • Quality of Searches
  • Recall ( of all relevant items retrieved)
  • Response Time (Latency) to 1st result
  • Software utilized
  • Modified the LimeWire Gnutella Client
  • Run as leaf or ultrapeer
  • Sniff Gnutella traffic
  • Inject queries and gather results

7
Gnutella Search Quality
  • Reissue Gnutella queries
  • 30 LimeWire Ultrapeers on PlanetLab
  • 700 Gnutella queries at 3 different times
  • Compute Query Recall
  • Each query issued simultaneously from 30
    ultrapeers
  • Union of results from 30 ultrapeers
  • Union-of-30 is our approximation of perfect
    answer

8
Queries with Small Result Sets
9
Result Size CDF
10
Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
11
Query Latency
Queries that return few results have poor
response times
12
Summary of Measurements
  • Searching on Gnutella
  • Highly effective for popular items
  • Less effective for rare items
  • Significant opportunity to do better
  • Large fraction of queries return few or no
    results even when they exist
  • Bad response times for queries on rare items

13
DHT-based Search
  • Advantages
  • Avoid flooding query in network
  • Guarantee recall (critical for small result sets)
  • Disadvantages
  • Constructing inverted lists is costly
  • So is intersecting inverted lists at query time
  • Back-of-envelope calculations shows infeasibility
    for large datasets (IPTPS 03)
  • Feasible for querying rare items
  • Queries over rare items ship 7x fewer inverted
    list entries compared to the average query

14
Hybrid Search
  • Hybrid Best of both worlds
  • Flooding techniques for searching popular items
  • DHT for rare items
  • Identifying rare items
  • Query snooping
  • Items from previous queries that return few
    results.
  • Other techniques (ongoing work)
  • Term Frequency Statistics (single term, pairs)
  • Sample items on neighboring nodes

15
PlanetLab Deployment
L
L
Horizon of P1
U2
L
L
P2
P1
U1
L
P3
L
L
L
L
L
L
Gnutella Leaf
Gnutella Ultrapeer
U
Gnutella links
Hybrid Ultrapeer (PIER Gnutella)
P
PIER links
16
Results
  • Improved Response Time
  • PIER returns first result in 10 seconds
  • 40 seconds in aggregate including 30 seconds
    timeout
  • Gnutella queries returns first result in 65
    seconds
  • 25 seconds (38) reduction in latency

17
Results
  • Improved Recall 18 reduction in the number of
    queries that eventually receive no results from
    Gnutella
  • Lots of room for improvement 66 potential
    reduction based on Union-of-30 query

18
Conclusion
  • Hybrid Search Infrastructure
  • Flooding for popular items, DHT for rare items
  • Simple idea that works in practice

19
Questions?
20
Backup Slides
21
Gnutella Optimizations
  • Ultrapeers and Leaf nodes
  • Dynamic Querying
  • More aggressive flooding for queries that return
    few results
  • Query Routing Protocol (QRP) tables
  • Bloom-filter-like structures
  • Cache at neighboring nodes for directed flooding.

22
Hybrid Implementation
  • PIER
  • Fully decentralized relational query processor
    over DHTs
  • Supports Select, Project, Joins, Group
    By/Aggregation.
  • Support for keyword searching
  • Intersection of Posting Lists -gt Joins.
  • Hybrid Ultrapeer Node
  • LimeWire Ultrapeer (Gnutella)
  • PIER (DHT overlay)

23
For more details
  • http//pier.cs.berkeley.edu
  • LimeWire. http//www.limewire.org/
  • Gnutella Developer Forum
  • http//groups.yahoo.com/group/the_gdf

24
Handling Network Churn
  • Only ultrapeers participate in DHT overlay.
    Ultrapeers publish items for leaves.
  • Avoid publishing items from short-lived nodes.

25
Hybrid Implementation
26
Search Engine using PIER
Symmetric Hash Join
Join Index
Item(docID, filename, location, filesize,
) Posting(keyword, docID) using keyword as DHT
storage key.
27
Related Work
  • Most file downloads are for popular items
    (SIGCOMM 2003)
  • Downloads are only half the story.
  • Rare items in aggregate can be substantial.
  • Build Gnutella on a DHT? (HOTNETS 2003)
  • Improved performance for floods and random walks
  • Reduced maintenance overheads.

28
Outline
  • Background
  • P2P Search Today
  • Unstructured Networks (Gnutella)
  • Structured Networks or DHTs
  • The Case for Hybrid
  • Hybrid Design and Implementation
  • Evaluation on PlanetLab
Write a Comment
User Comments (0)
About PowerShow.com