The Case for a Hybrid P2P Search Infrastructure - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

The Case for a Hybrid P2P Search Infrastructure

Description:

The Case for a Hybrid P2P Search Infrastructure. Boon Thau Loo. UC Berkeley ... Hybrid Search Infrastructure: Flooding for popular items, DHT for rare items ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 29

Provided by: unkn492

Category:

more less

Transcript and Presenter's Notes

Title: The Case for a Hybrid P2P Search Infrastructure

1
The Case for a Hybrid P2P Search Infrastructure

Boon Thau Loo
UC Berkeley
(with Ryan Huebsch,
Joseph Hellerstein, Ion Stoica)

IPTPS 2004 (26 Feb 2004)
2
Our Goal

P2P Search - What is the best design?
Target Environment
Items stored in P2P network, queried by keywords
Replicas of items follow a long-tailed
distribution
Popular items at head of distribution
Rare items at tail of distribution
Typical of P2P file-sharing environments

3
Design Choices

Unstructured networks
Queries are flooded for bounded number of hops
E.g. Gnutella and Kazaa
No guarantees on recall
Structured Networks
Inverted Indexes on Distributed Hash tables
(DHTs)
Inverted Lists indexed by keyword.
ltBritney Doc4, Doc5, Doc6, Doc9, Doc11, gt
ltQuantum Doc1, Doc3gt
Query execution
Routes query to all sites hosting keyword in
query
Intersection of multiple Inverted Lists.
Guarantees perfect recall (absence of network
failures)

4
Our Proposal Hybrid Solution
Flood-based Network (All items)
DHT (Index Rare Items)
5
Gnutella Network
Oct 2003 Crawl

Crawl of Gnutella Network
Based on multiple crawlers from 30 vantage points
on PlanetLab
? 100,000 nodes, 20 million files
Ultrapeer-based Topology
Queries flooded among ultrapeers
Leaf nodes shielded from query traffic

Ultrapeer nodes
Leaf nodes
gt100 Files
0 Files
0-100 Files
6
Gnutella Measurements

Quality of Searches
Recall ( of all relevant items retrieved)
Response Time (Latency) to 1st result
Software utilized
Modified the LimeWire Gnutella Client
Run as leaf or ultrapeer
Sniff Gnutella traffic
Inject queries and gather results

7
Gnutella Search Quality

Reissue Gnutella queries
30 LimeWire Ultrapeers on PlanetLab
700 Gnutella queries at 3 different times
Compute Query Recall
Each query issued simultaneously from 30
ultrapeers
Union of results from 30 ultrapeers
Union-of-30 is our approximation of perfect
answer

8
Queries with Small Result Sets
9
Result Size CDF
10
Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
11
Query Latency
Queries that return few results have poor
response times
12
Summary of Measurements

Searching on Gnutella
Highly effective for popular items
Less effective for rare items
Significant opportunity to do better
Large fraction of queries return few or no
results even when they exist
Bad response times for queries on rare items

13
DHT-based Search

Advantages
Avoid flooding query in network
Guarantee recall (critical for small result sets)
Disadvantages
Constructing inverted lists is costly
So is intersecting inverted lists at query time
Back-of-envelope calculations shows infeasibility
for large datasets (IPTPS 03)
Feasible for querying rare items
Queries over rare items ship 7x fewer inverted
list entries compared to the average query

14
Hybrid Search

Hybrid Best of both worlds
Flooding techniques for searching popular items
DHT for rare items
Identifying rare items
Query snooping
Items from previous queries that return few
results.
Other techniques (ongoing work)
Term Frequency Statistics (single term, pairs)
Sample items on neighboring nodes

15
PlanetLab Deployment
L
L
Horizon of P1
U2
L
L
P2
P1
U1
L
P3
L
L
L
L
L
L
Gnutella Leaf
Gnutella Ultrapeer
U
Gnutella links
Hybrid Ultrapeer (PIER Gnutella)
P
PIER links
16
Results

Improved Response Time
PIER returns first result in 10 seconds
40 seconds in aggregate including 30 seconds
timeout
Gnutella queries returns first result in 65
seconds
25 seconds (38) reduction in latency

17
Results

Improved Recall 18 reduction in the number of
queries that eventually receive no results from
Gnutella
Lots of room for improvement 66 potential
reduction based on Union-of-30 query

18
Conclusion

Hybrid Search Infrastructure
Flooding for popular items, DHT for rare items
Simple idea that works in practice

19
Questions?
20
Backup Slides
21
Gnutella Optimizations

Ultrapeers and Leaf nodes
Dynamic Querying
More aggressive flooding for queries that return
few results
Query Routing Protocol (QRP) tables
Bloom-filter-like structures
Cache at neighboring nodes for directed flooding.

22
Hybrid Implementation

PIER
Fully decentralized relational query processor
over DHTs
Supports Select, Project, Joins, Group
By/Aggregation.
Support for keyword searching
Intersection of Posting Lists -gt Joins.
Hybrid Ultrapeer Node
LimeWire Ultrapeer (Gnutella)
PIER (DHT overlay)

23
For more details

http//pier.cs.berkeley.edu
LimeWire. http//www.limewire.org/
Gnutella Developer Forum
http//groups.yahoo.com/group/the_gdf

24
Handling Network Churn

Only ultrapeers participate in DHT overlay.
Ultrapeers publish items for leaves.
Avoid publishing items from short-lived nodes.

25
Hybrid Implementation
26
Search Engine using PIER
Symmetric Hash Join
Join Index
Item(docID, filename, location, filesize,
) Posting(keyword, docID) using keyword as DHT
storage key.
27
Related Work