Title: The Case for a Hybrid P2P Search Infrastructure
1The Case for a Hybrid P2P Search Infrastructure
- Boon Thau Loo
- UC Berkeley
- (with Ryan Huebsch,
- Joseph Hellerstein, Ion Stoica)
IPTPS 2004 (26 Feb 2004)
2Our Goal
- P2P Search - What is the best design?
- Target Environment
- Items stored in P2P network, queried by keywords
- Replicas of items follow a long-tailed
distribution - Popular items at head of distribution
- Rare items at tail of distribution
- Typical of P2P file-sharing environments
3Design Choices
- Unstructured networks
- Queries are flooded for bounded number of hops
- E.g. Gnutella and Kazaa
- No guarantees on recall
- Structured Networks
- Inverted Indexes on Distributed Hash tables
(DHTs) - Inverted Lists indexed by keyword.
- ltBritney Doc4, Doc5, Doc6, Doc9, Doc11, gt
- ltQuantum Doc1, Doc3gt
- Query execution
- Routes query to all sites hosting keyword in
query - Intersection of multiple Inverted Lists.
- Guarantees perfect recall (absence of network
failures)
4Our Proposal Hybrid Solution
Flood-based Network (All items)
DHT (Index Rare Items)
5Gnutella Network
Oct 2003 Crawl
- Crawl of Gnutella Network
- Based on multiple crawlers from 30 vantage points
on PlanetLab - ? 100,000 nodes, 20 million files
- Ultrapeer-based Topology
- Queries flooded among ultrapeers
- Leaf nodes shielded from query traffic
Ultrapeer nodes
Leaf nodes
gt100 Files
0 Files
0-100 Files
6Gnutella Measurements
- Quality of Searches
- Recall ( of all relevant items retrieved)
- Response Time (Latency) to 1st result
- Software utilized
- Modified the LimeWire Gnutella Client
- Run as leaf or ultrapeer
- Sniff Gnutella traffic
- Inject queries and gather results
7Gnutella Search Quality
- Reissue Gnutella queries
- 30 LimeWire Ultrapeers on PlanetLab
- 700 Gnutella queries at 3 different times
- Compute Query Recall
- Each query issued simultaneously from 30
ultrapeers - Union of results from 30 ultrapeers
- Union-of-30 is our approximation of perfect
answer
8Queries with Small Result Sets
9Result Size CDF
10Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
11Query Latency
Queries that return few results have poor
response times
12Summary of Measurements
- Searching on Gnutella
- Highly effective for popular items
- Less effective for rare items
- Significant opportunity to do better
- Large fraction of queries return few or no
results even when they exist - Bad response times for queries on rare items
13DHT-based Search
- Advantages
- Avoid flooding query in network
- Guarantee recall (critical for small result sets)
- Disadvantages
- Constructing inverted lists is costly
- So is intersecting inverted lists at query time
- Back-of-envelope calculations shows infeasibility
for large datasets (IPTPS 03) - Feasible for querying rare items
- Queries over rare items ship 7x fewer inverted
list entries compared to the average query
14Hybrid Search
- Hybrid Best of both worlds
- Flooding techniques for searching popular items
- DHT for rare items
- Identifying rare items
- Query snooping
- Items from previous queries that return few
results. - Other techniques (ongoing work)
- Term Frequency Statistics (single term, pairs)
- Sample items on neighboring nodes
15PlanetLab Deployment
L
L
Horizon of P1
U2
L
L
P2
P1
U1
L
P3
L
L
L
L
L
L
Gnutella Leaf
Gnutella Ultrapeer
U
Gnutella links
Hybrid Ultrapeer (PIER Gnutella)
P
PIER links
16Results
- Improved Response Time
- PIER returns first result in 10 seconds
- 40 seconds in aggregate including 30 seconds
timeout - Gnutella queries returns first result in 65
seconds - 25 seconds (38) reduction in latency
17Results
- Improved Recall 18 reduction in the number of
queries that eventually receive no results from
Gnutella - Lots of room for improvement 66 potential
reduction based on Union-of-30 query
18Conclusion
- Hybrid Search Infrastructure
- Flooding for popular items, DHT for rare items
- Simple idea that works in practice
19Questions?
20Backup Slides
21Gnutella Optimizations
- Ultrapeers and Leaf nodes
- Dynamic Querying
- More aggressive flooding for queries that return
few results - Query Routing Protocol (QRP) tables
- Bloom-filter-like structures
- Cache at neighboring nodes for directed flooding.
22Hybrid Implementation
- PIER
- Fully decentralized relational query processor
over DHTs - Supports Select, Project, Joins, Group
By/Aggregation. - Support for keyword searching
- Intersection of Posting Lists -gt Joins.
- Hybrid Ultrapeer Node
- LimeWire Ultrapeer (Gnutella)
- PIER (DHT overlay)
23For more details
- http//pier.cs.berkeley.edu
- LimeWire. http//www.limewire.org/
- Gnutella Developer Forum
- http//groups.yahoo.com/group/the_gdf
24Handling Network Churn
- Only ultrapeers participate in DHT overlay.
Ultrapeers publish items for leaves. - Avoid publishing items from short-lived nodes.
25Hybrid Implementation
26Search Engine using PIER
Symmetric Hash Join
Join Index
Item(docID, filename, location, filesize,
) Posting(keyword, docID) using keyword as DHT
storage key.
27Related Work
- Most file downloads are for popular items
(SIGCOMM 2003) - Downloads are only half the story.
- Rare items in aggregate can be substantial.
- Build Gnutella on a DHT? (HOTNETS 2003)
- Improved performance for floods and random walks
- Reduced maintenance overheads.
28Outline
- Background
- P2P Search Today
- Unstructured Networks (Gnutella)
- Structured Networks or DHTs
- The Case for Hybrid
- Hybrid Design and Implementation
- Evaluation on PlanetLab