Title: 1
1Enhancing P2P File-Sharing with an Internet-Scale
Query Processor
- Boon Thau Loo
- Joseph M. Hellerstein, Ryan Huebsch, Scott
Shenker, Ion Stoica - UC Berkeley, Intel Research Berkeley, ICSI
2Introduction
- Internet-scale Query Processing
- Millions of machines
- Churn Nodes arriving and departing continuously.
- Lots of semantic homogeneity
- E.g. keywords in file names, packet headers on
the Internet, etc - This is the focus of the PIER project at Berkeley
(VLDB 03) - The real-world example today is file-sharing
3P2P File-Sharing
- Real-world Challenges
- Millions of real users and data
- Large number of concurrent queries
- Plenty of research on this topic
- Databases, Networks, Distributed Systems
- My talk today is on exploring this real-world
scenario - Provide insights into future research in
Internet-scale querying
4Our Goal
- P2P Search - What is the best design?
- Target Environment
- Typical P2P file-sharing environments.
- Items stored in P2P network, queried by keywords
- Replicas of items follow a long-tailed
distribution - Popular items at head of distribution
- Rare items at tail of distribution
5Unstructured Networks
- Ad-hoc topology
- Queries are flooded for bounded number of hops
- No guarantees on recall
- E.g. Gnutella and Kazaa
xyz
Query xyz
6Structured Networks
- Distributed Hash Tables (DHTs)
- Hash table interface put(key,item), get(key)
- O(log n) hops
- Guarantees on recall
7Keyword Search using DHTs
- Inverted Lists hashed by keyword (term) in the
DHT
Query T1 AND T2
8Solution Space Proposal
Flood-based Network
DHT
9Outline
- Introduction
- Gnutella (Flood-based) Measurements
- Hybrid (Flood DHT) Solution
- Evaluation
- Analytical Model
- Real-world Measurements
10Gnutella Network
Oct 2003 Crawl
- Popular open-source file-sharing network
- 450,000 users today
- Ultrapeer-based Topology
- Queries flooded among ultrapeers
- Leaf nodes shielded from query traffic
- Based on multiple crawlers from 30 vantage points
on PlanetLab
Ultrapeer nodes
Leaf nodes
100 Files
0 Files
0-100 Files
11PlanetLab
- PlanetLab Open, globally distributed platform
for deploying planetary-scale network services - 431 nodes at 181 sites, 5 continents
- URL http//www.planet-lab.org
12Gnutella Measurements
- Quality of Searches
- Recall ( of all relevant items retrieved)
- Distinct Recall ( of all relevant distinct items
retrieved) - Response Time (Latency) to 1st result
- Software utilized
- Modified the LimeWire Gnutella Client
- Run as leaf or ultrapeer
- Monitor Gnutella traffic
- Inject queries and gather results
13Gnutella Search Quality
- Log of Gnutella queries from LimeWire clients
- Reissued Gnutella queries
- 700 randomly chosen queries
- 30 LimeWire Ultrapeers on PlanetLab
- 3 different times
- Computed Query Recall
- Each query issued simultaneously from 30
ultrapeers - Union of results from 30 ultrapeers
- Union-of-30 is our approximation of perfect
answer
14Queries with Small Result Sets
15Result Size CDF
16Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
17Query Latency
18Summary of Measurements
- Searching on Flood-based networks
- Highly effective for popular (highly replicated)
items - Less effective for rare items
- Significant opportunity to do better
- Large fraction of queries return few or no
results even when they exist - Bad response times for queries on rare items
- Aggressive flooding is not the solution
- Diminishing returns with flooding
- Does not improve response times
19Outline
- Motivation and Introduction
- Gnutella (Flood-based) Measurements
- Hybrid (Flood DHT) Solution
- Evaluation
- Analytical Model Simulations
- Real-world Measurements
20DHT-based Search
- Advantages
- Avoid flooding query in network
- Guarantee recall (critical for small result sets)
- Disadvantages
- Hashing inverted lists into the DHT is costly
- So is intersecting inverted lists at query time
- Infeasible for Google-like datasets (IPTPS 03)
- Feasible for querying rare items
- Queries with ?10 results ship 7x fewer inverted
list entries compared to the average query - Query optimization can reduce communication
overhead intersect rare terms first
21Hybrid Search
- Hybrid Best of both worlds
Flood-based Network
DHT
(Search Rare Items)
(Search Popular Items)
22Challenges
- Identifying Rare Items
- Query Results Size (QRS)
- Publish items from previous queries that return
few results - Term Frequency Statistics
- Single Term (TF)
- Term Pairs (TPF)
- Sample items on neighboring nodes (SAM)
- Network Churn (IPTPS 04)
- Use Ultrapeers as DHT nodes
- Avoid publishing items from short-lived nodes
23Outline
- Motivation and Introduction
- Flood-based (Gnutella) Measurements
- Hybrid (Flood DHT) Solution
- Evaluation
- Analytical Model Simulations
- Real-world Measurements
24Evaluation Methodology
- Combination of analysis, trace-based simulation,
and live deployment - Analytical cost model for understanding tradeoffs
between query recall and system overheads - Trace-based simulation to allow comparisons of
rare-item selections schemes - Including an infeasible oracle scheme
- Live deployment to ground hybrid search in
practice
25Analytical Model Simulation
- Components
- Probability of finding an item
- Cost of querying an item
- Publishing overhead of rare items
- Details of the model in the paper
- Trace-driven simulations
- 315,000 files, 75,000 nodes, 350 queries
26Identifying Rare Items
()
- Upper-bound Perfect scheme with complete
knowledge of network - Lower-bound Random scheme
- Diminishing returns as more items are published
- All schemes in between Random and Perfect
27Outline
- Motivation and Introduction
- Gnutella (Flood-based) Measurements
- Hybrid (Flood DHT) Solution
- Evaluation
- Analytical Model Simulations
- Real-world Measurements
28Hybrid Search Implementation
- Prototype implementation based on PIER
- PIER P2P Information Exchange and Retrieval
- A relational query engine
- Designed for Internet-scale (millions of nodes)
- Built on top of DHTs
- Relational Operators
- Selection, Projection
- Join, Intersect
- Group-By, Aggregation
- DHT-based Search Engine with PIER
29Hybrid Peer Implementation
?
?
?
?
30PlanetLab Deployment
L
L
Horizon of P1
U2
L
L
P2
P1
U1
L
P3
L
L
L
L
L
L
Gnutella Leaf
Gnutella Ultrapeer
U
Gnutella links
Hybrid Ultrapeer (PIER Gnutella)
P
PIER links
31Results
- Improved Response Time
- PIER (DHT) returns first result in 10 seconds
- 40 seconds in aggregate including 30 seconds
timeout - Gnutella (Flood) queries returns first result in
65 seconds - 25 seconds (38) reduction in latency
- Improved Recall Analysis
- 18 reduction in queries with empty results
- Using the naïve QRS rare-item selection scheme
- Opportunity to do far better
- Recall 66 potential reduction based on
Union-of-30 - Approaches larger-scale deployments and using
better rare-item schemes.
32Conclusion
- Address the challenges of scalable P2P Search
- Focus on P2P file-sharing, a real-world
application - Discuss two alternative designs
- Gnutella (Flood-based) Measurements
- Highly effective for popular items
- Ineffective for rare items
- Propose Hybrid Search Infrastructure
- Flooding for popular items
- DHT for rare items
- Evaluate hybrid search
- Analytical model and trace-driven simulations to
compare schemes for identifying rare items - Deployment on PlanetLab for validation
33Questions?
34Backup Slides
35DHT-based Search with PIER
Item(itemID, filename, filesize, ipAddress, port,
) Inverted(keyword, itemID) with keyword as the
DHT publishing (index) key
Query T1 T2
36Gnutella Optimizations
- Ultrapeers and Leaf nodes
- Dynamic Querying
- More aggressive flooding for queries that return
few results - Query Routing Protocol (QRP) tables
- Bloom-filter-like structures
- Cache at neighboring nodes for directed flooding.
37Handling Network Churn
- Only ultrapeers participate in DHT overlay.
Ultrapeers publish items for leaves. - Avoid publishing items from short-lived nodes.