1 - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

1

Description:

Modified the LimeWire Gnutella Client. Run as leaf or ultrapeer. Monitor Gnutella traffic ... Log of Gnutella queries from LimeWire clients. Reissued Gnutella queries ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 38

Provided by: unkn492

Category:

Tags: limewire

more less

Transcript and Presenter's Notes

Title: 1

1
Enhancing P2P File-Sharing with an Internet-Scale
Query Processor

Boon Thau Loo
Joseph M. Hellerstein, Ryan Huebsch, Scott
Shenker, Ion Stoica
UC Berkeley, Intel Research Berkeley, ICSI

2
Introduction

Internet-scale Query Processing
Millions of machines
Churn Nodes arriving and departing continuously.
Lots of semantic homogeneity
E.g. keywords in file names, packet headers on
the Internet, etc
This is the focus of the PIER project at Berkeley
(VLDB 03)
The real-world example today is file-sharing

3
P2P File-Sharing

Real-world Challenges
Millions of real users and data
Large number of concurrent queries
Plenty of research on this topic
Databases, Networks, Distributed Systems
My talk today is on exploring this real-world
scenario
Provide insights into future research in
Internet-scale querying

4
Our Goal

P2P Search - What is the best design?
Target Environment
Typical P2P file-sharing environments.
Items stored in P2P network, queried by keywords
Replicas of items follow a long-tailed
distribution
Popular items at head of distribution
Rare items at tail of distribution

5
Unstructured Networks

Ad-hoc topology
Queries are flooded for bounded number of hops
No guarantees on recall
E.g. Gnutella and Kazaa

xyz
Query xyz
6
Structured Networks

Distributed Hash Tables (DHTs)
Hash table interface put(key,item), get(key)
O(log n) hops
Guarantees on recall

7
Keyword Search using DHTs

Inverted Lists hashed by keyword (term) in the
DHT

Query T1 AND T2
8
Solution Space Proposal
Flood-based Network
DHT
9
Outline

Introduction
Gnutella (Flood-based) Measurements
Hybrid (Flood DHT) Solution
Evaluation
Analytical Model
Real-world Measurements

10
Gnutella Network
Oct 2003 Crawl

Popular open-source file-sharing network
450,000 users today
Ultrapeer-based Topology
Queries flooded among ultrapeers
Leaf nodes shielded from query traffic
Based on multiple crawlers from 30 vantage points
on PlanetLab

Ultrapeer nodes
Leaf nodes
100 Files
0 Files
0-100 Files
11
PlanetLab

PlanetLab Open, globally distributed platform
for deploying planetary-scale network services
431 nodes at 181 sites, 5 continents
URL http//www.planet-lab.org

12
Gnutella Measurements

Quality of Searches
Recall ( of all relevant items retrieved)
Distinct Recall ( of all relevant distinct items
retrieved)
Response Time (Latency) to 1st result
Software utilized
Modified the LimeWire Gnutella Client
Run as leaf or ultrapeer
Monitor Gnutella traffic
Inject queries and gather results

13
Gnutella Search Quality

Log of Gnutella queries from LimeWire clients
Reissued Gnutella queries
700 randomly chosen queries
30 LimeWire Ultrapeers on PlanetLab
3 different times
Computed Query Recall
Each query issued simultaneously from 30
ultrapeers
Union of results from 30 ultrapeers
Union-of-30 is our approximation of perfect
answer

14
Queries with Small Result Sets
15
Result Size CDF
16
Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
17
Query Latency
18
Summary of Measurements

Searching on Flood-based networks
Highly effective for popular (highly replicated)
items
Less effective for rare items
Significant opportunity to do better
Large fraction of queries return few or no
results even when they exist
Bad response times for queries on rare items
Aggressive flooding is not the solution
Diminishing returns with flooding
Does not improve response times

19
Outline

Motivation and Introduction
Gnutella (Flood-based) Measurements
Hybrid (Flood DHT) Solution
Evaluation
Analytical Model Simulations
Real-world Measurements

20
DHT-based Search

Advantages
Avoid flooding query in network
Guarantee recall (critical for small result sets)
Disadvantages
Hashing inverted lists into the DHT is costly
So is intersecting inverted lists at query time
Infeasible for Google-like datasets (IPTPS 03)
Feasible for querying rare items
Queries with ?10 results ship 7x fewer inverted
list entries compared to the average query
Query optimization can reduce communication
overhead intersect rare terms first

21
Hybrid Search

Hybrid Best of both worlds

Flood-based Network
DHT
(Search Rare Items)
(Search Popular Items)
22
Challenges

Identifying Rare Items
Query Results Size (QRS)
Publish items from previous queries that return
few results
Term Frequency Statistics
Single Term (TF)
Term Pairs (TPF)
Sample items on neighboring nodes (SAM)
Network Churn (IPTPS 04)
Use Ultrapeers as DHT nodes
Avoid publishing items from short-lived nodes

23
Outline

Motivation and Introduction
Flood-based (Gnutella) Measurements
Hybrid (Flood DHT) Solution
Evaluation
Analytical Model Simulations
Real-world Measurements

24
Evaluation Methodology

Combination of analysis, trace-based simulation,
and live deployment
Analytical cost model for understanding tradeoffs
between query recall and system overheads
Trace-based simulation to allow comparisons of
rare-item selections schemes
Including an infeasible oracle scheme
Live deployment to ground hybrid search in
practice

25
Analytical Model Simulation

Components
Probability of finding an item
Cost of querying an item
Publishing overhead of rare items
Details of the model in the paper
Trace-driven simulations
315,000 files, 75,000 nodes, 350 queries

26
Identifying Rare Items
()

Upper-bound Perfect scheme with complete
knowledge of network
Lower-bound Random scheme
Diminishing returns as more items are published
All schemes in between Random and Perfect

27
Outline

Motivation and Introduction
Gnutella (Flood-based) Measurements
Hybrid (Flood DHT) Solution
Evaluation
Analytical Model Simulations
Real-world Measurements

28
Hybrid Search Implementation

Prototype implementation based on PIER
PIER P2P Information Exchange and Retrieval
A relational query engine
Designed for Internet-scale (millions of nodes)
Built on top of DHTs
Relational Operators
Selection, Projection
Join, Intersect
Group-By, Aggregation
DHT-based Search Engine with PIER

29
Hybrid Peer Implementation
?
?
?
?
30
PlanetLab Deployment
L
L
Horizon of P1
U2
L
L
P2
P1
U1
L
P3
L
L
L
L
L
L
Gnutella Leaf
Gnutella Ultrapeer
U
Gnutella links
Hybrid Ultrapeer (PIER Gnutella)
P
PIER links
31
Results

Improved Response Time
PIER (DHT) returns first result in 10 seconds
40 seconds in aggregate including 30 seconds
timeout
Gnutella (Flood) queries returns first result in
65 seconds
25 seconds (38) reduction in latency
Improved Recall Analysis
18 reduction in queries with empty results
Using the naïve QRS rare-item selection scheme
Opportunity to do far better
Recall 66 potential reduction based on
Union-of-30
Approaches larger-scale deployments and using
better rare-item schemes.

32
Conclusion

Address the challenges of scalable P2P Search
Focus on P2P file-sharing, a real-world
application
Discuss two alternative designs
Gnutella (Flood-based) Measurements
Highly effective for popular items
Ineffective for rare items
Propose Hybrid Search Infrastructure
Flooding for popular items
DHT for rare items
Evaluate hybrid search
Analytical model and trace-driven simulations to
compare schemes for identifying rare items
Deployment on PlanetLab for validation

33
Questions?
34
Backup Slides
35
DHT-based Search with PIER
Item(itemID, filename, filesize, ipAddress, port,
) Inverted(keyword, itemID) with keyword as the
DHT publishing (index) key
Query T1 T2
36
Gnutella Optimizations