1 - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

1

Description:

Modified the LimeWire Gnutella Client. Run as leaf or ultrapeer. Monitor Gnutella traffic ... Log of Gnutella queries from LimeWire clients. Reissued Gnutella queries ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 38
Provided by: unkn492
Category:
Tags: limewire

less

Transcript and Presenter's Notes

Title: 1


1
Enhancing P2P File-Sharing with an Internet-Scale
Query Processor
  • Boon Thau Loo
  • Joseph M. Hellerstein, Ryan Huebsch, Scott
    Shenker, Ion Stoica
  • UC Berkeley, Intel Research Berkeley, ICSI

2
Introduction
  • Internet-scale Query Processing
  • Millions of machines
  • Churn Nodes arriving and departing continuously.
  • Lots of semantic homogeneity
  • E.g. keywords in file names, packet headers on
    the Internet, etc
  • This is the focus of the PIER project at Berkeley
    (VLDB 03)
  • The real-world example today is file-sharing

3
P2P File-Sharing
  • Real-world Challenges
  • Millions of real users and data
  • Large number of concurrent queries
  • Plenty of research on this topic
  • Databases, Networks, Distributed Systems
  • My talk today is on exploring this real-world
    scenario
  • Provide insights into future research in
    Internet-scale querying

4
Our Goal
  • P2P Search - What is the best design?
  • Target Environment
  • Typical P2P file-sharing environments.
  • Items stored in P2P network, queried by keywords
  • Replicas of items follow a long-tailed
    distribution
  • Popular items at head of distribution
  • Rare items at tail of distribution

5
Unstructured Networks
  • Ad-hoc topology
  • Queries are flooded for bounded number of hops
  • No guarantees on recall
  • E.g. Gnutella and Kazaa

xyz
Query xyz
6
Structured Networks
  • Distributed Hash Tables (DHTs)
  • Hash table interface put(key,item), get(key)
  • O(log n) hops
  • Guarantees on recall

7
Keyword Search using DHTs
  • Inverted Lists hashed by keyword (term) in the
    DHT

Query T1 AND T2
8
Solution Space Proposal
Flood-based Network
DHT
9
Outline
  • Introduction
  • Gnutella (Flood-based) Measurements
  • Hybrid (Flood DHT) Solution
  • Evaluation
  • Analytical Model
  • Real-world Measurements

10
Gnutella Network
Oct 2003 Crawl
  • Popular open-source file-sharing network
  • 450,000 users today
  • Ultrapeer-based Topology
  • Queries flooded among ultrapeers
  • Leaf nodes shielded from query traffic
  • Based on multiple crawlers from 30 vantage points
    on PlanetLab

Ultrapeer nodes
Leaf nodes
100 Files
0 Files
0-100 Files
11
PlanetLab
  • PlanetLab Open, globally distributed platform
    for deploying planetary-scale network services
  • 431 nodes at 181 sites, 5 continents
  • URL http//www.planet-lab.org

12
Gnutella Measurements
  • Quality of Searches
  • Recall ( of all relevant items retrieved)
  • Distinct Recall ( of all relevant distinct items
    retrieved)
  • Response Time (Latency) to 1st result
  • Software utilized
  • Modified the LimeWire Gnutella Client
  • Run as leaf or ultrapeer
  • Monitor Gnutella traffic
  • Inject queries and gather results

13
Gnutella Search Quality
  • Log of Gnutella queries from LimeWire clients
  • Reissued Gnutella queries
  • 700 randomly chosen queries
  • 30 LimeWire Ultrapeers on PlanetLab
  • 3 different times
  • Computed Query Recall
  • Each query issued simultaneously from 30
    ultrapeers
  • Union of results from 30 ultrapeers
  • Union-of-30 is our approximation of perfect
    answer

14
Queries with Small Result Sets
15
Result Size CDF
16
Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
17
Query Latency
18
Summary of Measurements
  • Searching on Flood-based networks
  • Highly effective for popular (highly replicated)
    items
  • Less effective for rare items
  • Significant opportunity to do better
  • Large fraction of queries return few or no
    results even when they exist
  • Bad response times for queries on rare items
  • Aggressive flooding is not the solution
  • Diminishing returns with flooding
  • Does not improve response times

19
Outline
  • Motivation and Introduction
  • Gnutella (Flood-based) Measurements
  • Hybrid (Flood DHT) Solution
  • Evaluation
  • Analytical Model Simulations
  • Real-world Measurements

20
DHT-based Search
  • Advantages
  • Avoid flooding query in network
  • Guarantee recall (critical for small result sets)
  • Disadvantages
  • Hashing inverted lists into the DHT is costly
  • So is intersecting inverted lists at query time
  • Infeasible for Google-like datasets (IPTPS 03)
  • Feasible for querying rare items
  • Queries with ?10 results ship 7x fewer inverted
    list entries compared to the average query
  • Query optimization can reduce communication
    overhead intersect rare terms first

21
Hybrid Search
  • Hybrid Best of both worlds

Flood-based Network
DHT
(Search Rare Items)
(Search Popular Items)
22
Challenges
  • Identifying Rare Items
  • Query Results Size (QRS)
  • Publish items from previous queries that return
    few results
  • Term Frequency Statistics
  • Single Term (TF)
  • Term Pairs (TPF)
  • Sample items on neighboring nodes (SAM)
  • Network Churn (IPTPS 04)
  • Use Ultrapeers as DHT nodes
  • Avoid publishing items from short-lived nodes

23
Outline
  • Motivation and Introduction
  • Flood-based (Gnutella) Measurements
  • Hybrid (Flood DHT) Solution
  • Evaluation
  • Analytical Model Simulations
  • Real-world Measurements

24
Evaluation Methodology
  • Combination of analysis, trace-based simulation,
    and live deployment
  • Analytical cost model for understanding tradeoffs
    between query recall and system overheads
  • Trace-based simulation to allow comparisons of
    rare-item selections schemes
  • Including an infeasible oracle scheme
  • Live deployment to ground hybrid search in
    practice

25
Analytical Model Simulation
  • Components
  • Probability of finding an item
  • Cost of querying an item
  • Publishing overhead of rare items
  • Details of the model in the paper
  • Trace-driven simulations
  • 315,000 files, 75,000 nodes, 350 queries

26
Identifying Rare Items
()
  • Upper-bound Perfect scheme with complete
    knowledge of network
  • Lower-bound Random scheme
  • Diminishing returns as more items are published
  • All schemes in between Random and Perfect

27
Outline
  • Motivation and Introduction
  • Gnutella (Flood-based) Measurements
  • Hybrid (Flood DHT) Solution
  • Evaluation
  • Analytical Model Simulations
  • Real-world Measurements

28
Hybrid Search Implementation
  • Prototype implementation based on PIER
  • PIER P2P Information Exchange and Retrieval
  • A relational query engine
  • Designed for Internet-scale (millions of nodes)
  • Built on top of DHTs
  • Relational Operators
  • Selection, Projection
  • Join, Intersect
  • Group-By, Aggregation
  • DHT-based Search Engine with PIER

29
Hybrid Peer Implementation
?
?
?
?
30
PlanetLab Deployment
L
L
Horizon of P1
U2
L
L
P2
P1
U1
L
P3
L
L
L
L
L
L
Gnutella Leaf
Gnutella Ultrapeer
U
Gnutella links
Hybrid Ultrapeer (PIER Gnutella)
P
PIER links
31
Results
  • Improved Response Time
  • PIER (DHT) returns first result in 10 seconds
  • 40 seconds in aggregate including 30 seconds
    timeout
  • Gnutella (Flood) queries returns first result in
    65 seconds
  • 25 seconds (38) reduction in latency
  • Improved Recall Analysis
  • 18 reduction in queries with empty results
  • Using the naïve QRS rare-item selection scheme
  • Opportunity to do far better
  • Recall 66 potential reduction based on
    Union-of-30
  • Approaches larger-scale deployments and using
    better rare-item schemes.

32
Conclusion
  • Address the challenges of scalable P2P Search
  • Focus on P2P file-sharing, a real-world
    application
  • Discuss two alternative designs
  • Gnutella (Flood-based) Measurements
  • Highly effective for popular items
  • Ineffective for rare items
  • Propose Hybrid Search Infrastructure
  • Flooding for popular items
  • DHT for rare items
  • Evaluate hybrid search
  • Analytical model and trace-driven simulations to
    compare schemes for identifying rare items
  • Deployment on PlanetLab for validation

33
Questions?
34
Backup Slides
35
DHT-based Search with PIER
Item(itemID, filename, filesize, ipAddress, port,
) Inverted(keyword, itemID) with keyword as the
DHT publishing (index) key
Query T1 T2
36
Gnutella Optimizations
  • Ultrapeers and Leaf nodes
  • Dynamic Querying
  • More aggressive flooding for queries that return
    few results
  • Query Routing Protocol (QRP) tables
  • Bloom-filter-like structures
  • Cache at neighboring nodes for directed flooding.

37
Handling Network Churn
  • Only ultrapeers participate in DHT overlay.
    Ultrapeers publish items for leaves.
  • Avoid publishing items from short-lived nodes.
Write a Comment
User Comments (0)
About PowerShow.com