Title: PeertoPeer p2p Querying
1Peer-to-Peer (p2p) Querying
- Joe Hellerstein
- CS186 Fall 2005
2Note
- These slides are based on a tutorial given at
VLDB 2004 - http//db.cs.berkeley.edu/jmh/talks/vldb04-p2ptut-
final.ppt - http//db.cs.berkeley.edu/jmh/talks/vldb04-p2ptut-
2upbw.pdf - These slides were made on a Mac
- May not display correctly in PPT for Windows
- Animation is often a portability problem for PPT
- PPTs Compatibility Check finds 185 issues!
3Outline
- What is p2p?
- Querying in early p2p systems
- Napster
- Gnutella
- KaZaA, Gnutella with Ultrapeers
- Some problems with queries in Gnutella
- Distributed Hash Tables (DHTs)
- Chord
- Keyword search over DHTs
- More fun
- Towards full-service p2p querying
- Get involved!
- DB ideas infecting the network more deeply
4p2p
- Distributed applications without servers
- Scale
- Peers
- Churn
- Self-admin
- People tend to think of filestealing
- Respect the musicians, dont steal music.
- Also used for, e.g., swapping biological data,
open-source software, etc. - Lots of potential applications of the technology
- Go make some up!
- My favorite Public Health for the Internet
- P2P is an inherently democratic architecture
5p2p, cont
- p2p is organic
- Start the next phenomenon in your dorm room
- No need for a hosted server, administrator, etc.
- Hence no need for Venture Capital
- Hence no need to worry if it will take off
- Infrastructure right-sizes itself
6Outline
- What is p2p?
- Querying in early p2p systems
- Napster
- Gnutella
- KaZaA, Gnutella with Ultrapeers
- Some problems with queries in Gnutella
- Distributed Hash Tables (DHTs)
- Chord
- Keyword search over DHTs
- More fun
- Towards full-service p2p querying
- Get involved!
- DB ideas infecting the network more deeply
7Early P2P I Client-Server
xyz.mp3
xyz.mp3 ?
8Early P2P I Client-Server
- Napster
- Client-Server search
xyz.mp3
9Early P2P I Client-Server
- Napster
- Client-Server search
xyz.mp3
xyz.mp3 ?
10Early P2P I Client-Server
- Napster
- Client-Server search
- pt2pt file xfer
xyz.mp3
xyz.mp3 ?
11Early P2P I Client-Server
- Napster
- Client-Server search
- pt2pt file xfer
xyz.mp3
xyz.mp3 ?
12Early P2P II Flooding on Overlays
xyz.mp3
xyz.mp3 ?
An overlay network. Unstructured.
13Early P2P II Flooding on Overlays
xyz.mp3
xyz.mp3 ?
Flooding
14Early P2P II Flooding on Overlays
xyz.mp3
xyz.mp3 ?
Flooding
15Early P2P II Flooding on Overlays
xyz.mp3
16Early P2P II.v Ultrapeers
- Ultrapeers can be installed (KaZaA) or
self-promoted (Gnutella)
17Gnutella Network
Oct 2003 Crawl
- Popular open-source file-sharing network
- 450,000 users as of 2003
- 2,000,000 today
- Ultrapeer-based Topology
- Queries flooded among ultrapeers
- Leaf nodes shielded from query traffic
- Based on multiple crawlers from 30 vantage points
on PlanetLab
Ultrapeer nodes
Leaf nodes
100 Files
0 Files
0-100 Files
18PlanetLab
- PlanetLab Open, globally distributed platform
for deploying planetary-scale network services - 631 nodes at 299 sites, 5 continents
- URL http//www.planet-lab.org
19Gnutella Measurements
- Quality of Searches
- Recall ( of all relevant items retrieved)
- Distinct Recall ( of all relevant distinct items
retrieved) - Response Time (Latency) to 1st result
- Software utilized
- Modified the LimeWire Gnutella Client
- Run as leaf or ultrapeer
- Monitor Gnutella traffic
- Inject queries and gather results
20Gnutella Search Quality
- Log of Gnutella queries from LimeWire clients
- Reissued Gnutella queries
- 700 randomly chosen queries
- 30 LimeWire Ultrapeers on PlanetLab
- 3 different times
- Computed Query Recall
- Each query issued simultaneously from 30
ultrapeers - Union of results from 30 ultrapeers
- Union-of-30 is our approximation of perfect
answer
21Queries with Small Result Sets
22Result Size CDF
Single Query
Union-of-30 Query
Large fraction of queries return few or no
results even when they exist
23Query Latency
24Summary of Measurements
- Searching on Flood-based networks
- Highly effective for popular (highly replicated)
items - Less effective for rare items
- Significant opportunity to do better
- Large fraction of queries return few or no
results even when they exist - Bad response times for queries on rare items
- Aggressive flooding is not the solution
- Diminishing returns with flooding
- Does not improve response times
25Outline
- What is p2p?
- Querying in early p2p systems
- Napster
- Gnutella
- KaZaA, Gnutella with Ultrapeers
- Some problems with queries in Gnutella
- Distributed Hash Tables (DHTs)
- Chord
- Keyword search over DHTs
- More fun
- Towards full-service p2p querying
- Get involved!
- DB ideas infecting the network more deeply
26High-Level Idea Indirection
- Indirection in space
- Logical (content-based) IDs, routing to those IDs
- Content-addressable network
- Tolerant of churn
- nodes joining and leaving the network
- Indirection in time
- Want some scheme to temporally decouple send and
receive - Persistence required. Typical Internet solution
soft state - Combo of persistence via storage and via retry
- Publisher requests TTL on storage
- Republishes as needed
- Metaphor Distributed Hash Table
hz
27What is a DHT?
- Hash Table
- data structure that maps keys to values
- essential building block in software systems
- Distributed Hash Table (DHT)
- similar, but spread across the Internet
- Interface
- insert(key, value)
- lookup(key)
28How?
- Every DHT node supports a single operation
- Given key as input route messages toward node
holding key
29DHT in action
30DHT in action
31DHT in action
Operation take key as input route messages to
node holding key
32DHT in action put()
insert(K1,V1)
Operation take key as input route messages to
node holding key
33DHT in action put()
insert(K1,V1)
Operation take key as input route messages to
node holding key
34DHT in action put()
(K1,V1)
Operation take key as input route messages to
node holding key
35DHT in action get()
retrieve (K1)
Operation take key as input route messages to
node holding key
36DHT Design Goals
- An overlay network with
- Flexible mapping of keys to physical nodes
- Small network diameter
- Small degree (fanout)
- Local routing decisions
- Robustness to churn
- Routing flexibility
- Decent locality (low stretch)
- A storage or memory mechanism with
- No guarantees on persistence
- Maintenance via soft state
- Each fact has a time-to-live
- Publisher of the fact must republish to achieve
persistence
37DHT Outline
- High-level overview
- Fundamentals of structured network topologies
- And examples
- One concrete DHT
- Chord
- Some systems issues
- Storage models soft state
- Locality
- Churn management
38Outline
- What is p2p?
- Querying in early p2p systems
- Napster
- Gnutella
- KaZaA, Gnutella with Ultrapeers
- Some problems with queries in Gnutella
- Distributed Hash Tables (DHTs)
- Chord
- Keyword search over DHTs
- More fun
- Towards full-service p2p querying
- Get involved!
- DB ideas infecting the network more deeply
39An Example DHT Chord
- Assume n 2m nodes for a moment
- A complete Chord ring
- Well generalize shortly
40An Example DHT Chord
41An Example DHT Chord
42An Example DHT Chord
43Routing in Chord
- At most one of each Gon
- E.g. 1-to-0
44Routing in Chord
- At most one of each Gon
- E.g. 1-to-0
45Routing in Chord
- At most one of each Gon
- E.g. 1-to-0
46Routing in Chord
- At most one of each Gon
- E.g. 1-to-0
47Routing in Chord
- At most one of each Gon
- E.g. 1-to-0
48Routing in Chord
- At most one of each Gon
- E.g. 1-to-0
- What happened?
- We constructed thebinary number 15!
- Routing from x to yis like computing y - x mod
n by summing powers of 2
2
4
8
1
Diameter log n (1 hop per gon type)Degree log
n (one outlink per gon type)
49Outline
- What is p2p?
- Querying in early p2p systems
- Napster
- Gnutella
- KaZaA, Gnutella with Ultrapeers
- Some problems with queries in Gnutella
- Distributed Hash Tables (DHTs)
- Chord
- Keyword search over DHTs
- More fun
- Towards full-service p2p querying
- Get involved!
- DB ideas infecting the network more deeply
50File Search using DHTs
- Inverted Index in a DHT
- To answer query term1 AND term2
- Route query to hash(term1) and hash(term2)
- Rehash postings for term1 on DocID
- Rehash postings for term2 on DocID
- Do local intersection at each node that received
tuples - Send matches to querier
51Keyword Search using DHTs
- Inverted Lists hashed by keyword (term) in the
DHT
Query T1 AND T2
52File Search Flooding vs. DHTs
- Recall
- Flooding can miss files
- DHTs should never
- Query complexity
- Flooding can handle arbitrary single-site logic
- DHTs can do equijoins, selections, aggregates,
etc. - But not so good at fancy selections like
wildcards - Query Performance
- Flooding can be slow to find things, uses lots of
BW - DHTs expensive to publish documents with lots of
terms - DHTs expensive to intersect really long term
lists - Even if output is really small!
- Not likely to replace Google any time soon
- Hybrid solution!
53Hybrid Search
- Hybrid Best of both worlds
Flood-based Network
DHT
(Search Rare Items)
(Search Popular Items)
54Challenges
- Identifying Rare Items
- Query Results Size (QRS)
- Publish items from previous queries that return
few results - Term Frequency Statistics
- Single Term (TF)
- Term Pairs (TPF)
- Sample items on neighboring nodes (SAM)
- Network Churn
- Use Ultrapeers as DHT nodes
- Avoid publishing items from short-lived nodes
55Results
- Trace-driven simulations
- 315,000 files, 75,000 nodes, 350 queries
- Improved Response Time
- PIER (DHT) returns first result in 10 seconds
- 40 seconds in aggregate including 30 seconds
timeout - Gnutella (Flood) queries returns first result in
65 seconds - 25 seconds (38) reduction in latency
- Improved Recall Analysis
- 18 reduction in queries with empty results
- Using a naïve rare-item selection scheme
- Opportunity to do far better
- Recall 66 potential reduction based on
Union-of-30 - Approaches larger-scale deployments and using
better rare-item schemes.
56Outline
- What is p2p?
- Querying in early p2p systems
- Napster
- Gnutella
- KaZaA, Gnutella with Ultrapeers
- Some problems with queries in Gnutella
- Distributed Hash Tables (DHTs)
- Chord
- Keyword search over DHTs
- More fun
- Towards full-service p2p querying
- Get involved!
- DB ideas infecting the network more deeply
57DHTs Gave Us Equality Lookups
- What else might we want?
- Range Search
- Aggregation
- Group By
- More complex Joins
- Intelligent Query Dissemination
- Theme
- All can be built elegantly on DHTs!
- PIER
pier.cs.berkeley.edu
58Joining the Fun
59OpenDHT
- A shared DHT service
- Hosted on PlanetLab
- Simple API
- You dont need to deploy or host to play with a
real DHT! - A playground for killer apps?
- Neednt be as big as PIER!
- Example FreeDB replacement
60Infecting the network even more deeply?
- Todays internet infrastructure is a mess
- Very complex to configure
- Very limited in functionality
- Assume they cannot know all kinds of things
- Things like p2p are threatening to make it
obsolete - What do networks do?
- Managing routing and forwarding tables
- Perform dataflow
- Uhhhhh. isnt that kind of like query engines?
- Yes!
- But dont try using Oracle for this just yet
61(No Transcript)
62p2.cs.berkeley.edu
- P2 A declarative networking system
- Based on recursive queries over graphs
- E.g. Find the shortest path between me and you
- A topic in DB theory that was mostly abandoned in
practice up til recently - Can be used to implement routing protocols, DHTs,
etc. - Chord in 47 rules
- Instead of 10,000 lines of C
- Lots of new fun query processing challenges