Title: DISTRIBUTED HASH TABLES Building large-scale, robust distributed applications
1DISTRIBUTED HASH TABLESBuilding large-scale,
robust distributed applications
- Frans Kaashoek
- kaashoek_at_lcs.mit.edu
- Joint work with H. Balakrishnan, P. Druschel ,
J. Hellerstein , D. Karger, R. Karp, J.
Kubiatowicz, B. Liskov, D. Mazières, R. Morris,
S. Shenker, I. Stoica
2P2P an exciting social development
- Internet users cooperating to share, for example,
music files - Napster, Gnutella, Morpheus, KaZaA, etc.
- Lots of attention from the popular press
- The ultimate form of democracy on the Internet
- The ultimate threat to copy-right protection on
the Internet - Many vendors have launched P2P efforts
3What is P2P?
Client
Client
Client
Internet
Client
Client
- A distributed system architecture
- No centralized control
- Nodes are symmetric in function
- Typically many nodes, but unreliable and
heterogeneous
4Traditional distributed computingclient/server
Server
Client
Client
Internet
Client
Client
- Successful architecture, and will continue to be
so - Tremendous engineering necessary to make server
farms scalable and robust
5Application-level overlays
Site 3
Site 2
N
N
N
ISP1
ISP2
Site 1
N
N
ISP3
- One per application
- Nodes are decentralized
- NOC is centralized
Site 4
N
P2P systems are overlay networks without central
control
6(Potential) P2P advantages
- Allows for scalable incremental growth
- Aggregate tremendous amount of computation and
storage resources - Tolerate faults or intentional attacks
7Example P2P problem lookup
N2
N1
N3
Internet
Keytitle Valuefile data
?
Client
Publisher
Lookup(title)
N6
N4
N5
- At the heart of all P2P systems
8Centralized lookup (Napster)
N2
N1
SetLoc(title, N4)
N3
Client
DB
N4
Publisher_at_
Lookup(title)
Keytitle Valuefile data
N8
N9
N7
N6
Simple, but O(N) state and a single point of
failure
9Flooded queries (Gnutella)
N2
N1
Lookup(title)
N3
Client
N4
Publisher_at_
Keytitle ValueMP3 data
N6
N8
N7
N9
Robust, but worst case O(N) messages per lookup
10Another approach distributed hash tables
Distributed applications
data
Lookup (key)
Insert(key, data)
Distributed hash tables
.
node
node
node
- Nodes are the hash buckets
- Key identifies data uniquely
- DHT balances keys and data across nodes
- DHT replicates, caches, routes lookups, etc.
11Why DHTs now?
- Demand pulls
- Growing need for security and robustness
- Large-scale distributed apps are difficult to
build - Many applications use location-independent data
- Technology pushes
- Bigger, faster, and better every PC can be a
server - Scalable lookup algorithms are available
- Trustworthy systems from untrusted components
12DHT is a good interface
DHT
UDP/IP
Send(IP address, data) Receive (IP address) ? data
lookup(key) ? data Insert(key, data)
- Supports a wide range of applications, because
few restrictions - Keys have no semantic meaning
- Value is application dependent
- Minimal interface
13DHT is a good shared infrastructure
- Applications inherit some security and robustness
from DHT - DHT replicates data
- Resistant to malicious participants
- Low-cost deployment
- Self-organizing across administrative domains
- Allows to be shared among applications
- Large scale supports Internet-scale workloads
14DHTs support many applications
- File sharing CFS, OceanStore, PAST,
- Web cache Squirrel, ..
- Censor-resistant stores Eternity, FreeNet,..
- Event notification Scribe
- Naming systems ChordDNS, INS, ..
- Query and indexing Kademlia,
- Communication primitives I3,
- Backup store HiveNet
- Web archive Herodotus
data is location-independent
15Cooperative read-only file sharing
File system
block
Lookup (key)
insert (key, block)
Distributed hash tables
.
node
node
node
- DHT is a robust block store
- Client of DHT implements file system
16File representationself-authenticating data
File System key995
431SHA-1
144 SHA-1
901 SHA-1
995 key901 key732 Signature
key431 key795
a.txt ID144
(i-node block)
(data)
(root block)
(directory blocks)
- DHT key for block is SHA-1(content block)
- File and file systems form Merkle hash trees
17DHT distributes blocks by hashing IDs
Block 732
Block 705
Node B
995 key901 key732 Signature
247 key407 key992 key705 Signature
Node A
Internet
Block 407
Node C
Node D
Block 901
Block 992
- DHT replicates blocks for fault tolerance
- DHT caches popular blocks for load balance
18Historical web archiver
- Goal make and archive a daily check point of the
Web - Estimates
- Web is about 57 Tbyte, compressed HTMLimg
- New data per day 580 Gbyte
- 128 Tbyte per year with 5 replicas
- Design
- 12,810 nodes 100 Gbyte disk each and 61 Kbit/s
per node
19Implementation using DHT
Crawler
Client
Insert(sha-1(URL), page)
Lookup (URL)
Insert(sha-1(URL), URL)
Distributed hash tables
.
node
node
node
- DHT usage
- Crawler distributes crawling load by hash(URL)
- Crawler inserts Web pages by hash(URL)
- Client retrieve Web pages by hash(URL)
- DHT replicates data for fault tolerance
20Backup store
- Goal backup on other users machines
- Observations
- Many user machines are not backed up
- Backup requires significant manual effort
- Many machines have lots of spare disk space
- Using DHT
- Merkle tree to validate integrity of data
- Administrative and financial costs are less for
all participants - Backups are robust (automatic off-site backups)
- Blocks are stored once, if key sha1(data)
21Research challenges
- Scalable lookup
- Balance load (flash crowds)
- Handling failures
- Coping with systems in flux
- Network-awareness for performance
- Robustness with untrusted participants
- Programming abstraction
- Heterogeneity
- Anonymity
- Goal simple, provably-good algorithms
this talk
221. Scalable lookup
- Map keys to nodes in a load-balanced way
- Hash keys and nodes into a string of digit
- Assign key to closest node
- Forward a lookup for a key to a closer node
- Insert lookup store
- Join insert node in ring
Examples CAN, Chord, Kademlia, Pastry, Tapestry,
Viceroy, .
23Chords routing table fingers
½
¼
1/8
1/16
1/32
1/64
1/128
N80
24Lookups take O(log(N)) hops
N5
N10
N110
K19
N20
N99
N32
Lookup(K19)
N80
N60
- Lookup route to closest predecessor
25CAN exploit d dimensions
- Each node is assigned a zone
- Nodes are identified by zone boundaries
- Join chose random point, split its zone
26Routing in 2-dimensions
- Routing is navigating a d-dimensional ID space
- Route to closest neighbor in direction of
destination - Routing table contains O(d) neighbors
- Number of hops is O(dN1/d)
272. Balance load
N5
K19
N10
N110
K19
N20
N99
N32
Lookup(K19)
N80
N60
- Hash function balances keys over nodes
- For popular keys, cache along the path
28Why Caching Works Well
N20
- Only O(log N) nodes have fingers pointing to N20
- This limits the single-block load on N20
293. Handling failures redundancy
N5
N10
N110
N20
N99
N32
N40
N80
N60
- Each node knows IP addresses of next r nodes
- Each key is replicated at next r nodes
30Lookups find replicas
N5
N10
N110
3.
N20
1.
2.
N99
K19
N40
4.
N50
N80
N60
N68
Lookup(K19)
- Tradeoff between latency and bandwidth Kademlia
314. Systems in flux
- Lookup takes log(N) hops
- If system is stable
- But, system is never stable!
- What we desire are theorems of the type
- In the almost-ideal state, .log(N)
- System maintains almost-ideal state as nodes join
and fail
32Half-life Liben-Nowell 2002
N new nodes join
N nodes
N/2 old nodes leave
- Doubling time time for N joins
- Halfing time time for N/2 old nodes to fail
- Half life MIN(doubling-time, halfing-time)
33Applying half life
- For any node u in any P2P networks
- If u wishes to stay connected with high
probability, - then, on average, u must be notified about ?(log
N) new nodes per half life - And so on,
345. Optimize routing to reduce latency
N20
N41
N40
N80
- Nodes close on ring, but far away in Internet
- Goal put nodes in routing table that result in
few hops and low latency
35close metric impacts choice of nearby nodes
N06
USA
N105
USA
K104
Far east
N32
N103
Europe
N60
USA
- Chords numerical close and table restrict choice
- Prefix-based allows for choice
- Kademlias offers choice in nodes and places
nodes in absolute order close (a,b) XOR(a, b)
36Neighbor set
N06
USA
USA
N105
K104
N32
N103
Far east
Europe
N60
USA
- From k nodes, insert nearest node with
appropriate prefix in routing table - Assumption triangle inequality holds
37Finding k near neighbors
- Ping random nodes
- Swap neighbor sets with neighbors
- Combine with random pings to explore
- Provably-good algorithm to find nearby neighbors
based on sampling Karger and Ruhl 02
38Finding nearest neighbor Karger and Ruhl 02
- Maintain a neighbor table
- entry i k nodes in distance 2ir
- Find nearest node
- Ask nodes in entry i for its nodes in entry i
- Insert nearest in entry i1
r
A
2r
- Claim algorithm will find the most nearby nodes
with high probability - Triangle inequality holds
- Doubling property holds
- Chord maintains finger and neighbor table
396. Malicious participants
- Attacker denies service
- Flood DHT with data
- Attacker returns incorrect data detectable
- Self-authenticating data
- Attacker denies data exists liveness
- Bad node is responsible, but says no
- Bad node supplies incorrect routing info
- Bad nodes make a bad ring, and good node joins it
Basic approach use redundancy
40Sybil attack Douceur 02
N5
- Attacker creates multiple identities
- Attacker controls enough nodes to foil the
redundancy
N10
N110
N20
N99
N32
N40
N80
N60
- Need a way to control creation of node IDs
41One solution secure node IDs
- Every node has a public key
- Certificate authority signs public key of good
nodes - Every node signs and verifies messages
- Quotas per publisher
42Another solutionexploit practical byzantine
protocols
N06
N105
N
N
N
N32
N103
N
N60
- A core set of servers is pre-configured with keys
and perform admission control - The servers achieve consensus with a practical
byzantine recovery protocol Castro and Liskov
99 and 00 - The servers serialize updates OceanStore or
assign secure node Ids Configuration service
43A more decentralized solutionweak secure node
IDs
- ID SHA-1 (IP-address node)
- Assumption attacker controls limited IP
addresses - Before using a node, challenge it to verify its ID
44Using weak secure node IDS
- Detect malicious nodes
- Define verifiable system properties
- Each node has a successor
- Data is stored at its successor
- Allow querier to observe lookup progress
- Each hop should bring the query closer
- Cross check routing tables with random queries
- Recovery assume limited number of bad nodes
- Quota per node ID
457. Programming abstraction
- Blocks versus files
- Database queries (join, etc.)
- Mutable data (writers)
- Atomicity of DHT operations
46Philosophical questions
- How decentralized should systems be?
- Gnutella versus content distribution network
- Have a bit of both? (e.g., OceanStore)
- Why does the distributed systems community have
more problems with decentralized systems than the
networking community? - A distributed system is a system in which a
computer you dont know about renders your own
computer unusable - Internet (BGP, NetNews)
47What are we doing at MIT?
- Building a system based on Chord
- Applications CFS, Herodotus, Melody, Backup
store, R/W file system, - Collaborate with other institutions
- P2P workshop
- Big ITR
- Building a large-scale testbed
- RON, PlanetLab
48Summary
- Once we have DHTs, building large-scale
distributed applications is easy - Single, shared infrastructure for many
applications - Robust in the face of failures and attacks
- Scalable to large number of servers
- Self configuring across administrative domains
- Easy to program
- Lets build DHTs . stay tuned .