Title: Brahms
1Brahms
- Byzantine-Resilient Random Membership Sampling
Bortnikov, Gurevich, Keidar, Kliot, and Shraer
2Edward (Eddie) Bortnikov
Maxim (Max) Gurevich
Idit Keidar
Alexander (Alex) Shraer
Gabriel (Gabi) Kliot
3Why Random Node Sampling
- Gossip partners
- Random choices make gossip protocols work
- Unstructured overlay networks
- E.g., among super-peers
- Random links provide robustness, expansion
- Gathering statistics
- Probe random nodes
- Choosing cache locations
4The Setting
- Many nodes n
- 10,000s, 100,000s, 1,000,000s,
- Come and go
- Churn
- Every joining node knows some others
- Connectivity
- Full network
- Like the Internet
- Byzantine failures
5Byzantine Fault Tolerance (BFT)
- Faulty nodes (portion f)
- Arbitrary behavior bugs, intrusions, selfishness
- Choose f ids arbitrarily
- No CA, but no panacea for Cybil attacks
- May want to bias samples
- Isolate nodes, DoS nodes
- Promote themselves, bias statistics
6Previous Work
- Benign gossip membership
- Small (logarithmic) views
- Robust to churn and benign failures
- Empirical study Lpbcast,Scamp,Cyclon,PSS
- Analytical study Allavena et al.
- Never proven uniform samples
- Spatial correlation among neighbors views PSS
- Byzantine-resilient gossip
- Full views MMR,MS,Fireflies,Drum,BAR
- Small views, some resilience SPSS
- We are not aware of any analytical work
7Our Contributions
- Gossip-based BFT membership
- Linear portion f of Byzantine failures
- O(n1/3)-size partial views
- Correct nodes remain connected
- Mathematically analyzed, validated in simulations
- Random sampling
- Novel memory-efficient approach
- Converges to proven independent uniform samples
The view is not all bad
Better than benign gossip
8Brahms
- Sampling - local component
- Gossip - distributed component
Gossip
view
Sampler
sample
9Sampler Building Block
- Input data stream, one element at a time
- Bias some values appear more than others
- Used with stream of gossiped ids
- Output uniform random sample
- of unique elements seen thus far
- Independent of other Samplers
- One element at a time (converging)
next
Sampler
sample
10Sampler Implementation
- Memory stores one element at a time
- Use random hash function h
- From min-wise independent family Broder et al.
- For each set X, and all ,
next
init
Sampler
Keep id with smallest hash so far
Choose random hash function
sample
11Component S Sampling and Validation
id streamfrom gossip
init
next
Sampler
Sampler
Sampler
Sampler
using pings
sample
Validator
Validator
Validator
Validator
S
12Gossip Process
- Provides the stream of ids for S
- Needs to ensure connectivity
- Use a bag of tricks to overcome attacks
13Gossip-Based Membership Primer
- Small (sub-linear) local view V
- V constantly changes - essential due to churn
- Typically, evolves in (unsynchronized) rounds
- Push send my id to some node in V
- Reinforce underrepresented nodes
- Pull retrieve view from some node in V
- Spread knowledge within the network
- Allavena et al. 05 both are essential
- Low probability for partitions and star topologies
14Brahms Gossip Rounds
- Each round
- Send pushes, pulls to random nodes from V
- Wait to receive pulls, pushes
- Update S with all received ids
- (Sometimes) re-compute V
- Tricky! Beware of adversary attacks
15Problem 1 Push Drowning
Push Alice
A
E
M
M
Push Bob
Push Mallory
Push Carol
B
M
M
Push Ed
Push Dana
Push MM
D
Push Malfoy
M
16Trick 1 Rate-Limit Pushes
- Use limited messages to bound faulty pushes
system-wide - E.g., computational puzzles/virtual currency
- Faulty nodes can send portion p of them
- Views wont be all bad
17Problem 2 Quick Isolation
Ha! Shes out! Now lets move on to the next guy!
Push Alice
A
E
M
Push Bob
Push Carol
Push Mallory
Push Ed
Push Dana
Push MM
Push Malfoy
C
D
18Trick 2 Detection Recovery
- Do not re-compute V in rounds when too many
pushes are received - Slows down isolation does not prevent it
Push Bob
Push Mallory
Hey! Im swamped! I better ignore all of em
pushes
Push MM
Push Malfoy
19Trick 3 Balance Pulls Pushes
- Control contribution of push - aV ids versus
contribution of pull - ßV ids - Parameters a, ß
- Pull-only ? eventually all faulty ids
- Pull from faulty nodes all faulty ids, from
correct nodes some faulty ids - Push-only ? quick isolation of attacked node
- Push ensures system-wide not all bad ids
- Pull slows down (does not prevent) isolation
20Trick 4 History Samples
- Attacker influences both push and pull
- Feedback ?V random ids from S
- Parameters a ß ? 1
- Attacker loses control - samples are eventually
perfectly uniform
Yoo-hoo, is there any good process out there?
21View and Sample Maintenance
Pushed ids
Pulled ids
S
? V
?V
?V
View V
Sample
22Key Property
- Samples take time to help
- Assume attack starts when samples are empty
- With appropriate parameters
- E.g.,
- Time to isolation gt time to convergence
Prove lower bound using tricks 1,2,3(not using
samples yet)
Prove upper bound until some good sample
persists forever
Self-healing from partitions
23History Samples Rationale
- Judicious use essential
- Bootstrap, avoid slow convergence
- Deal with churn
- With a little bit of history samples (10) we
can cope with any adversary - Amplification!
24Analysis
- Sampling - mathematical analysis
- Connectivity - analysis and simulation
- Full system simulation
25Connectivity ? Sampling
- Theorem If overlay remains connected
indefinitely, samples are eventually uniform
26Sampling ? Connectivity Ever After
- Perfect sample of a sampler with hash h the id
with the lowest h(id) system-wide - If correct, sticks once the sampler sees it
- Correct perfect sample ? self-healing from
partitions ever after - We analyze PSP(t) probability of perfect sample
at time t
27Convergence to 1st Perfect Sample
- n 1000
- f 0.2
- 40 unique ids in stream
28Scalability
- Analysis says
- For scalability, want small and constant
convergence time - independent of system size, e.g., when
29Connectivity Analysis 1 Balanced Attacks
- Attack all nodes the same
- Maximizes faulty ids in views system-wide
- in any single round
- If repeated, system converges to fixed point
ratio of faulty ids in views, which is lt 1 if - ?0 (no history) and p lt 1/3 or
- History samples are used, any p
There are always good ids in views!
30Fixed Point Analysis Push
Local view node i
Local view node 1
i
Time t
push
lost push
push from faulty node
1
Time t1
x(t) portion of faulty nodes in views at round
t portion of faulty pushes to correct nodes p
/ ( p ( 1 - p )( 1 - x(t) ) )
31Fixed Point Analysis Pull
Local view node i
Local view node 1
i
Time t
pull from i faulty with probability x(t)
pull from faulty
Time t1
Ex(t1) ? p / (p (1 - p)(1 - x(t))) ?
( x(t) (1-x(t))?x(t) ) ?f
32Faulty Ids in Fixed Point
Assumed perfect in analysis, real history in
simulations
With a few history samples, any portion of bad
nodes can be tolerated
Perfectly validated fixed pointsand convergence
33Convergence to Fixed Point
34Connectivity Analysis 2Targeted Attack
Roadmap
- Step 1 analysis without history samples
- Isolation in logarithmic time
- but not too fast, thanks to tricks 1,2,3
- Step 2 analysis of history sample convergence
- Time-to-perfect-sample lt Time-to-Isolation
- Step 3 putting it all together
- Empirical evaluation
- No isolation happens
35Targeted Attack Step 1
- Q How fast (lower bound) can an attacker isolate
one node from the rest? - Worst-case assumptions
- No use of history samples (? 0)
- Unrealistically strong adversary
- Observes the exact number of correct pushes and
complements it to aV - Attacked node not represented initially
- Balanced attack on the rest of the system
36Isolation w/out History Samples
Isolation time for V60
Depend on a,ß,p
37Step 2 Sample Convergence
- n 1000
- p 0.2
- aß0.5, ?0
- 40 unique ids
Perfect sample in 2-3 rounds
Empirically verified
38Step 3 Putting It All TogetherNo Isolation with
History Samples
Works well despite small PSP
39Sample Convergence (Balanced)
Convergence twice as fast with
40Summary
- O(n1/3)-size views
- Resist Byzantine failures of linear portion
- Converge to proven uniform samples
- Precise analysis of impact of failures
41Balanced Attack Analysis (1)
- Assume (roughly) equal initial node degrees
- x(t) portion of faulty ids in correct node
views at time t - Compute Ex(t1) as function of x(t), p, ?, ?, ?
- Result 1 Short-term Optimality
- Any non-balanced schedule imposes a smaller x(t)
in a single round
42Balanced Attack Analysis (2)
- Result 2 Existence of Fixed Point X
- Ex(t1) x(t) X
- Analyze X (function of p, ?, ?, ?)
- Conditions for uniqueness
- For ??0.5, p lt 1/3, exists X lt 1
- The view is not entirely poisoned history
samples are not essential - Result 3 Convergence to fixed point
- From any initial portion lt 1 of faulty ids
- From Hillam 1975 (sequence convergence)