Title: Byzantine%20Techniques%20II
1Byzantine Techniques II
- Justin W. Hart
- CS 614
- 12/01/2005
2Papers
- BAR Fault Tolerance for Cooperative Services.
Amitanand S. Aiyer, et. al. (SOSP 2005) - Fault-scalable Byzantine Fault-Tolerant Services.
Michael Abd-El-Malek et.al. SOSP 2005
3BAR Fault Tolerance for Distributed Services
- BAR Model
- General Three-Level Architecture
- BAR-B
4Motivation
- General approach to constructing cooperative
services that span multiple administrative
domains (MADs)
5Why is this difficult?
- Nodes are under control of multiple
administrators - Broken Byzantine behaviors.
- Misconfigured, or configured with malicious
intent. - Selfish Rational behaviors
- Alter the protocol to increase local utility
6Other models?
- Byzantine Models Account for Byzantine
behavior, but do not handle rational behavior. - Rational Models Account for rational behavior,
but may break with Byzantine behavior.
7BAR Model
- Byzantine
- Behaving arbitrarily or maliciously
- Altruistic
- Execute the proposed program, whether it benefits
them or not - Rational
- Deviate from the proposed program for purposes of
local benefit
8BART BAR Tolerant
- Its a cruel world
- At most (n-2)/3 nodes in the system are Byzantine
- The rest are rational
9Two classes of protocols
- Incentive-Compatible Byzantine Fault Tolerant
(IC-BFT) - Guarantees a set of safety and liveliness
properties - It is in the best interest of rational nodes to
follow the protocol exactly - Byzantine Altruistic Rational Tolerant
- Guarantees a set of safety and liveliness
properties despite the presence of rational nodes - IC-BFT is a subset of BART
10An important concept
- It isnt enough for a protocol to survive drills
of a handful of attacks. It must provably
provide its guarantees.
11A flavor of things to come
- Protocol builds on Practical Byzantine Fault
Tolerance in order to combat Byzantine behavior - Protocol uses game theoretical concepts in order
to combat rational behavior
12A taste of Nash Equilibrium
Swerve Go Straight
Swerve 0, 0 -1,1
Go Straight 1,-1 X_X,X_X -100,-100
13and the nodes are starving!
- Nodes require access to a state machine in order
to complete their objectives - Protocol contains methods for punishing rational
nodes, including denying them access to the state
machine
14An expensive notion of identity
- Identity is established through cryptographic
keys assigned through a trusted authority - Prevents Sybil attacks
- Bounds the number of Byzantine nodes
- Gives rational nodes reason to consider long-term
consequences of their actions - Gives real world grounding to identity
15Assumptions about rational nodes
- Receive long-term benefit from staying in the
protocol - Conservative when computing the impact of
Byzantine nodes on their utility - If the protocol provides a Nash equilibrium,
then all rational nodes will follow it - Rational nodes do not colludecolluding nodes
are classified as Byzantine
16Byzantine nodes
- Byzantine fault model
- Strong adversary
- Adversary can coordinate collusion attacks
17Important concepts
- Promptness principal
- Proof of Misbehavior (POM)
- Cost balancing
18Promptness principal
- If a rational node gains no benefit from delaying
a message, it will send it as soon as possible
19Proof of Misbehavior (POM)
- Self-contained, cryptographic proof of wrongdoing
- Provides accountability to nodes for their actions
20Example of POM
- Node A requests that Node B store a chunk
- Node B replies that it has stored the chunk
- Later Node A requests that chunk back
- Node B sends back random garbage (it hadnt
stored the chunk) and a signature - Because Node A stored a hash of the chunk, it can
demonstrate misbehavior on part of Node B
21but its a bit more complicated than that!
- This corresponds to a rather simple behavior to
combat. Aggressively Byzantine behavior.
22Passive-aggressive behaviors
- Harder cases than aggressively Byzantine
- A malicious Node A could merely lie about
misbehavior on the part of Node B - A node could exploit non-determinism in order to
shirk work
23Cost Balancing
- If two behaviors have the same cost, there is no
reason to choose the wrong one
24Three-Level Architecture
25Level 1
- Unilaterally deny service to nodes that fail to
deliver messages - Tit-for-Tat
- Balance costs
- No incentive to make the wrong choice
- Penance
- Unilaterally impose extra work on nodes with
untimely responses
26Level 2
- Failure to respond to a request by a state
machine will generate a POM from a quorum of
nodes in the state machine
27Level 3
- Makes use of reliable work assignment
- Needs only to provide sufficient information to
identify valid request/response pairs
28Nuts and Bolts
29Level 1
- Ensure long-term benefit to participants
- The RSM rotates the leadership role to
participants. - Participants want to stay in the system in order
to control the RSM and complete their protocols - Limit non-determinism
- Self interested nodes could hide behind
non-determinism to shirk work - Use Terminating Reliable Broadcast, rather than
consensus. - In TRB, only the sender can propose a value
- Other nodes can only adopt this value, or choose
a default value
30Level 1
- Mitigate the effects of residual non-determinism
- Cost balancing
- The protocol preferred choice is no more
expensive than any other - Encouraging timeliness
- Nodes can inflict sanctions on untimely messages
- Enforce predictable communication patterns
- Nodes have to have participated at every step in
order to have the opportunity to issue a command
31Terminating Reliable Broadcast
323f2 nodes, rather than 3f1
- Suppose a sender s is slow
- The same group of nodes now want to determine
that s is slow - A new leader is elected
- Every node but s wants a timely conclusion to
this, in order to get their turn to propose a
value to the state machine - s is not allowed to participate in this quorum
33TRB provides a few guarantees
- They differ during periods of synchrony and
periods of asynchrony
34In synchrony
- Termination
- Every non-Byzantine process delivers exactly one
message - Agreement
- If on non-Byzantine process delivers a message m,
then all non-Byzantine processes eventually
deliver m
35In asynchrony
- Integrity
- If a non-Byzantine process delivers m, then the
sender sent m - Non-Triviality
- If the sender is non-Byzantine and sends m, then
the sender eventually delivers m
36Message Queue
- Enforces predictable communication patterns
- Bubbles
- A simple retaliation policy
- Node As message queue is filled with messages
that it intends to send to Node B - This message queue is interleaved with bubbles.
- Bubbles contain predicates indicating messages
expected from B - No message except the expected predicate from B
can fill the bubble - No messages in As queue will go to B until B
fills the bubble
37Balanced Messages
- Weve already discussed this quite a bit
- We assure this at this level of the protocol
- This is where we get our gigantic timeout message
38Penance
- Untimely vector
- Tracks a nodes perception of the responsiveness
of other nodes - When a node becomes a sender, it includes its
untimely vector with the message
39Penance
- All nodes but the sender receive penance messages
from each node. - Because of bubbles, each untimely node must sent
a penance message back in order to continue using
the system - This provides a penalty to those nodes
- The sender is excluded from this process, because
it may be motivated to lie in its penance vector,
in order to avoid the work of transmitting
penance messages
40Timeouts and Garbage Collection
- Set-turn timeout
- Timeout to take leadership away from the sender
- Initially 10 seconds in this implementation, in
order to overcome all expected network delays - Can only be changed by the sender
- Max_response_time
- Time at which a node is removed from the system,
its messages discarded and its resources garbage
collected - Set to 1 week or 1 month in the prototypes
41Global Punishment
- Badlists
- Transform local suspicion into POMs
- Suspicion is recorded in a local nodes badlist
- Sender includes its badlist with its message
- If, over time, recipients see a node in f 1
different senders badlists, then they too,
consider that node to be faulty
42Proof
- Real proofs do not appear in this paper, they
appear in the technical report
43but heres a bit
- Theorem 1 The TRB protocol satisfies
Termination, Agreement, Integrity and
Non-Triviality
44and a bit more
- Theorem 2 No node has a unilateral incentive to
deviate from the protocol - Lemma 1 No rational node r benefits from
delaying sending the set-turn message - Follows from penance
- Lemma 2 No rational node r benefits from sending
the set-turn message early - Sending early could result in senderTO to be sent
(this protocol uses synchronized clocks, and all
messages are cryptographically signed)
45and the rest thats mentioned in the paper
- Lemma 3 No rational node r benefits from sending
a malformed set-turn message. - The set-turn message only contains the turn
number. Because of this, doing so reduces to
either sending early (dealt with in Lemma 1) or
sending late (dealt with in Lemma 2)
46Level 2
- State machine replication is sufficient to
support a backup service, but the overhead is
unacceptable - 100 participants 100 MB backed up 10 GB of
drive space - Assign work to individual nodes, using arithmetic
codes to provide low-overhead fault-tolerant
storage
47Guaranteed Response
- Direct communication is insufficient when nodes
can behave rationally - We introduce a witness that overhears the
conversation - This eliminates ambiguity
- Messages are routed through this intermediary
48Guaranteed Response
49Guaranteed Response
- Node A sends a request to Node B through the
witness - The witness stores the request, and enters
RequestReceived state - Node B sends a response to Node A through the
witness - The witness stores the response, and enters
ResponseReceived
50Guaranteed Response
- Deviation from this protocol will cause the
witness to either notice the timeout from Node B
or lying on the part of Node A
51Implementation
- The system must remain incentive-compatible
- Communication with the witness node is not in the
form of actual message sending, it is in the form
of a command to the RSM - Theorem 3 If the witness node enters the
request received state, for some work w to
rational node b, then b will execute w - Holds if sufficient sanctions exist to cause it
to be motivated to do this
52State limiting
- State is limited by limiting the number of slots
(nodes with which a node can communicate)
available to a node - Applies a limit to the memory overhead
- Limits the rate at which requests are inserted
into the system - Forces nodes to acknowledge responses to requests
- Nodes want their slots back
53Optimization through Credible Threats
54Optimization through Credible Threats
- Returns to game theory
- Protocol is optimized so nodes can communicate
directly. Add a fast path - Nodes register vows with the witness
- If recipient does not respond, nodes proceed to
the unoptimized case - Analogous to a driver in chicken throwing their
steering wheel out the window
55Periodic Work Protocol
- Witness checks that periodic tasks, such as
system maintenance are performed - It is expected that, with a certain frequency,
each node in the system will perform such a task - Failure to perform one will generate a POM from
the witness
56Authoritative Time Service
- Maintains authoritative time
- Binds messages sent to that time
- Guaranteed response protocol relies on this for
generating NoResponses
57Authoritative Time Service
- Each submission to the state machine contains the
timestamp of the proposer - Timestamp is taken to be the maximum of the
median of timestamps of the previous f1
decisions - If no decision is decided, then the timestamp
is the previous authoritative time
58Level 3 BAR-B
- BAR-B is a cooperative backup system
- Three operations
- Store
- Retrieve
- Audit
59Storage
- Nodes break files up into chunks
- Chunks are encrypted
- Chunks are stored on remote nodes
- Remote nodes send signed receipts and store
StoreInfos
60Retrieval
- A node storing a chunk can respond to a request
for a chunk with - The chunk
- A demonstration that the chunks lease has
expired - A more recent StoreInfo
61Auditing
- Receipts constitute audit records
- Nodes will exchange receipts in order to verify
compliance with storage quotas
62Arithmetic Coding
- Arithmetic coding is used to keep storage size
reasonable - 1 GB of storage requires 1.3 GB of overhead
- Keeping this ratio reasonable is crucial to
motivate self-interested nodes to participate
63Request-Response pattern
64Retrieve
- Originator sends a Receipt for the StoreInfo to
be retrieved - Storage node can send
- A RetrieveConfirm
- Containing the data and the receipt
- A RetrieveDeny
- Containing a receipt and a proof regarding why
- Anything else
- Generates a POM
65Store
- Originator sends a StoreInfo to be stored
- Storage node can send
- A receipt
- A StoreReject
- Demonstrates that the node has reached its
storage commitment - Anything else
- Generates a POM
66Audit
- Three phases
- Auditor requests both OwnList and StoreList from
auditee - Does this for random nodes in the system
- Lists are checked for inconsistencies
- Inconsistencies result in a POM
67Time constraints
- Data is stored for 30 days
- After this, it is garbage collected
- Nodes must renew their leases on stored chunks,
in order to keep them in the system, prior to
this expiration
68Sanctions
- Periodic work protocol forces generation of POMs
or special NoPOMs - POMs and NoPOMs are balanced
- POMs evict nodes from the system
69Recovery
- Nodes must be able to recover after failures
- Chained membership certificates are used in order
to allow them to retrieve their old chunks - Use of certificate later in the chain is regarded
as a new node entering the system - The old node is regarded as dead
- The new node is allowed to view the old nodes
chunks
70Recovery
- This forces nodes to redistribute their chunks
that were on that node - Length of chains is limited, in order to prevent
nodes from shirking work by using a certificate
later in the chain
71Guarantees
- Data on BAR-B can be retrieved within the lease
period - No POM can be gathered against a node that does
not deviate from the protocol - No node can store more than its quota
- A time window is available to nodes with
catastrophic failures for recovery
72Evaluation
- Performance is inferior to protocols that do note
make these guarantees, but acceptable
73Impact of additional nodes
74Impact of rotating leadership
75Impact of fast path optimization
76Fault-Scalable Byzantine Fault-Tolerant Services
- Query/Update (Q/U) protocol
- Optimistic quorum based protocol
- Better throughput and fault-scalability than
Replicated State Machines - Introduces preferred quorum as an optimization on
quorum protocols
77Motivation
- Compelling need for services and distributed data
structures to be efficient and fault-tolerant - In Byzantine fault-tolerant systems, performance
drops off sharply as more faults are tolerated
78Fault Scalability
- A fault-scalable service is one in which
performance degrades gracefully as more server
faults are tolerated
79Operations-based interface
- Provides an interface similar to RSMs
- Exports interfaces comprised of deterministic
methods - Queries
- Do not modify data
- Updates
- Modify data
- Multi-object updates
- Allow a set of objects to be updated together
80Properties
- Operates correctly under an asynchronous model
- Queries and updates are strictly serializable
- In benign execution, they are obstruction-free
- Cost is an increase in the number of required
servers 5b 1 servers, rather than 3b 1 servers
81Optimism
- Servers store a version history of objects
- Updates are non-destructive to the objects
- Use of logical timestamps based on contents of
update and object state upon which the update is
conditioned
82Speedups
- Preferred quorum, rather than random quorum
- Addressed later
- Efficient cryptographic techniques
- Addressed later
83Efficiency and Scalability
84Efficiency
- Most failure atomic protocols require at least a
2 phase commit - Prepare
- Commit
- The optimistic approach does not need a prepare
phase - This introduces the need for clients to repair
inconsistent objects - The optimistic approach also obviates the need
for locking!
85Versioning Servers
- In order to allow for this, versioning servers
are employed - Each update creates a new version on the server
- Updates contain information about the version to
be updated. - If no update has been committed since that
version, the update goes through unimpeded.
86Throughput-scalability
- Additional servers, beyond those necessary to
provide the desired fault tolerance, can provide
additional throughput
87Scaleup pitfall?
- Encourage the use of fine-grained objects, which
reduce per-object contention - If majority of accesses access individual
objects, or few objects, then scaleup pitfall can
be avoided - In the example applications, this holds.
88No need to partition
- Other systems achieve throughput-scalability by
partitioning services - This is unnecessary in this system
89The Query/Update Protocol
90System model
- Asynchronous timing
- Clients and servers may be Byzantine faulty
- Clients and servers assumed to be computationally
bounded, assuring effectiveness of cryptography - Failure model is a hybrid failure model
- Benign
- Malevolent
- Faulty
91System model
- Extends definition of fail prone system given
by Malkhi and Reiter
92System model
- Point-to-point authenticated channels exist
between all clients and servers - Infrastructure deploying symmetric keys on all
channels - Channels are assumed unreliable
- but, of course, they can be made reliable
93Overview
- Clients update objects by issuing requests
stamped with object versions to version servers. - Version servers evaluate these requests.
- If the request is over an out of date version,
the clients version is corrected and the request
reissued - If an out of date server is required to reach a
quorum, it retrieves an object history from a
group of other servers - If the version matches the server version, of
course, it is executed - Everything else is a variation upon this theme
94Overview
- Queries are read only methods
- Updates modify an object
- Methods exported take arguments and return
answers - Clients perform operations by issuing requests to
a quorum - A server receives a request. If it accepts it it
invokes a method - Each update creates a new object version
95Overview
- The object version is kept with its logical
timestamp in a version history called the replica
history - Servers return replica histories in response to
requests - Clients store replica histories in their object
history set, an array of replicas indexed by
server
96Overview
- Timestamps in these histories are candidates for
future operations - Candidates are classified in order to determine
which object version a method should be executed
upon
97Overview
- In non-optimistic operation, a client may need to
perform a repair - Addressed later
- To perform an operation, a client first retrieves
an object history set. The clients operation is
conditioned on this set, which is transmitted
with the operation.
98Overview
- The client sends this operation to a quorum of
servers. - To promote efficiency, the client sends the
request to a preferred quorum - Addressed later
- Single phase operation hinges on the availability
of a preferred quorum, and on concurrency-free
access.
99Overview
- Before executing a request, servers first
validate its integrity. - This is important, servers do not communicate
object histories directly to each other, so the
clients data must be validated. - Servers use authenticators to do this, lists of
HMACs that prevent malevolent nodes from
fabricating replica histories. - Servers cull replica histories from the
conditioned on OHS that they cannot validate
100Overview the last bit
- Servers validate that they do not have a higher
timestamp in their local replica histories - Failing this, the client repairs
- Passing this, the method is executed, and the new
timestamp created - Timestamps are crafted such that they always
increase in value
101Preferred Quorums
- Traditional quorum systems use random quorums,
but this means that servers frequently need to be
synced - This is to distribute the load
- Preferred quorums choose to access servers with
the most up to date data, assuring that syncs
happen less often
102Preferred Quorums
- If a preferred quorum cannot be met, clients
probe for additional servers to add to the quorum - Authenticators make it impossible to forge object
histories for benign servers - The new host syncs with b1 host servers, in
order to validate that the data is correct - In the prototype, probing selects servers such
that the load is distributed using a method
parameterized on object ID and server ID
103Concurrency and Repair
- Concurrent access to an object may fail
- Two operations
- Barrier
- Barrier candidates have no data associated with
them, and so are safe to select during periods of
contention - Barrier advances the logical clock so as to
prevent earlier timestamps from completing - Copy
- Copies the latest object data past the barrier,
so it can be acted upon
104Concurrency and Repair
- Clients may repeatedly barrier each other, to
combat this, an exponential backoff strategy is
enforced
105Classification and Constraints
- Based on partial observations of the global
system state, an operation may be - Complete
- Repairable
- Can be repaired using the copy and barrier
strategy - Incomplete
106Multi-Object Updates
- In this case, servers lock their local copies, if
they approve the OHS, the update goes through - If not, a multi-object repair protocol goes
through - In this case, repair depends on the ability to
establish all objects in the set - Objects in the set are only repairable if all are
repairable. If objects in the set that would be
repairable are reclassified as incomplete.
107An example of all of this
108Implementation details
109Cached object history set
- Clients cache object history sets during
execution, and execute updates without first
querying. - Failing the request based on an out of date OHS,
the server returns an up-to-date OHS with the
failure
110Optimistic query execution
- If a client has not accessed an object recently,
it is still possible to complete in a single
phase. - Servers execute the update on the latest object
that they store. Clients then evaluate the
result normally.
111Inline repair
- Does not require a barrier and copy
- Repairs the candidate in-place, obviating the
need for a round trip - Only possible in cases where there is no
contention
112Handling repeated requests
- Mechanisms may cause requests to be repeated
- In order to shortcut other checks, the timestamp
is checked first
113Retry and backoff policies
- Update-update requires retry, and backoff to
avoid livelock - Update-query does not, the query can be updated
in place
114Object syncing
- Only 1 server needs to send the entire object
version state - Others send hashes
- Syncing server then calculates hash and comparers
against all others
115Other speedups
- Authenticators
- Authenticators use HMACs rather than digital
signatures - Compact timestamps
- Hashes are used rather than object histories in
timestamps using a collision resistant hash - Compact replica histories
- Replica histories are prune based on the
conditioned-on timestamp after updates
116Malevolent components
- The astute among you must have noticed the
possibility of DOS attacking by refusing
exponential backoff - Servers could rate-limit clients
- Clients could also issue updates to a subset of a
quorum, forcing incomplete updates - Lazy verification can be used to verify
correctness of client operations in the
background - The amount of unverified work by a client can
then be limited
117Correctness
- Operations are strictly serializable
- To understand, consider the conditioned-on chain.
- All operations chain back to the initial
candidate, and a total order is imposed through
on all established operations - Operations occur atomically, including those
spanning multiple objects - If no operations span multiple objects, then
correct operations that complete are also
linearizable
118Tests
- Tests performed on a rack of 76 Intel Pentium 4
2.8 GHz machines - Implemented an increment method and an NFSv3
metadata service
119Fault Scalability
120More fault-scalability
121Isolated vs Contending
122NFSv3 metadata
123References
- Text and images have been borrowed directly from
both papers.