Title: Increasing Intrusion Tolerance Via Scalable Redundancy
1Increasing Intrusion Tolerance Via Scalable
Redundancy
- Greg Ganger
- greg.ganger_at_cmu.edu
- Natassa9 Ailamaki Mike Reiter Priya
Narasimhan Chuck Cranor
2Technical Objective
- To design, implement and evaluate new protocols
for implementing intrusion-tolerant services that
scale better - Here, scale refers to efficiency as number of
servers and number of failures tolerated grows - Targeting three types of services
- Read-write data objects
- Custom flat object types for particular
applications, notably directories for
implementing an intrusion-tolerant file system - Arbitrary objects that support object nesting
3Expected Impact
- Significant efficiency and scalability benefits
over todays protocols for intrusion tolerance - For example, for data services, we anticipate
- At-least twofold latency improvement even at
small configurations (e.g., tolerating 3-5
Byzantine server failures) over current best - And improvements will grow as system scales up
- A twofold improvement in throughput, again
growing with system size - Without such improvements, intrusion tolerance
will remain relegated to small deployments in
narrow application areas
4The Problem Space
- Distributed services manage redundant state
across servers to tolerate faults - We consider tolerance to Byzantine faults, as
might result from an intrusion into a server or
client - A faulty server or client may behave arbitrarily
- We also make no timing assumptions in this work
- An asynchronous system
- Primary existing practice replicated state
machines - Offers no load dispersion, requires data
replication, and degrades as system scales with
O(N2) messages
5Our approach
- Combine techniques to eliminate work in common
cases - Server-side versioning
- allows optimism with read-time repair, if nec.
- allows work to be off-loaded to clients in lieu
of server agreement - Quorum systems (and erasure coding)
- allows load dispersion (and more efficient
redundancy for bulk data) - Several others applied to defend against
Byzantine actions - Major risk?
- could be complex for arbitrary objects
6Evaluation
- We are Scenario I centralized server setting
- Baseline the BFT library
- Popular, publicly available implementation of
Byzantine fault-tolerant state machine
replication (by Castro Liskov) - Reported to be an efficient implementation of
that approach - Two measures
- Average latency of operations, from clients
perspective - Peak sustainable throughput of operations
- Our consistency definition linearizability of
invocations
7Outline
- Overview
- Read-write storage protocol
- Some results
- Continuing work
8Read-write block storage
- Clients erasure-code/replicate blocks into
fragments - Storage-nodes version fragments on every write
Storage-nodes
Data block
Client
9Challenges Concurrency
- Concurrent updates can violate linearizability
Servers
4
5
1
2
3
4
5
1
2
3
Data
Data
10Challenges Server Failures
- Can attempt to mislead clients
- Typically addressed by voting
Servers
3
1
2
4
5
4
????
11Challenges Client Failures
- Byzantine client failures can also mislead
clients - Typically addressed by submitting a request via
an agreement protocol
Servers
5
4
1
2
3
4
?
2
Data
?
12Consistency via versioning
- Leverage versioning storage nodes for consistency
- Allow writes to proceed with versioning
- All writes create new data versions
- Partial writes and concurrency wont destroy data
- Reader detects and resolves update conflicts
- Concurrency rare in FS workloads (typically lt 1)
- Offloads work to client resulting in greater
scalability - Only perform extra work when needed
- Optimistically assume fault-free,
concurrency-free operation - Single round-trip for reads and writes in common
case
13Our system model
- Crash-recovery storage-node fault model
- Up to t total bad storage-nodes
(crashed/Byzantine) - Up to b t Byzantine (arbitrary faults)
- So, t - b faults are crash-recovery faults
- Client fault model
- Any number of crash or Byzantine clients
- Asynchronous timing model
- Point-to-point authenticated channels
14Read/write protocol
- Unit of update a block
- Complete blocks are read and written
- Erasure-coding may be used for space-efficiency
- Update semantics Readwrite
- No guarantee of contents between read write
- Sufficient for block-based storage
- Consistency Linearizability
- Liveness wait-freedom
15R/W protocol Write
- Client erasure-codes data-item into N
data-fragments - Client tags write requests with logical timestamp
- Round-trip required to read logical time
- Client issues requests to at least W
storage-nodes - Storage-nodes validate integrity of request
- Storage-nodes insert request into version history
- Write completes after W requests have completed
16R/W protocol Read
- Client reads latest version from storage-node
subset - Read set guaranteed to intersect with latest
complete write - Client determines latest candidate write
(candidate) - Set of responses containing the latest timestamp
- Client classifies the candidate as one of
- Complete
- Incomplete
- Repairable
For consistency only complete writes can be
returned
17R/W protocol Read classification
- Based on clients (limited) system knowledge
- Failures and asynchrony lead to imperfect
information - Candidate classification rules
- Complete candidate exists on ? W nodes
- candidate is decoded and returned
- Incomplete candidate exists on ? W nodes
- Read previous version to determine new candidate
- Iterateperform classification on new candidate
- Repairable candidate may exist on ? W nodes
- Repair and return data-item
18Example Successful read
(N5, W3, t1, b0)
Storage Nodes
Ø
Ø
Ø
Ø
Ø
Time
D1
1
2
3
4
5
D0
D1
Ø
D0
D0
D0 determined complete, returned
D1 latest candidate
D1 incomplete
D0 latest candidate
19Example Repairable read
(N5, W3, t1, b0)
Storage Nodes
Ø
Ø
Ø
Ø
Ø
D0
D0
D0
Time
D1
D2
D2
1
2
3
4
5
D0
D1
D2
D2
D2
D2
D2 repairable
Repair D2
Return D2
D2 latest candidate
20Protecting against Byzantine storage-nodes
- Must defend against servers that modify data in
their possession - Solution Cross checksums Gong 89
- Hash each data-fragment
- Concatenate all N hashes
- Append cross checksum to each fragment
- Clients verify hashes against fragments and use
cross checksums as votes
Data-fragments
Hashes
Data-item
Cross checksum
21Protecting against Byzantine clients
- Must ensure all fragment sets decode to same
value - Solution Validating timestamps
- Write place hash of cross checksum in timestamp
- also prevents multiple values being written at
same timestamp - Storage-nodes validate their fragment against
corresponding hash - Read regenerate fragments and cross checksum
Example Byzantine encoding with poisonous
fragment
Data-fragments
?
Data-items
22Experimental setup
- Prototype system PASIS
- 20 node cluster
- Dual 1 GHz Pentium III storage-nodes
- Single 2 GHz Pentium IV clients
- 100 Mb switched Ethernet
- 16 KB data-item size (before encoding)
- Blowup of over the data-item size
- Each fragment is the data-item size
23PASIS response time
N 2t 2b 1
20
Writes
18
b t
Writes
b 1
Fault models b t and b 1
16
Reads
b t
14
Reads
b 1
12
Mean response time (ms)
10
Decode computation
NW delay redundant fragments
8
6
4
1-way 16KB ping
2
0
1
2
3
4
Total failures tolerated (t)
24Throughput experiment
- Same system set-up as resp. time experiment
- Clients issue read or write requests
- Increase number of clients to increase load
- Demonstrate value of erasure-codes
- Increase m to reduce per storage-node load
- Compare with Byzantine atomic broadcast
- BFT library Castro Liskov 99
- Supports arbitrary operations
- Replica (with multicast) limits write throughput
- O(N2) messages limits performance scalability
25PASIS vs. BFT Write throughput
m
N
b t 1
3500
PASIS
2
5
PASIS
3
6
3000
BFT
1
4
2500
PASIS has higher write throughput than BFT
BFT uses replication which increases per
storage-node load
Reduce per storage-node load with erasure-codes
2000
Throughput (req/s)
1500
1000
60
500
0
0
2
4
6
8
Clients
26PASIS vs. BFT Read throughput
m
N
b t 1
3500
PASIS
2
5
PASIS
3
6
BFT
1
4
3000
2500
2000
Throughput (req/s)
1500
1000
500
0
0
2
4
6
8
Clients
27Continuing work
- New testbed 70 servers connected with switched
Gbit/sec - experiments can then explore higher scalability
points - baseline and our results will come from this
testbed - Protocol for arbitrary deterministic functions on
objects - built from same basic primitives
- Protocol for objects with nested objects
- adds requirement of replicated invocations
28Summary
- Goal To design, implement and evaluate new
protocols for implementing intrusion-tolerant
services that scale better - Here, scale refers to efficiency as number of
servers and number of failures tolerated grows - Started with a protocol for read-write storage
- based on versioning and quorums
- scales efficiently (and much better than BFT)
- also flexible (can add assumptions to reduce
costs) - Going forward (in progress)
- generalize types of objects and operations that
can be supported
29Questions?
30Garbage collection
- Pruning old versions is necessary to reclaim
space - Versions prior to latest complete write can be
pruned - Storage-nodes need to know latest complete write
- In isolation they do not have this information
- Perform read operation to classify latest
complete write - Many possible policies exist for when to clean
what - Best to clean during idle time (if possible)
- Rank blocks in order of greatest potential gains
- Work remains in this area