Title: Byzantine Fault Isolation in the Farsite Distributed File System
1Byzantine Fault Isolation in the Farsite
Distributed File System
- John R. Douceur and Jon Howell
2Definitions
Farsite \'fär-sit\ n (2000) serverless
distributed file system developed at Microsoft
Research, designed to be scalable, strongly
consistent, and secure despite running on an
untrusted infrastructure of desktop PCs
3Talk Outline
- Context Farsite system
- Why BFT doesnt scale
- Farsites use of multiple BFT groups
- The need for isolating Byzantine faults
- Formal system specification
- BFI in Farsite
4Farsite System
client
server
server
client
server
5Farsite System
Metadata
metadata
users
BFT group
clients
6Farsite System
Metadata
T tolerable faults
R count of replicas
R gt 3 T
- Using Byzantineagreement protocol,assign
sequencenumbers to messages
- Prepare-commitamong 2 T 1 servers
- Deterministicallyupdate metadata
users
BFT group
clients
7The Cost of BFT Groups
computation
? 1
? 4
message delays
5
2
messages
2
32
8Throughput vs. Scale
7
6
5
4
throughput multiple
3
2
1
0
1
2
3
4
5
6
7
machine count
ideal
typical
flat
BFT
9Workload Sharing
Workload
client
server
10BFT at Scale
11Multiple BFT Groups
12Tree of BFT Groups
13Tree of BFT Groups
/
users
public
cruft
emacs
Alice
Bob
vi
Outlook
docs
code
C
C
Proj X
foo
bar
src
bin
src
bin
14Delegation to New Group
/
users
public
cruft
emacs
Alice
Bob
vi
Outlook
docs
code
C
C
Proj X
foo
bar
src
bin
src
bin
15Pathname Resolution
/users/Alice/code/C/bar
16Machine Failures at Scale
17Group Failures at Scale
18System Failure at Scale
19Quantitative Fault Analysis
- Example system
- File system distributed among interacting BFT
groups - Simplifying assumptions
- Files are partitioned evenly among BFT groups
- Machine failures are independent
- Machine fault probability 0.001
- Evaluate operational fault rate
- Probability that an operation on a randomly
selected file exhibits a fault
20Operational Faults vs. System Scale
operational fault rate
1
10
100
1,000
10,000
100,000
system scale (count of BFT groups)
BFT 4, no BFI
BFT 7, no BFI
BFT 10, no BFI
BFT 4, ideal BFI
BFT 4, tree (4) BFI
BFT 4, tree (16) BFI
21BFI versus no BFI
22BFI versus no BFI
4-member BFT groups with BFI
10-member BFT groups without BFI
computation
? 4
? 10
messages
200
32
throughput reduction
60
84
23BFI via Formal Specification
state
state
actions
actions
faults
faults
distributedsystemspec
semanticspec
24Farsite Semantic Spec
/
tools
code
C
emacs
src
bin
a.h
a.cpp
a.exe
cl.exe
a.obj
read
open
move
open handles
pending operations
25Farsite Distributed-System Spec
26Farsite Refinement
del
27Actions are State Transitions
/
a.cpp
openhandles
pending operations
28Proving Refinement Inductively
/
a.cpp
openhandles
pending operations
29Refinement with Byzantine Faults
30Refinement with Byzantine Faults
/
tools
code
C
emacs
src
bin
a.h
a.cpp
a.exe
cl.exe
a.obj
read
del
move
open handles
pending operations
31Semantic Fault Specification
- Safety
- A tainted file may have arbitrary contents and
attributes - A tainted file may appear not linked into
namespace - A tainted file may pretend not to have children
it actually has - A tainted file may pretend to have children that
do not exist - A tainted file may pretend another tainted file
is a child or parent - Liveness
- Operations involving a tainted file may not
complete
A tainted file may have arbitrary contents and
attributes
A tainted file may appear not linked into
namespace
A tainted file may pretend not to have children
it actually has
A tainted file may pretend to have children that
do not exist
A tainted file may pretend another tainted file
is a child or parent
Operations involving a tainted file may not
complete
/
Hello world
,,)() 19x o . 2
_at__at_)
,. ,. \--/ " "
,". ltogt _ ltogt / _ .Y. _
_/ ----' \_ / \ / \ / (
) y \ ! ! / ,-.i i i
i,-. (!!( V )!!) -'-'--'-'-
code
tools
emacs
src
bin
C
foo
bar
a.h
a.cpp
a.exe
a.obj
cl.exe
32Distributed-System Improvements
Maintain redundant info across BFT group
boundaries
- Maintain redundant info across BFT group
boundaries - Augment messages with info that justifies
correctness - Ensure unambiguous chains of authority over data
- Carefully order messages and state updates for
operations involving multiple BFT groups
Augment messages with info that justifies
correctness
Ensure unambiguous chains of authority over data
Carefully order messages and state updates
foroperations involving multiple BFT groups
33Summary of BFI Methodology
- Formally specify your system
- Semantic spec users view of system
- Distributed-system spec designers view of
system - Refinement interprets distributed-system spec in
semantic terms - Modify distributed-system spec to express
Byzantine faults - Simultaneously
- Strategically weaken semantic spec to describe
faults - Improve distributed-system spec to quarantine
faults - Refinement lets you know when you are done
34Conclusions
- BFT groups have negative throughput scaling
- Scalable systems can be built from multiple BFT
groups - System scale increases the probability of
non-maskable Byzantine faults - If faults are not isolated, a single faulty group
can corrupt the entire system. - BFI is a methodology for isolating Byzantine
faults - BFI uses formal system specification
- Improves fault tolerance without hurting
throughput, unlike increasing BFT group size
35Contact Information
- JohnDo_at_microsoft.com
- Howell_at_microsoft.com
- http//research.microsoft.com/farsite
36Backup Slides
37Farsite Spec Stats
- Semantic specification
- 1800 lines of TLA
- 114 definitions
- Distributed-system specification
- 11,500 lines of TLA
- 775 definitions
- Why so big?
- Windows file-system semantics are complex
- Scalability and strong consistency
- Byzantine fault isolation