Scalability, Accountability and Instant Information Access for Network Centric Warfare - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Scalability, Accountability and Instant Information Access for Network Centric Warfare

Description:

Yair Amir, Claudiu Danilov, Danny Dolev, Jon Kirsch, John Lane, Jonathan Shapiro ... SRS goal: Improve latency by a factor of 3. ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 26

Provided by: yai45

Category:

more less

Transcript and Presenter's Notes

Title: Scalability, Accountability and Instant Information Access for Network Centric Warfare

1
Scalability, Accountability and Instant
Information Access for Network Centric Warfare
Yair Amir, Claudiu Danilov, Danny Dolev, Jon
Kirsch, John Lane, Jonathan Shapiro

Department of Computer Science
Johns Hopkins University

Chi-Bun Chan, Cristina Nita-Rotaru, Josh
Olsen David Zage
Department of Computer Science Purdue University
http//www.cnds.jhu.edu
2
Dealing with Insider Threats
Project Goals

Scaling survivable replication to wide area
networks.
Overcome 5 malicious replicas.
SRS goal Improve latency by a factor of 3.
Self imposed goal Improve throughput by a factor
of 3.
Self imposed goal Improve availability of the
system.
Dealing with malicious clients.
Compromised clients can inject authenticated but
incorrect data - hard to detect on the fly.
Malicious or just an honest error? Can be useful
for both.
Exploiting application update semantics for
replication speedup in malicious environments.
Weaker update semantics allows for immediate
response.

3
State Machine Replication

Main Challenge Ensuring coordination between
servers.
Requires agreement on the request to be processed
and consistent order of requests.
Byzantine faults BFT CL99 must contact 2f1
out of 3f1 servers and uses 3 rounds to allow
consistent progress.
Benign faults Paxos Lam98,Lam01 must contact
f1 out of 2f1 servers and uses 2 rounds to
allow consistent progress.

4
State of the Art in Byzantine ReplicationBFT
CL99
Baseline technology
5
The Paxos ProtocolNormal Case, after leader
election Lam98
Key A simple end-to-end algorithm
6
Steward Survivable Technology for Wide Area
Replication
A site
Clients
Server
Replicas
o o o
3f1
1
2
3

Each site acts as a trusted logical unit that can
crash or partition.
Effects of malicious faults are confined to the
local site.
Threshold signatures prove agreement to other
sites.
Between sites
Fault-tolerant protocol between sites.
There is no free lunch we pay with more
hardware

7
Challenges (I)

Each site has a representative that
Coordinates the Byzantine protocol inside the
site.
Forwards packets in and out of the site.
One of the sites act as the leader in the wide
area protocol
The representative of the leading site is the one
assigning sequence numbers to updates.
How do we select and change the representatives
and the leader site, in agreement ?
How do we transition safely when we need to
change them ?

8
Challenges (II)

Messages coming out of a site during leader
election are based on communication between
2f1(out of 3f1) servers inside the site.
There can be multiple sets of 2f1 servers.
In some instances, multiple correct but different
site messages can be issued by a malicious
representative.
It is sometimes impossible to completely isolate
a malicious server behavior inside its own site.
This behavior can happen in two instances
The servers inside a site propose a new leading
site.
The servers inside a site report their individual
status with respect to the global site progress.
Developed a detailed proof of correctness of the
protocol.

9
Main idea

Sites change their local representatives based on
timeouts.
Leader site representative has a larger timeout .
allows for
communication with
at least one correct
rep. at other sites.
After changing f1 leader site representatives,
servers at all sites stop participating in the
protocol, and elect a different leading site.

10
Steward First Byzantine Replication Scalable to
Wide Area Networks

A second iteration implementation
Based on the complete theoretical design.
Follows closely the pseudocode proven to be
correct.
We benchmarked the new implementation against the
program metrics.
The code successfully passed the red-team
experiment.
We believe it is theoretically unbreakable.

11
Testing Environment

Platform Dual Intel Xeon CPU 3.2 GHz 64 bits
1 GByte RAM, Linux Fedora Core 3.
Library relies on Openssl
Used OpenSSL 0.9.7a 19 Feb 2003.
Baseline operations
RSA 1024-bits sign 1.3 ms, verify 0.07 ms.
Perform modular exponentiation 1024 bits, 1 ms.
Generate a 1024 bits RSA key 55ms.

12
Evaluation Network 1 Symmetric Wide Area Network

Synthetic network used for analysis and
understanding.
5 sites, each of which connected to all other
sites with equal bandwidth/latency links.
One fully deployed site of 16 replicas the other
sites are emulated by one computer each.
Total 80 replicas in the system, emulated by 20
computers.
50 ms wide area links between sites.
Varied wide area bandwidth and the number of
clients.

13
Write Update Performance

Symmetric network.
5 sites.
Steward
16 replicas per site.
Total of 80 replicas (four sites are emulated).
Actual computers 20.
BFT
16 replicas total.
4 replicas in one site, 3 replicas in each other
site.
Update only performance (no disk writes).

14
Read-only Query Performance

10 Mbps on wide area links.
10 clients inject mixes of read-only queries and
write updates.
None of the systems was limited by bandwidth.
Performance improves between a factor of two and
more than an order of magnitude.
Availability Queries can be answered locally,
within each site.

15
Evaluation Network 2Practical Wide-Area Network
Boston
MITPC
Delaware
4.9 ms
San Jose
UDELPC
9.81Mbits/sec
TISWPC
3.6 ms 1.42Mbits/sec
ISEPC
1.4 ms 1.47Mbits/sec
ISEPC3
100 Mb/s lt1ms
38.8 ms 1.86Mbits/sec
ISIPC4
Virginia
ISIPC
100 Mb/s lt 1ms
Los Angeles

Based on a real experimental network (CAIRN).
Modeled on our cluster, emulating bandwidth and
latency constraints, both for Steward and BFT.

16
CAIRN Emulation Performance

Link of 1.86Mbps between East and West coasts is
the bottleneck
Steward is limited by bandwidth at 51 updates per
second.
1.8Mbps can barely accommodate 2 updates per
second for BFT.
Earlier experimentation with benign fault 2-phase
commit protocols achieved up to 76 updates per
second.

17
Wide-Area Scalability (3)

Selected 5 Planetlab sites, in 5 different
continents US, Brazil, Sweden, Korea and
Australia.
Measured bandwidth and latency between every pair
of sites.
Emulated the network on our cluster, both for
Steward and BFT.
3-fold latency improvement even when bandwidth is
not limited.

18
Performance metrics

The system can withstand f (5) faults in each
site.
Performs better than a flat solution that
withstands f (5) faults total.
Quantitative improvements - Performance
Between twice and over 30 times lower latency,
depending on network topology and update/query
mix.
Program metric met and exceeded in most types of
wide area networks, even when write updates only
are considered.
Qualitative improvements - Availability
Read-only queries can be answered locally even in
case of partitions.
Write updates can be done when only a majority of
sites are connected (as opposed to 2f1 out of
3f1 connected servers).

19
Red Team Experiment

Excellent interaction both with red-team and
white-team.
Performance evaluation symmetric network
Several points on the performance graphs
presented were re-evaluated.
results were almost identical.
Thorough discussions regarding the measuring
methodology and presenting the latency results
validated our experiments.
Five crash faults were induced in the leading
site
Performance slightly improved !!!

20
Red Team Experiment (2)

Steward under attack
Five sites, 4 replicas each.
Red team had full control (sudo) over five
replicas, one in each site.
Compromised replicas were injecting
Loss (up to 20 each)
Delay (up to 200ms)
Packet reordering
Fragmentation (up to 100 bytes)
Replay attacks
Compromised replicas were running modified
servers that contained malicious code.

21
Red Team Experiment (3)

The system was NOT compromised!
Safety and liveness guarantees were preserved.
The system continued to run correctly under all
attacks.
All logs from all experiments are available.
Most of the attacks did not affect the
performance.
The system was slowed down when the
representative of the leading site was attacked.
Speed of update ordering was slowed down to a
factor of 1/5.
Speed was not low enough to trigger defense
mechanisms.
Crashing the corrupt representative caused the
system to do a view change and re-gain
performance.

22
Red Team Experiment (4)

Lessons learned
We re-built the entire system having in mind the
red-team attack we learned a lot even before the
experiment.
The overall performance of the system could be
improved by not validating messages that are not
needed (after 2f1 messages have been received).
Performance under attack could be improved
substantially with further research.

23
Next StepsThroughput Comparison (CAIRN)
ADMST02
Not Byzantine!!!!!
24
Next Steps

Performance during common operation
We believe that wide-area throughput performance
can be improved by at least a factor of 5 by
using a more elaborate replication algorithm
between wide area sites.
Performance under attack
So far, we only focused on optimizing performance
in the common case, while guaranteeing safety
and liveness at all times. Performance under
attack is extremely important, but not trivial to
achieve.
System availability and safety guarantees
A Byzantine-tolerant protocol between wide-area
sites would guarantee system availability and
safety even when some of the sites are completely
compromised.

25
Scalability, Accountability and Instant
Information Access for Network-Centric Warfare
New ideas
First scalable wide-area intrusion-tolerant
replication architecture. Providing
accountability for authorized but malicious
client updates. Exploiting update semantics to
provide instant and consistent information access.
Impact
Schedule
Resulting systems with at least 3 times higher
throughput, lower latency and high availability
for updates over wide area networks. Clear path
for technology transitions into Military C3I
systems such as the Army Future Combat System.
System integration evaluation
Component analysis design
Comp. eval.
Component Implement.
C3I model, baseline and demo
Final C3I demo and baseline eval
June 04
Dec 04
June05
Dec 05
http//www.cnds.jhu.edu/funding/srs/

Write a Comment

User Comments (0)