Scalability, Accountability and Instant Information Access for Network Centric Warfare - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Scalability, Accountability and Instant Information Access for Network Centric Warfare

Description:

Yair Amir, Claudiu Danilov, Danny Dolev, Jon Kirsch, John Lane, Jonathan Shapiro ... SRS goal: Improve latency by a factor of 3. ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 26
Provided by: yai45
Category:

less

Transcript and Presenter's Notes

Title: Scalability, Accountability and Instant Information Access for Network Centric Warfare


1
Scalability, Accountability and Instant
Information Access for Network Centric Warfare
Yair Amir, Claudiu Danilov, Danny Dolev, Jon
Kirsch, John Lane, Jonathan Shapiro
  • Department of Computer Science
  • Johns Hopkins University

Chi-Bun Chan, Cristina Nita-Rotaru, Josh
Olsen David Zage
Department of Computer Science Purdue University
http//www.cnds.jhu.edu
2
Dealing with Insider Threats
Project Goals
  • Scaling survivable replication to wide area
    networks.
  • Overcome 5 malicious replicas.
  • SRS goal Improve latency by a factor of 3.
  • Self imposed goal Improve throughput by a factor
    of 3.
  • Self imposed goal Improve availability of the
    system.
  • Dealing with malicious clients.
  • Compromised clients can inject authenticated but
    incorrect data - hard to detect on the fly.
  • Malicious or just an honest error? Can be useful
    for both.
  • Exploiting application update semantics for
    replication speedup in malicious environments.
  • Weaker update semantics allows for immediate
    response.

3
State Machine Replication
  • Main Challenge Ensuring coordination between
    servers.
  • Requires agreement on the request to be processed
    and consistent order of requests.
  • Byzantine faults BFT CL99 must contact 2f1
    out of 3f1 servers and uses 3 rounds to allow
    consistent progress.
  • Benign faults Paxos Lam98,Lam01 must contact
    f1 out of 2f1 servers and uses 2 rounds to
    allow consistent progress.

4
State of the Art in Byzantine ReplicationBFT
CL99
Baseline technology
5
The Paxos ProtocolNormal Case, after leader
election Lam98
Key A simple end-to-end algorithm
6
Steward Survivable Technology for Wide Area
Replication
A site
Clients
Server
Replicas
o o o
3f1
1
2
3
  • Each site acts as a trusted logical unit that can
    crash or partition.
  • Effects of malicious faults are confined to the
    local site.
  • Threshold signatures prove agreement to other
    sites.
  • Between sites
  • Fault-tolerant protocol between sites.
  • There is no free lunch we pay with more
    hardware

7
Challenges (I)
  • Each site has a representative that
  • Coordinates the Byzantine protocol inside the
    site.
  • Forwards packets in and out of the site.
  • One of the sites act as the leader in the wide
    area protocol
  • The representative of the leading site is the one
    assigning sequence numbers to updates.
  • How do we select and change the representatives
    and the leader site, in agreement ?
  • How do we transition safely when we need to
    change them ?

8
Challenges (II)
  • Messages coming out of a site during leader
    election are based on communication between
    2f1(out of 3f1) servers inside the site.
  • There can be multiple sets of 2f1 servers.
  • In some instances, multiple correct but different
    site messages can be issued by a malicious
    representative.
  • It is sometimes impossible to completely isolate
    a malicious server behavior inside its own site.
  • This behavior can happen in two instances
  • The servers inside a site propose a new leading
    site.
  • The servers inside a site report their individual
    status with respect to the global site progress.
  • Developed a detailed proof of correctness of the
    protocol.

9
Main idea
  • Sites change their local representatives based on
    timeouts.
  • Leader site representative has a larger timeout .
  • allows for
  • communication with
  • at least one correct
  • rep. at other sites.
  • After changing f1 leader site representatives,
    servers at all sites stop participating in the
    protocol, and elect a different leading site.

10
Steward First Byzantine Replication Scalable to
Wide Area Networks
  • A second iteration implementation
  • Based on the complete theoretical design.
  • Follows closely the pseudocode proven to be
    correct.
  • We benchmarked the new implementation against the
    program metrics.
  • The code successfully passed the red-team
    experiment.
  • We believe it is theoretically unbreakable.

11
Testing Environment
  • Platform Dual Intel Xeon CPU 3.2 GHz 64 bits
    1 GByte RAM, Linux Fedora Core 3.
  • Library relies on Openssl
  • Used OpenSSL 0.9.7a 19 Feb 2003.
  • Baseline operations
  • RSA 1024-bits sign 1.3 ms, verify 0.07 ms.
  • Perform modular exponentiation 1024 bits, 1 ms.
  • Generate a 1024 bits RSA key 55ms.

12
Evaluation Network 1 Symmetric Wide Area Network
  • Synthetic network used for analysis and
    understanding.
  • 5 sites, each of which connected to all other
    sites with equal bandwidth/latency links.
  • One fully deployed site of 16 replicas the other
    sites are emulated by one computer each.
  • Total 80 replicas in the system, emulated by 20
    computers.
  • 50 ms wide area links between sites.
  • Varied wide area bandwidth and the number of
    clients.

13
Write Update Performance
  • Symmetric network.
  • 5 sites.
  • Steward
  • 16 replicas per site.
  • Total of 80 replicas (four sites are emulated).
  • Actual computers 20.
  • BFT
  • 16 replicas total.
  • 4 replicas in one site, 3 replicas in each other
    site.
  • Update only performance (no disk writes).

14
Read-only Query Performance
  • 10 Mbps on wide area links.
  • 10 clients inject mixes of read-only queries and
    write updates.
  • None of the systems was limited by bandwidth.
  • Performance improves between a factor of two and
    more than an order of magnitude.
  • Availability Queries can be answered locally,
    within each site.

15
Evaluation Network 2Practical Wide-Area Network
Boston
MITPC
Delaware
4.9 ms
San Jose
UDELPC
9.81Mbits/sec
TISWPC
3.6 ms 1.42Mbits/sec
ISEPC
1.4 ms 1.47Mbits/sec
ISEPC3
100 Mb/s lt1ms
38.8 ms 1.86Mbits/sec
ISIPC4
Virginia
ISIPC
100 Mb/s lt 1ms
Los Angeles
  • Based on a real experimental network (CAIRN).
  • Modeled on our cluster, emulating bandwidth and
    latency constraints, both for Steward and BFT.

16
CAIRN Emulation Performance
  • Link of 1.86Mbps between East and West coasts is
    the bottleneck
  • Steward is limited by bandwidth at 51 updates per
    second.
  • 1.8Mbps can barely accommodate 2 updates per
    second for BFT.
  • Earlier experimentation with benign fault 2-phase
    commit protocols achieved up to 76 updates per
    second.

17
Wide-Area Scalability (3)
  • Selected 5 Planetlab sites, in 5 different
    continents US, Brazil, Sweden, Korea and
    Australia.
  • Measured bandwidth and latency between every pair
    of sites.
  • Emulated the network on our cluster, both for
    Steward and BFT.
  • 3-fold latency improvement even when bandwidth is
    not limited.

18
Performance metrics
  • The system can withstand f (5) faults in each
    site.
  • Performs better than a flat solution that
    withstands f (5) faults total.
  • Quantitative improvements - Performance
  • Between twice and over 30 times lower latency,
    depending on network topology and update/query
    mix.
  • Program metric met and exceeded in most types of
    wide area networks, even when write updates only
    are considered.
  • Qualitative improvements - Availability
  • Read-only queries can be answered locally even in
    case of partitions.
  • Write updates can be done when only a majority of
    sites are connected (as opposed to 2f1 out of
    3f1 connected servers).

19
Red Team Experiment
  • Excellent interaction both with red-team and
    white-team.
  • Performance evaluation symmetric network
  • Several points on the performance graphs
    presented were re-evaluated.
  • results were almost identical.
  • Thorough discussions regarding the measuring
    methodology and presenting the latency results
  • validated our experiments.
  • Five crash faults were induced in the leading
    site
  • Performance slightly improved !!!

20
Red Team Experiment (2)
  • Steward under attack
  • Five sites, 4 replicas each.
  • Red team had full control (sudo) over five
    replicas, one in each site.
  • Compromised replicas were injecting
  • Loss (up to 20 each)
  • Delay (up to 200ms)
  • Packet reordering
  • Fragmentation (up to 100 bytes)
  • Replay attacks
  • Compromised replicas were running modified
    servers that contained malicious code.

21
Red Team Experiment (3)
  • The system was NOT compromised!
  • Safety and liveness guarantees were preserved.
  • The system continued to run correctly under all
    attacks.
  • All logs from all experiments are available.
  • Most of the attacks did not affect the
    performance.
  • The system was slowed down when the
    representative of the leading site was attacked.
  • Speed of update ordering was slowed down to a
    factor of 1/5.
  • Speed was not low enough to trigger defense
    mechanisms.
  • Crashing the corrupt representative caused the
    system to do a view change and re-gain
    performance.

22
Red Team Experiment (4)
  • Lessons learned
  • We re-built the entire system having in mind the
    red-team attack we learned a lot even before the
    experiment.
  • The overall performance of the system could be
    improved by not validating messages that are not
    needed (after 2f1 messages have been received).
  • Performance under attack could be improved
    substantially with further research.

23
Next StepsThroughput Comparison (CAIRN)
ADMST02
Not Byzantine!!!!!
24
Next Steps
  • Performance during common operation
  • We believe that wide-area throughput performance
    can be improved by at least a factor of 5 by
    using a more elaborate replication algorithm
    between wide area sites.
  • Performance under attack
  • So far, we only focused on optimizing performance
    in the common case, while guaranteeing safety
    and liveness at all times. Performance under
    attack is extremely important, but not trivial to
    achieve.
  • System availability and safety guarantees
  • A Byzantine-tolerant protocol between wide-area
    sites would guarantee system availability and
    safety even when some of the sites are completely
    compromised.

25
Scalability, Accountability and Instant
Information Access for Network-Centric Warfare
New ideas
First scalable wide-area intrusion-tolerant
replication architecture. Providing
accountability for authorized but malicious
client updates. Exploiting update semantics to
provide instant and consistent information access.
Impact
Schedule
Resulting systems with at least 3 times higher
throughput, lower latency and high availability
for updates over wide area networks. Clear path
for technology transitions into Military C3I
systems such as the Army Future Combat System.
System integration evaluation
Component analysis design
Comp. eval.
Component Implement.
C3I model, baseline and demo
Final C3I demo and baseline eval
June 04
Dec 04
June05
Dec 05
http//www.cnds.jhu.edu/funding/srs/
Write a Comment
User Comments (0)
About PowerShow.com