A Research Program in Reliable Adaptive Distributed Systems RADS - PowerPoint PPT Presentation

About This Presentation

Title:

A Research Program in Reliable Adaptive Distributed Systems RADS

Description:

Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion ... Michael Jordan: Statistical Learning Theory ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 15

Provided by: Rand220

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Research Program in Reliable Adaptive Distributed Systems RADS

1
A Research Program inReliable AdaptiveDistribute
d Systems (RADS)

Armando Fox, Michael Jordan, Randy Katz, George
Necula, David Patterson, Ion Stoica, Doug Tygar
University of California, BerkeleyStanford
University

2
What Are We Trying to DoNew Approach for RADS

Dramatically improve the trustworthiness of
networked systems
Observe design observation points throughout
system
Analyze infer via statistical learning
Respond detect anomalous behavior vs. baseline
Learn use observations to modify responses to
future observations
Act
Reactive use control points in system for rapid
recovery if detect something wrong
Proactive/protective prophylactically act on
system to prevent predicted impending failure

3
Todays Systems are Too Brittle

Fragile, easily broken, yielding poor
dependability and security
E.g., Amazon yearly revenue 3.1B, downtime
costs 600,000/hr
Why?
Existing systems focus on performance, not fast
adaptive detection and response to failure and
attack
Fundamentally incorrect assumptions
Humans are perfect
Software can be made bug free
Maintenance is free
People/HW/SW failures are facts, not problems
If a problem has no solution, it may not be a
problem, but a fact--not to be solved, but to be
coped with over time
Shimon Peres

4
Failures and Attacks Inevitable soDesign for
Rapid Adaptation

Rapid application and server recovery, agile
network rerouting, proactive protective actions
...
No distinction between normal operation and
recovery
Elements of our solution
Programming paradigms for robust recovery
Crash-only software design for rapid server
recovery
Network protocols designed for observation to
allow rapid detection of behavioral violations
Instrumentation and online statistical analysis
for anomaly detection and failure
diagnosis/localization
Adaptation benchmarks to measure progress
What you cant measure, you cant improve
Collect real failure data to drive benchmarks

5
Example anomaly detection meets crash-only design

Use simple time series analysis on key operating
statistics (committed writes, offered load, etc.)
Count relative frequencies of all substrings of
length k or shorter, look for discrepancies in
relative frequencies across replicas
Works even when period is irregular or not known
a priori
If you see anything unusual, coerce to a crash
and recover from that reboot is nearly free, so
occasional false positives OK

6
Security Challenges for RADS

Need new techniques to detect and respond to
rapidly-evolving attacks
But these techniques can themselves be used to
mount attacks
So we must secure the learning process
Rapid secure protocol synthesis tools can be
applied to this problem

7
Approach for SuccessInterdisciplinary Expertise

Interdisciplinary Team
Armando Fox/Dave Patterson Dependable System
Design
Randy Katz/Ion Stoica Network Services/Protocols
Michael Jordan Statistical Learning Theory
Ion Stoica/Doug Tygar Verification of networks
and security
George Necula Language/Applications-level
mechanisms
Spans algorithm design and system implementations
Comprehensive distributed architecture embedding
SLT as a primitive building block
Embedding observational and inference means at
strategic points throughout the distributed
system
New kinds of statistical inference and
verification techniques able to execute on-line
and in real-time

8
RADS Conceptual Architecture
Prototype Application Messaging, E-Mail for
Operational Systems
User
Programming Abstractions For Roll-back (Necula)
Operator
Benchmarks,Tools for Human Operators (Patterson)
Crash-Only Middleware Servers, System
OC Infrastructure (Fox)
SLT Services
Application- Specific Overlay Network
Online Statistical Learning Algorithms (Jordan)
PNE
PNE
Edge Network
Edge Network
Protocols Enabling Fast Detection Route
Recovery, Network OC Infrastructure (Katz,
Stoica)
Router
Router
CommodityInternet IP networks
Reduction to practice of on-line SLT and
observe/analyze/act infrastructureReusable
embeddable componentsPervasive security
considerations (Tygar)
9
Vulnerable Messaging Application that Requires
Trustworthiness
DHS/Federal Network
Coalition Internet
Trust Relations
Allies Networks
Allies Networks
Allies Networks
Allies Networks
Local Police, Fire, State Police
Incident Reports Responder Locations GIS Data Etc.
Compromised Network With Embedded Adversaries
Exploit DETER Testbed for Prototyping
10
Scientific Foundation For Self- Systems

New design principles and tools for systems that
continuously adjust their behavior in response to
analysis of online observations
New metrics and benchmarks for evaluating
self-adapting networked systems
Advances in Statistical Learning Theory to move
from offline to online analysis of large-scale
distributed systems

11
Measuring Success

Build messaging prototype using RADS design
principles and tools
Put realistic performance workload on prototype,
embed in DHS DETER testbed
Subject prototype to increasingly aggressive
failure and attack workloads
E.g., hardware failures, software failures,
operator failures, worms attacks, DDOS attacks,
Measure false positive rates, accuracy rates,
time to analyze failures, time to act,
performance impact of actions, availability of
prototype, performability of prototype,
Compare results with conventional systems under
similar performance, failure, and attack workloads

12
New Funding OpportunityNSF CyberTrust Program

From RFP
People rely on systems based on networked
computers
Too vulnerable to cyber attacks inhibit
function, corrupt data, or expose private
information
Promote vision where networked systems are
More predictable, more accountable, and less
vulnerable to attack and abuse
Developed, configured, operated and evaluated by
a well-trained and diverse workforce
Used by a public educated in their secure and
ethical operation
Example research area improve trustworthiness of
networks explore evolving nature of security
protocols and policies in communications networks
Individual, Team projects and 1-2 Centers

13
CATS Center for Adaptive Trustworthy Systems

Dramatically improve the trustworthiness of
networked systems
New understanding of how to construct such
systems
Observe-Analyze-Act
From responding to known problems to learning new
problems
From reacting to problems to proactively
responding before problems become significant
Experimental method of benchmarking, prototyping,
and deployment to provide context
Technical Thrusts
Statistical Learning Theory
Crash-Only Software
Behaviorally-Consistent and Secure Protocols
Programmable Network Elements
Integration Vehicle
Application Disaster Response Messaging
Supported by prototype distributed system
architecture
Deployment and Evaluation Plan