Title: A Research Program in Reliable Adaptive Distributed Systems RADS
1A Research Program inReliable AdaptiveDistribute
d Systems (RADS)
- Armando Fox, Michael Jordan, Randy Katz, George
Necula, David Patterson, Ion Stoica, Doug Tygar - University of California, BerkeleyStanford
University
2What Are We Trying to DoNew Approach for RADS
- Dramatically improve the trustworthiness of
networked systems - Observe design observation points throughout
system - Analyze infer via statistical learning
- Respond detect anomalous behavior vs. baseline
- Learn use observations to modify responses to
future observations - Act
- Reactive use control points in system for rapid
recovery if detect something wrong - Proactive/protective prophylactically act on
system to prevent predicted impending failure
3Todays Systems are Too Brittle
- Fragile, easily broken, yielding poor
dependability and security - E.g., Amazon yearly revenue 3.1B, downtime
costs 600,000/hr - Why?
- Existing systems focus on performance, not fast
adaptive detection and response to failure and
attack - Fundamentally incorrect assumptions
- Humans are perfect
- Software can be made bug free
- Maintenance is free
- People/HW/SW failures are facts, not problems
- If a problem has no solution, it may not be a
problem, but a fact--not to be solved, but to be
coped with over time - Shimon Peres
4Failures and Attacks Inevitable soDesign for
Rapid Adaptation
- Rapid application and server recovery, agile
network rerouting, proactive protective actions
... - No distinction between normal operation and
recovery - Elements of our solution
- Programming paradigms for robust recovery
- Crash-only software design for rapid server
recovery - Network protocols designed for observation to
allow rapid detection of behavioral violations - Instrumentation and online statistical analysis
for anomaly detection and failure
diagnosis/localization - Adaptation benchmarks to measure progress
- What you cant measure, you cant improve
- Collect real failure data to drive benchmarks
5Example anomaly detection meets crash-only design
- Use simple time series analysis on key operating
statistics (committed writes, offered load, etc.)
- Count relative frequencies of all substrings of
length k or shorter, look for discrepancies in
relative frequencies across replicas - Works even when period is irregular or not known
a priori - If you see anything unusual, coerce to a crash
and recover from that reboot is nearly free, so
occasional false positives OK
6Security Challenges for RADS
- Need new techniques to detect and respond to
rapidly-evolving attacks - But these techniques can themselves be used to
mount attacks - So we must secure the learning process
- Rapid secure protocol synthesis tools can be
applied to this problem
7Approach for SuccessInterdisciplinary Expertise
- Interdisciplinary Team
- Armando Fox/Dave Patterson Dependable System
Design - Randy Katz/Ion Stoica Network Services/Protocols
- Michael Jordan Statistical Learning Theory
- Ion Stoica/Doug Tygar Verification of networks
and security - George Necula Language/Applications-level
mechanisms - Spans algorithm design and system implementations
- Comprehensive distributed architecture embedding
SLT as a primitive building block - Embedding observational and inference means at
strategic points throughout the distributed
system - New kinds of statistical inference and
verification techniques able to execute on-line
and in real-time
8RADS Conceptual Architecture
Prototype Application Messaging, E-Mail for
Operational Systems
User
Programming Abstractions For Roll-back (Necula)
Operator
Benchmarks,Tools for Human Operators (Patterson)
Crash-Only Middleware Servers, System
OC Infrastructure (Fox)
SLT Services
Application- Specific Overlay Network
Online Statistical Learning Algorithms (Jordan)
PNE
PNE
Edge Network
Edge Network
Protocols Enabling Fast Detection Route
Recovery, Network OC Infrastructure (Katz,
Stoica)
Router
Router
CommodityInternet IP networks
Reduction to practice of on-line SLT and
observe/analyze/act infrastructureReusable
embeddable componentsPervasive security
considerations (Tygar)
9Vulnerable Messaging Application that Requires
Trustworthiness
DHS/Federal Network
Coalition Internet
Trust Relations
Allies Networks
Allies Networks
Allies Networks
Allies Networks
Local Police, Fire, State Police
Incident Reports Responder Locations GIS Data Etc.
Compromised Network With Embedded Adversaries
Exploit DETER Testbed for Prototyping
10Scientific Foundation For Self- Systems
- New design principles and tools for systems that
continuously adjust their behavior in response to
analysis of online observations - New metrics and benchmarks for evaluating
self-adapting networked systems - Advances in Statistical Learning Theory to move
from offline to online analysis of large-scale
distributed systems
11Measuring Success
- Build messaging prototype using RADS design
principles and tools - Put realistic performance workload on prototype,
embed in DHS DETER testbed - Subject prototype to increasingly aggressive
failure and attack workloads - E.g., hardware failures, software failures,
operator failures, worms attacks, DDOS attacks, - Measure false positive rates, accuracy rates,
time to analyze failures, time to act,
performance impact of actions, availability of
prototype, performability of prototype, - Compare results with conventional systems under
similar performance, failure, and attack workloads
12New Funding OpportunityNSF CyberTrust Program
- From RFP
- People rely on systems based on networked
computers - Too vulnerable to cyber attacks inhibit
function, corrupt data, or expose private
information - Promote vision where networked systems are
- More predictable, more accountable, and less
vulnerable to attack and abuse - Developed, configured, operated and evaluated by
a well-trained and diverse workforce - Used by a public educated in their secure and
ethical operation - Example research area improve trustworthiness of
networks explore evolving nature of security
protocols and policies in communications networks - Individual, Team projects and 1-2 CentersÂ
13CATS Center for Adaptive Trustworthy Systems
- Dramatically improve the trustworthiness of
networked systems - New understanding of how to construct such
systems - Observe-Analyze-Act
- From responding to known problems to learning new
problems - From reacting to problems to proactively
responding before problems become significant - Experimental method of benchmarking, prototyping,
and deployment to provide context - Technical Thrusts
- Statistical Learning Theory
- Crash-Only Software
- Behaviorally-Consistent and Secure Protocols
- Programmable Network Elements
- Integration Vehicle
- Application Disaster Response Messaging
- Supported by prototype distributed system
architecture - Deployment and Evaluation Plan
14We need your help and support!Discussion?