Title: A Research Program in Reliable Adaptive Distributed Systems RADS
1A Research Program inReliable AdaptiveDistribute
d Systems (RADS)
- Armando Fox, Michael Jordan, Randy Katz, George
Necula, David Patterson, Ion Stoica, Doug Tygar - University of California, Berkeleyand Stanford
University
2Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
3New Approach for RADS(Reliable Adaptive
Distributed Systems)
- Dramatically improve the trustworthiness of
networked systems - Observe design observation points throughout
system - Analyze SLT as an enabling technology
- Respond detect anomalous behavior vs. baseline
- Learn use observations to modify responses to
future observations - Act
- Reactive use control points in system for rapid
recovery if detect something wrong - Proactive/protective prophylactically act on
system to prevent predicted impending failure
4Todays Systems are Too Brittle
- Fragile, easily broken, yielding poor
trustworthiness (dependability and security). - Amazon Revenue 3.1B, Downtime Costs 600,000
per hour - Why? Overly focused on performance, performance,
and cost-performance - Systems based on fundamentally incorrect
assumptions - Humans are perfect
- Software will eventually be bug free
- Maintenance is free
- People/HW/SW failures are facts, not problems
- If a problem has no solution, it may not be a
problem, but a fact--not to be solved, but to be
coped with over time Shimon Peres (Peress
Law)
5If Failure is Inevitable...then Design for Rapid
Adaptation
- Encompasses rapid server recovery, network
rerouting, prophylactic/protective actions... - Blurs distinction between normal operation and
recovery - Elements of the solution
- Programming paradigms for robust recovery
- Crash-only software design for rapid server
recovery - Network protocols designed for rapid detection of
assertion violations - Instrumentation and SLT for online analysis,
anomaly detection, and diagnosis of failure - Recovery benchmarks to measure progress
- What you cant measure, you cant improve
- Collect real failure data to drive benchmarks
6RADS Conceptual Architecture
User
Programming Abstractions For Roll-back (Necula
Operator
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Benchmarks,Tools for Human Operators (Patterson)
Crash-Only Middleware Servers, System
OC Infrastructure (Fox)
SLT Services
Application- Specific Overlay Network
Online Statistical Learning Algorithms (Jordan)
PNE
PNE
Edge Network
Edge Network
Protocols Enabling Fast Detection Route
Recovery, Network OC Infrastructure (Katz,
Stoica)
Router
Router
CommodityInternet IP networks
- Reduction to practice of online SLT and
observe/analyze/act infrastructure - Reusable embeddable components
7Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
8Crash-Only SoftwareDramatically Simplifying
Recovery
- Since robust systems must be crash-safe, make
crashes the only supported form of
shutdown/restart - Software componentsexternal power switch is
independent ofmisbehaving component - Recovery becomes inexpensive/safe to try
- Simplifies failure detection, since can be overly
aggressive - Simplifies recovery, since only 1 type of
recovery action and always safe to try - Idea if something looks anomalous, its probably
wrong - Can machine learning and statistical monitoring
approaches be applied during online operations?
9Crash-Only SoftwarePractical to Build
- refocus on JAGR, talk about relevance of
middleware - Case studies two crash-only state-storage
subsystems (for session state and durable state) - OK to crash any node at any time for any reason
- Recovery is highly predictable, doesnt impact
online performance - Replication provides probabilistic durability
capacity during recovery - Access pattern of workload exploited for
consistency guarantees - 9 activity state statistics monitored per
storage brick - Metrics compared against those of peer bricks
- Basic idea Changes in workload tend to affect
all bricks equally - Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time - Anomaly in 6 or more (out of 9) metrics gt reboot
brick - Simple thresholding and substring-frequency used
to determine anomalous
10Supporting Crash-Only in Middleware
- Add observation control points to Java
application middleware - Observe capture paths taken through system by
user request - Analyze look for highly-unlikely anomalous
(therefore probably buggy) paths - Act micro-reboot suspected-faulty J2EE
components transparently to rest of system - Result fast recovery improves overall
performability - micro-reboot is 2-3 orders of magnitude faster
than full application reboot - Improves performability (total amount of work per
unit time in presence of faults) - Minimizes disruption to users of other
(non-faulty) parts of system
Fast, cheap uRBs statistical monitoring
provide a degree of application-generic failure
detection recovery
11Crash-Oriented SoftwareSystematic Approach
- Needed Systematic mechanism for determining when
micro-reboots are safe - Programming-language level support for rollback
and state tracking - Needed Better integration with SLT
- Which clustering/analysis techniques best
correlate anomalous paths to particular observed
failure types? (current prototype uses very
simple data clustering techniques) - Are these techniques suitable for online use?
(current prototype does offline analysis)
12Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
13Research Challenges
- No protection against DoS attacks
- MS Blaster inflicted Internet packet loss gt 20
- Routing protocols blindly believe routes
advertised by neighbors - BGP router misconfigurations
- 200-1200 prefixes affected every day
- CWs (AS3561) misconfiguration caused an outage
for gt 5000 prefixes for 2 hours (April 2001) - Malicious routers huge potential threat
- Drop packets and render a destination unreachable
- Eavesdrop the traffic to a given destination
- Impersonate the destination
14Observe, Analyze, Act
- Observe
- Use multiple vantage points to monitor the
network - Design protocols whose behaviors can be verified
- Analyze
- Learn from protocol behavior
- Identify bogus information
- Act
- Contain misbehaving components
- Rise flags for network operators
- Empower end-hosts (e.g., enable end-hosts to stop
unwanted packets in the network infrastructure) - End-hosts know better when under attack
(flashcrowds vs. DoS attacks) - End-hosts can react faster than infrastructure
sender
receiver
15Case Study BGP (Listen Whisper)
- Whisper
- Use redundancy to check for route advertisements
consistency - Listen
- Monitor TCP flow progress to detect reachability
problems - Results
- Whisper reduce the region of Internet vulnerable
to an isolated adversary to 5 - Scalable, implementation can handle 10 times
todays BGP load - Listen detect reachability problems
- Probability of false positives 1
- Vulnerable to port scans ? plan to use SLT
16Programmable Network Elements
In-Port Classify Transform Out-Port
Edge Network
Edge Network
Router
Router
Commodity Internet IP networks
- Enabling Technology
- Edge network elements for IDS, firewall, traffic
shaping, etc. - Next generation exposed APIs for 3rd party
programming - Location for efficient network-level monitoring
and control - Observe rapid detection of route failure or
network attack - Act e.g., filter intrusions, quarantine
propagating worms - Avoid configuration and latest patch not
installed errors
17Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
18Research Challenge Self-sensing and Reactive
Systems
- Internet scale attacks are fundamentally
different than host scale attacks - Traditional Intrusion Detection Systems (IDS)
have had some success with host scale attacks,
but also many false positives - Internet scale attacks offer opportunity (more
evidence of wide scale attack) but also more
challenge (integrating data from a large number
of disparate sources)
19Observe, Analyze, Act
- Observe what to monitor, how to monitor
- Analyze Learning from patterns of messages (not
parsing their contents) - Act
- How to exchange minimal information (in system
under attack) - rapidly evolving security protocols (for
resilience to attack) - Applications Worm detection, spam detection
- Ultimate challenge beyond detection and into
response
20Security of Networked SystemsTechnical Approach
- Mechanisms to learn, share, repair against
potential threats to dependability - Strengthen assurance of shared information via
lightweight authentication and encryption - TESLA authentication system replaces public-key
crypto with lightweight symmetric encryption
uses time asymmetry to provide assurance - Messages initially encrypted, verification keys
revealed laterprevents attacker from using a
received key to forge messages - Variations provide instant authentication.
- Athena system generate random instances of
secure protocols - Ultra-fast checking softwaremodel-checking
proof-theoretic techniques to verify protocols
against stated requirements - Intelligently generate most efficient secure
protocol satisfying requirements or a random
instance of a secure protocol satisfying a given
set of requirements - Apply for SLT systems to more quickly exchange
information
21Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
22Statistical Learning Theory
- Toolbox for design/analysis of adaptive systems
- Algorithms for classification, diagnosis,
prediction, novelty detection, outlier detection,
quantile estimation, density estimation, feature
selection, variable selection, response surface
optimization, sequential decision-making - Classification algorithms
- Recent scaling breakthroughs 10K features,
millions of data points - Kernel machines functional analysis and convex
optimization - Generalized inner productsimilarities among data
point pairs - Defined for many data types
- Classical linear statistical algorithms
kernelized for state-of-the-art nonlinear SLT
algorithms in many areas
23Statistical Learning Theory
- Novelty Detection Problem
- Unlimited observations reflecting normal
activityYet few (or no) instances that reflect
an attack or a bug - E.g. intrusion detection, machine diagnostics
- Second-order cone program a convex optimization
problem with an efficient solution method - Given cloud of data in a high-dimensional feature
space, place a boundary around these to guarantee
that only a small fraction falls outside - Basic problem---find a boundary that encloses a
desired fraction of the data, and is as tight as
possible - can be done using the generalized Chebyshev
inequality - using kernels, this is a convex problem
24Example Statistical Bug-finding
- Programs are buggy, yet many people use them
- Instrument programs to take samples of program
state at runtime - Collect information over the Internet from many
users runs - Learn a statistical classifier based on
successful and failed runs, using feature
selection methods to pinpoint the bugs - Example finding a bug in Unix bc utility
- 2908 features instrumented
- All top feature indicate indx being unusually
large in more_arrays subroutine - storage.c176 more_arrays() indx gt optopt
- storage.c176 more_arrays() indx gt opterr
- storage.c176 more_arrays() indx gt use_math
- Indeed, array overrun bug in re-allocation
routine more_arrays() found to cause memory
corruption and sometimes an eventual crash
25Example III Diagnosis
- A probabilistic graphical model with 600 disease
nodes, 4000 finding nodes - Node probabilities p(f_i d) were assessed from
an expert (Shwe, et al., 1991) - Want to compute posteriors p(d_j f)
- Is this tractable?
26Case Study Medical Diagnosis
- Symbolic complexity
- symbolic expressions fill dozens of pages
- would take years to compute
- Numerical simplicity
- Jaakkola and Jordan (1999) describe a variational
method based on convexity that computes
approximate posteriors in less than a second
27Challenge for SLT
- Challenge on-line versions of the best
algorithms have yet to be developed - update the learning systems state based on small
sets of data - Available for some kernelized problems
- On-line versions of the best algorithms have yet
to be developed!
28Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
29System Prototype
- Comprehensive system architecture
- Reduction of SLT to practical software components
embedded within a distributed systems context - Exhibition of an architecture for dramatically
improving the reliability and security of
important systems through observation-coordination
-adaptation mechanisms.
30Messaging as an Application
- E-mail is now mission-critical application
- Organizational storage capacity shifting from
financial data bases to email (email is fastest
growing storage) - Loss of email more critical to continuing
operation of organization than telephony (imagine
if govt had no email for a week) - Instant Messaging is now mission-critical
application - In a crisis, many communication schemes will be
used land-based telephony, cellular telephony,
instant messaging, email, - Coordination among first-responders during crisis
response in field (administrators operators) - Demands for dependability, resistance to attack,
establishment of trust among interacting entities - Despite attempts by hackers, terrorists,
31Measuring Sucess
- Build email/IM prototype using RADS design
principles and tools - Put realistic performance workload on prototype
- Subject prototype to increasingly difficult
failure workloads and attack workloads - E.g., hardware failures, software failures,
operator failures, worms attacks, DDOS attacks, - Measure false positive rates, accuracy rates,
time to analyze failures, time to act,
performance impact of actions, availability of
prototype, performability of prototype, - Compare results to conventional email/IM systems
under similar performance, failure, and attack
workloads
32Disaster Response Messaging Application
Active Adversary Service Attacks
DHS/Federal Network
Net Failure
Coalition Internet
Trust Relations
Allies Networks
Allies Networks
Allies Networks
Adversary
Net Failure
Allies Networks
Local Police, Fire, State Police
Adversary
Incident Reports Responder Locations GIS Data Etc.
Compromised Network With Embedded Adversaries
33Presentation Outline
- Why We Need a New Approach to Networked Systems
- New Design Philosophy for RADS
- Applying the Philosophy Early Experience with
Specific Approaches - Approaches for Software and Hardware
Dependability - Approaches for Networking
- Approaches for Security
- Applying SLT to dependability problems
- Elements of a unified Experimental Prototype
- Summary and Conclusions
34Old Science vs. New Science
- First 50 years of computer science
- manually-engineered systems
- lack of adaptability, robustness, and security
- no concern with closing the loop with the
environment - Next 50 years of computer science
- statistical learning systems throughout the
infrastructure - self-configuring, adaptive, sentient systems
- perception, reasoning, decision-making cycle
- systems are always recovering because of this
ongoing automatic and dynamic adaptation - New way to think about and design adaptive
systems - Makes continuous monitoring and reaction a
first-class goal - Provides point of leverage for applying SLT and
related techniques
35Scientific Foundation For Self- Systems
- New design principles and tools for systems that
continuously adjust their behavior in response to
analysis of online observations - New metrics and benchmarks for evaluating
self-adapting networked systems - Advances in Statistical Learning Theory to move
from offline to online analysis of large-scale
distributed systems
36BACKUP SLIDES
37Statistical Learning Theory
- Super kernels combine heterogeneous data via
multiple kernels - Semidefinite programs, convex optimization
problems with efficient solutions involving
efficient decomposition techniques - Useful in fusing evidence at distributed nodes
- Problems of interest require combined parameter
estimation and optimization - Response surface methodology building local
mappings from configurations to performance, and
suggesting gradient directions in configuration
space leading to performance improvements - Policy-gradient methods SLT algorithms that make
sequences of decisions, yielding a behavior or
policy successfully developed policies for
nonlinear control problems involving high degrees
of freedom
38Statistical Machine Learning
- Kernel methods
- neural network heritage
- convex optimization algorithms
- kernels available for strings, trees, graphs,
vectors, etc. - state-of-the-art performance in many problem
domains - frequentist theoretical foundations
- Graphical models
- marriage of graph theory and probability theory
- recursive algorithms on graphs
- modular design
- state-of-the-art performance in many problem
domains - Bayesian theoretical foundations
39Self-Verifiable ProtocolsBGP Whisper
- AS1 advertises its address prefix
- Chooses a secrete key x, and sends y h(x)
- h() well-known one-way hash function
- Every router forwards y h(y)
- AS4 performs consistency check (y1)3 (y2)3 ?
- If yes, assume both routes are correct
- If no, at least one rout is incorrect (but dont
know which) ?rise a flag
(AS1,AS2,y1h2(x))
(AS1,AS2,AS3,y1h3(x))
AS3
AS2
(AS1, y1h(x))
AS4
Chose secret key x
AS1
(AS1,AS3,y2h2(x))
(AS1, y2h(x))
AS3
40Enabling TechnologyEdge Services by Network
Appliances
- In-the-Network Processing the Computer IS THE
Network
41Self-Verifiable ProtocolsStatus and Future Plans
- Two examples
- BGP verifications (Listen Whisper)
- Can trigger alarms and contain malicious routers
- Minimal changes to BGP incrementally deployable
(Listen) - Self-verifiable CSFQ
- Per-flow isolation without maintaining per flow
state - Detect and contain malicious flows
- Ultimate goal develop distributed system able to
self diagnose and self-repair - Eliminate faulty components
- Minimum raise a flag in case of configurations
and attackers - Develop set of principles and techniques for
robust protocols
42Enabling TechnologyProgrammable Networks
- Problem
- Common programming/control environment for
diverse network elements to realize full power of
inside the network services and applications - Approach
- Software toolkit and VM architecture for PNEs,
with retargetable optimized backend for diverse
appliance-specific architectures - Current Focus
- Network health monitoring, protocol interworking
and packet translation services, iSCSI processing
and performance enhancement, intrusion and worm
detection and quarantining - Potential Impact
- Open framework for multi-platform appliances,
enabling third party service development - Provable application properties and invariants
avoidance of configuration and latest patch not
installed errors
43Enabling TechnologyProgrammable Networks
- Generalized PNE programming and control model
- Generalized virtual machine model for this
class of devices - Retargetable for different underlying
implementations - Edge services of interest
- Network measurement and monitoring supporting
model formation and statistical anomaly detection - Framework for inside-the-network protocol
listening - Selective blocking/filtering/quarantining of
traffic - Application-specific routing
- Faster detection and recovery from routing
failures than is possible from existing Internet
protocols - Implementation of self-verifiable protocols
44Crash-Only Statistical Monitoring Resilience
to Real-World Transients
- Simple fault model observed anomalies coerced
into crash faults - Surprise! Statisticalmonitoring catches
manyreal-world faults, withouta pre-established
baseline
45Self-Verifiable ProtocolsStatement of the
Problem
- Problem Detect and contain network effects of
misconfigurations and faulty/malicious components
- Approach design network protocols so each
component verifies correct behavior of the
protocol - Examples
- e2e protocols
- routing (BGP) protocols
46Self-Verifiable ProtocolsCase Study BGP
- Propagating invalid BGP routes can bring the
Internet down - Multiple causes
- Router misconfigurations happen daily, yielding
outages lasting hours - Malicious routers huge potential threat
- Routers with default passwords
- Possible to buy routers passwords on darknets
- Existing solutions
- Hard to deploy (e.g., Secure-BGP), or
insufficient security - Our solution
- Whisper verify the correctness of router
advertisements - Listen verify the reachability on the data plane
47Self-Verifiable ProtocolsBGP Whisper
- Use redundancy to check consistency of peers
information - Whisper game
- Group sits in a circle., person whispers secret
phrase to neighbors - Person at other end concludes
- Phrase is correct if same phrase from both
neighbors - Otherwise, at least one phrase is incorrect
48Self-Verifiable ProtocolsBGP Listen
- Monitor progress of TCP flows
- If TCP flow doesnt make progress, might be
because route is incorrect - Use heuristics to reduce number of false
positives and negatives - Still difficult to handle traffic patterns like
port scanners - Use SLT techniques to improve the detection
accuracy?
49Military Messaging Application
Active Adversary Service Attacks
US Forces Network
Net Failure
Coalition Internet
Trust Relations
SitReps
Allies Networks
Allies Networks
Allies Networks
Adversary
Net Failure
Allies Networks
Allies Networks
Adversary
Compromised Network With Embedded Adversaries