Title: A Research Program in Reliable Adaptive Distributed Systems RADS
1A Research Program inReliable AdaptiveDistribute
d Systems (RADS)
- Armando FoxStanford University
- Michael Jordan, Randy Katz, George Necula, David
Patterson, Doug Tygar - University of California, Berkeley
2Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
3Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
4Networked SystemsCurrent State-of-the-Art
- Todays systems fragile, easily broken, yielding
poor reliability and security - Complexity of configurations by humans is
overwhelming, infrequently correct, yielding lack
of dependability and introducing vulnerabilities - 50 outages, 90 security break-ins attributed
to configuration - Attackers exploit known problems faster than
system managers apply known fixes - Overly focused on performance, performance, and
cost-performance - Systems based on fundamentally incorrect
assumptions - Humans are perfect
- Software will eventually be bug free
- Hardware MTBF is already very large, and will
continue to increase - Maintenance costs irrelevant vs. purchase price
5Networked SystemsCost of Failure and its
Inevitability
- Outage Costs
- Amazon Revenue 3.1B, 7744 employees
- Revenue (24x7) 350k per hour
- Employee productivity costs 250k per hour
- Total Downtime Costs 600,000 per hour
- Employee cost/hour comparable to revenue, even
for an Internet company - People/HW/SW failures are facts, not problems
- If a problem has no solution, it may not be a
problem, but a fact--not to be solved, but to be
coped with over time Shimon Peres (Peress
Law) - Recovery/repair is how we cope with them
6Principles for Reliable Adaptive Distributed
Systems
- Given errors occur, design to recover rapidly
- Partial Restart
- Crash only software (1 way to start, stop)
- Given humans make (most of the) errors, build
tools to help operator find and repair problems - Pinpoint the error
- Undo of human error
- Note Errors often associated with configuration
- Recovery benchmarks to measure progress
- What you cant measure, you cant improve
- Collect real failure data to drive benchmarks
7Networked SystemsComponents of New Approach
- Statistical learning algorithms that observe and
predict future behaviors - Verification techniques that check for correct
behavior, reveal vulnerabilities, harness
techniques for the rapid generation of behaviors
with desirable properties - Programmable network elements allowing active
code to be inserted into the network, to provide
observation and enforcement points without the
need for access to user end systems
8Interdisciplinary Expertise
- SLT (Jordan), Network Services/Protocols (Fox,
Katz, Patterson, Stoica), and Verification
Methods applied to network and security behaviors
(Stoica, Tygar) - Comprehensive distributed architecture embedding
SLT as building block for critical components for
system observation, coordination, inference,
correction, and evolution of behaviors - Components suitable for embedding in distributed
systems - Network behaviors that reveal correct or
incorrect operation of higher-level network
applications - Embedding observational and inference means at
strategic points in the network, obviating need
to modify end hosts or applications - System level heterogeneity and ability to
generate new behaviors on demand in response to a
dynamic system threat environment to achieve
enhanced dependability and resilience to attack - Enabling applications for investigation will
include web services, intrusion detection,
storage access
9Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
10Statistical Learning Theory
- Toolbox for design/analysis of adaptive systems
- Algorithms for classification, diagnosis,
prediction, novelty detection, outlier detection,
quantile estimation, density estimation, feature
selection, variable selection, response surface
optimization, sequential decision-making - Classification algorithms
- Recent scaling breakthroughs 10K features,
millions of data points - Kernel machines functional analysis and convex
optimization - Generalized inner productsimilarities among data
point pairs - Defined for many data types
- Classical linear statistical algorithms
kernelized for state-of-the-art nonlinear SLT
algorithms in many areas
11Statistical Learning Theory
- Novelty Detection Problem
- Unlimited observations reflecting normal
activityYet few (or no) instances that reflect
an attack or a bug - Second-order cone program a convex optimization
problem with an efficient solution method - Given cloud of data in a high-dimensional feature
space, place a boundary around these to guarantee
that only a small fraction falls outside - Needed on-line variants of SLT algorithms that
update the learning systems state based on small
sets of data - Available for some kernelized problems
- On-line versions of the best algorithms have yet
to be developed!
12Statistical Learning Theory
- Super kernels combine heterogeneous data via
multiple kernels - Semidefinite programs, convex optimization
problems with efficient solutions involving
efficient decomposition techniques - Useful in fusing evidence at distributed nodes
- Problems of interest require combined parameter
estimation and optimization - Response surface methodology building local
mappings from configurations to performance, and
suggesting gradient directions in configuration
space leading to performance improvements - Policy-gradient methods SLT algorithms that make
sequences of decisions, yielding a behavior or
policy successfully developed policies for
nonlinear control problems involving high degrees
of freedom
13Statistical Machine Learning
- Kernel methods
- neural network heritage
- convex optimization algorithms
- kernels available for strings, trees, graphs,
vectors, etc. - state-of-the-art performance in many problem
domains - frequentist theoretical foundations
- Graphical models
- marriage of graph theory and probability theory
- recursive algorithms on graphs
- modular design
- state-of-the-art performance in many problem
domains - Bayesian theoretical foundations
14Vision
- First 50 years of computer science
- manually-engineered systems
- lack of adaptability, robustness, and security
- no concern with closing the loop with the
environment - Next 50 years of computer science
- statistical learning systems throughout the
infrastructure - self-configuring, adaptive, sentient systems
- perception, reasoning, decision-making cycle
15Example I Statistical Bug-finding
- Programs are buggy, yet people use them
- Exploit this use user trials to debug programs
- Outline of system
- Instrument programs to take samples at runtime of
program state - Collect information over the Internet
- Learn a statistical classifier based on
successful and failed runs, using feature
selection methods to pinpoint the bugs
16Case Study BC
- Array overrun bug in re-allocation routine
more?arrays() leads to memory corruption and
sometimes an eventual crash 2908 features - All top feature indicate indx being unusually
large - storage.c176 more?arrays() indx ? optopt
- storage.c176 more?arrays() indx ? opterr
- storage.c176 more?arrays() indx ? use?math
- storage.c176 more?arrays() indx ? quiet
- storage.c176 more?arrays() indx ? f?count
- And this indeed pinpoints the bug
17Example II Novelty Detection
- The goal is binary classification
- but all of the training data come from one class
- Many practical applications
- intrusion detection
- machine diagnostics
- Basic problem---find a boundary that encloses a
desired fraction of the data, and is as tight as
possible - can be done using the generalized Chebyshev
inequality - using kernels, this is a convex problem
18Case Study Analog Circuit Design
- case study---a Low Noise Amplifier (LNA) for
wireless applications - 7 design parameters (transistor size, bias
currents, etc) - 50,000 positive samples
- visualize the projection of feasible solutions in
a plane representing second-order harmonic
distortion (HD_2) and third-order (HD_3) harmonic
distortion
19Case Study Analog Circuit Design
20Example III Diagnosis
- A probabilistic graphical model with 600 disease
nodes, 4000 finding nodes - Node probabilities p(f_i d) were assessed from
an expert (Shwe, et al., 1991) - Want to compute posteriors p(d_j f)
- Is this tractable?
21Case Study Medical Diagnosis
- Symbolic complexity
- symbolic expressions fill dozens of pages
- would take years to compute
- Numerical simplicity
- Jaakkola and Jordan (1999) describe a variational
method based on convexity that computes
approximate posteriors in less than a second
22Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
23Crash-Only SoftwareDramatically Simplifying
Recovery
- Robust systems must be crash-safe
- Restart-curable bugs cause outages
- Rebooting eliminates many corruptions
- Why support any other kind of shutdown/restart??
- Crash-only software,
- Shutdown crash recover restart
- Software components provide external PWR switch
independent of component behavior - Recovery is inexpensive/safe to try
- Crash-only power switch infrastructure is
simpler than apps, common to all of them - Higher confidence that it will work
- Like transaction invariants yet focused on
recovery - Can machine learning and statistical monitoring
approaches be applied during online operations?
24Crash-Only Software Simplified Recovery
Management
- Failure detection and recovery management are
hard - How to detect that somethings wrong?
- How do you know when recovery is really
necessary? - Will a particular recovery technique work?
- What is the effect on online performance?
- What if you needlessly over-recover?
- Predictable, fast recovery simplifies failure
detection and recovery management - Something doesnt look the way it used to
anomaly - Not all anomalies are failures... but
over-recovering is OK - If rebooting suspected-bad component doesnt
work reboot its larger containing group,
recursively - Leverage for applying statistical monitoring
machine learning
25Crash-Only SoftwarePractical to Build
- Case studies two crash-only state-storage
subsystems (for session state and durable state) - OK to crash any node at any time for any reason
- Recovery is highly predictable, doesnt impact
online performance - Replication provides probabilistic durability
capacity during recovery - Access pattern exploited for consistency
guarantees - Nine activity state statistics monitored
per storage brick - Metrics compared against those of peer bricks
- Basic idea Changes in workload tend to affect
all bricks equally - Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time - Anomaly in 6 or more (out of 9) metrics reboot
brick - Simple thresholding and substring-frequency used
to determine anomalous
26Crash-Only Statistical Monitoring Resilience
to Real-World Transients
- Simple fault model observed anomalies coerced
into crash faults - Surprise! Statisticalmonitoring catches
manyreal-world faults, withouta pre-established
baseline - Memory bitflips in code, data, checksums (
crash) - hang/timeout/freeze
- Network loss (drop up to 70 of packets
randomly) - Hiccup (eg from garbage collection)
- Persistent slowdown (one node lags the others)
- Overload (TCP-like mechanism used to generate
backpressure)
27Generalizing Crash-Only Micro-reboots
- Add micro-reboot (uRB) support to middleware
- Enhance open-source JBoss J2EE application server
with fault injection, code path tracing,
micro-reboots - Use automated fault injection observation to
infer propagation of exceptions - During operation, micro-reboot components or
component groups suspected of being correlated to
an observed failure - uRBs improve performability
- 2-3 orders of magnitude faster than full
reboots or application reload - Minimizes disruption to users of other
(non-faulty) parts of system - Goal is fast recovery, not causal analysis
Fast, cheap uRBs statistical monitoring
provide a degree of application-generic failure
detection recovery
28Crash-Oriented SoftwareSystematic Approach
- Some design lessons already learned
- OK to say no, OK to make mistakes,
interchangeable parts - Systematic approach for generic componentized
apps ... - Compiler and languages technology to understand
what makes app amenable to microreboots or c/o
generally - E.g., racking state management across app
components - E.g., establishing observational equivalence
between executions with and without
micro-recovery - Goal
- Static and dynamic analysis of when safe to use
generic recovery - Aggressive application of machine learning
statistical monitoring to trigger generic
recovery mechanisms - High confidence in mechanisms due to simplicity
and orthogonality
29Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
30Self-Verifiable ProtocolsStatement of the
Problem
- Problem Detect and contain network effects of
misconfigurations and faulty/malicious components
- Approach design network protocols so each
component verifies correct behavior of the
protocol - Examples
- e2e protocols
- routing (BGP) protocols
31Self-Verifiable ProtocolsCase Study BGP
- Propagating invalid BGP routes can bring the
Internet down - Multiple causes
- Router misconfigurations happen daily, yielding
outages lasting hours - Malicious routers huge potential threat
- Routers with default passwords
- Possible to buy routers passwords on darknets
- Existing solutions
- Hard to deploy (e.g., Secure-BGP), or
insufficient security - Our solution
- Whisper verify the correctness of router
advertisements - Listen verify the reachability on the data plane
32Self-Verifiable ProtocolsBGP Whisper
- Use redundancy to check consistency of peers
information - Whisper game
- Group sits in a circle., person whispers secret
phrase to neighbors - Person at other end concludes
- Phrase is correct if same phrase from both
neighbors - Otherwise, at least one phrase is incorrect
33Self-Verifiable ProtocolsBGP Whisper
- AS1 advertises its address prefix
- Chooses a secrete key x, and sends y h(x)
- h() well-known one-way hash function
- Every router forwards y h(y)
- AS4 performs consistency check (y1)3 (y2)3 ?
- If yes, assume both routes are correct
- If no, at least one rout is incorrect (but dont
know which) ?rise a flag
(AS1,AS2,y1h2(x))
(AS1,AS2,AS3,y1h3(x))
AS3
AS2
(AS1, y1h(x))
AS4
Chose secret key x
AS1
(AS1,AS3,y2h2(x))
(AS1, y2h(x))
AS3
34Self-Verifiable ProtocolsBGP Listen
- Monitor progress of TCP flows
- If TCP flow doesnt make progress, might be
because route is incorrect - Use heuristics to reduce number of false
positives and negatives - Still difficult to handle traffic patterns like
port scanners - Use SLT techniques to improve the detection
accuracy?
35Self-Verifiable ProtocolsStatus and Future Plans
- Two examples
- BGP verifications (Listen Whisper)
- Can trigger alarms and contain malicious routers
- Minimal changes to BGP incrementally deployable
(Listen) - Self-verifiable CSFQ
- Per-flow isolation without maintaining per flow
state - Detect and contain malicious flows
- Ultimate goal develop distributed system able to
self diagnose and self-repair - Eliminate faulty components
- Minimum raise a flag in case of configurations
and attackers - Develop set of principles and techniques for
robust protocols
36Enabling TechnologyEdge Services by Network
Appliances
- In-the-Network Processing the Computer IS THE
Network
37Generic PNE Architecture
Tag Mem
Rules Programs
38Enabling TechnologyProgrammable Networks
- Problem
- Common programming/control environment for
diverse network elements to realize full power of
inside the network services and applications - Approach
- Software toolkit and VM architecture for PNEs,
with retargetable optimized backend for diverse
appliance-specific architectures - Current Focus
- Network health monitoring, protocol interworking
and packet translation services, iSCSI processing
and performance enhancement, intrusion and worm
detection and quarantining - Potential Impact
- Open framework for multi-platform appliances,
enabling third party service development - Provable application properties and invariants
avoidance of configuration and latest patch not
installed errors
39Enabling TechnologyProgrammable Networks
- Generalized PNE programming and control model
- Generalized virtual machine model for this
class of devices - Retargetable for different underlying
implementations - Edge services of interest
- Network measurement and monitoring supporting
model formation and statistical anomaly detection - Framework for inside-the-network protocol
listening - Selective blocking/filtering/quarantining of
traffic - Application-specific routing
- Faster detection and recovery from routing
failures than is possible from existing Internet
protocols - Implementation of self-verifiable protocols
40Security of Networked SystemsLearning Systems
Opportunity
- New focus on network-wide attacks
- E.g., worms, denial of service
- Arise suddenly, spread quickly
- No time to deploy patches or filters to protect
machines - SLT offers promises for improvements
- Distributed, so information across machines are
shared - Handles changes in user behavior, preventing
false positives - Truly distributed SLT systems are possible that
can detect and protect against very large-scale
security attacks
41Security of Networked SystemsTechnical Approach
- Mechanisms to learn, share, repair against
potential threats to dependability - Strengthen assurance of shared information via
lightweight authentication and encryption - TESLA authentication system replaces public-key
crypto with lightweight symmetric encryption
uses time asymmetry to provide assurance - Messages initially encrypted, verification keys
revealed laterprevents attacker from using a
received key to forge messages - Variations provide instant authentication.
- Athena system generate random instances of
secure protocols - Ultra-fast checking softwaremodel-checking
proof-theoretic techniques to verify protocols
against stated requirements - Intelligently generate most efficient secure
protocol satisfying requirements or a random
instance of a secure protocol satisfying a given
set of requirements - Apply for SLT systems to more quickly exchange
information
42Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
43System Prototype
- Comprehensive system architecture
- Reduction of SLT to practical software components
embedded within a distributed systems context - Exhibition of an architecture for dramatically
improving the reliability and security of
important systems through observation-coordination
-adaptation mechanisms.
44 RADS Prototype Applications
- E-mail Systems/Messaging
- Scale
- Distribution, heterogeneity
- Non-stop
- Reactive Systems
- E.g., Distributed worm detection
- Network/Web Services
- Financial Applicatioms
- Collective Decision Making/Electronic Voting
- Security, privacy
- Non-stop
45RADS Conceptual Architecture
Operator
User
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Programming Abstractions For Roll-back
SLT Services
Crash-Oriented Svrcs Observation Infrastructure
forSystem SLT
Application- Specific Overlay Network
Verifiable Protocols Fast Detection Route
Recovery ObservationInfrastructure for network
SLT
PNE
PNE
Edge Network
Edge Network
Commodity Internet
46Presentation Outline
- A New Vision for Networked Systems
- Enabling Technology Statistical Learning Theory
- Approaches for Dependability
- Approaches for Security
- Elements of an Experimental Prototype
- Summary and Conclusions
47Summary and Conclusions