A Research Program in Reliable Adaptive Distributed Systems RADS

About This Presentation

Title:

A Research Program in Reliable Adaptive Distributed Systems RADS

Description:

Michael Jordan, Randy Katz, George Necula, David Patterson, Doug Tygar ... SLT (Jordan), Network Services/Protocols (Fox, Katz, Patterson, Stoica), and ... – PowerPoint PPT presentation

Number of Views:143

Avg rating:3.0/5.0

Slides: 48

Provided by: Rand220

Learn more at: http://bnrg.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Research Program in Reliable Adaptive Distributed Systems RADS

1
A Research Program inReliable AdaptiveDistribute
d Systems (RADS)

Armando FoxStanford University
Michael Jordan, Randy Katz, George Necula, David
Patterson, Doug Tygar
University of California, Berkeley

2
Presentation Outline

A New Vision for Networked Systems
Enabling Technology Statistical Learning Theory
Approaches for Dependability
Approaches for Security
Elements of an Experimental Prototype
Summary and Conclusions

3
Presentation Outline

A New Vision for Networked Systems
Enabling Technology Statistical Learning Theory
Approaches for Dependability
Approaches for Security
Elements of an Experimental Prototype
Summary and Conclusions

4
Networked SystemsCurrent State-of-the-Art

Todays systems fragile, easily broken, yielding
poor reliability and security
Complexity of configurations by humans is
overwhelming, infrequently correct, yielding lack
of dependability and introducing vulnerabilities
50 outages, 90 security break-ins attributed
to configuration
Attackers exploit known problems faster than
system managers apply known fixes
Overly focused on performance, performance, and
cost-performance
Systems based on fundamentally incorrect
assumptions
Humans are perfect
Software will eventually be bug free
Hardware MTBF is already very large, and will
continue to increase
Maintenance costs irrelevant vs. purchase price

5
Networked SystemsCost of Failure and its
Inevitability

Outage Costs
Amazon Revenue 3.1B, 7744 employees
Revenue (24x7) 350k per hour
Employee productivity costs 250k per hour
Total Downtime Costs 600,000 per hour
Employee cost/hour comparable to revenue, even
for an Internet company
People/HW/SW failures are facts, not problems
If a problem has no solution, it may not be a
problem, but a fact--not to be solved, but to be
coped with over time Shimon Peres (Peress
Law)
Recovery/repair is how we cope with them

6
Principles for Reliable Adaptive Distributed
Systems

Given errors occur, design to recover rapidly
Partial Restart
Crash only software (1 way to start, stop)
Given humans make (most of the) errors, build
tools to help operator find and repair problems
Pinpoint the error
Undo of human error
Note Errors often associated with configuration
Recovery benchmarks to measure progress
What you cant measure, you cant improve
Collect real failure data to drive benchmarks

7
Networked SystemsComponents of New Approach

Statistical learning algorithms that observe and
predict future behaviors
Verification techniques that check for correct
behavior, reveal vulnerabilities, harness
techniques for the rapid generation of behaviors
with desirable properties
Programmable network elements allowing active
code to be inserted into the network, to provide
observation and enforcement points without the
need for access to user end systems

8
Interdisciplinary Expertise

SLT (Jordan), Network Services/Protocols (Fox,
Katz, Patterson, Stoica), and Verification
Methods applied to network and security behaviors
(Stoica, Tygar)
Comprehensive distributed architecture embedding
SLT as building block for critical components for
system observation, coordination, inference,
correction, and evolution of behaviors
Components suitable for embedding in distributed
systems
Network behaviors that reveal correct or
incorrect operation of higher-level network
applications
Embedding observational and inference means at
strategic points in the network, obviating need
to modify end hosts or applications
System level heterogeneity and ability to
generate new behaviors on demand in response to a
dynamic system threat environment to achieve
enhanced dependability and resilience to attack
Enabling applications for investigation will
include web services, intrusion detection,
storage access

9
Presentation Outline

A New Vision for Networked Systems
Enabling Technology Statistical Learning Theory
Approaches for Dependability
Approaches for Security
Elements of an Experimental Prototype
Summary and Conclusions

10
Statistical Learning Theory

Toolbox for design/analysis of adaptive systems
Algorithms for classification, diagnosis,
prediction, novelty detection, outlier detection,
quantile estimation, density estimation, feature
selection, variable selection, response surface
optimization, sequential decision-making
Classification algorithms
Recent scaling breakthroughs 10K features,
millions of data points
Kernel machines functional analysis and convex
optimization
Generalized inner productsimilarities among data
point pairs
Defined for many data types
Classical linear statistical algorithms
kernelized for state-of-the-art nonlinear SLT
algorithms in many areas

11
Statistical Learning Theory

Novelty Detection Problem
Unlimited observations reflecting normal
activityYet few (or no) instances that reflect
an attack or a bug
Second-order cone program a convex optimization
problem with an efficient solution method
Given cloud of data in a high-dimensional feature
space, place a boundary around these to guarantee
that only a small fraction falls outside
Needed on-line variants of SLT algorithms that
update the learning systems state based on small
sets of data
Available for some kernelized problems
On-line versions of the best algorithms have yet
to be developed!

12
Statistical Learning Theory

Super kernels combine heterogeneous data via
multiple kernels
Semidefinite programs, convex optimization
problems with efficient solutions involving
efficient decomposition techniques
Useful in fusing evidence at distributed nodes
Problems of interest require combined parameter
estimation and optimization
Response surface methodology building local
mappings from configurations to performance, and
suggesting gradient directions in configuration
space leading to performance improvements
Policy-gradient methods SLT algorithms that make
sequences of decisions, yielding a behavior or
policy successfully developed policies for
nonlinear control problems involving high degrees
of freedom

13
Statistical Machine Learning

Kernel methods
neural network heritage
convex optimization algorithms
kernels available for strings, trees, graphs,
vectors, etc.
state-of-the-art performance in many problem
domains
frequentist theoretical foundations
Graphical models
marriage of graph theory and probability theory
recursive algorithms on graphs
modular design
state-of-the-art performance in many problem
domains
Bayesian theoretical foundations

14
Vision

First 50 years of computer science
manually-engineered systems
lack of adaptability, robustness, and security
no concern with closing the loop with the
environment
Next 50 years of computer science
statistical learning systems throughout the
infrastructure
self-configuring, adaptive, sentient systems
perception, reasoning, decision-making cycle

15
Example I Statistical Bug-finding

Programs are buggy, yet people use them
Exploit this use user trials to debug programs
Outline of system
Instrument programs to take samples at runtime of
program state
Collect information over the Internet
Learn a statistical classifier based on
successful and failed runs, using feature
selection methods to pinpoint the bugs

16
Case Study BC

Array overrun bug in re-allocation routine
more?arrays() leads to memory corruption and
sometimes an eventual crash 2908 features
All top feature indicate indx being unusually
large
storage.c176 more?arrays() indx ? optopt
storage.c176 more?arrays() indx ? opterr
storage.c176 more?arrays() indx ? use?math
storage.c176 more?arrays() indx ? quiet
storage.c176 more?arrays() indx ? f?count
And this indeed pinpoints the bug

17
Example II Novelty Detection

The goal is binary classification
but all of the training data come from one class
Many practical applications
intrusion detection
machine diagnostics
Basic problem---find a boundary that encloses a
desired fraction of the data, and is as tight as
possible
can be done using the generalized Chebyshev
inequality
using kernels, this is a convex problem

18
Case Study Analog Circuit Design

case study---a Low Noise Amplifier (LNA) for
wireless applications
7 design parameters (transistor size, bias
currents, etc)
50,000 positive samples
visualize the projection of feasible solutions in
a plane representing second-order harmonic
distortion (HD_2) and third-order (HD_3) harmonic
distortion

19
Case Study Analog Circuit Design
20
Example III Diagnosis

A probabilistic graphical model with 600 disease
nodes, 4000 finding nodes
Node probabilities p(f_i d) were assessed from
an expert (Shwe, et al., 1991)
Want to compute posteriors p(d_j f)
Is this tractable?

21
Case Study Medical Diagnosis

Symbolic complexity
symbolic expressions fill dozens of pages
would take years to compute
Numerical simplicity
Jaakkola and Jordan (1999) describe a variational
method based on convexity that computes
approximate posteriors in less than a second

22
Presentation Outline

A New Vision for Networked Systems
Enabling Technology Statistical Learning Theory
Approaches for Dependability
Approaches for Security
Elements of an Experimental Prototype
Summary and Conclusions

23
Crash-Only SoftwareDramatically Simplifying
Recovery

Robust systems must be crash-safe
Restart-curable bugs cause outages
Rebooting eliminates many corruptions
Why support any other kind of shutdown/restart??
Crash-only software,
Shutdown crash recover restart
Software components provide external PWR switch
independent of component behavior
Recovery is inexpensive/safe to try
Crash-only power switch infrastructure is
simpler than apps, common to all of them
Higher confidence that it will work
Like transaction invariants yet focused on
recovery
Can machine learning and statistical monitoring
approaches be applied during online operations?

24
Crash-Only Software Simplified Recovery
Management

Failure detection and recovery management are
hard
How to detect that somethings wrong?
How do you know when recovery is really
necessary?
Will a particular recovery technique work?
What is the effect on online performance?
What if you needlessly over-recover?
Predictable, fast recovery simplifies failure
detection and recovery management
Something doesnt look the way it used to
anomaly
Not all anomalies are failures... but
over-recovering is OK
If rebooting suspected-bad component doesnt
work reboot its larger containing group,
recursively
Leverage for applying statistical monitoring
machine learning

25
Crash-Only SoftwarePractical to Build

Case studies two crash-only state-storage
subsystems (for session state and durable state)
OK to crash any node at any time for any reason
Recovery is highly predictable, doesnt impact
online performance
Replication provides probabilistic durability
capacity during recovery
Access pattern exploited for consistency
guarantees
Nine activity state statistics monitored
per storage brick
Metrics compared against those of peer bricks
Basic idea Changes in workload tend to affect
all bricks equally
Underlying (weak) assumption Most bricks are
doing mostly the right thing most of the time
Anomaly in 6 or more (out of 9) metrics reboot
brick
Simple thresholding and substring-frequency used
to determine anomalous

26
Crash-Only Statistical Monitoring Resilience
to Real-World Transients

Simple fault model observed anomalies coerced
into crash faults
Surprise! Statisticalmonitoring catches
manyreal-world faults, withouta pre-established
baseline
Memory bitflips in code, data, checksums (
crash)
hang/timeout/freeze
Network loss (drop up to 70 of packets
randomly)
Hiccup (eg from garbage collection)
Persistent slowdown (one node lags the others)
Overload (TCP-like mechanism used to generate
backpressure)

27
Generalizing Crash-Only Micro-reboots

Add micro-reboot (uRB) support to middleware
Enhance open-source JBoss J2EE application server
with fault injection, code path tracing,
micro-reboots
Use automated fault injection observation to
infer propagation of exceptions
During operation, micro-reboot components or
component groups suspected of being correlated to
an observed failure
uRBs improve performability
2-3 orders of magnitude faster than full
reboots or application reload
Minimizes disruption to users of other
(non-faulty) parts of system
Goal is fast recovery, not causal analysis

Fast, cheap uRBs statistical monitoring
provide a degree of application-generic failure
detection recovery
28
Crash-Oriented SoftwareSystematic Approach

Some design lessons already learned
OK to say no, OK to make mistakes,
interchangeable parts
Systematic approach for generic componentized
apps ...
Compiler and languages technology to understand
what makes app amenable to microreboots or c/o
generally
E.g., racking state management across app
components
E.g., establishing observational equivalence
between executions with and without
micro-recovery
Goal
Static and dynamic analysis of when safe to use
generic recovery
Aggressive application of machine learning
statistical monitoring to trigger generic
recovery mechanisms
High confidence in mechanisms due to simplicity
and orthogonality

29
Presentation Outline

A New Vision for Networked Systems
Enabling Technology Statistical Learning Theory
Approaches for Dependability
Approaches for Security
Elements of an Experimental Prototype
Summary and Conclusions

30
Self-Verifiable ProtocolsStatement of the
Problem

Problem Detect and contain network effects of
misconfigurations and faulty/malicious components
Approach design network protocols so each
component verifies correct behavior of the
protocol
Examples
e2e protocols
routing (BGP) protocols

31
Self-Verifiable ProtocolsCase Study BGP

Propagating invalid BGP routes can bring the
Internet down
Multiple causes
Router misconfigurations happen daily, yielding
outages lasting hours
Malicious routers huge potential threat
Routers with default passwords
Possible to buy routers passwords on darknets
Existing solutions
Hard to deploy (e.g., Secure-BGP), or
insufficient security
Our solution
Whisper verify the correctness of router
advertisements
Listen verify the reachability on the data plane

32
Self-Verifiable ProtocolsBGP Whisper

Use redundancy to check consistency of peers
information
Whisper game
Group sits in a circle., person whispers secret
phrase to neighbors
Person at other end concludes
Phrase is correct if same phrase from both
neighbors
Otherwise, at least one phrase is incorrect

33
Self-Verifiable ProtocolsBGP Whisper

AS1 advertises its address prefix
Chooses a secrete key x, and sends y h(x)
h() well-known one-way hash function
Every router forwards y h(y)
AS4 performs consistency check (y1)3 (y2)3 ?
If yes, assume both routes are correct
If no, at least one rout is incorrect (but dont
know which) ?rise a flag

(AS1,AS2,y1h2(x))
(AS1,AS2,AS3,y1h3(x))
AS3
AS2
(AS1, y1h(x))
AS4
Chose secret key x
AS1
(AS1,AS3,y2h2(x))
(AS1, y2h(x))
AS3
34
Self-Verifiable ProtocolsBGP Listen

Monitor progress of TCP flows
If TCP flow doesnt make progress, might be
because route is incorrect
Use heuristics to reduce number of false
positives and negatives
Still difficult to handle traffic patterns like
port scanners
Use SLT techniques to improve the detection
accuracy?

35
Self-Verifiable ProtocolsStatus and Future Plans

Two examples
BGP verifications (Listen Whisper)
Can trigger alarms and contain malicious routers
Minimal changes to BGP incrementally deployable
(Listen)
Self-verifiable CSFQ
Per-flow isolation without maintaining per flow
state
Detect and contain malicious flows
Ultimate goal develop distributed system able to
self diagnose and self-repair
Eliminate faulty components
Minimum raise a flag in case of configurations
and attackers
Develop set of principles and techniques for
robust protocols

36
Enabling TechnologyEdge Services by Network
Appliances

In-the-Network Processing the Computer IS THE
Network

37
Generic PNE Architecture
Tag Mem
Rules Programs
38
Enabling TechnologyProgrammable Networks

Problem
Common programming/control environment for
diverse network elements to realize full power of
inside the network services and applications
Approach
Software toolkit and VM architecture for PNEs,
with retargetable optimized backend for diverse
appliance-specific architectures
Current Focus
Network health monitoring, protocol interworking
and packet translation services, iSCSI processing
and performance enhancement, intrusion and worm
detection and quarantining
Potential Impact
Open framework for multi-platform appliances,
enabling third party service development
Provable application properties and invariants
avoidance of configuration and latest patch not
installed errors

39
Enabling TechnologyProgrammable Networks

Generalized PNE programming and control model
Generalized virtual machine model for this
class of devices
Retargetable for different underlying
implementations
Edge services of interest
Network measurement and monitoring supporting
model formation and statistical anomaly detection
Framework for inside-the-network protocol
listening
Selective blocking/filtering/quarantining of
traffic
Application-specific routing
Faster detection and recovery from routing
failures than is possible from existing Internet
protocols
Implementation of self-verifiable protocols

40
Security of Networked SystemsLearning Systems
Opportunity

New focus on network-wide attacks
E.g., worms, denial of service
Arise suddenly, spread quickly
No time to deploy patches or filters to protect
machines
SLT offers promises for improvements
Distributed, so information across machines are
shared
Handles changes in user behavior, preventing
false positives
Truly distributed SLT systems are possible that
can detect and protect against very large-scale
security attacks

41
Security of Networked SystemsTechnical Approach

Mechanisms to learn, share, repair against
potential threats to dependability
Strengthen assurance of shared information via
lightweight authentication and encryption
TESLA authentication system replaces public-key
crypto with lightweight symmetric encryption
uses time asymmetry to provide assurance
Messages initially encrypted, verification keys
revealed laterprevents attacker from using a
received key to forge messages
Variations provide instant authentication.
Athena system generate random instances of
secure protocols
Ultra-fast checking softwaremodel-checking
proof-theoretic techniques to verify protocols
against stated requirements
Intelligently generate most efficient secure
protocol satisfying requirements or a random
instance of a secure protocol satisfying a given
set of requirements
Apply for SLT systems to more quickly exchange
information

42
Presentation Outline

A New Vision for Networked Systems
Enabling Technology Statistical Learning Theory
Approaches for Dependability
Approaches for Security
Elements of an Experimental Prototype
Summary and Conclusions

43
System Prototype

Comprehensive system architecture
Reduction of SLT to practical software components
embedded within a distributed systems context
Exhibition of an architecture for dramatically
improving the reliability and security of
important systems through observation-coordination
-adaptation mechanisms.

44
RADS Prototype Applications

E-mail Systems/Messaging
Scale
Distribution, heterogeneity
Non-stop
Reactive Systems
E.g., Distributed worm detection
Network/Web Services
Financial Applicatioms
Collective Decision Making/Electronic Voting
Security, privacy
Non-stop

45
RADS Conceptual Architecture
Operator
User
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Programming Abstractions For Roll-back
SLT Services
Crash-Oriented Svrcs Observation Infrastructure
forSystem SLT
Application- Specific Overlay Network
Verifiable Protocols Fast Detection Route
Recovery ObservationInfrastructure for network
SLT
PNE
PNE
Edge Network
Edge Network
Commodity Internet
46
Presentation Outline