A Research Program in Reliable Adaptive Distributed Systems RADS - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

A Research Program in Reliable Adaptive Distributed Systems RADS

Description:

Michael Jordan, Randy Katz, George Necula, David Patterson, Doug Tygar ... SLT (Jordan), Network Services/Protocols (Fox, Katz, Patterson, Stoica), and ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 48
Provided by: Rand220
Category:

less

Transcript and Presenter's Notes

Title: A Research Program in Reliable Adaptive Distributed Systems RADS


1
A Research Program inReliable AdaptiveDistribute
d Systems (RADS)
  • Armando FoxStanford University
  • Michael Jordan, Randy Katz, George Necula, David
    Patterson, Doug Tygar
  • University of California, Berkeley

2
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

3
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

4
Networked SystemsCurrent State-of-the-Art
  • Todays systems fragile, easily broken, yielding
    poor reliability and security
  • Complexity of configurations by humans is
    overwhelming, infrequently correct, yielding lack
    of dependability and introducing vulnerabilities
  • 50 outages, 90 security break-ins attributed
    to configuration
  • Attackers exploit known problems faster than
    system managers apply known fixes
  • Overly focused on performance, performance, and
    cost-performance
  • Systems based on fundamentally incorrect
    assumptions
  • Humans are perfect
  • Software will eventually be bug free
  • Hardware MTBF is already very large, and will
    continue to increase
  • Maintenance costs irrelevant vs. purchase price

5
Networked SystemsCost of Failure and its
Inevitability
  • Outage Costs
  • Amazon Revenue 3.1B, 7744 employees
  • Revenue (24x7) 350k per hour
  • Employee productivity costs 250k per hour
  • Total Downtime Costs 600,000 per hour
  • Employee cost/hour comparable to revenue, even
    for an Internet company
  • People/HW/SW failures are facts, not problems
  • If a problem has no solution, it may not be a
    problem, but a fact--not to be solved, but to be
    coped with over time Shimon Peres (Peress
    Law)
  • Recovery/repair is how we cope with them

6
Principles for Reliable Adaptive Distributed
Systems
  • Given errors occur, design to recover rapidly
  • Partial Restart
  • Crash only software (1 way to start, stop)
  • Given humans make (most of the) errors, build
    tools to help operator find and repair problems
  • Pinpoint the error
  • Undo of human error
  • Note Errors often associated with configuration
  • Recovery benchmarks to measure progress
  • What you cant measure, you cant improve
  • Collect real failure data to drive benchmarks

7
Networked SystemsComponents of New Approach
  • Statistical learning algorithms that observe and
    predict future behaviors
  • Verification techniques that check for correct
    behavior, reveal vulnerabilities, harness
    techniques for the rapid generation of behaviors
    with desirable properties
  • Programmable network elements allowing active
    code to be inserted into the network, to provide
    observation and enforcement points without the
    need for access to user end systems

8
Interdisciplinary Expertise
  • SLT (Jordan), Network Services/Protocols (Fox,
    Katz, Patterson, Stoica), and Verification
    Methods applied to network and security behaviors
    (Stoica, Tygar)
  • Comprehensive distributed architecture embedding
    SLT as building block for critical components for
    system observation, coordination, inference,
    correction, and evolution of behaviors
  • Components suitable for embedding in distributed
    systems
  • Network behaviors that reveal correct or
    incorrect operation of higher-level network
    applications
  • Embedding observational and inference means at
    strategic points in the network, obviating need
    to modify end hosts or applications
  • System level heterogeneity and ability to
    generate new behaviors on demand in response to a
    dynamic system threat environment to achieve
    enhanced dependability and resilience to attack
  • Enabling applications for investigation will
    include web services, intrusion detection,
    storage access

9
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

10
Statistical Learning Theory
  • Toolbox for design/analysis of adaptive systems
  • Algorithms for classification, diagnosis,
    prediction, novelty detection, outlier detection,
    quantile estimation, density estimation, feature
    selection, variable selection, response surface
    optimization, sequential decision-making
  • Classification algorithms
  • Recent scaling breakthroughs 10K features,
    millions of data points
  • Kernel machines functional analysis and convex
    optimization
  • Generalized inner productsimilarities among data
    point pairs
  • Defined for many data types
  • Classical linear statistical algorithms
    kernelized for state-of-the-art nonlinear SLT
    algorithms in many areas

11
Statistical Learning Theory
  • Novelty Detection Problem
  • Unlimited observations reflecting normal
    activityYet few (or no) instances that reflect
    an attack or a bug
  • Second-order cone program a convex optimization
    problem with an efficient solution method
  • Given cloud of data in a high-dimensional feature
    space, place a boundary around these to guarantee
    that only a small fraction falls outside
  • Needed on-line variants of SLT algorithms that
    update the learning systems state based on small
    sets of data
  • Available for some kernelized problems
  • On-line versions of the best algorithms have yet
    to be developed!

12
Statistical Learning Theory
  • Super kernels combine heterogeneous data via
    multiple kernels
  • Semidefinite programs, convex optimization
    problems with efficient solutions involving
    efficient decomposition techniques
  • Useful in fusing evidence at distributed nodes
  • Problems of interest require combined parameter
    estimation and optimization
  • Response surface methodology building local
    mappings from configurations to performance, and
    suggesting gradient directions in configuration
    space leading to performance improvements
  • Policy-gradient methods SLT algorithms that make
    sequences of decisions, yielding a behavior or
    policy successfully developed policies for
    nonlinear control problems involving high degrees
    of freedom

13
Statistical Machine Learning
  • Kernel methods
  • neural network heritage
  • convex optimization algorithms
  • kernels available for strings, trees, graphs,
    vectors, etc.
  • state-of-the-art performance in many problem
    domains
  • frequentist theoretical foundations
  • Graphical models
  • marriage of graph theory and probability theory
  • recursive algorithms on graphs
  • modular design
  • state-of-the-art performance in many problem
    domains
  • Bayesian theoretical foundations

14
Vision
  • First 50 years of computer science
  • manually-engineered systems
  • lack of adaptability, robustness, and security
  • no concern with closing the loop with the
    environment
  • Next 50 years of computer science
  • statistical learning systems throughout the
    infrastructure
  • self-configuring, adaptive, sentient systems
  • perception, reasoning, decision-making cycle

15
Example I Statistical Bug-finding
  • Programs are buggy, yet people use them
  • Exploit this use user trials to debug programs
  • Outline of system
  • Instrument programs to take samples at runtime of
    program state
  • Collect information over the Internet
  • Learn a statistical classifier based on
    successful and failed runs, using feature
    selection methods to pinpoint the bugs

16
Case Study BC
  • Array overrun bug in re-allocation routine
    more?arrays() leads to memory corruption and
    sometimes an eventual crash 2908 features
  • All top feature indicate indx being unusually
    large
  • storage.c176 more?arrays() indx ? optopt
  • storage.c176 more?arrays() indx ? opterr
  • storage.c176 more?arrays() indx ? use?math
  • storage.c176 more?arrays() indx ? quiet
  • storage.c176 more?arrays() indx ? f?count
  • And this indeed pinpoints the bug

17
Example II Novelty Detection
  • The goal is binary classification
  • but all of the training data come from one class
  • Many practical applications
  • intrusion detection
  • machine diagnostics
  • Basic problem---find a boundary that encloses a
    desired fraction of the data, and is as tight as
    possible
  • can be done using the generalized Chebyshev
    inequality
  • using kernels, this is a convex problem

18
Case Study Analog Circuit Design
  • case study---a Low Noise Amplifier (LNA) for
    wireless applications
  • 7 design parameters (transistor size, bias
    currents, etc)
  • 50,000 positive samples
  • visualize the projection of feasible solutions in
    a plane representing second-order harmonic
    distortion (HD_2) and third-order (HD_3) harmonic
    distortion

19
Case Study Analog Circuit Design
20
Example III Diagnosis
  • A probabilistic graphical model with 600 disease
    nodes, 4000 finding nodes
  • Node probabilities p(f_i d) were assessed from
    an expert (Shwe, et al., 1991)
  • Want to compute posteriors p(d_j f)
  • Is this tractable?

21
Case Study Medical Diagnosis
  • Symbolic complexity
  • symbolic expressions fill dozens of pages
  • would take years to compute
  • Numerical simplicity
  • Jaakkola and Jordan (1999) describe a variational
    method based on convexity that computes
    approximate posteriors in less than a second

22
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

23
Crash-Only SoftwareDramatically Simplifying
Recovery
  • Robust systems must be crash-safe
  • Restart-curable bugs cause outages
  • Rebooting eliminates many corruptions
  • Why support any other kind of shutdown/restart??
  • Crash-only software,
  • Shutdown crash recover restart
  • Software components provide external PWR switch
    independent of component behavior
  • Recovery is inexpensive/safe to try
  • Crash-only power switch infrastructure is
    simpler than apps, common to all of them
  • Higher confidence that it will work
  • Like transaction invariants yet focused on
    recovery
  • Can machine learning and statistical monitoring
    approaches be applied during online operations?

24
Crash-Only Software Simplified Recovery
Management
  • Failure detection and recovery management are
    hard
  • How to detect that somethings wrong?
  • How do you know when recovery is really
    necessary?
  • Will a particular recovery technique work?
  • What is the effect on online performance?
  • What if you needlessly over-recover?
  • Predictable, fast recovery simplifies failure
    detection and recovery management
  • Something doesnt look the way it used to
    anomaly
  • Not all anomalies are failures... but
    over-recovering is OK
  • If rebooting suspected-bad component doesnt
    work reboot its larger containing group,
    recursively
  • Leverage for applying statistical monitoring
    machine learning

25
Crash-Only SoftwarePractical to Build
  • Case studies two crash-only state-storage
    subsystems (for session state and durable state)
  • OK to crash any node at any time for any reason
  • Recovery is highly predictable, doesnt impact
    online performance
  • Replication provides probabilistic durability
    capacity during recovery
  • Access pattern exploited for consistency
    guarantees
  • Nine activity state statistics monitored
    per storage brick
  • Metrics compared against those of peer bricks
  • Basic idea Changes in workload tend to affect
    all bricks equally
  • Underlying (weak) assumption Most bricks are
    doing mostly the right thing most of the time
  • Anomaly in 6 or more (out of 9) metrics reboot
    brick
  • Simple thresholding and substring-frequency used
    to determine anomalous

26
Crash-Only Statistical Monitoring Resilience
to Real-World Transients
  • Simple fault model observed anomalies coerced
    into crash faults
  • Surprise! Statisticalmonitoring catches
    manyreal-world faults, withouta pre-established
    baseline
  • Memory bitflips in code, data, checksums (
    crash)
  • hang/timeout/freeze
  • Network loss (drop up to 70 of packets
    randomly)
  • Hiccup (eg from garbage collection)
  • Persistent slowdown (one node lags the others)
  • Overload (TCP-like mechanism used to generate
    backpressure)

27
Generalizing Crash-Only Micro-reboots
  • Add micro-reboot (uRB) support to middleware
  • Enhance open-source JBoss J2EE application server
    with fault injection, code path tracing,
    micro-reboots
  • Use automated fault injection observation to
    infer propagation of exceptions
  • During operation, micro-reboot components or
    component groups suspected of being correlated to
    an observed failure
  • uRBs improve performability
  • 2-3 orders of magnitude faster than full
    reboots or application reload
  • Minimizes disruption to users of other
    (non-faulty) parts of system
  • Goal is fast recovery, not causal analysis

Fast, cheap uRBs statistical monitoring
provide a degree of application-generic failure
detection recovery
28
Crash-Oriented SoftwareSystematic Approach
  • Some design lessons already learned
  • OK to say no, OK to make mistakes,
    interchangeable parts
  • Systematic approach for generic componentized
    apps ...
  • Compiler and languages technology to understand
    what makes app amenable to microreboots or c/o
    generally
  • E.g., racking state management across app
    components
  • E.g., establishing observational equivalence
    between executions with and without
    micro-recovery
  • Goal
  • Static and dynamic analysis of when safe to use
    generic recovery
  • Aggressive application of machine learning
    statistical monitoring to trigger generic
    recovery mechanisms
  • High confidence in mechanisms due to simplicity
    and orthogonality

29
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

30
Self-Verifiable ProtocolsStatement of the
Problem
  • Problem Detect and contain network effects of
    misconfigurations and faulty/malicious components
  • Approach design network protocols so each
    component verifies correct behavior of the
    protocol
  • Examples
  • e2e protocols
  • routing (BGP) protocols

31
Self-Verifiable ProtocolsCase Study BGP
  • Propagating invalid BGP routes can bring the
    Internet down
  • Multiple causes
  • Router misconfigurations happen daily, yielding
    outages lasting hours
  • Malicious routers huge potential threat
  • Routers with default passwords
  • Possible to buy routers passwords on darknets
  • Existing solutions
  • Hard to deploy (e.g., Secure-BGP), or
    insufficient security
  • Our solution
  • Whisper verify the correctness of router
    advertisements
  • Listen verify the reachability on the data plane

32
Self-Verifiable ProtocolsBGP Whisper
  • Use redundancy to check consistency of peers
    information
  • Whisper game
  • Group sits in a circle., person whispers secret
    phrase to neighbors
  • Person at other end concludes
  • Phrase is correct if same phrase from both
    neighbors
  • Otherwise, at least one phrase is incorrect

33
Self-Verifiable ProtocolsBGP Whisper
  • AS1 advertises its address prefix
  • Chooses a secrete key x, and sends y h(x)
  • h() well-known one-way hash function
  • Every router forwards y h(y)
  • AS4 performs consistency check (y1)3 (y2)3 ?
  • If yes, assume both routes are correct
  • If no, at least one rout is incorrect (but dont
    know which) ?rise a flag

(AS1,AS2,y1h2(x))
(AS1,AS2,AS3,y1h3(x))
AS3
AS2
(AS1, y1h(x))
AS4
Chose secret key x
AS1
(AS1,AS3,y2h2(x))
(AS1, y2h(x))
AS3
34
Self-Verifiable ProtocolsBGP Listen
  • Monitor progress of TCP flows
  • If TCP flow doesnt make progress, might be
    because route is incorrect
  • Use heuristics to reduce number of false
    positives and negatives
  • Still difficult to handle traffic patterns like
    port scanners
  • Use SLT techniques to improve the detection
    accuracy?

35
Self-Verifiable ProtocolsStatus and Future Plans
  • Two examples
  • BGP verifications (Listen Whisper)
  • Can trigger alarms and contain malicious routers
  • Minimal changes to BGP incrementally deployable
    (Listen)
  • Self-verifiable CSFQ
  • Per-flow isolation without maintaining per flow
    state
  • Detect and contain malicious flows
  • Ultimate goal develop distributed system able to
    self diagnose and self-repair
  • Eliminate faulty components
  • Minimum raise a flag in case of configurations
    and attackers
  • Develop set of principles and techniques for
    robust protocols

36
Enabling TechnologyEdge Services by Network
Appliances
  • In-the-Network Processing the Computer IS THE
    Network

37
Generic PNE Architecture
Tag Mem
Rules Programs
38
Enabling TechnologyProgrammable Networks
  • Problem
  • Common programming/control environment for
    diverse network elements to realize full power of
    inside the network services and applications
  • Approach
  • Software toolkit and VM architecture for PNEs,
    with retargetable optimized backend for diverse
    appliance-specific architectures
  • Current Focus
  • Network health monitoring, protocol interworking
    and packet translation services, iSCSI processing
    and performance enhancement, intrusion and worm
    detection and quarantining
  • Potential Impact
  • Open framework for multi-platform appliances,
    enabling third party service development
  • Provable application properties and invariants
    avoidance of configuration and latest patch not
    installed errors

39
Enabling TechnologyProgrammable Networks
  • Generalized PNE programming and control model
  • Generalized virtual machine model for this
    class of devices
  • Retargetable for different underlying
    implementations
  • Edge services of interest
  • Network measurement and monitoring supporting
    model formation and statistical anomaly detection
  • Framework for inside-the-network protocol
    listening
  • Selective blocking/filtering/quarantining of
    traffic
  • Application-specific routing
  • Faster detection and recovery from routing
    failures than is possible from existing Internet
    protocols
  • Implementation of self-verifiable protocols

40
Security of Networked SystemsLearning Systems
Opportunity
  • New focus on network-wide attacks
  • E.g., worms, denial of service
  • Arise suddenly, spread quickly
  • No time to deploy patches or filters to protect
    machines
  • SLT offers promises for improvements
  • Distributed, so information across machines are
    shared
  • Handles changes in user behavior, preventing
    false positives
  • Truly distributed SLT systems are possible that
    can detect and protect against very large-scale
    security attacks

41
Security of Networked SystemsTechnical Approach
  • Mechanisms to learn, share, repair against
    potential threats to dependability
  • Strengthen assurance of shared information via
    lightweight authentication and encryption
  • TESLA authentication system replaces public-key
    crypto with lightweight symmetric encryption
    uses time asymmetry to provide assurance
  • Messages initially encrypted, verification keys
    revealed laterprevents attacker from using a
    received key to forge messages
  • Variations provide instant authentication.
  • Athena system generate random instances of
    secure protocols
  • Ultra-fast checking softwaremodel-checking
    proof-theoretic techniques to verify protocols
    against stated requirements
  • Intelligently generate most efficient secure
    protocol satisfying requirements or a random
    instance of a secure protocol satisfying a given
    set of requirements
  • Apply for SLT systems to more quickly exchange
    information

42
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

43
System Prototype
  • Comprehensive system architecture
  • Reduction of SLT to practical software components
    embedded within a distributed systems context
  • Exhibition of an architecture for dramatically
    improving the reliability and security of
    important systems through observation-coordination
    -adaptation mechanisms.

44
RADS Prototype Applications
  • E-mail Systems/Messaging
  • Scale
  • Distribution, heterogeneity
  • Non-stop
  • Reactive Systems
  • E.g., Distributed worm detection
  • Network/Web Services
  • Financial Applicatioms
  • Collective Decision Making/Electronic Voting
  • Security, privacy
  • Non-stop

45
RADS Conceptual Architecture
Operator
User
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Programming Abstractions For Roll-back
SLT Services
Crash-Oriented Svrcs Observation Infrastructure
forSystem SLT
Application- Specific Overlay Network
Verifiable Protocols Fast Detection Route
Recovery ObservationInfrastructure for network
SLT
PNE
PNE
Edge Network
Edge Network
Commodity Internet
46
Presentation Outline
  • A New Vision for Networked Systems
  • Enabling Technology Statistical Learning Theory
  • Approaches for Dependability
  • Approaches for Security
  • Elements of an Experimental Prototype
  • Summary and Conclusions

47
Summary and Conclusions
Write a Comment
User Comments (0)
About PowerShow.com