A Research Program in Reliable Adaptive Distributed Systems RADS - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

A Research Program in Reliable Adaptive Distributed Systems RADS

Description:

Applying the Philosophy: Early Experience with Specific Approaches ... diagnosis, prediction, novelty detection, outlier detection, quantile estimation, ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 50
Provided by: Rand220
Category:

less

Transcript and Presenter's Notes

Title: A Research Program in Reliable Adaptive Distributed Systems RADS


1
A Research Program inReliable AdaptiveDistribute
d Systems (RADS)
  • Armando Fox, Michael Jordan, Randy Katz, George
    Necula, David Patterson, Ion Stoica, Doug Tygar
  • University of California, Berkeleyand Stanford
    University

2
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

3
New Approach for RADS(Reliable Adaptive
Distributed Systems)
  • Dramatically improve the trustworthiness of
    networked systems
  • Observe design observation points throughout
    system
  • Analyze SLT as an enabling technology
  • Respond detect anomalous behavior vs. baseline
  • Learn use observations to modify responses to
    future observations
  • Act
  • Reactive use control points in system for rapid
    recovery if detect something wrong
  • Proactive/protective prophylactically act on
    system to prevent predicted impending failure

4
Todays Systems are Too Brittle
  • Fragile, easily broken, yielding poor
    trustworthiness (dependability and security).
  • Amazon Revenue 3.1B, Downtime Costs 600,000
    per hour
  • Why? Overly focused on performance, performance,
    and cost-performance
  • Systems based on fundamentally incorrect
    assumptions
  • Humans are perfect
  • Software will eventually be bug free
  • Maintenance is free
  • People/HW/SW failures are facts, not problems
  • If a problem has no solution, it may not be a
    problem, but a fact--not to be solved, but to be
    coped with over time Shimon Peres (Peress
    Law)

5
If Failure is Inevitable...then Design for Rapid
Adaptation
  • Encompasses rapid server recovery, network
    rerouting, prophylactic/protective actions...
  • Blurs distinction between normal operation and
    recovery
  • Elements of the solution
  • Programming paradigms for robust recovery
  • Crash-only software design for rapid server
    recovery
  • Network protocols designed for rapid detection of
    assertion violations
  • Instrumentation and SLT for online analysis,
    anomaly detection, and diagnosis of failure
  • Recovery benchmarks to measure progress
  • What you cant measure, you cant improve
  • Collect real failure data to drive benchmarks

6
RADS Conceptual Architecture
User
Programming Abstractions For Roll-back (Necula
Operator
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Benchmarks,Tools for Human Operators (Patterson)
Crash-Only Middleware Servers, System
OC Infrastructure (Fox)
SLT Services
Application- Specific Overlay Network
Online Statistical Learning Algorithms (Jordan)
PNE
PNE
Edge Network
Edge Network
Protocols Enabling Fast Detection Route
Recovery, Network OC Infrastructure (Katz,
Stoica)
Router
Router
CommodityInternet IP networks
  • Reduction to practice of online SLT and
    observe/analyze/act infrastructure
  • Reusable embeddable components

7
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

8
Crash-Only SoftwareDramatically Simplifying
Recovery
  • Since robust systems must be crash-safe, make
    crashes the only supported form of
    shutdown/restart
  • Software componentsexternal power switch is
    independent ofmisbehaving component
  • Recovery becomes inexpensive/safe to try
  • Simplifies failure detection, since can be overly
    aggressive
  • Simplifies recovery, since only 1 type of
    recovery action and always safe to try
  • Idea if something looks anomalous, its probably
    wrong
  • Can machine learning and statistical monitoring
    approaches be applied during online operations?

9
Crash-Only SoftwarePractical to Build
  • refocus on JAGR, talk about relevance of
    middleware
  • Case studies two crash-only state-storage
    subsystems (for session state and durable state)
  • OK to crash any node at any time for any reason
  • Recovery is highly predictable, doesnt impact
    online performance
  • Replication provides probabilistic durability
    capacity during recovery
  • Access pattern of workload exploited for
    consistency guarantees
  • 9 activity state statistics monitored per
    storage brick
  • Metrics compared against those of peer bricks
  • Basic idea Changes in workload tend to affect
    all bricks equally
  • Underlying (weak) assumption Most bricks are
    doing mostly the right thing most of the time
  • Anomaly in 6 or more (out of 9) metrics gt reboot
    brick
  • Simple thresholding and substring-frequency used
    to determine anomalous

10
Supporting Crash-Only in Middleware
  • Add observation control points to Java
    application middleware
  • Observe capture paths taken through system by
    user request
  • Analyze look for highly-unlikely anomalous
    (therefore probably buggy) paths
  • Act micro-reboot suspected-faulty J2EE
    components transparently to rest of system
  • Result fast recovery improves overall
    performability
  • micro-reboot is 2-3 orders of magnitude faster
    than full application reboot
  • Improves performability (total amount of work per
    unit time in presence of faults)
  • Minimizes disruption to users of other
    (non-faulty) parts of system

Fast, cheap uRBs statistical monitoring
provide a degree of application-generic failure
detection recovery
11
Crash-Oriented SoftwareSystematic Approach
  • Needed Systematic mechanism for determining when
    micro-reboots are safe
  • Programming-language level support for rollback
    and state tracking
  • Needed Better integration with SLT
  • Which clustering/analysis techniques best
    correlate anomalous paths to particular observed
    failure types? (current prototype uses very
    simple data clustering techniques)
  • Are these techniques suitable for online use?
    (current prototype does offline analysis)

12
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

13
Research Challenges
  • No protection against DoS attacks
  • MS Blaster inflicted Internet packet loss gt 20
  • Routing protocols blindly believe routes
    advertised by neighbors
  • BGP router misconfigurations
  • 200-1200 prefixes affected every day
  • CWs (AS3561) misconfiguration caused an outage
    for gt 5000 prefixes for 2 hours (April 2001)
  • Malicious routers huge potential threat
  • Drop packets and render a destination unreachable
  • Eavesdrop the traffic to a given destination
  • Impersonate the destination

14
Observe, Analyze, Act
  • Observe
  • Use multiple vantage points to monitor the
    network
  • Design protocols whose behaviors can be verified
  • Analyze
  • Learn from protocol behavior
  • Identify bogus information
  • Act
  • Contain misbehaving components
  • Rise flags for network operators
  • Empower end-hosts (e.g., enable end-hosts to stop
    unwanted packets in the network infrastructure)
  • End-hosts know better when under attack
    (flashcrowds vs. DoS attacks)
  • End-hosts can react faster than infrastructure

sender
receiver
15
Case Study BGP (Listen Whisper)
  • Whisper
  • Use redundancy to check for route advertisements
    consistency
  • Listen
  • Monitor TCP flow progress to detect reachability
    problems
  • Results
  • Whisper reduce the region of Internet vulnerable
    to an isolated adversary to 5
  • Scalable, implementation can handle 10 times
    todays BGP load
  • Listen detect reachability problems
  • Probability of false positives 1
  • Vulnerable to port scans ? plan to use SLT

16
Programmable Network Elements
In-Port Classify Transform Out-Port
Edge Network
Edge Network
Router
Router
Commodity Internet IP networks
  • Enabling Technology
  • Edge network elements for IDS, firewall, traffic
    shaping, etc.
  • Next generation exposed APIs for 3rd party
    programming
  • Location for efficient network-level monitoring
    and control
  • Observe rapid detection of route failure or
    network attack
  • Act e.g., filter intrusions, quarantine
    propagating worms
  • Avoid configuration and latest patch not
    installed errors

17
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

18
Research Challenge Self-sensing and Reactive
Systems
  • Internet scale attacks are fundamentally
    different than host scale attacks
  • Traditional Intrusion Detection Systems (IDS)
    have had some success with host scale attacks,
    but also many false positives
  • Internet scale attacks offer opportunity (more
    evidence of wide scale attack) but also more
    challenge (integrating data from a large number
    of disparate sources)

19
Observe, Analyze, Act
  • Observe what to monitor, how to monitor
  • Analyze Learning from patterns of messages (not
    parsing their contents)
  • Act
  • How to exchange minimal information (in system
    under attack)
  • rapidly evolving security protocols (for
    resilience to attack)
  • Applications Worm detection, spam detection
  • Ultimate challenge beyond detection and into
    response

20
Security of Networked SystemsTechnical Approach
  • Mechanisms to learn, share, repair against
    potential threats to dependability
  • Strengthen assurance of shared information via
    lightweight authentication and encryption
  • TESLA authentication system replaces public-key
    crypto with lightweight symmetric encryption
    uses time asymmetry to provide assurance
  • Messages initially encrypted, verification keys
    revealed laterprevents attacker from using a
    received key to forge messages
  • Variations provide instant authentication.
  • Athena system generate random instances of
    secure protocols
  • Ultra-fast checking softwaremodel-checking
    proof-theoretic techniques to verify protocols
    against stated requirements
  • Intelligently generate most efficient secure
    protocol satisfying requirements or a random
    instance of a secure protocol satisfying a given
    set of requirements
  • Apply for SLT systems to more quickly exchange
    information

21
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

22
Statistical Learning Theory
  • Toolbox for design/analysis of adaptive systems
  • Algorithms for classification, diagnosis,
    prediction, novelty detection, outlier detection,
    quantile estimation, density estimation, feature
    selection, variable selection, response surface
    optimization, sequential decision-making
  • Classification algorithms
  • Recent scaling breakthroughs 10K features,
    millions of data points
  • Kernel machines functional analysis and convex
    optimization
  • Generalized inner productsimilarities among data
    point pairs
  • Defined for many data types
  • Classical linear statistical algorithms
    kernelized for state-of-the-art nonlinear SLT
    algorithms in many areas

23
Statistical Learning Theory
  • Novelty Detection Problem
  • Unlimited observations reflecting normal
    activityYet few (or no) instances that reflect
    an attack or a bug
  • E.g. intrusion detection, machine diagnostics
  • Second-order cone program a convex optimization
    problem with an efficient solution method
  • Given cloud of data in a high-dimensional feature
    space, place a boundary around these to guarantee
    that only a small fraction falls outside
  • Basic problem---find a boundary that encloses a
    desired fraction of the data, and is as tight as
    possible
  • can be done using the generalized Chebyshev
    inequality
  • using kernels, this is a convex problem

24
Example Statistical Bug-finding
  • Programs are buggy, yet many people use them
  • Instrument programs to take samples of program
    state at runtime
  • Collect information over the Internet from many
    users runs
  • Learn a statistical classifier based on
    successful and failed runs, using feature
    selection methods to pinpoint the bugs
  • Example finding a bug in Unix bc utility
  • 2908 features instrumented
  • All top feature indicate indx being unusually
    large in more_arrays subroutine
  • storage.c176 more_arrays() indx gt optopt
  • storage.c176 more_arrays() indx gt opterr
  • storage.c176 more_arrays() indx gt use_math
  • Indeed, array overrun bug in re-allocation
    routine more_arrays() found to cause memory
    corruption and sometimes an eventual crash

25
Example III Diagnosis
  • A probabilistic graphical model with 600 disease
    nodes, 4000 finding nodes
  • Node probabilities p(f_i d) were assessed from
    an expert (Shwe, et al., 1991)
  • Want to compute posteriors p(d_j f)
  • Is this tractable?

26
Case Study Medical Diagnosis
  • Symbolic complexity
  • symbolic expressions fill dozens of pages
  • would take years to compute
  • Numerical simplicity
  • Jaakkola and Jordan (1999) describe a variational
    method based on convexity that computes
    approximate posteriors in less than a second

27
Challenge for SLT
  • Challenge on-line versions of the best
    algorithms have yet to be developed
  • update the learning systems state based on small
    sets of data
  • Available for some kernelized problems
  • On-line versions of the best algorithms have yet
    to be developed!

28
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

29
System Prototype
  • Comprehensive system architecture
  • Reduction of SLT to practical software components
    embedded within a distributed systems context
  • Exhibition of an architecture for dramatically
    improving the reliability and security of
    important systems through observation-coordination
    -adaptation mechanisms.

30
Messaging as an Application
  • E-mail is now mission-critical application
  • Organizational storage capacity shifting from
    financial data bases to email (email is fastest
    growing storage)
  • Loss of email more critical to continuing
    operation of organization than telephony (imagine
    if govt had no email for a week)
  • Instant Messaging is now mission-critical
    application
  • In a crisis, many communication schemes will be
    used land-based telephony, cellular telephony,
    instant messaging, email,
  • Coordination among first-responders during crisis
    response in field (administrators operators)
  • Demands for dependability, resistance to attack,
    establishment of trust among interacting entities
  • Despite attempts by hackers, terrorists,

31
Measuring Sucess
  • Build email/IM prototype using RADS design
    principles and tools
  • Put realistic performance workload on prototype
  • Subject prototype to increasingly difficult
    failure workloads and attack workloads
  • E.g., hardware failures, software failures,
    operator failures, worms attacks, DDOS attacks,
  • Measure false positive rates, accuracy rates,
    time to analyze failures, time to act,
    performance impact of actions, availability of
    prototype, performability of prototype,
  • Compare results to conventional email/IM systems
    under similar performance, failure, and attack
    workloads

32
Disaster Response Messaging Application
Active Adversary Service Attacks
DHS/Federal Network
Net Failure
Coalition Internet
Trust Relations
Allies Networks
Allies Networks
Allies Networks
Adversary
Net Failure
Allies Networks
Local Police, Fire, State Police
Adversary
Incident Reports Responder Locations GIS Data Etc.
Compromised Network With Embedded Adversaries
33
Presentation Outline
  • Why We Need a New Approach to Networked Systems
  • New Design Philosophy for RADS
  • Applying the Philosophy Early Experience with
    Specific Approaches
  • Approaches for Software and Hardware
    Dependability
  • Approaches for Networking
  • Approaches for Security
  • Applying SLT to dependability problems
  • Elements of a unified Experimental Prototype
  • Summary and Conclusions

34
Old Science vs. New Science
  • First 50 years of computer science
  • manually-engineered systems
  • lack of adaptability, robustness, and security
  • no concern with closing the loop with the
    environment
  • Next 50 years of computer science
  • statistical learning systems throughout the
    infrastructure
  • self-configuring, adaptive, sentient systems
  • perception, reasoning, decision-making cycle
  • systems are always recovering because of this
    ongoing automatic and dynamic adaptation
  • New way to think about and design adaptive
    systems
  • Makes continuous monitoring and reaction a
    first-class goal
  • Provides point of leverage for applying SLT and
    related techniques

35
Scientific Foundation For Self- Systems
  • New design principles and tools for systems that
    continuously adjust their behavior in response to
    analysis of online observations
  • New metrics and benchmarks for evaluating
    self-adapting networked systems
  • Advances in Statistical Learning Theory to move
    from offline to online analysis of large-scale
    distributed systems

36
BACKUP SLIDES
37
Statistical Learning Theory
  • Super kernels combine heterogeneous data via
    multiple kernels
  • Semidefinite programs, convex optimization
    problems with efficient solutions involving
    efficient decomposition techniques
  • Useful in fusing evidence at distributed nodes
  • Problems of interest require combined parameter
    estimation and optimization
  • Response surface methodology building local
    mappings from configurations to performance, and
    suggesting gradient directions in configuration
    space leading to performance improvements
  • Policy-gradient methods SLT algorithms that make
    sequences of decisions, yielding a behavior or
    policy successfully developed policies for
    nonlinear control problems involving high degrees
    of freedom

38
Statistical Machine Learning
  • Kernel methods
  • neural network heritage
  • convex optimization algorithms
  • kernels available for strings, trees, graphs,
    vectors, etc.
  • state-of-the-art performance in many problem
    domains
  • frequentist theoretical foundations
  • Graphical models
  • marriage of graph theory and probability theory
  • recursive algorithms on graphs
  • modular design
  • state-of-the-art performance in many problem
    domains
  • Bayesian theoretical foundations

39
Self-Verifiable ProtocolsBGP Whisper
  • AS1 advertises its address prefix
  • Chooses a secrete key x, and sends y h(x)
  • h() well-known one-way hash function
  • Every router forwards y h(y)
  • AS4 performs consistency check (y1)3 (y2)3 ?
  • If yes, assume both routes are correct
  • If no, at least one rout is incorrect (but dont
    know which) ?rise a flag

(AS1,AS2,y1h2(x))
(AS1,AS2,AS3,y1h3(x))
AS3
AS2
(AS1, y1h(x))
AS4
Chose secret key x
AS1
(AS1,AS3,y2h2(x))
(AS1, y2h(x))
AS3
40
Enabling TechnologyEdge Services by Network
Appliances
  • In-the-Network Processing the Computer IS THE
    Network

41
Self-Verifiable ProtocolsStatus and Future Plans
  • Two examples
  • BGP verifications (Listen Whisper)
  • Can trigger alarms and contain malicious routers
  • Minimal changes to BGP incrementally deployable
    (Listen)
  • Self-verifiable CSFQ
  • Per-flow isolation without maintaining per flow
    state
  • Detect and contain malicious flows
  • Ultimate goal develop distributed system able to
    self diagnose and self-repair
  • Eliminate faulty components
  • Minimum raise a flag in case of configurations
    and attackers
  • Develop set of principles and techniques for
    robust protocols

42
Enabling TechnologyProgrammable Networks
  • Problem
  • Common programming/control environment for
    diverse network elements to realize full power of
    inside the network services and applications
  • Approach
  • Software toolkit and VM architecture for PNEs,
    with retargetable optimized backend for diverse
    appliance-specific architectures
  • Current Focus
  • Network health monitoring, protocol interworking
    and packet translation services, iSCSI processing
    and performance enhancement, intrusion and worm
    detection and quarantining
  • Potential Impact
  • Open framework for multi-platform appliances,
    enabling third party service development
  • Provable application properties and invariants
    avoidance of configuration and latest patch not
    installed errors

43
Enabling TechnologyProgrammable Networks
  • Generalized PNE programming and control model
  • Generalized virtual machine model for this
    class of devices
  • Retargetable for different underlying
    implementations
  • Edge services of interest
  • Network measurement and monitoring supporting
    model formation and statistical anomaly detection
  • Framework for inside-the-network protocol
    listening
  • Selective blocking/filtering/quarantining of
    traffic
  • Application-specific routing
  • Faster detection and recovery from routing
    failures than is possible from existing Internet
    protocols
  • Implementation of self-verifiable protocols

44
Crash-Only Statistical Monitoring Resilience
to Real-World Transients
  • Simple fault model observed anomalies coerced
    into crash faults
  • Surprise! Statisticalmonitoring catches
    manyreal-world faults, withouta pre-established
    baseline

45
Self-Verifiable ProtocolsStatement of the
Problem
  • Problem Detect and contain network effects of
    misconfigurations and faulty/malicious components
  • Approach design network protocols so each
    component verifies correct behavior of the
    protocol
  • Examples
  • e2e protocols
  • routing (BGP) protocols

46
Self-Verifiable ProtocolsCase Study BGP
  • Propagating invalid BGP routes can bring the
    Internet down
  • Multiple causes
  • Router misconfigurations happen daily, yielding
    outages lasting hours
  • Malicious routers huge potential threat
  • Routers with default passwords
  • Possible to buy routers passwords on darknets
  • Existing solutions
  • Hard to deploy (e.g., Secure-BGP), or
    insufficient security
  • Our solution
  • Whisper verify the correctness of router
    advertisements
  • Listen verify the reachability on the data plane

47
Self-Verifiable ProtocolsBGP Whisper
  • Use redundancy to check consistency of peers
    information
  • Whisper game
  • Group sits in a circle., person whispers secret
    phrase to neighbors
  • Person at other end concludes
  • Phrase is correct if same phrase from both
    neighbors
  • Otherwise, at least one phrase is incorrect

48
Self-Verifiable ProtocolsBGP Listen
  • Monitor progress of TCP flows
  • If TCP flow doesnt make progress, might be
    because route is incorrect
  • Use heuristics to reduce number of false
    positives and negatives
  • Still difficult to handle traffic patterns like
    port scanners
  • Use SLT techniques to improve the detection
    accuracy?

49
Military Messaging Application
Active Adversary Service Attacks
US Forces Network
Net Failure
Coalition Internet
Trust Relations
SitReps
Allies Networks
Allies Networks
Allies Networks
Adversary
Net Failure
Allies Networks
Allies Networks
Adversary
Compromised Network With Embedded Adversaries
Write a Comment
User Comments (0)
About PowerShow.com