CrashOnly Web Services: Failure Semantics in an SOA Environment - PowerPoint PPT Presentation

About This Presentation
Title:

CrashOnly Web Services: Failure Semantics in an SOA Environment

Description:

Easier to restart quickly in a known state than to clean up ... Stall Proxy. Web Service. Consumer. Web Services. Endpoint. Recovery. Agent. Crash-Only. Backend ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 36
Provided by: paulk77
Category:

less

Transcript and Presenter's Notes

Title: CrashOnly Web Services: Failure Semantics in an SOA Environment


1
Crash-Only Web ServicesFailure Semantics in an
SOA Environment
www.oasis-open.org
Chris Hobbs and Abbie BarbirPresented byPaul
KnightNortel OASIS Symposium 2007, San Diego
2
The crash-only model
  • Software design approach
  • Easier to restart quickly in a known state than
    to clean up and rebuild to recover from an error

George Candea and Armando Fox are key proponents
of crash-only software
3
Two themes of this talk
  • Discuss issues of the behaviors of individual and
    composed services and their part in Web Services
    Service Level Agreements (WSLA)
  • Based on the behaviors of the individual services
  • Need a taxonomy or ontology of service behaviors
  • Need an approach to calculating behaviors of
    composed services
  • The crash-only model of operation as a simple
    failure behavior for a Web Service
  • Failure is one of many identified behaviors

4
Background Orchestration as a New Programming
Paradigm
  • SOA promotes the concept of combining services
    through orchestration - invoking services in a
    defined sequence to implement a business process
  • Orchestration compounds the difficulties of
    testing and managing the quality of the deployed
    services
  • Testing composite services in SOA environment is
    a discipline which is still at an early stage of
    study
  • Describing and usefully modeling the individual
    and combined behaviors - needed to offer Service
    Level Agreements (SLA) - is at an even earlier
    stage
  • We hope to stimulate additional research on these
    topics

5
Testing Composed Services
  • Its fairly straightforward to test the operation
    of a device or system if we control all the
    parts.
  • When we start offering orchestrated services as a
    product, the services we are using may be outside
    our control.
  • For example consider well-known components
  • Google mapping service
  • Amazon S3 storage service
  • Mobile operators location service

6
Testing Composed Services (2)
  • With orchestrated services, there is never a
    complete box we can test
  • With orchestration as the new programming
    paradigm, testing becomes a much bigger problem
  • Failures of orchestrated services are often
    Heisenbugs - impervious to conventional
    debugging, generally non-reproducible
  • Offering a WSLA based on testing alone, without
    reliable knowledge of component service
    behaviors, may be risky

7
Web Services SLA (WSLA)
Packets
Provider X Service X
Service Provider Z
Client
Network
Web Service
WSLA
Provider Y Service Y
Message flows
  • Concerned with behaviors of the message flows and
    services spanning the end-to-end business
    transaction
  • Clients can develop testing strategies that
    stress the service to ensure that the service
    provider has met the contracted WSLA commitment
  • Composed services make offering a WSLA more risky

8
How can WSLAs be derived from behaviors of
component services?
  • Need to develop a model of the behavioral
    attributes of the individual component Web
    Services which contribute to the overall behavior
    of an orchestrated or composed Web Service.
  • Need to model the combination of individual
    service behavioral models

9
Web Services behaviors
  • Behaviors may be described and quantified for
    each Web Service
  • May be combined by a calculus of behaviors when
    multiple services are composed
  • Behavior parameters may become a part of the
    service description, perhaps in WSDL.
  • Availability and Reliability
  • Performance
  • Management
  • Failure
  • Security
  • Privacy, confidentiality and integrity
  • Scalability
  • Execution
  • Internationalization
  • Synchronization
  • Etc.,

10
Web Services behaviors (2)
  • To develop a Service Level Agreement (SLA) for a
    composed service (Z), we need to have relevant
    behavior descriptions for the individual services
    (X and Y)
  • We also need a deep understanding of how to
    combine the descriptions of X and Y to calculate
    results for Z

Z
X
Y
11
Web Services behaviors (3)
  • For each behavior, the challenges include the
    following
  • 1. How may service Xs and service Ys behavior
    be characterized?
  • 2. How may those characterizations be formalized
    and advertised by X and Y?
  • 3. How may Z incorporate Xs and Ys
    characterizations and then advertise the result?
  • Z itself might become a component of an even
    larger service and therefore needs to advertise
    its own characteristics. It also needs this
    characterization to offer an SLA to consumers.

12
Web Services behaviors (4)
  • Each behavior may have its own ontology,
    measures, and calculus of combining those
    measures when services are composed.

Local Ontology
Z Specific Ontology
Abstracted Ontology
?
X
Local Ontology
Z
Abstracted Ontology
Y
Need this analysis for each behavior of services
X, Y and Z
Local Ontology
13
Web Services behaviors (5)
  • Ten behavior examples
  • Availability and Reliability
  • Performance
  • Management
  • Failure (Crash-only is one mode)
  • Security
  • Privacy, confidentiality and integrity
  • Scalability
  • Execution
  • Internationalization
  • Synchronization
  • Lets focus on a few of these behaviors

Source Advertising Service Properties,
unpublished paper by C. Hobbs, J. Bell, P. Sanchez
14
Availability and Reliability
  • Availability is the percentage of client
    requests to which the server responds within the
    time it advertised.
  • Reliability is the percentage of such server
    responses which return the correct answer.
  • In some applications availability is more
    important than reliability
  • Many protocols used within the Internet, for
    example, are self-correcting and an occasional
    wrong answer is unimportant. The failure to give
    any answer, however, can cause a major network
    upheaval.

15
Availability and Reliability (2)
  • In other applications reliability is more
    important than availability
  • If the service which calculates a persons annual
    tax return does not respond occasionally its not
    a major problem - the user can try again
  • If that service does respond but with the wrong
    answer which is submitted to the tax authorities,
    then it could be disastrous

16
Availability and Reliability (3)
  • Services are built with either availability or
    reliability in mind, with clients accepting that
    no service can ever be 100 available or 100
    reliable.
  • In combining services X and Y into a composite
    service Z, it is necessary to combine the
    underlying availability and reliability models
    and predict Zs model.
  • To do so without manual intervention, Xs and Ys
    models must be exposed.

17
Availability and Reliability (4)
  • Availability and reliability models are often
    expressed as Markov Models or Petri Nets, which
    are easy to combine in a hierarchical way.
  • Major issues
  • Agreeing upon the semantics of the states in the
    Markov model or places in the Petri nets
  • Finding a way for X and Y to publish the models
    in a standard form.

18
Availability and Reliability (5)
  • Currently, apart from raw percentage figures,
    there is no method for describing these models
  • Percentage time when the server is unavailable?
  • Percentage of requests to which it does not
    reply?
  • Different clients may experience these
    differently
  • A server which is unavailable from 0000 to 0400
    every day can be 100 available to a client that
    only tries to access it in the afternoons.

19
Availability and Reliability (6)
  • If X and Y are distributed, then it is possible,
    following network failures, that for some
    customers, Z can access X but not Y and for
    others Y but not X.
  • The assessment of Zs availability may be hard to
    quantify, so it may be difficult for Z to offer a
    meaningful WSLA.

20
Failure
  • The failure models of X and Y may be very
    different
  • X fails cleanly and may, because of its
    idempotency, immediately be called again
  • Y has more complex failure modes
  • Z will add its own failure modes to those of X
    and Y
  • Predicting the outcome could be very difficult
  • The complexity is increased because many
    developers do not understand failure modeling
    and, even were models to be published, their
    combination would be difficult due to their
    stochastic nature.

21
Failure (2)
  • One approach to describing a services failure
    model
  • Service publishes the exceptions that it can
    raise and associates the required consumer
    behavior with each
  • Exception D may be thrown when the database is
    locked by another process. Required action is to
    try again after a random backoff period of not
    less than 34ms.
  • Crash-only failure model is a simple starting
    point for building a taxonomy of failure
    behavior. This work is just beginning.

22
Scalability
  • A behavioral description and WSLA for the
    composite service Z must include its scalability
  • How many simultaneous service instances can it
    support?
  • What service request rate does it handle? etc.
  • These parameters will almost certainly differ
    between the component services X and Y, and will
    need to be published by those services.
  • X and Y are presumably not dedicated solely to Z,
    so the actual load being applied to X and Y at
    any given time is unknown to the provider of Z,
    making the scalability of Z even harder to
    determine.

23
Web Services behaviors (again)
  • Ten behavior examples
  • Availability and Reliability
  • Performance
  • Management
  • Failure (Crash-only is one mode)
  • Security
  • Privacy, confidentiality and integrity
  • Scalability
  • Execution
  • Internationalization
  • Synchronization
  • We described a few of these behaviors
  • Can we use them to build WSLAs?

24
Web Service Level Agreement (WSLA)
  • Based on behaviors and descriptors for these
    behaviors.
  • Example Failure model
  • Is transaction half-performed?
  • Is it re-wound?
  • These behaviors and descriptors are not available
    in the WS description, in WSDL
  • No performance info
  • Not even price!

25
Web Service Level Agreements (2)
  • Business acceptance of composed services for
    business-critical operations depends on a service
    providers ability to offer WSLA
  • Uptime, response time, etc.
  • Offering a WSLA depends on ability to compose the
    WSLA-related behaviors of the individual services
  • This information needs to be available via WSDL
    or similar source
  • Should include test vectors to test the SLA
    claims
  • The ability to determine and offer a WSLA
    commitment is a limiting factor for widespread
    acceptance of services based on orchestration

26
Web Service Level Agreements conclusions
  • Need a more precise way to express the parameters
    of behaviors
  • Availability What is 99.97 uptime?
  • Several milliseconds outage each minute?
  • Several minutes planned downtime each month?
  • Failure model Crash-only as the simplest,
    lowest layer or level of failure in a future full
    failure model.
  • Eight other SLA-related behaviors listed here
    each has a complex semantic for description and
    composition
  • More questions than answers now - many PhDs still
    to be earned in this area!

27
Back to the crash-only software model
  • Can it simplify service composition, testing,
    development of WSLA, and end world hunger?

28
Crash-only software (1)
  • Historically, developers have spent a lot of
    effort making software resilient
  • Put borders around it so it will not affect other
    things if it fails
  • Try to close it down cleanly
  • Save state
  • Reload the software component
  • Restart and replay
  • Trying to keep the client from becoming aware
    that a failure occurred

29
Crash-only software (2)
  • Years of work over last ten years on resilient
    software - which stays up all the time, and
    recovers from problems
  • For example, tutorials by Bev Littlewood
  • Crash-only software is the exact opposite
  • Client accepts that the server may crash
  • Power failure, network down, hardware, etc.
  • Client must be able to recover or restart the
    process by itself

30
Crash-only software (3)
  • Crash-only principles
  • Forget recovery - more trouble than its worth
  • When the server senses a problem, it will crash
    as cleanly as possible and may perform a
    micro-reboot to return to original state
  • Sometimes recover to a well-defined checkpoint
  • Client may initiate the crash
  • The server is back working sooner than if it
    tried to recover via logs and journals, etc.
  • Principles fit the Web Services paradigm nicely!
  • Loose coupling of services
  • Little state shared among services

31
Crash-Only Software (4)
  • Crash-only semantic has several advantages
  • Simpler macroscopic behavior with fewer
    externally visible states
  • Reduces outage time by removing all shutting-down
    time
  • Simplifies failure model by reducing recovery
    state table size
  • Crashing can be invoked from outside the software
    of the provider
  • Recovery from a failed state is notoriously
    difficult and the crash-only paradigm coerces the
    system into a known state without attempting
    recovery
  • Reduce the complexity of the provider code
  • Simplifies testing by reducing the failure
    combinations that have to be verified. Consumer
    is assumed to be able to initiate the crash.

32
Crash-Only Web Services
  • Candeas list of properties required for a
    crash-only system can be abstracted to match
    properties of Web Services
  • Components have externally enforced boundaries.
    This is supported by the virtual machine concept
    used on many Web Service systems
  • All interactions between components have a
    timeout. This is implicit in any loosely-coupled
    Web Services interaction.
  • All resources are leased to the service rather
    than being permanently allocated. This is
    particularly useful in Web Services.
  • Requests are entirely self-describing. For
    crash-only services this requires that the
    request carries information about time-to-live
    and idempotency will it return the same result
    if invoked again?.
  • All important non-volatile state is managed by
    dedicated state stores.

33
Crash-Only Reliable Web Service
WS-ReliableMessaging
Crash-Only Application Server
Web Services Endpoint
Web Service Consumer
Internet
Crash-Only Backend
Crash-only WSM
Stall Proxy
Crash-Only Backend
Recovery Agent
Reliable SOAP Protocol
  • For systems with hardware redundancy, by using
    crash only techniques, SOAP WS-RM can be
    extended in order to produce an always available
    Web Service from the providers and consumers
    point of view
  • WSLA response time may be at risk if a service is
    forced to crash

34
Conclusions
  • Testing Web Services in an SOA environment is a
    discipline that is still in its infancy
  • There are no standard models to describe or
    combine Web Services behavior information across
    various services and providers
  • Web Services SLAs (WSLAs) for composed services
    are problematic
  • Testing is only a partial solution
  • Behavioral composition needs work, but is
    promising
  • Crash-only Web Services can address some of these
    difficulties
  • There are many related areas for further work

35
Q A
Write a Comment
User Comments (0)
About PowerShow.com