Wide-Area Service Composition: Availability, Performance, and Scalability - PowerPoint PPT Presentation

About This Presentation
Title:

Wide-Area Service Composition: Availability, Performance, and Scalability

Description:

Code is NOT mobile (mutually untrusting service providers) ... None address wide-area network performance or failure issues for long-lived composed sessions ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 22
Provided by: bhas2
Category:

less

Transcript and Presenter's Notes

Title: Wide-Area Service Composition: Availability, Performance, and Scalability


1
Wide-Area Service Composition Availability,
Performance, and Scalability
  • Bhaskaran Raman
  • SAHARA, EECS, U.C.Berkeley
  • SAHARA Retreat, Jan 2002

2
Service Composition Motivation
Cellular Phone
Video-on-demand server
Provider A
Provider R
Provider B
Text to speech
Transcoder
Service-Level Path
Email repository
Thin Client
Provider Q
Reuse, Flexibility
Other examples ICEBERG, IETF OPES00
3
In this work Problem Statement and Goals
  • Problem Statement
  • Path could stretch across
  • multiple service providers
  • multiple network domains
  • Inter-domain Internet paths
  • Poor availability Labovitz99
  • Poor time-to-recovery Labovitz00
  • Take advantage of service replicas
  • Goals
  • Performance Choose set of service instances
  • Availability Detect and handle failures quickly
  • Scalability Internet-scale operation

4
In this work Assumptions and Non-goals
  • Operational model
  • Service providers deploy different services at
    various network locations
  • Next generation portals compose services
  • Code is NOT mobile (mutually untrusting service
    providers)
  • We do not address service interface issue
  • Assume that service instances have no persistent
    state
  • Not very restrictive OPES00

5
Related Work
  • Other efforts have addressed
  • Semantics and interface definitions
  • OPES (IETF), COTS (Stanford)
  • Fault tolerant composition within a single
    cluster
  • TACC (Berkeley)
  • Performance constrained choice of service, but
    not for composed services
  • SPAND (Berkeley), Harvest (Colorado),
    Tapestry/CAN (Berkeley), RON (MIT)
  • None address wide-area network performance or
    failure issues for long-lived composed sessions

6
Solution Requirements
  • Failure detection/liveness tracking
  • Server, Network failures
  • Performance information collection
  • Load, Network characteristics
  • Service location
  • Global information is required
  • Hop-by-hop approach will not work

7
Challenges
  • Scalability and Global information
  • Information about all service instances, and
    network paths in-between should be known
  • Quick failure detection and recovery
  • Internet dynamics ? intermittent congestion

8
Is quick failure detection possible?
  • What is a failure on an Internet path?
  • Outage periods happen for varying durations
  • Study outage periods using traces
  • 12 pairs of hosts
  • Periodic UDP heart-beat, every 300 ms
  • Study gaps between receive-times
  • Main results
  • Short outage (1.2-1.8 sec) ? Long outage (gt 30
    sec)
  • Sometimes this is true over 50 of the time
  • False-positives are rare
  • O(once an hour) at most
  • Okay to react to short outage periods, by
    switching service-level path

9
Towards an Architecture
  • Service execution platforms
  • For providers to deploy services
  • First-party, or third-party service platforms
  • Overlay network of such execution platforms
  • Collect performance information
  • Exploit redundancy in Internet paths

10
Architecture
11
Key Design Points
  • Overlay size
  • Could grow much slower than services, or
    clients
  • How many nodes?
  • A comparison Akamai cache servers
  • O(10,000) nodes for Internet-wide operation
  • Overlay network is virtual-circuit based
  • Switching-state at each node
  • E.g. Source/Destination of RTP stream, in
    transcoder
  • Failure information need not propagate for
    recovery
  • Problem of service-location separated from that
    of performance and liveness
  • Cluster ? process/machine failures handled within

12
Software Architecture
Service-Level Path Creation, Maintenance, Recovery
Service-Composition Layer
Link-State Propagation
Finding Overlay Entry/Exit
Location of Service Replicas
Link-State Layer
At-least -once UDP
Perf. Meas.
Liveness Detection
Peer-Peer Layer
Functionalities at the Cluster-Manager
13
Layers of Functionality
  • Link-State layer Why Link-State?
  • Service-Composition layer
  • What algorithm for path creation?
  • Algorithm for path recovery?
  • State management?

14
Evaluation
  • What is the effect of recovery mechanism on
    application?
  • What is the scaling bottleneck?

15
Evaluation Emulation Testbed
  • Idea Use real implementation, emulate the
    wide-area network behavior (NistNET)
  • Opportunity Millennium cluster

Rule for 1?2
App
Emulator
Node 1
Rule for 1?3
Lib
Rule for 3?4
Node 2
Rule for 4?3
Node 3
Node 4
16
Evaluation Recovery of Application Session
  • Text-to-Speech application
  • Two possible places of failure
  • Setup
  • 20-node overlay network
  • One service instance for each service
  • Deterministic failure for 10sec during session
  • Metric gap between arrival of successive audio
    packets at the client

17
Recovery of Application SessionCDF of gapsgt100ms
Jump 2 at 2,963 ms
Jump 1 350-450 ms
Jump at 10,000 ms
18
Discussion
  • Jump 1 Due to synchronous text-to-speech
    processing
  • Jump 2 Recovery after failure
  • Breakup 2,963 1,800 O(700) O(450)
  • 1,800 ms timeout to conclude failure
  • 700 ms signaling to setup alternate path
  • 450 ms recovery of application soft-state
  • Re-processing current sentence
  • Without recovery algorithm takes as long as
    failure duration
  • O(3 sec) recovery
  • Can be completely masked with buffering
  • Interactive apps still much better than without
    recovery
  • Quick recovery possible since failure information
    does not have to propagate across network

19
Evaluation Scaling
  • Scaling bottleneck
  • Simultaneous recovery of all client sessions on a
    failed overlay link
  • Can recover at least 1,500 paths without hitting
    bottlenecks
  • Translates to about 700 simultaneous client
    sessions per cluster-manager
  • In comparison, our text-to-speech implementation
    can support O(15) clients per machine
  • Other scaling concerns
  • Link-State floods
  • Graph computation for service-level path creation

20
Summary
  • Service Composition flexible service creation
  • We address performance, availability, scalability
  • Initial analysis Failure detection -- meaningful
    to timeout in O(1.2-1.8 sec)
  • Design Overlay network of service clusters
  • Evaluation results so far
  • Good recovery time for real-time applications
    O(3 sec)
  • Good scalability -- minimal additional
    provisioning for cluster managers
  • Ongoing work
  • Overlay topology issues how many nodes, peering
  • Stability issues

Feedback, Questions?
Presentation made using VMWare
21
References
  • OPES00 A. Beck and et.al., Example Services
    for Network Edge Proxies, Internet Draft,
    draft-beck-opes-esfnep-01.txt, Nov 2000
  • Labovitz99 C. Labovitz, A. Ahuja, and F.
    Jahanian, Experimental Study of Internet
    Stability and Wide-Area Network Failures, Proc.
    Of FTCS99
  • Labovitz00 C. Labovitz, A. Ahuja, A. Bose, and
    F. Jahanian, Delayed Internet Routing
    Convergence, Proc. SIGCOMM00
  • Acharya96 A. Acharya and J. Saltz, A Study of
    Internet Round-Trip Delay, Technical Report
    CS-TR-3736, U. of Maryland
  • Yajnik99 M. Yajnik, S. Moon, J. Kurose, and D.
    Towsley, Measurement and Modeling of the
    Temporal Dependence in Packet Loss, Proc.
    INFOCOM99
  • Balakrishnan97 H. Balakrishnan, S. Seshan, M.
    Stemm, and R. H. Katz, Analyzing Stability in
    Wide-Area Network Performance, Proc.
    SIGMETRICS97
Write a Comment
User Comments (0)
About PowerShow.com