Wide-Area Service Composition: Availability, Performance, and Scalability - PowerPoint PPT Presentation

About This Presentation

Title:

Wide-Area Service Composition: Availability, Performance, and Scalability

Description:

Code is NOT mobile (mutually untrusting service providers) ... None address wide-area network performance or failure issues for long-lived composed sessions ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 22

Provided by: bhas2

Category:

more less

Transcript and Presenter's Notes

Title: Wide-Area Service Composition: Availability, Performance, and Scalability

1
Wide-Area Service Composition Availability,
Performance, and Scalability

Bhaskaran Raman
SAHARA, EECS, U.C.Berkeley
SAHARA Retreat, Jan 2002

2
Service Composition Motivation
Cellular Phone
Video-on-demand server
Provider A
Provider R
Provider B
Text to speech
Transcoder
Service-Level Path
Email repository
Thin Client
Provider Q
Reuse, Flexibility
Other examples ICEBERG, IETF OPES00
3
In this work Problem Statement and Goals

Problem Statement
Path could stretch across
multiple service providers
multiple network domains
Inter-domain Internet paths
Poor availability Labovitz99
Poor time-to-recovery Labovitz00
Take advantage of service replicas

Goals
Performance Choose set of service instances
Availability Detect and handle failures quickly
Scalability Internet-scale operation

4
In this work Assumptions and Non-goals

Operational model
Service providers deploy different services at
various network locations
Next generation portals compose services
Code is NOT mobile (mutually untrusting service
providers)
We do not address service interface issue
Assume that service instances have no persistent
state
Not very restrictive OPES00

5
Related Work

Other efforts have addressed
Semantics and interface definitions
OPES (IETF), COTS (Stanford)
Fault tolerant composition within a single
cluster
TACC (Berkeley)
Performance constrained choice of service, but
not for composed services
SPAND (Berkeley), Harvest (Colorado),
Tapestry/CAN (Berkeley), RON (MIT)
None address wide-area network performance or
failure issues for long-lived composed sessions

6
Solution Requirements

Failure detection/liveness tracking
Server, Network failures
Performance information collection
Load, Network characteristics
Service location
Global information is required
Hop-by-hop approach will not work

7
Challenges

Scalability and Global information
Information about all service instances, and
network paths in-between should be known
Quick failure detection and recovery
Internet dynamics ? intermittent congestion

8
Is quick failure detection possible?

What is a failure on an Internet path?
Outage periods happen for varying durations
Study outage periods using traces
12 pairs of hosts
Periodic UDP heart-beat, every 300 ms
Study gaps between receive-times
Main results
Short outage (1.2-1.8 sec) ? Long outage (gt 30
sec)
Sometimes this is true over 50 of the time
False-positives are rare
O(once an hour) at most
Okay to react to short outage periods, by
switching service-level path

9
Towards an Architecture

Service execution platforms
For providers to deploy services
First-party, or third-party service platforms
Overlay network of such execution platforms
Collect performance information
Exploit redundancy in Internet paths

10
Architecture
11
Key Design Points

Overlay size
Could grow much slower than services, or
clients
How many nodes?
A comparison Akamai cache servers
O(10,000) nodes for Internet-wide operation
Overlay network is virtual-circuit based
Switching-state at each node
E.g. Source/Destination of RTP stream, in
transcoder
Failure information need not propagate for
recovery
Problem of service-location separated from that
of performance and liveness
Cluster ? process/machine failures handled within

12
Software Architecture
Service-Level Path Creation, Maintenance, Recovery
Service-Composition Layer
Link-State Propagation
Finding Overlay Entry/Exit
Location of Service Replicas
Link-State Layer
At-least -once UDP
Perf. Meas.
Liveness Detection
Peer-Peer Layer
Functionalities at the Cluster-Manager
13
Layers of Functionality

Link-State layer Why Link-State?
Service-Composition layer
What algorithm for path creation?
Algorithm for path recovery?
State management?

14
Evaluation

What is the effect of recovery mechanism on
application?
What is the scaling bottleneck?

15
Evaluation Emulation Testbed

Idea Use real implementation, emulate the
wide-area network behavior (NistNET)
Opportunity Millennium cluster

Rule for 1?2
App
Emulator
Node 1
Rule for 1?3
Lib
Rule for 3?4
Node 2
Rule for 4?3
Node 3
Node 4
16
Evaluation Recovery of Application Session

Text-to-Speech application
Two possible places of failure
Setup
20-node overlay network
One service instance for each service
Deterministic failure for 10sec during session
Metric gap between arrival of successive audio
packets at the client

17
Recovery of Application SessionCDF of gapsgt100ms
Jump 2 at 2,963 ms
Jump 1 350-450 ms
Jump at 10,000 ms
18
Discussion

Jump 1 Due to synchronous text-to-speech
processing
Jump 2 Recovery after failure
Breakup 2,963 1,800 O(700) O(450)
1,800 ms timeout to conclude failure
700 ms signaling to setup alternate path
450 ms recovery of application soft-state
Re-processing current sentence
Without recovery algorithm takes as long as
failure duration
O(3 sec) recovery
Can be completely masked with buffering
Interactive apps still much better than without
recovery
Quick recovery possible since failure information
does not have to propagate across network

19
Evaluation Scaling

Scaling bottleneck
Simultaneous recovery of all client sessions on a
failed overlay link
Can recover at least 1,500 paths without hitting
bottlenecks
Translates to about 700 simultaneous client
sessions per cluster-manager
In comparison, our text-to-speech implementation
can support O(15) clients per machine
Other scaling concerns
Link-State floods
Graph computation for service-level path creation

20
Summary

Service Composition flexible service creation
We address performance, availability, scalability
Initial analysis Failure detection -- meaningful
to timeout in O(1.2-1.8 sec)
Design Overlay network of service clusters
Evaluation results so far
Good recovery time for real-time applications
O(3 sec)
Good scalability -- minimal additional
provisioning for cluster managers
Ongoing work
Overlay topology issues how many nodes, peering
Stability issues

Feedback, Questions?
Presentation made using VMWare
21
References

OPES00 A. Beck and et.al., Example Services
for Network Edge Proxies, Internet Draft,
draft-beck-opes-esfnep-01.txt, Nov 2000
Labovitz99 C. Labovitz, A. Ahuja, and F.
Jahanian, Experimental Study of Internet
Stability and Wide-Area Network Failures, Proc.
Of FTCS99
Labovitz00 C. Labovitz, A. Ahuja, A. Bose, and
F. Jahanian, Delayed Internet Routing
Convergence, Proc. SIGCOMM00
Acharya96 A. Acharya and J. Saltz, A Study of
Internet Round-Trip Delay, Technical Report
CS-TR-3736, U. of Maryland
Yajnik99 M. Yajnik, S. Moon, J. Kurose, and D.
Towsley, Measurement and Modeling of the
Temporal Dependence in Packet Loss, Proc.
INFOCOM99
Balakrishnan97 H. Balakrishnan, S. Seshan, M.
Stemm, and R. H. Katz, Analyzing Stability in
Wide-Area Network Performance, Proc.
SIGMETRICS97