Title: OnCall
1OnCall
- Defeating Traffic Spikes with a Free-Market
Application Cluster
James Norris Keith Coleman Armando Fox
George Candea Stanford University
2Motivation
- CNN.com September 114x traffic in a single
day8x traffic on second day - Offline for 2.5 hours, diminished service
afterwards - Slashdot Effect
- Variable Traffic
- Ticket Sales
- Contestsetc
337.4 M
162.4 M
CNN.com Page Views
40 M
3What to do?
4Three Options
- One Option Overprovision Works for steady
state fluctuations (but not optimal) - Too expensive for spike conditions (8x
servers for CNN) - Another Option Graceful Degradation Provides
basic service continuity - Full features (including revenue-generating
features) may be lost - Better Option Dynamic Allocation
5What is OnCall?
- OnCall is
- a cluster management system designed to
multiplex several (possibly competing) dynamic
web applications onto a single cluster. - Goal
- Make spike handling possible while providing
useful resource guarantees to all apps -
- Solution
- Marketplace of Applications
- Applications rent and lend computing resources
according to pre-defined market policies - Generic Platform
- Based on VMs
- ? application generic
- ? fast app swapping
6Marketplace
7Market Rounds
- Offline
- Each application assigned ownership of G
computers at a fixed price (or rate) - Online
- Determine market equilibrium price, P, by
querying each application - Calculate new allocation sizes at price P
- Adjust allocations, moving computers from sellers
to buyers - Repeat every time quantum, t
8Offline Market G
- G
- Each app owns G nodes
- Resource guarantees
- Never have to sell no matter what the price or
what other apps demands, an app is guaranteed
use of its G nodes - Can lend by choice (if there are renters at
desired price) - Can rent extra nodes (if it needs to and/or can
afford to) -
9Online Market
7 5 2 14, but I only have 10 nodes!
5 3 2 10 Perfect!
10 nodes in cluster
Marketplace
Policy
Policy
Policy
10Online Market Policies
Output of computers desired at price P
POLICY
Price P
11Example Market Policy
n lt G (no spike)
- For each round, application A computes the number
of nodes, n, it needs to handle current traffic - Ex Application A has a price threshold of 6
- If (P lt 6), A will ask for n nodes
- If (P 6), A will only ask for min(n, G) nodes
it cant afford to rent extras
n gt G (spike)
12Finding the Equilibrium
- Sample points along the different policy
functions - Determine the price at which the total number of
nodes desired by all apps equals the total number
of nodes available on the cluster
13Competitive vs Cooperative
- Competitive Environments
- Ex ASP, where app owners may be in competition
- Cooperative Environments
- Ex Search engine, Yahoogle
- Quick Case Study
- App 1 Paid web search (very high value in low
latency) - App 2 Ad-supported web search (high value in
low latency) - App 3 Crawler (latency OK, starvation not)
- For each app, model utility of running at a
given time -
- Benefit If you add an app, just need to model
that app, not remodel whole system
14Platform
15Platform Overview
16Does this work?
17Simulation Testbed
- Three Simulations, Four Traits
- Spike handling under unconstrained resources
- Spike handling under constrained resources
- Resource guarantees
- Fast server activation
- U.C. Berkeley X Cluster
- 30 Nodes (double CNN.com)
- Dual 1 GHz PIII, 1.5 GB RAM
- VMware GSX Server on Linux
18 Sim 1 Spike Handling
- G 10 for both apps
- App 1 handles spikes, App 2 makes
- Notice Lag time between node assigned ? node
active
19 Sim 2 Resource Constraints
- G1 12, G2 6, G3 12
- App 1 has higher budget than App 2, but both
spike - App 1 handles spikes, App 2 sees guarantee, App 3
makes - App 2 buys more when App 1s spike subsides
20 Sim 3 Fast Activation
Platform OnCall Optimal OnCall Limited Standard with OS Standard w/out OS
Time until Active (s) 5-10 50-120 270-330 710-750
- OnCall Optimal Load VMs from suspended state
- OnCall Limited Load VMs from shutdown state
- Standard with OS OS already installed on node
- Standard without OS Must install OS first
- Significance
- Worst case, gt 2x improvement
- When spike lasts only 30 minutes, this is
significant - If you can startup quickly, accurate predictor is
not critical
21Questions?
22Notes and Assumptions
- Homogeneity Assumption
- Cluster is assumed to be homogeneousall nodes
rented at same price (for simplicity) - Swapping Costs
- Time delay cost in start up / shut down of an
app on a node. - If a rental contract is renewed, app runs on
same node. - P Only for Extras
- Apps only pay price P for nodes above and beyond
their own G - Ex Using 40, G 30
- ? 40 30 10 nodes at price P
-
23Runtime Operation
- Runtime cycle repeats every t
- Marketplace calculates equilibrium price (and
thus application allocations) - Managers assigns apps to physical nodes
(minimizing shutdowns and startups) - Manager signals Responders to shutdown and start
new app, as necessary - At end of round, Manager gathers new usage stats
reports stats to Market Policies - Repeat
24Marketplace Optimality
- What is optimal?
- Under resource constraints, those applications
with the most utility to derive from the use of
additional nodes are given those nodes - Utility Curves
- Curve specifies dollar value an application
derives from possessing a certain number of nodes
for a specific time quantum.
Trivially Utility curves are always
monotonically non-decreasing (i.e. it is never
worse to own more nodes at a given total cost)
To be optimal Marginal utility curves are
always monotonically non-increasing (i.e. every
additional node is worth same or less than one
before)
25Profit Through Efficiency
- Shut Down App
- ASP shuts down servers when it can buy them for
less than the cost of keeping them running (A/C,
utilities, etc) - ASP can then add additional capacity and sell
only when profitable
26Marketplace Fairness
- Markets are optimal if
- they are free and fair
-
- Anti-competitive behavior
- Monopoly/Oligopoly
- Aggressive tactics
- Fairness through Regulation
- Ensure enough distinct owners ? no monopoly
- Fine or ban app that engages in overtly
anti-competitive behavior
27Future Work
- VM caching
- Cache VMs to local disk (speculatively or as
read from NAS) -
- Fault tolerance
- Add master-backup fault tolerance to the OnCall
Manager - Performance statistics
- Provide market policies with additional
statistics (e.g. end-to-end response time) - Scalable data layer
- Add support for scalable persistent stores that
would allow replication on the data tier. - Multiplexing
- Study trade-offs of running several applications
on one node