Power Aware Domain Migration in a Virtualized Cluster - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Power Aware Domain Migration in a Virtualized Cluster

Description:

Find utilization of all domains, compare to node performance capacity ... Sum of (100% - average domain utilization) for each domain ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 25
Provided by: dominoRes
Category:

less

Transcript and Presenter's Notes

Title: Power Aware Domain Migration in a Virtualized Cluster


1
Power Aware Domain Migration in a Virtualized
Cluster
  • Tyler Bletsch, Min Yeol Lim, and Vince Freeh

2
Motivation
  • Data centers provision for peak demand
  • Expect to see average demand far less than peak
  • Underutilized resources!
  • Can we exploit this to save energy?
  • Of course.

CPU Demand
Time
3
Existing techniques
  • Prior work
  • Assume a single input stream
  • Requests are short-lived, independent
  • Can be served by homogeneous stateless software
  • Power-Aware Request Distribution (PARD) 1
  • Dynamically power on/off nodes in response to
    anticipated load
  • No state to save/migrate
  • Direct requests to activated nodes
  • This is a very specific workload

4
A more general solution
  • What about
  • systems that maintain state?
  • environments with separate, indivisible
    processes?
  • Goal a more general energy savings technique
  • How?
  • Mechanism Virtualization with live migration
  • Policy Power-Aware Domain Distribution (PADD)
  • Terminology
  • Domain A virtual machine capable of migration
  • Node A physical machine that runs one or more
    domains

5
PARD vs. PADD
6
Conceptual diagram
CPU Demand
Time
Tb
Ta
Node 1
Node 1
Dom 1
Dom 2
Dom 1
Node 2
Node 2 (off)
Dom 2
7
Problem
  • Problem is easy to state
  • What is the best way to distribute domains within
    a cluster of physical nodes such that resource
    demands are satisfied while minimizing the energy
    (and therefore number of required nodes)?
  • Solution is a bit harder

8
Questions
  • How is resource utilization measured?
  • What window-size?
  • When do we shut down a node?
  • Which node gets shut down?
  • Where do we send the victim domains?
  • When do we bring up a node?
  • Which domains get migrated to the new node?
  • What safety margins are needed to handle
    transient spikes? How are they enforced?

9
How to proceed?
  • PADD Simulator
  • All questions become parameters
  • Quickly explore the problem
  • Test many combinations of parameters
  • Ask what-if questions
  • What if migration time could be reduced?
  • What if my incoming traffic doubles?
  • Avoid dealing with transient technical issues

10
Simulator design (1)
  • System resources model
  • Only 1 resource now CPU
  • Other resources will be added
  • Node model
  • 3 power states
  • What is standby?
  • Fixed performance measure
  • Max support CPU utilization
  • Allows for SMP (values gt 1)
  • Allows for different CPU models (non-integer
    values)

Active
Standby
Off
11
Simulator design (2)
  • CPU utilization metric
  • At a given moment, CPU is busy or halted
  • Utilization busy_time/total_time
  • What is total_time the sampling window?
  • Tradeoff
  • Too small chaotic mix of peaks/valley, no trend,
    overreact to small events
  • Too large slow to react

12
Simulator design (3)
  • Reduction run a reduce function on raw
    utilization samples within a given reduction
    window
  • Functions average, max, nth percentile, always
    return X

Max
Mean
Raw data
13
Simulator algorithm
  • For each time step
  • Find utilization of all domains, compare to node
    performance capacity
  • For each overcommitted node
  • Pick a victim domain
  • Is there a node with enough slack to run the
    victim?
  • Yes Migrate the domain there
  • No Boot a new node, then migrate to that
  • If a node can have all of its domains migrated
    off safely
  • Migrate such domains
  • Set node to reduced power state
  • Record statistics (energy, migration count, etc.)

14
Simulator details
  • Safety margins
  • Local delta Extra CPU slack for each node
  • To handle transient demand spikes
  • Calculated dynamically according to a policy
    parameter
  • Static user-specified value
  • Sum of (standard deviation of utilization over
    reduction window) for each domain
  • Sum of (100 - average domain utilization) for
    each domain
  • Global delta Extra CPU slack for the whole
    cluster
  • Keeps some extra nodes running so we dont have
    to wait on boot-up while demand increases
  • Calculated dynamically, can use same policies as
    local delta

15
When is demand satisfied?
  • Depends on workload
  • Throughput-oriented workload
  • First pass unmet CPU demand
  • Each sample, compare total domain demand D
    against node CPU performance capacity P
  • If PD, then demand is satisfied
  • Else, D-P is the unmet demand for that sample
  • Add to a running total
  • A simulation is successful if unmet_demand lt
    MAX_UNMET_DEMAND
  • MAX_UNMET_DEMAND empirically derived, small

16
Summary
  • Simulator takes inputs
  • CPU utilization traces from real servers
  • Cluster configuration
  • Domain migration policy parameters
  • Produces outputs
  • Trace of migration history with running
    statistics
  • Summary statistics total energy use, total unmet
    demand, etc.

17
Preliminary test
  • Configuration
  • 4 web-server domains
  • Test duration 12 hours
  • Sampling window 1 s
  • Reduction window 300 s
  • Comparing policy decisions

18
Preliminary results
  • Bad news
  • Most non-naïve policies led to large unmet demand
    (5800 CPU seconds)
  • Naïve Always assume 100 demand
  • No successful simulation saved any power
    naïve techniques simply ran on 4 nodes all the
    time
  • Good news
  • The maximum reduction function worked as well
    as naïve
  • Should perform even better on workloads with
    stable, low demand
  • E.g. run-to-completion where CPU is not the
    bottleneck
  • Several unsuccessful runs did save power
  • If we can curb the unmet demand, well be in
    business!

19
Conclusion
  • This work is in its infancy, many problems to
    overcome
  • Simulator allows rapid exploration of the problem
  • Hope to find successful power-saving
    configurations in the coming months

20
References
  • 1 K. Rajamani and C. Lefurgy. On evaluating
    request-distribution schemes for saving energy in
    server clusters. Proceedings of the IEEE
    International Symposium on Performance Analysis
    of Systems and Software, 2003.

21
Unmet demand
22
Energy savings
or the lack thereof
23
Boot timing (1)
Cold boot
Restoring from hibernate mode
24
Boot timing (2)
Write a Comment
User Comments (0)
About PowerShow.com