Title: SensorBased Fast Thermal Evaluation Model For Energy Efficient HighPerformance Datacenters
1?Sensor-Based Fast Thermal Evaluation Model For
Energy Efficient High-Performance Datacenters
- Q. Tang, T. Mukherjee, Sandeep K. S. Gupta
- Department of Computer Sc. Engg.
- Arizona State University
-
- Phil Cayton, Intel Corp.
2Heating problem in Data Center
- Power densities are increasing exponentially
along with Moore Law - Current cooling solutions at various levels
- Chip / component level
- Server/board level
- Rack level
- Data center level
3Two steps of reducing heating effects
- Design and deployment stage (Civil Mechanical
Engineering Approach ) - Increasing air conditioner capacity
- Designing optimized layout to facilitate air
circulation - Operation stage (Computer Science Approach)
- Example dynamically assigning tasks to avoid
overheated servers and to achieve thermal
balancing - Assigning task to servers who consume less energy
4Thermal Management of Datacenter
- Motivation and significance
- Compute Intensive Applications (Online Gaming,
Computer Movie Animation, Data Mining) requiring
increased utilization of Data Center - Maximizing computing capacity is a demanding
requirement - New blade servers can be packed more densely
- Energy cost is rising dramatically
- Goal
- Improving thermal performance
- Lowering hardware failure rate
- Reducing energy cost
5Typical layout of a datacenter
- Rack outlet temperature Tout
- Rack inlet temperature Tin
- Air conditioner supply temperature Ts
6Schematic View of Thermal Management
7Thermal-Aware Scheduling versusDatacenter Energy
Cost
8Thermal Scheduling Problem Statement
- We present results of thermal-aware scheduling to
improve the (blade server based) energy efficient
of datacenter - Given a total task C, how to divide it among N
server node to finish computing task with minimal
total energy cost ?
9Energy Conservation
Outlet Airflow
Server Power Consumption Pi Depending on amount
of computing task
Inlet Airflow, a mixture of Supplied cold air and
Recirculated hot air
10Thermal Management
- Different task assignment result in different
power consumption distribution - Different power consumption distribution results
in different temperature distribution - Different temperature distribution results in
different total energy cost
11Example
Inlet temperature distribution without Cooling
Cooling lowered Inlet temperature lowered
blow redline threshold
Different scheduling Results different
inlet Temperature distribution
Demand for cooling load /energy
Scheduling 1
25?C
Demand for cooling load/energy
Scheduling 2
25?C
12Total Energy Cost of Datacenter
- Computing energy cost
- Cooling energy cost
- ?keep the maximal inlet temperature below the
redline temperature of devices 25?C - COP Coefficient Of Performance (COP)
- Total Energy Cost
the amount of heat removed
COP
the energy consumed by the cooling device.
13Observation
- Even with the same computing power dissipation,
different temperature distribution may demand
different cooling load, results in different
total energy cost - We can manipulating task scheduling to achieve
best temperature distribution, consequently
minimize total energy cost
14Naive Scheduling Algorithm
15Uniform Outlet Profile
Temperature rise due to power consumption
- Why Naive
- Based on observation and intuition
- No mathematical formalization
- Uniform Outlet Profile (UOP)
- Assigning tasks in a way trying to achieve
unifrom outlet temperature distribution Tc - Assigning more task to nodes with low inlet
temperature (water filling process)
Tc
Inlet Temperature
16Uniform Task
- Uniform Task (UT)
- Assigning all chassis the same amount of tasks
(power consumptions) - All nodes experience the same power consumption
and temperature rise
17Minimum Computing Energy
- Minimum computing energy (cooling inlet)
- Assigning tasks in a way to keep the number of
active (power on) chassis as small as possible
18Abstract Heat Flow Mode Cross Interference
Coefficients
19Abstract Heat Flow Model
- Observation
- Airflow pattern are stable (confirmed through CFD
simulation) - Hypothesis
- The amount of recirculated heat is stable, can be
characterized - Define aij the percentage of recirculated heat
from node i to node j
20Cross Interference among Server Nodes
- Cross Interference Coefficients (CIC)
- Define aij the percentage of recirculated heat
from node i to node j - Cross interference coefficients
- Cross Interference Matrix
- Correlations among power consumption (utilization
rate), temperature, and cross interference
21Fast Thermal Evaluation
- Use profiling process to calculate cross
interference coefficients - Temperature Prediction
A Configuration of Distributed System
Numerical Simulation (hours)
Fast Thermal Evaluation (real time)
Thermal Performance Evaluation
22Recirculation Minimized Scheduling XInt
23Formalizing optimization problem
- To minimize cooling energy cost, we only need to
minimize maximal inlet temperature - Formalized optimization problem based on abstract
heat flow model, can be converged into LP, ILP,
linear, nonlinear problems according to different
models and policies
24Simulation Results
25Simulation Environment
- 2 Row Datacenter
- Ten standard 42U racks
- Each rack has five Dell 1855 Blade server
- CFD simulation is used for evaluate temperature
distribution - (Flovent from Flomerics)
26DataCenter model
Node 50
Node 5
Node 30
Node 2
Node 25
Node 1
27Cross Interference Coefficients
- Confirmed with datacenter reality
- Strong interference to neighboring nodes
28Fast Thermal Evaluation Results
- Provides fast and accurate temperature prediction
- Practical for online real-time thermal management
29Simulation Results Cooling Cost
30Simulation Results Analysis Summary
- XInt consistently outperforms all other
scheduling algorithms - Compared with MinHR, XInt is more practicabel
- Task oriented scheduling vs. Power oriented
scheduling - Online, real-time
- XInt is mathematically formalized
31Future Works
- Integrating with cluster management software
platforms - Moab, Torque, etc
- Considering task priorities and time constraints
32Questions ?
33Related Works
- Consil vs Fast Thermal Evaluation
- Deduction vs. Prediction
- Current vs. future, which is more important for
proactive and preventive thermal management - MinHR vs. XInt
- Both characterize recirculation in similar
granulites - Aggregated effects vs. point to point
- Offline vs. online
- Power oriented vs. Task oriented
34Supply Heat Index (SHI)
- Roughly characterize recirculation
- Cannot differentiate the same SHI but different
temperature distribution