Title: ISCA 2004 Tutorial
1ISCA 2004 Tutorial
- Thermal Issues for Temperature-Aware Computer
Systems - Saturday, June 19th
- 800am - 500pm
2Presenters
- Kevin Skadron (skadron_at_cs.virginia.edu)
- CS Department, University of Virginia
- David Brooks (dbrooks_at_eecs.harvard.edu)
- CS Department, Harvard University
- Antonio Gonzalez (antonio_at_ac.upc.es)
- UPC-Barcelona, and Intel Barcelona Research
Center - Lev Finkelstein (lev.finkelstein_at_intel.com)
- Intel Haifa
- Mircea Stan (mircea_at_virginia.edu)
- ECE Department, University of Virginia
3Overview
- Motivation (Kevin) 1.5 hrs
- Thermal issues (Kevin)
- Power modeling (David) 1.5
- Thermal management (David) hrs
- Optimal DTM (Lev) .5 hrs
- Clustering (Antonio) 1 hr
- Power distribution (David) 15 min
- What current chips do (Lev) 45 min
- HotSpot and sensors (Kevin) 1 hr
4Overview
- Motivation (Kevin)
- Thermal issues (Kevin)
- Power modeling (David)
- Thermal management (David)
- Optimal DTM (Lev)
- Clustering (Antonio)
- Power distribution (David)
- What current chips do (Lev)
- HotSpot (Kevin)
5Motivation
- Power consumption first-order design constraint
- unconstrained power is a theoretical max
- peak (?inst.) power is limiting power delivery
- sustained power limits thermal design/packaging
- max sustained power thermal virus
- same as thermal design power
- average active power and idle power limit mobile
battery life, etc. - Common fallacy instantaneous power ? temperature
- Power-density is increasing exponentially
- Unfortunate corollary of Moores Law
- thermal effects become more problematic
- Need Power/Temperature-aware computing!
6Power Dissipation
Source Microprocessor Report
7Effects of Technology Scaling on Power Dissipation
- Feature size is scaling down
- 30
- Frequency is increasing
- 2x
- Area increases due to microarchitecture
improvements - 25 (Ideal scaling decreases by 50)
- Active capacitance increases
- at least 30 (Ideal scaling decreases by 30)
- Vdd is not scaled down at the same rate as
feature size - 0-10 (Ideal scaling 30)
- Ideal scaling P ? CV2f ? 0.72 reduction ? 0.5
- Observed scaling ? 2 2.5x increase
- Power density becomes a problem!
- Especially since the power density is non-uniform
8Trends in Power Density
Sun's Surface
1000
Rocket Nozzle
Nuclear Reactor
100
Pentium 4
Pentium III
Hot plate
Pentium II
10
Pentium Pro
Pentium
i386
i486
1
1.5m
1m
0.7m
0.5m
0.35m
0.25m
0.18m
0.13m
0.1m
0.07m
New Microarchitecture Challenges in the Coming
Generations of CMOS Process Technologies Fred
Pollack, Intel Corp. Micro32 conference key note
- 1999.
9ITRS Projections
- These are targets
- Power-density problem is still getting worse
- Intel papers suggest that in the 45-75W range,
cooling costs 1/W but then rate of increase
goes up 2, 3/W, probably more!(Borkar, IEEE
Micro 99, Gunther et al, ITJ 01)
ITRS 2001
10Leakage Power
- The fraction of leakage power is increasing
exponentially with each generation - Also exponentially dependent on temperature
Increasingratioacross generations
Source Sankaranarayanan et al, University of
Virginia
11Power-aware figures of merit
- Power (P) battery time (mobile)
- (1/W) packaging (high-performance)
- Energy (PD) battery life (mobile)
- (MIPS/W) fundamental limits (kT)
- Energy-delay (PD2)
- (MIPS2/W) performance and low power
- Energy-delay2 (PD3) indep. of Vdd
- (MIPS3/W) emphasis on performance
- Power-aware ? low power
- Similar to old VLSI complexity (A,AD,AD2)
- None of these are appropriate for thermal
- This is a problem
- Refs R. Gonzales et al. Supply and threshold
voltage scaling for low power CMOS, JSSC, Aug.
1997 - A. Martin et al. Design of an Asynchronous MIPS
R3000, ARVLSI97 - J. Ullman, Computational aspects of VLSI, CS
Press, 1984
12Cooking-aware computing
- Some chips rated for 100C
13Power and temperature are BAD
Source Toms Hardware Guidehttp//www6.tomshardw
are.com/cpu/01q3/010917/heatvideo-01.html
14Other Costs of High Heat Flux
- Some chips may already be underclocked due to
thermal constraints! - (especially mobile and sealed systems)
15Temporal, Spatial Variations
Temperature variation of SPEC applu over time
Hot spots increase cooling costs ? must cool
for hot spot
16Application Variations
- Wide variation across applications
- Architectural and technology trends are making it
worse, e.g. simultaneous multithreading (SMT) - Leakage is an especially severe problem
exponentially dependent on temperature!
17Heat vs. Temperature
- Different time scales
- Heat no notion of spatial locality
- Does architecture have a role?
- Temperature-aware computing
- Optimize performance subject to a temperature
constraint
18Overview
- Motivation (Kevin)
- Thermal issues (Kevin)
- Power modeling (David)
- Thermal management (David)
- Optimal DTM (Lev)
- Clustering (Antonio)
- Power distribution (David)
- What current chips do (Lev)
- HotSpot and sensors (Kevin)
19Thermal issues
- Temperature affects
- Circuit performance
- Circuit power (leakage)
- IC reliability
- IC and system packaging cost
- Environment
20Performance and leakage
- Temperature affects
- Transistor threshold and mobility
- Subthreshold leakage, gate leakage
- Ion, Ioff, Igate, delay
- ITRS 85C for high-performance, 110C for
embedded!
Ioff
Ion NMOS
21Temperature-aware circuits
- Robustness constraint sets Ion/Ioff ratio
- Robustness and reliability Ion/Igate ratio
- Idea keep ratios constant with T trade leakage
for performance!
Ref Ghoshal et al. Refrigeration
Technologies, ISSCC 2000 Garrett et al. T3,
ISCAS 2001
22Resulting performance
- 25 - 30 extra performance (110oC to 0oC)
regular
TAC
23Reliability
- The Arrhenius Equation MTFAexp(Ea/KT)
- MTF mean time to failure at T
- A empirical constant
- Ea activation energy
- K Boltzmanns constant
- T absolute temperature
- Failure mechanisms
- Die metalization (Corrosion, Electromigration,
Contact spiking) - Oxide (charge trapping, gate oxide breakdown, hot
electrons) - Device (ionic contamination, second breakdown,
surface-charge) - Die attach (fracture, thermal breakdown, adhesion
fatigue) - Interconnect (wirebond failure, flip-chip joint
failure) - Package (cracking, whisker and dendritic growth,
lid seal failure) - Most of the above increase with T (Arrhenius)
- Notable exception hot electrons are worse at low
temperatures - More on this later
24Packaging cost
- From Cray (local power generator and
refrigeration)
Source Gordon Bell, A Seymour Cray
perspective http//www.research.microsoft.com/use
rs/gbell/craytalk/
25Packaging cost
- To today
- Grid computing power plants co-located near
compute farms - IBM S/390
- refrigeration
Source R. R. Schmidt, B. D. Notohardjono
High-end server low temperature cooling IBM
Journal of RD
26IBM S/390 refrigeration
Source R. R. Schmidt, B. D. Notohardjono
High-end server low temperature cooling IBM
Journal of RD
27IBM S/390 processor packaging
- Processor subassembly complex!
- C4 Controlled Collapse Chip Connection
(flip-chip)
Source R. R. Schmidt, B. D. Notohardjono
High-end server low temperature cooling IBM
Journal of RD
28Intel Itanium packaging
- Complex and expensive (note heatpipe)
Source H. Xie et al. Packaging the Itanium
Microprocessor Electronic Components and
Technology Conference 2002
29Intel Pentium 4 packaging
Source Intel web site
30Graphics Cards
Source Tech-Report.com
31More Graphics Cards
32Under/Overclocking
- Some chips need to be underclocked
- Especially true in constrained form factors
- Try fitting this in a laptop or Gameboy!
Ultra model of Gigabyte's 3D Cooler Series
Source Toms Hardware Guide
33Apple G5 liquid cooling
- Dont know details
- Lots of people in thermal engineering community
think liquid is inevitable, especially for server
rooms - But others say no
- This introduces a whole new kind of leakage
problem - Water and electronics dont mix!
34Environment
- Environment Protection Agency (EPA) computers
consume 10 of commercial electricity consumption - This incl. peripherals, possibly also
manufacturing - A DOE report suggested this percentage is much
lower - No consensus, but its still a lot
- Equivalent power (with only 30 efficiency) for
AC - CFCs used for refrigeration
- Lap burn
- Fan noise
35Heat mechanisms
- Conduction
- Convection
- Radiation
- Phase change
- Heat storage
36Conduction
- Similar to electrical conduction (e.g. metals are
good conductors) - Heat flow from high energy to low energy
- Microscopic (vibration, adjacent molecules,
electron transport) - No major displacement of molecules
- Need a material typically in solids (fluids
distance between mol) - Typical example thermal slug, spreader,
heatsink
Source CRC Press, R. Remsburg Ed. Thermal
Design of Electronic Equipment, 2001
37Conduction
- Not a strongfunction oftemperature
-
- But for the hightemp. variationson high-perf.
chips,(30), it matters - Note esp. Sivs. Al, Cu
Source CRC Press, R. Remsburg Ed. Thermal
Design of Electronic Equipment, 2001
38Convection
- Macroscopic (bulk transport, mix of hot and cold,
energy storage) - Need material (typically in fluids, liquid, gas)
- Natural vs. forced (gas or liquid)
- Typical example heatsink (fan), liquid cooling
- Note that convection is profoundly affected by
board layout
Source CRC Press, R. Remsburg Ed. Thermal
Design of Electronic Equipment, 2001
39Radiation
- Electromagnetic waves (can occur in vacuum)
- Negligible in typical applications
- Sometimes the only mechanism (e.g. in space)
Source CRC Press, R. Remsburg Ed. Thermal
Design of Electronic Equipment, 2001
40Carnot Efficiency
- Note that in all cases, heat transfer is
proportional to ?T - This is also one of the reasons energy
harvesting in computers is probably not
cost-effective - ?T w.r.t. ambient is ltlt 100
- For example, with a 25W processor, thermoelectric
effect yields only 50mW - Solbrekken et al, ITHERM04
- This is also why Peltier coolers are not energy
efficient - 10 eff., vs. 30 for a refrigerator
41Surface-to-surface contacts
- Not negligible, heat crowding
- Thermal greases/epoxy (can pump-out)
- Phase Change Films (undergo a transition from
solid to semi-solid with the application of heat)
Source CRC Press, R. Remsburg Ed. Thermal
Design of Electronic Equipment, 2001
42Phase-change
- Thermal solutions evolution
- Natural air cooling
- Forced-air cooling
- Liquid cooling
- Phase change (e.g. heat pipe)
- Refrigeration
- Phase change
a. Solid changing to a liquidfusion, or
melting, b. Liquid changing to a
vaporevaporation, also boiling, c. Vapor
changing to a liquidcondensation, e. Liquid
changing to a solidcrystallization, or
freezing, f. Solid changing to a
vaporsublimation, g. Vapor changing to a
soliddeposition.
43Thermal resistance
44Thermal capacitance
- Cth VCp ?
- ?(Aluminum) 2,710 kg/m3
- Cp(Aluminum) 875 J/(kg-C)
- V t A 0.000025 m3
- Cbulk VCp ? 59.28 J/C
45Refrigeration
- conventional vs. thermo-electric (TEC)
- Can get T lt T_amb (negative Rth!)
- TEC Peltier effect (can use for local cooling)
46TEC electro-thermal model
47Simplistic steady-state model
- All thermal transfer R k/A
- Power density matters!
- Ohms law for thermals
- (steady-state)
- ?V I R -gt ?T P R
- T_hot P Rth T_amb
- Ways to reduce T_hot
- reduce P (power-aware)
- reduce Rth (packaging)
- reduce T_amb (Alaska?)
- maybe also take advantage of transients (Cth)
48Simplistic dynamic thermal model
- Electrical-thermal duality
- V ? temp (T)
- I ? power (P)
- R ? thermal resistance (Rth)
- C ? thermal capacitance (Cth)
- RC ? time constant
- KCL
- differential eq. I C dV/dt V/R
- difference eq. ?V I/C ?t V/RC ?t
- thermal domain ?T P/C ?t T/RC ?t
- (T T_hot T_amb)
- One can compute stepwise changes in
temperature for any granularity at which one can
get P, T, R, C
49Combined package model
Note Tja is meaningless!
Steady-state Tj junction temperature Tc case
temperature Ts heatsink temperature Ta
ambient temperature
What exactly is Ta?
Guts of the component
Tjc is better but still sketchy
Source CRC Press, R. Remsburg Ed. Thermal
Design of Electronic Equipment, 2001
50Reliability as f(T)
- Reliability criteria (e.g., DTM thresholds) are
typically based on worst-case assumptions - But actual behavior is often not worst case
- So aging occurs more slowly
- This means the DTM design is over-engineered!
- We can exploit this, e.g. for DTM or frequency
Spend
Bank
51EM Model
Life Consumption Rate
Apply in a lumped fashion at the granularity of
microarchitecture units, just like RAMP
Srinivasan et al.
52Reliability-Aware DTM
53Temperature limits
- Temperature limits for circuit performance can be
measured - Temperature limits for reliability are at best an
estimate - 150 is a reasonable rule of thumb for when
immediate damage might occur - Chips are typically specified at lower
temperatures, 100-125 for both performance and
long-term reliability - Rule of thumb that every 10 halves circuit
lifetime is false - Originates from a mil-spec that is debunked
54Thermal issues summary
- Temperature affectsperformance, power, and
reliability - Architecture-level conduction only
- Very crude approximation of convection as
equivalent resistance - Convection too complicated
- Need CFD!
- Radiation can be ignored
- Use compact models for package
- Power density is key
- Temporal, spatial variation are key
- Hot spots drive thermal design
55Review of Thermal Issues
- From ITHERM04 keynote by Ken Goodson,
Stanford/Cooligy