Title: Energy Awareness and Uncertainty in Design at Microarchitectural Level
1Energy Awareness and Uncertainty in Design at
Microarchitectural Level
Title Goes Here
- Diana Marculescu
- Dept. of Electrical and Computer Engineering
- Carnegie Mellon University
- dianam_at_ece.cmu.edu
- http//www.ece.cmu.edu/enyac
2Why energy aware?
Courtesy of Deo Singh, Intel Cool Chips Tutorial,
MICRO-32
- Moderate performance improvements come at
significant power costs - Wide variation of
- Application behavior
- User profile
- Environment characteristics
- within and across applications
3The Road Towards Energy Awareness
- Need for synergistic approaches that tie
- Lower level technology capabilities such as
supply and threshold voltage or speed scaling - with knobs available at higher levels of
abstraction (microarchitecture, architecture,
system level) - Need for coping with design complexity issues
through localized, fine-grain power management - Possible solution the use of voltage/frequency
islands that may run at lower speed/power for a
prescribed workload
4The Road Towards Energy Awareness
- Need for synergistic approaches that tie
- Lower level technology capabilities such as
supply and threshold voltage or speed scaling - with knobs available at higher levels of
abstraction (microarchitecture, architecture,
system level) - Need for coping with design complexity issues,
localized, fine-grain power management - Possible solution the use of voltage/frequency
islands that may run at lower speed/power for a
prescribed workload
One possible solution Globally Asynchronous,
Locally Synchronous processing
5Outline
- Why energy awareness via GALS?
- Impact of variability in design
- Joint energy and variability metrics
- Case study
- Globally Asynchronous, Locally Synchronous (GALS)
vs. fully synchronous architectures - Ahead
6Why energy aware via GALS?
uarch variability
- Many approaches targeting energy awareness may
not be complexity effective - And thus, may negatively affect indirectly
overall variability - Our claim is that GALS designs may enable lower
uarchitecture-driven variability
7Energy Variability interactions
8Faster clock speeds ? Less variability
- Deeper pipelining worsens random variation impact
- Total variation impact insensitive to pipeline
depth - Variations worsen with increasing number of
critical paths - Most performance enhancing techniques increase
number of critical paths the simplest
(superpipelining) increases overall random
variations
Source Shekhar Borkar, Intel, 2004
9Energy awareness ? Less variability
- Deeper pipelining worsens random variation impact
- Total variation impact insensitive to pipeline
depth - Variations worsen with increasing number of
critical paths - Many techniques for achieving application-driven
energy awareness increase variability selective
voltage scaling adaptive resource scaling, etc.
e.g., they exploit existing slack for hiding
voltage scaling latencies
Source Shekhar Borkar, Intel, 2004
10If Energy is the question, is GALS the answer?
Synchronous
GALS
- Potential for fine-grain adaptability
- Different speeds among synchronous blocks
- Different voltages, and hence potential for lower
power consumption - Can be used with on-the-fly application-driven
adaptation
11If Variability is the question, is GALS the
answer?
Synchronous
GALS
- Potential for less process or system
parameterinduced variability - Local clock domains are characterized by tighter
static or dynamic variations - Enable faster local speeds AND better overall
energy efficiency
12More on GALS systems
Title Goes Here
13GALS design issues
- Metastability resolution
- Always a problem when interfacing clock domains
- Synchronizers and arbiters
- Possible solution mixed-timing FIFOs Chelcea et
al., DAC 2001 - Local clock generation
- Ring oscillators Muttersbach et al., ASYNC 2000
- Failure modeling
- Synchronization failures can happen
- The goal is to maximize Mean Time Between Failure
(MTBF)
14GALS problems of interest
- Granularity of clock domain partitioning
- How many clock domains?
- Architecture definition where to decouple?
- Fine-grain, dynamic control strategy for
adjusting the voltage/clock speed - Occupancy-based, threshold control algorithms
- Attack-decay control algorithms
- Plus using cross-domain dependency information
15How many frequency islands?
16How many frequency islands?
17How many frequency islands?
18Architecture definition where to decouple?
- Automatic partitioning into clock domains Hemani
et al., DAC 1999 - Applied only to random logic, not helpful for
high-end processors - A typical superscalar core exhibits natural
decoupling - Front-end (I-cache and BP hardware)
- Decode, Register renaming
- Integer, FP and memory domains
- Local clock grids usually follow the same type of
partitioning
19Performance increase coefficient
- Significant speed-up can be achieved increasing
clock speed in the Fetch or Memory, followed by
Integer and FP partitions
20How many frequency islands?
- Performance penalty increases with the number of
asynchronous interfaces
21Impact on energy reduction
- Due to the increase in execution time, total
energy per task required by GALS processors can
actually increase - For more than 5 clock domains, a GALS design is
no longer cost-effective
22Energy and performance trends
- Average power decreases, but overall energy may
increase - Reasons
- Performance drops due to asynchronous
communication - Hence longer execution time and more overhead for
unused modules - Longer branch recovery pipeline
- Consequently, more speculative execution
- Higher fetch to commit time for each instruction
- Higher occupancies of rename tables and issue
queues
23Where Does Energy Go?
- Longer execution times translate into larger
energy costs
24Delay - voltage dependency
- Fine-grain voltage reduction can be beneficial in
GALS systems - Each clock domain can be run at a different speed
- Vdd is the supply voltage
- Vt is the threshold voltage
- a is a technology-defined constant between 1 and
2 - a is 1.2-1.6 for present generation Chen et al.,
1998
25A possible solution Dynamic control strategy
- Threshold-based algorithm Iyer et al., 2002
- Assumes two operating modes
- Selects the appropriate mode based on the Issue
Queue occupancy - Attack-decay algorithm Semeraro et al., 2002
- Assumes several (tens) operating modes
- Tries to preserve the same Issue Queue occupancy,
while more aggressively pursuing the best
power/performance trade-off - Additional improvements Talpes et al., 2003
- Fetch clock speed scaled to match commit rates
- Use cross-domain dependency information to
eliminate false positives in low occupancy rates
26Impact on energy reduction Talpes et al., 2003
- By using DVS, an average energy reduction of up
to 25 can be achieved at the expense of a 10
penalty in performance - Note that these are pessimistic results
variability is not taken into account
27Joint Variability-Energy Characterization
28GALS and variability
- Our claim GALS (or frequency island-based)
design allows for less process-induced
variability - While globally enabling better performance
and/or better power/performance trade-offs - Intuitive observation on the role of uarch
decisions - The number of critical paths per clock domain is
smaller - Hence, less variability
- Beneficial impact on other parameters as well
- Will include smaller variations in temperature
per clock domain
29Variability modeling Bowman et al., 2002
- Assume normal distributions for critical path
delay (Tcp,nom nominal critical path delay) - Maximum critical path delay distribution (f
probability density, F cumulative probability
function, Ncp number of critical paths)
30Putting energy, performance and variability
together
- A possible probabilistic design metric that needs
to be maximized (FMAX clock speed distribution)
Borkar, 2004 - However, in the case of high-end processors
- Clock speed does not necessarily translate into
performance - Moreover, IPC increasing artifacts affect
variability - Proposed joint metric that must be minimized
- The goal is to include variability in the maximum
critical path or minimum clock speed, with and
without temperature modeling (q temperature,
ncp logic depth) Basu et al., 2004
31Modeling details
- Assume only WID effects
- WID variations mostly affect the mean of FMAX
distribution - D2D affects the spread
- Concerned only with
- Static gate length variations
- Dynamic temperature variations
- Here we do not look at the impact on leakage
variability - Use device counts per module/clock domain to
estimate - Total die area or clock domain area
- Number of critical paths Ncp per die or clock
domain
32Microarchitecture settings
- Pipeline 16 stages, 4 way out-of-order
- Instruction Window 64 entries - 32 Int, 16 FP, 16
Mem - Load / Store Queue 32 entries
- I-Cache 32K, 2 way set-associative, 1 cycle hit
time, LRU replacement - D-Cache 32K, 4 way set-associative, 2 cycles hit
time, LRU replacement - L2 Cache Unified, 256K, 4 way set-associative,
LRU replacement - Access time 10 cycles
- Memory access time 100 cycles
- Functional Units 4 Integer ALUs, 2 Integer
MUL/DI - 2 Memory ports
- 2 FP Adders, 1 FP MUL/DIV
- Branch Prediction G-share, 11 bits history, 2048
entries - Technology 0.13 um STMicro technology (high
speed) - Vdd 1.8V, Vt 0.2V
- Normalized leakage 80 nA
- current per device 1
- Clock Speed / Vdd 250MHz - 1000MHz, 0.7V - 1.8V
- DVS Thresholds Integer - 9, 12 Memory - 9,
12 FP - 6, 9 - DVS Speed levels Integer - High 1GHz, Low
750MHz
33Case study GALS vs. fully synchronous processors
34Critical path delay distribution without Temp
- Locally clocked domains have a mean value for the
maximum critical path delay that is 2-12 smaller
than for the fully synchronous baseline
35Critical path delay distribution with Temp
- Locally clocked domains have a mean value for the
maximum critical path delay that is 8-18 smaller
than for the fully synchronous baseline
36Q metric distribution with and without Temp
- Using local speed/voltage scaling per clock
domain decreases Q by 26 when compared to the
synchronous baseline
37Q metric probability with and without Temp
- GALS-T-DVS eliminates most of the high-Q bin when
compared to the synchronous baseline
38Instead of summary
- Microarchitectural modeling of process
variability effects is possible - In conjunction with fine-grain DVS,
minimally-clocked machines provide a better joint
energy/performance/variability metric than their
fully synchronous counterparts - Considered only WID-induced gate length effects
and temperature-induced effects - Ahead
- both WID and D2D variability
- leakage variations
- true microarchitecture design exploration with
variability in mind
39Thank you!
- More information
- CMUs Energy Aware Computing Group
- http//www.ece.cmu.edu/enyac