Title: Power Aware Architecture Design
1Power Aware Architecture Design
- Meeta S. Gupta
- Snehal Sanghavi
- Sama Usmani
2The Power Problem
- Power consumption has been increasing for each
new CPU generation
3Power Reduction Techniques
- Where does all the power go?
Alpha 21264 Source Wilcox, Micro99
- Clock gating
- Voltage scaling
- Architectural techniques
4Voltage Scaling
5Motivation
- PCV2F
- Reducing voltage gives significant power gains
- CPU usage is bursty
- What is voltage scaling?
- Provide multiple voltage levels to voltage
regulator - Use minimum voltage where possible to save power
6Dual Speed Pipelines
- Pipelines running at different voltages and clock
frequencies - Non-critical instructions can tolerate higher
latencies - Criticality can be determined by
- Oldest Instruction critical (OI) scheme
- More than one Consumer critical (MC) scheme
Pyreddy et al, WCED-01
7Processor Model
Pyreddy et al, WCED-01
8Results
- Claim
- Peak performance is when number of fast
instructions is between 40-50
Pyreddy et al, WCED-01
9Determining the voltage supply
- Adjust the supply voltage and frequency in
response to ILP of an application - Different modes of voltage operation
- High performance Processor runs at maximum
voltage - Power-saver Processor runs at minimum voltage
- Automatic Supply voltage chosen on basis of
application needs - F new Fold MIPS goal/ MIPS observed
- Voltage adjustment latency
- Multiple voltages on a single chip eases
switching
Childers et al, Micro-00
10How the voltage adjustment works
When timer expires then get observed
MIPS choose new MIPS compute new
frequency If new frequency ( 33 MHz old
frequency) then stop fetch drain pipeline get
level value get discrete voltage, frequency
value resume fetch
Childers et al, Micro-00
11Energy Improvement
- Claim
- 47 improvement in energy consumption
Childers et al, Micro-00
12Enhanced SpeedStep Technology
- Voltage-frequency switching operation
- Voltage stepped in short increments
- Shorter jumps between operating points
- Lesser latency
- CPU unavailability time reduced
13Power Aware Issue Queue Design
14Motivation
- Cannot use voltage scaling beyond a certain limit
- one-size-fits-all philosophy is not true
- Resource usage varies over applications, and also
within the application - Oversized resources committed for performance
- Wastage of power
- Need to dynamically adapt resources with minimal
performance loss
Source V Tiwari, Micro99
15Issue Queue Design
- One of the most complex logic of a superscalar
processor - Wakeup logic, Select logic
- Performance centric design
- Latch-based
- Compacting vs. non-compacting
- CAM/RAM based
16CoAdaptive Instruction Fetch and Issue
- Issue centric fetch gating
- Instructions fetched earlier than necessary spend
idle energy - Detect mismatch between size of instruction
window and the required size for application
parallelism
Distant parallelism
tail
Higher utilization
Close parallelism
Lower utilization
head
ROB
Issue Queue
- Fetch gate when close parallelism and high
utilization
Buyuktosunoglu et al, ISCA03
17Co-Adaptive Instruction Fetch and Issue
- Dynamic adaptation of issue queue size
- Based on activity of instructions
A
A
NA
- Dynamic adaptation of issue queue gives a 31
power reduction - 20 additional savings when coupled with fetch
gating
Buyuktosunoglu et al, ISCA03
18Dynamic Allocation of Multiple Datapath Resources
- Use of multiple resources in the instruction path
is highly correlated
- Independently resize ROB, IQ, LSQ based on their
occupancy
Ponomarev et al, Micro01
19Dynamic Allocation of Multiple Datapath Resources
- Downsize resource based on occupancy
- Periodically sample the resource, and average
over few sample periods - If the difference between the current size and
active size is greater than a partition - Reduce by a partition
- Aggressively reduce by (difference/partition size)
- Upsizing more aggressive in nature
- Not based on decrease in IPC
- If the number of instructions blocked at dispatch
increases beyond a threshold, increase the
resource size - Power savings of 50 in IQ, 70.5 in ROB and
55.7 in LSQ with a performance degradation of
7.3
Ponomarev et al, Micro01
20Discussion
- Provides a coarse-grained gating mechanism
- Can help reduce the leakage current
- By shutting down unused resources
- Tradeoffs
- Impact on IPC
- lt8 decrease in performance
- Circuit complexity increases with additional
logic added - Simpler designs like non-compacting queues with
clock gating might give similar benefits
21Profile-based Hardware Configuration
22Motivation
- Variation in program behavior
- Across various applications
- From one section to another
- Variable resource usage
- Unused resources consume energy
- Identify the optimal configuration for each code
region - Requires code profiling
- Dynamic configuration of processor
23Profiling Basics
- What is profiling for power?
- Collecting performance and power statistics
- Static and Dynamic profiling
- Static Off-line, profiling done before final
execution - Dynamic Profiling done during execution
- Profiling granularity
- Fine-grained Basic blocks, Functions
- Coarse-grained Meta blocks
24Compiler Assisted Code Annotation
- Profiling for a Basic Block
- Find optimal number of instructions to be
executed in parallel - Consider variable fetch width and execution width
- Fetch width control dominated architectures
(superscalar, O-o-O) - Execution width datapath dominated architectures
(VLIW) - Annotation of all instructions of basic block
with profiling data - Annotation value signifies number of functional
units - If annotated value gt 1, instruction concurrently
executed - Assumptions
- Instruction format allows for code annotation
- Microprocessor supports variable fetch and
execution rates - Technique holds for single type of functional
unit - Advantage for unrolled loops
Marculescu et al
25Code Annotation - Algorithm
- Low_Energy_Code_Annotate (Program)
- Extract all basic blocks (b1,,bn)
- For x 1,,K
- for j 1,,n
- P power of basic block bi
- when x instructions are
executed in parallel -
- For i 1,,n
- Find the value of x 1,,K such that P is
minimized - Annotate all instructions in bi with x
Marculescu et al
26Results
- Variable Execution Width
- 44 power savings at 35 performance decrease
with no constraints - 33 power savings at 25 performance decrease
with constraint - Variable Fetch Width
- 15 power reduction for 6 performance loss
Marculescu et al
27Dynamic Profiling Scheme
- Profiling based on Hotspots
- 92 of execution time is in hotspots !
- - Find best configuration for a hotspot
- Consider size of RUU and pipeline width
- RUU 3 different sizes (16, 32, 48, 64)
- Pipeline width 4 widths (4, 6, 8)
A. Iyer et al
28Example of FSM
d 1 after every 1024 instructions
- On detecting hotspot, traverse through 12-state
FSM - Track power for each configuration
- Switch processor to optimal configuration
64/8
48/8
32/8
16/8
d
d
d
d
64/4
48/4
32/4
16/4
d
d
d
d
Hotspot found
64/2
48/2
32/2
16/2
d
d
d
d
S
OPT
Hotspot
lost
Program start
A. Iyer et al
29Dynamic Profiling - Results
18 power savings over baseline processor
30Discussion
- Static vs. Dynamic
- Static methods have pre-processing overheads
- Dynamic may get out of the hotspot before
detecting optimal configuration - Fine-grained vs. Coarse-grained
- Fine-grained increases switching, overhead
- Coarse-grained might not recognize changes in
program behavior
31Thank you Questions?
32Appendices
33- Appendix A1. Comparison of the two heuristics
with base configuration (issue restricted and
issue unrestricted) for CRAFTY.
34Appendix A2. Enhanced Intel SpeedStep technology
35Appendix B1. Instruction format in SimpleScalar
architecture
36Appendix B2. Meta-blocks obtained from successive
basic blocks
Branch probability
37Appendix B3. Hotspot detection hardware
- BBB
- Execution counter incremented when branch is
taken - Once it reaches a threshold, branch is marked as
candidate branch - (2k entries, 9-bit counters)
- HDC
- Saturating counter, initialized to all 1s
- Decrement by D when candidate branch is taken,
increment by I when non-candidate branch is taken - When HDC 0, hotspot detected
- (13-bit HDC, D 2, I 1)
- After Nf cycles, BBB non-candidate branches are
flushed - After Nr cycles, entire BBB is reset
- This prevents stagnation
- (Nf 4k, Nr 64k)