Power Aware Architecture Design - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Power Aware Architecture Design

Description:

Power Aware Architecture Design – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 38

Provided by: meetasha

Category:

more less

Transcript and Presenter's Notes

Title: Power Aware Architecture Design

1
Power Aware Architecture Design

Meeta S. Gupta
Snehal Sanghavi
Sama Usmani

2
The Power Problem

Power consumption has been increasing for each
new CPU generation

3
Power Reduction Techniques

Where does all the power go?

Alpha 21264 Source Wilcox, Micro99

Clock gating
Voltage scaling
Architectural techniques

4
Voltage Scaling
5
Motivation

PCV2F
Reducing voltage gives significant power gains
CPU usage is bursty
What is voltage scaling?
Provide multiple voltage levels to voltage
regulator
Use minimum voltage where possible to save power

6
Dual Speed Pipelines

Pipelines running at different voltages and clock
frequencies
Non-critical instructions can tolerate higher
latencies
Criticality can be determined by
Oldest Instruction critical (OI) scheme
More than one Consumer critical (MC) scheme

Pyreddy et al, WCED-01
7
Processor Model
Pyreddy et al, WCED-01
8
Results

Claim
Peak performance is when number of fast
instructions is between 40-50

Pyreddy et al, WCED-01
9
Determining the voltage supply

Adjust the supply voltage and frequency in
response to ILP of an application
Different modes of voltage operation
High performance Processor runs at maximum
voltage
Power-saver Processor runs at minimum voltage
Automatic Supply voltage chosen on basis of
application needs
F new Fold MIPS goal/ MIPS observed
Voltage adjustment latency
Multiple voltages on a single chip eases
switching

Childers et al, Micro-00
10
How the voltage adjustment works
When timer expires then get observed
MIPS choose new MIPS compute new
frequency If new frequency ( 33 MHz old
frequency) then stop fetch drain pipeline get
level value get discrete voltage, frequency
value resume fetch
Childers et al, Micro-00
11
Energy Improvement

Claim
47 improvement in energy consumption

Childers et al, Micro-00
12
Enhanced SpeedStep Technology

Voltage-frequency switching operation
Voltage stepped in short increments
Shorter jumps between operating points
Lesser latency
CPU unavailability time reduced

13
Power Aware Issue Queue Design
14
Motivation

Cannot use voltage scaling beyond a certain limit
one-size-fits-all philosophy is not true
Resource usage varies over applications, and also
within the application
Oversized resources committed for performance
Wastage of power
Need to dynamically adapt resources with minimal
performance loss

Source V Tiwari, Micro99
15
Issue Queue Design

One of the most complex logic of a superscalar
processor
Wakeup logic, Select logic
Performance centric design
Latch-based
Compacting vs. non-compacting
CAM/RAM based

16
CoAdaptive Instruction Fetch and Issue

Issue centric fetch gating
Instructions fetched earlier than necessary spend
idle energy
Detect mismatch between size of instruction
window and the required size for application
parallelism

Distant parallelism
tail
Higher utilization
Close parallelism
Lower utilization
head
ROB
Issue Queue

Fetch gate when close parallelism and high
utilization

Buyuktosunoglu et al, ISCA03
17
Co-Adaptive Instruction Fetch and Issue

Dynamic adaptation of issue queue size
Based on activity of instructions

A
A
NA

Dynamic adaptation of issue queue gives a 31
power reduction
20 additional savings when coupled with fetch
gating

Buyuktosunoglu et al, ISCA03
18
Dynamic Allocation of Multiple Datapath Resources

Use of multiple resources in the instruction path
is highly correlated

Independently resize ROB, IQ, LSQ based on their
occupancy

Ponomarev et al, Micro01
19
Dynamic Allocation of Multiple Datapath Resources

Downsize resource based on occupancy
Periodically sample the resource, and average
over few sample periods
If the difference between the current size and
active size is greater than a partition
Reduce by a partition
Aggressively reduce by (difference/partition size)

Upsizing more aggressive in nature
Not based on decrease in IPC
If the number of instructions blocked at dispatch
increases beyond a threshold, increase the
resource size
Power savings of 50 in IQ, 70.5 in ROB and
55.7 in LSQ with a performance degradation of
7.3

Ponomarev et al, Micro01
20
Discussion

Provides a coarse-grained gating mechanism
Can help reduce the leakage current
By shutting down unused resources
Tradeoffs
Impact on IPC
lt8 decrease in performance
Circuit complexity increases with additional
logic added
Simpler designs like non-compacting queues with
clock gating might give similar benefits

21
Profile-based Hardware Configuration
22
Motivation

Variation in program behavior
Across various applications
From one section to another
Variable resource usage
Unused resources consume energy
Identify the optimal configuration for each code
region
Requires code profiling
Dynamic configuration of processor

23
Profiling Basics

What is profiling for power?
Collecting performance and power statistics
Static and Dynamic profiling
Static Off-line, profiling done before final
execution
Dynamic Profiling done during execution
Profiling granularity
Fine-grained Basic blocks, Functions
Coarse-grained Meta blocks

24
Compiler Assisted Code Annotation

Profiling for a Basic Block
Find optimal number of instructions to be
executed in parallel
Consider variable fetch width and execution width
Fetch width control dominated architectures
(superscalar, O-o-O)
Execution width datapath dominated architectures
(VLIW)
Annotation of all instructions of basic block
with profiling data
Annotation value signifies number of functional
units
If annotated value gt 1, instruction concurrently
executed
Assumptions
Instruction format allows for code annotation
Microprocessor supports variable fetch and
execution rates
Technique holds for single type of functional
unit
Advantage for unrolled loops

Marculescu et al
25
Code Annotation - Algorithm

Low_Energy_Code_Annotate (Program)
Extract all basic blocks (b1,,bn)
For x 1,,K
for j 1,,n
P power of basic block bi
when x instructions are
executed in parallel
For i 1,,n
Find the value of x 1,,K such that P is
minimized
Annotate all instructions in bi with x

Marculescu et al
26
Results

Variable Execution Width
44 power savings at 35 performance decrease
with no constraints
33 power savings at 25 performance decrease
with constraint
Variable Fetch Width
15 power reduction for 6 performance loss

Marculescu et al
27
Dynamic Profiling Scheme

Profiling based on Hotspots
92 of execution time is in hotspots !
- Find best configuration for a hotspot
Consider size of RUU and pipeline width
RUU 3 different sizes (16, 32, 48, 64)
Pipeline width 4 widths (4, 6, 8)

A. Iyer et al
28
Example of FSM
d 1 after every 1024 instructions

On detecting hotspot, traverse through 12-state
FSM
Track power for each configuration
Switch processor to optimal configuration

64/8
48/8
32/8
16/8
d
d
d
d
64/4
48/4
32/4
16/4
d
d
d
d
Hotspot found
64/2
48/2
32/2
16/2
d
d
d
d
S
OPT
Hotspot
lost
Program start
A. Iyer et al
29
Dynamic Profiling - Results
18 power savings over baseline processor
30
Discussion

Static vs. Dynamic
Static methods have pre-processing overheads
Dynamic may get out of the hotspot before
detecting optimal configuration
Fine-grained vs. Coarse-grained
Fine-grained increases switching, overhead
Coarse-grained might not recognize changes in
program behavior

31
Thank you Questions?
32
Appendices
33

Appendix A1. Comparison of the two heuristics
with base configuration (issue restricted and
issue unrestricted) for CRAFTY.

34
Appendix A2. Enhanced Intel SpeedStep technology
35
Appendix B1. Instruction format in SimpleScalar
architecture
36
Appendix B2. Meta-blocks obtained from successive
basic blocks
Branch probability
37
Appendix B3. Hotspot detection hardware

BBB
Execution counter incremented when branch is
taken
Once it reaches a threshold, branch is marked as
candidate branch
(2k entries, 9-bit counters)
HDC
Saturating counter, initialized to all 1s
Decrement by D when candidate branch is taken,
increment by I when non-candidate branch is taken
When HDC 0, hotspot detected
(13-bit HDC, D 2, I 1)
After Nf cycles, BBB non-candidate branches are
flushed
After Nr cycles, entire BBB is reset
This prevents stagnation
(Nf 4k, Nr 64k)