Power Aware Architecture Design - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Power Aware Architecture Design

Description:

Power Aware Architecture Design – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 38
Provided by: meetasha
Category:

less

Transcript and Presenter's Notes

Title: Power Aware Architecture Design


1
Power Aware Architecture Design
  • Meeta S. Gupta
  • Snehal Sanghavi
  • Sama Usmani

2
The Power Problem
  • Power consumption has been increasing for each
    new CPU generation

3
Power Reduction Techniques
  • Where does all the power go?

Alpha 21264 Source Wilcox, Micro99
  • Clock gating
  • Voltage scaling
  • Architectural techniques

4
Voltage Scaling
5
Motivation
  • PCV2F
  • Reducing voltage gives significant power gains
  • CPU usage is bursty
  • What is voltage scaling?
  • Provide multiple voltage levels to voltage
    regulator
  • Use minimum voltage where possible to save power

6
Dual Speed Pipelines
  • Pipelines running at different voltages and clock
    frequencies
  • Non-critical instructions can tolerate higher
    latencies
  • Criticality can be determined by
  • Oldest Instruction critical (OI) scheme
  • More than one Consumer critical (MC) scheme

Pyreddy et al, WCED-01
7
Processor Model
Pyreddy et al, WCED-01
8
Results
  • Claim
  • Peak performance is when number of fast
    instructions is between 40-50

Pyreddy et al, WCED-01
9
Determining the voltage supply
  • Adjust the supply voltage and frequency in
    response to ILP of an application
  • Different modes of voltage operation
  • High performance Processor runs at maximum
    voltage
  • Power-saver Processor runs at minimum voltage
  • Automatic Supply voltage chosen on basis of
    application needs
  • F new Fold MIPS goal/ MIPS observed
  • Voltage adjustment latency
  • Multiple voltages on a single chip eases
    switching

Childers et al, Micro-00
10
How the voltage adjustment works
When timer expires then get observed
MIPS choose new MIPS compute new
frequency If new frequency ( 33 MHz old
frequency) then stop fetch drain pipeline get
level value get discrete voltage, frequency
value resume fetch
Childers et al, Micro-00
11
Energy Improvement
  • Claim
  • 47 improvement in energy consumption

Childers et al, Micro-00
12
Enhanced SpeedStep Technology
  • Voltage-frequency switching operation
  • Voltage stepped in short increments
  • Shorter jumps between operating points
  • Lesser latency
  • CPU unavailability time reduced

13
Power Aware Issue Queue Design
14
Motivation
  • Cannot use voltage scaling beyond a certain limit
  • one-size-fits-all philosophy is not true
  • Resource usage varies over applications, and also
    within the application
  • Oversized resources committed for performance
  • Wastage of power
  • Need to dynamically adapt resources with minimal
    performance loss

Source V Tiwari, Micro99
15
Issue Queue Design
  • One of the most complex logic of a superscalar
    processor
  • Wakeup logic, Select logic
  • Performance centric design
  • Latch-based
  • Compacting vs. non-compacting
  • CAM/RAM based

16
CoAdaptive Instruction Fetch and Issue
  • Issue centric fetch gating
  • Instructions fetched earlier than necessary spend
    idle energy
  • Detect mismatch between size of instruction
    window and the required size for application
    parallelism

Distant parallelism
tail
Higher utilization
Close parallelism
Lower utilization
head
ROB
Issue Queue
  • Fetch gate when close parallelism and high
    utilization

Buyuktosunoglu et al, ISCA03
17
Co-Adaptive Instruction Fetch and Issue
  • Dynamic adaptation of issue queue size
  • Based on activity of instructions

A
A
NA
  • Dynamic adaptation of issue queue gives a 31
    power reduction
  • 20 additional savings when coupled with fetch
    gating

Buyuktosunoglu et al, ISCA03
18
Dynamic Allocation of Multiple Datapath Resources
  • Use of multiple resources in the instruction path
    is highly correlated
  • Independently resize ROB, IQ, LSQ based on their
    occupancy

Ponomarev et al, Micro01
19
Dynamic Allocation of Multiple Datapath Resources
  • Downsize resource based on occupancy
  • Periodically sample the resource, and average
    over few sample periods
  • If the difference between the current size and
    active size is greater than a partition
  • Reduce by a partition
  • Aggressively reduce by (difference/partition size)
  • Upsizing more aggressive in nature
  • Not based on decrease in IPC
  • If the number of instructions blocked at dispatch
    increases beyond a threshold, increase the
    resource size
  • Power savings of 50 in IQ, 70.5 in ROB and
    55.7 in LSQ with a performance degradation of
    7.3

Ponomarev et al, Micro01
20
Discussion
  • Provides a coarse-grained gating mechanism
  • Can help reduce the leakage current
  • By shutting down unused resources
  • Tradeoffs
  • Impact on IPC
  • lt8 decrease in performance
  • Circuit complexity increases with additional
    logic added
  • Simpler designs like non-compacting queues with
    clock gating might give similar benefits

21
Profile-based Hardware Configuration
22
Motivation
  • Variation in program behavior
  • Across various applications
  • From one section to another
  • Variable resource usage
  • Unused resources consume energy
  • Identify the optimal configuration for each code
    region
  • Requires code profiling
  • Dynamic configuration of processor

23
Profiling Basics
  • What is profiling for power?
  • Collecting performance and power statistics
  • Static and Dynamic profiling
  • Static Off-line, profiling done before final
    execution
  • Dynamic Profiling done during execution
  • Profiling granularity
  • Fine-grained Basic blocks, Functions
  • Coarse-grained Meta blocks

24
Compiler Assisted Code Annotation
  • Profiling for a Basic Block
  • Find optimal number of instructions to be
    executed in parallel
  • Consider variable fetch width and execution width
  • Fetch width control dominated architectures
    (superscalar, O-o-O)
  • Execution width datapath dominated architectures
    (VLIW)
  • Annotation of all instructions of basic block
    with profiling data
  • Annotation value signifies number of functional
    units
  • If annotated value gt 1, instruction concurrently
    executed
  • Assumptions
  • Instruction format allows for code annotation
  • Microprocessor supports variable fetch and
    execution rates
  • Technique holds for single type of functional
    unit
  • Advantage for unrolled loops

Marculescu et al
25
Code Annotation - Algorithm
  • Low_Energy_Code_Annotate (Program)
  • Extract all basic blocks (b1,,bn)
  • For x 1,,K
  • for j 1,,n
  • P power of basic block bi
  • when x instructions are
    executed in parallel
  • For i 1,,n
  • Find the value of x 1,,K such that P is
    minimized
  • Annotate all instructions in bi with x

Marculescu et al
26
Results
  • Variable Execution Width
  • 44 power savings at 35 performance decrease
    with no constraints
  • 33 power savings at 25 performance decrease
    with constraint
  • Variable Fetch Width
  • 15 power reduction for 6 performance loss

Marculescu et al
27
Dynamic Profiling Scheme
  • Profiling based on Hotspots
  • 92 of execution time is in hotspots !
  • - Find best configuration for a hotspot
  • Consider size of RUU and pipeline width
  • RUU 3 different sizes (16, 32, 48, 64)
  • Pipeline width 4 widths (4, 6, 8)

A. Iyer et al
28
Example of FSM
d 1 after every 1024 instructions
  • On detecting hotspot, traverse through 12-state
    FSM
  • Track power for each configuration
  • Switch processor to optimal configuration

64/8
48/8
32/8
16/8
d
d
d
d
64/4
48/4
32/4
16/4
d
d
d
d
Hotspot found
64/2
48/2
32/2
16/2
d
d
d
d
S
OPT
Hotspot
lost
Program start
A. Iyer et al
29
Dynamic Profiling - Results
18 power savings over baseline processor
30
Discussion
  • Static vs. Dynamic
  • Static methods have pre-processing overheads
  • Dynamic may get out of the hotspot before
    detecting optimal configuration
  • Fine-grained vs. Coarse-grained
  • Fine-grained increases switching, overhead
  • Coarse-grained might not recognize changes in
    program behavior

31
Thank you Questions?
32
Appendices
33
  • Appendix A1. Comparison of the two heuristics
    with base configuration (issue restricted and
    issue unrestricted) for CRAFTY.

34
Appendix A2. Enhanced Intel SpeedStep technology
35
Appendix B1. Instruction format in SimpleScalar
architecture
36
Appendix B2. Meta-blocks obtained from successive
basic blocks
Branch probability
37
Appendix B3. Hotspot detection hardware
  • BBB
  • Execution counter incremented when branch is
    taken
  • Once it reaches a threshold, branch is marked as
    candidate branch
  • (2k entries, 9-bit counters)
  • HDC
  • Saturating counter, initialized to all 1s
  • Decrement by D when candidate branch is taken,
    increment by I when non-candidate branch is taken
  • When HDC 0, hotspot detected
  • (13-bit HDC, D 2, I 1)
  • After Nf cycles, BBB non-candidate branches are
    flushed
  • After Nr cycles, entire BBB is reset
  • This prevents stagnation
  • (Nf 4k, Nr 64k)
Write a Comment
User Comments (0)
About PowerShow.com