Lecture 14 Software Design for LowPower - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Lecture 14 Software Design for LowPower

Description:

B too large to store in registers - used memory transfers, instead. Loop rearrangement allowed intermediate B to stay in general register ... Allocate registers ... – PowerPoint PPT presentation

Number of Views:644

Avg rating:3.0/5.0

Slides: 34

Provided by: pagr

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 14 Software Design for LowPower

1
Lecture 14Software Design for Low-Power

Software dictates much of hardware activity
Need software power estimation method
Must optimize software at several levels of
abstraction
Ultimately involves hardware/software trade-offs
Summary

Michael L. Bushnell
CAIP Center and WINLAB
ECE Dept., Rutgers U., Piscataway, NJ

2
Sources of Software Power Dissipation

Memory system uses power (1/10 to ¼) in portable
computers
Dominates in video processing
System bus switching activity controlled by
software
ALU and FPU data paths needs good scheduling to
avoid pipeline stalls
Control logic and clock reduce by using
shortest possible program to do the computation
Software control of hardware
Reduces power of idle components
Controls power saving modes

3
Memory Accesses Are Expensive

Highly capacitive data / address / row / column
decode / word data lines with high fanout
Mapping of data structures to memory banks
determines possibility of parallel word loads
(which are more energy efficient)
Memory access patterns greatly affect cache
performance
Cache more power efficient than main memory
Closer to CPU and smaller than main memory

4
Methods of Software Power Estimation for Code

Lower level use gate-level simulation power
estimation on gate-level description of hardware
running the code most accurate
Higher level look at frequency of each type of
instruction / sequence (execution profile)
Use lookup table
Model ALUs, register files, etc.
Use bus switching activity exclusively
Must know bus architecture, OPCODEs,
representative input data to program, mapping of
code/data to address space
Characterize instruction power empirically on
real hardware

5
Instruction Level Power Analysis

For each instruction, need to measure
Base power cost (independent of prior state)
Prior state cost needs large-scale analysis /
simulation
Pipeline stalls, buffer stalls, cache misses
Circuit state effects due to localized
processor state change from execution of
instruction pair
Example ADD A B, C
MULT C A, B
Includes change of instruction bus state
Switching of control lines
ALU mode changes
Routing costs to / from register file

6
Instruction Analysis Example

EP overall energy cost of program
Bi base cost of instruction type i
Oi,j cost of instruction type i followed by
type j
Oi,j Oj,i
Ek costs of pipeline stalls and cache misses

7
Detailed Example

Simple DSP with 4 registers A, B, C, D
Evaluate (x y) z

8
Example Concluded
9
Software Power Optimizations

Select least expensive instructions / instruction
sequences
Minimize frequency or cost of memory access
Use hardware power minimization features
Algorithm choices
Must map well to available hardware
Be efficient for problem being solved
Must maximize performance parallel processing
Constraints
Battery-powered system minimize total energy
dissipated
When heat dissipation or reliability is important
minimize instantaneous / average power

10
Algorithm Transformations

Reduction operations use parallel hardware, but
run slowly

11
Two Adders Available
12
Four Adders Available
13
Minimizing Memory Access Costs

Memory causes power and performance bottleneck
Minimize accesses needed by algorithm
Minimize total memory size needed by algorithm
Put memory accesses as close as possible to CPU
Register, then cache, external RAM last
Efficiently use memory bandwidth
Use multiple-word parallel loads, not single word
loads

14
Improve Loop Nesting and Operation Order

B too large to store in registers - used memory
transfers, instead
Loop rearrangement allowed intermediate B to stay
in general register
Got rid of 2N memory accesses for B and N memory
locations not needed

15
Reduce Space Requirements

C not needed afterward reorder so that B can
overwrite C

16
Reduced Intermediate Value Storage

Reduce storage for intermediate values AiM
Reduced from N locations to 1

17
Dual Memory Load Example
18
Maximize Parallel Loads of Multiple Memory Words

Dual word loads reduce energy by 47
1st maximize dual loads with memory allocation
2nd Combine memory accesses
No dual load Dual
load Dual load parallel exec.

19
Minimize Memory Bandwidth

Allocate registers to minimize memory references
Cache blocking loop unrolling fix array
computations so that blocks of array only read
once
Register blocking eliminate redundant register
loads
Recurrence detection optimization use
registers for values carried over from one
recursion level to next
Compact multiple memory references into 1
reference
40 speedup on DEC alpha, 25 on Motorola 88100

20
Cache Locking

Lock data into cache
Prevents memory reads/writes from going to main
memory
Real benefit -- cache write hits not written
through to main memory
Read hits no main memory reference even if
cache were unlocked
Fujitsu SPARClite writing 0 drew 341 mA from
power supply when cache unlocked, only 194 mA
when cache locked

21
Instruction Selection/Ordering

Instruction packing
Single instruction does both ALU operation and
memory data transfer
Much instruction overhead not duplicated when
operations run in parallel
Concurrent execution of integer floating point
Ops
Easier to do in VLIW and superscalar
architectures
Reorder instructions to minimize circuit state
effects
Most significant for DSP units
Accumulator spilling and mode switching are most
sensitive

22
Operand Swapping/Ordering

Swapping minimizes switching of functional unit
inputs
Example x 7 and y 7 keep 7 on same adder
input
Ordering most significant if commutative operands
not treated symmetrically by hardware
Example Booth multiplier
2nd operand bit pattern determines additions
and subtractions (called recoding weight)
Put operand with lowest recoding weight on 2nd
operand
Saved 10-30 of the power in Lees experiments

23
Power Management

Software can often control processor power-down
modes
User interfaces activity comes in bursts
When system idle time exceeds threshold, likely
to continue to be idle start shutdown
Example SPARClite
Power-down register masks/enables clock for
SDRAM, DMA module, FPU, floating-point queues
Example Hitachi SH3
Standby mode CPU core stopped, peripheral
controller, bus controller, memory refresh
continue
Sleep mode Everything but real-time clock stops

24
More Examples

Example Intel 486SL
System Management Mode entered by asynchronous
interrupt
Can enable, disable, switch between fast and
slow clocks for CPU and ISA bus
Example PowerPC 603 and 604
Dynamic power management removes clock from
execution units (saves 8-16)
Static power management
Doze shuts off most function units, keeps bus
snooping enabled to maintain data cache coherence
Nap shuts off bus snooping and sets wakeup
timer, keeps phase-locked loop running to allow
quick clock restart
Sleep mode also shuts off phase-locked loop
Software control of power management better than
pure hardware control more information

25
Automated Low-Power Code Generation

High-level
Graphical / textual languages available to
describe DSP algorithms
HYPER_LP uncovers parallelism and minimizes
critical paths
Allows data path supply voltage to be reduced
MASAI reorganizes loops to minimize memory
transfers and size
DSP Compiler technology must deal with small
register set, irregular data paths, fully using
parallel resources

26
Sus Cold Scheduling Algorithm

Allocate registers
Pre-assemble Calculate target addresses, index
symbol table, transform instructions to binary
Schedule instructions with cold scheduling
Post-assemble Complete assembly
Reduced switching activity 20-30, performance
loss of 2-4
Lee et al. had a similar approach, but used an as
soon as possible packing of instructions
Saved 26 to 73 energy compared with no
instruction packing and no memory bank assignment
optimization

27
Instruction Set Co-Design

PEAS-I system
Takes HDL for CPU and C compiler, assembler, and
simulator for CPU
Takes design constraints (chip area, power),
hardware module database, sample application
program, program data set
Optimizes instruction set and implementations for
application program and given data set
Starts with core instructions needed for any C
program
Augments with instructions for C operators not
already included as a single instruction
Defines hardware, microprogram, and software
implementations
Estimates power and area of each instruction
implementation
Accounts for pipeline hazards
Solves as an integer program using
branch-and-bound search

28
Instruction Set Design

Huang and Despain system
Optimizes instruction set for sample application,
too
Groups micro-operations (MOPS) together to form
higher-level instructions
Merge MOPS together as byproduct of scheduling
MOPS must be scheduled to same clock cycle
Constrain with instruction bit width, instruction
set size, and hardware resources
Solved by simulated annealing with objective of
minimizing execution cycles instruction set size

29
Reconfigurable Computing

Some of hardware interconnect and logic is
modified at run time
Can closely optimize to wide variety of
applications
Can reconfigure at gate level or at architectural
level
Usually implemented using FPGAs
Gives software designer chance to tailor software
and processor to fit each other
Even after hardware design is fixed

30
Memory System Considerations

Feature
Total size
Bank partitioning
Wide data bus
Proximity to CPU
Cache size
Cache protocol
Cache locking

Low-Power Impact Code compaction, algorithm
transformations Smaller, lower CL,
faster Determined by application size, data
structure Needed for parallel loads Reduces CL,
makes memory use less costly Minimizes cache
misses for application Need spatial and temporal
locality Optimize for application Use to run part
of application only from cache
31
Architectural Considerations

Processor class DSP/RISC/CISC
Parallel Processing VLIW, Superscalar, SIMD, MIMD
Bus architecture
Register file size

Application dependent Does parallel processing
greatly improve performance so that reduced VDD
and slower clock can be used? Separate address,
data, instruction, I/O busses make it easier to
optimize instructions, addressing, and data to
minimize bus activity Eases register allocation,
reduces memory accesses, too many increases power
32
Power Management Considerations

Software vs. hardware control
Granularity of clock/voltage removal
Clock/voltage reduction
Guarded evaluation

Software control lets power management be
tailored to application Shutdown modes needed to
benefit from minimized execution times Useful if
there is schedule slack for a software task
(power reduced during slack) Latch functional
unit inputs to avoid meaningless calculation when
not in use
33
Summary