8 Software Design for Low Power

About This Presentation

Title:

8 Software Design for Low Power

Description:

Effective use auto-increment/decrement addressing modes. Code compression. Pure software approach ... No indexing modes, have auto-increment/decrement modes ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 59

Provided by: NTU

Category:

more less

Transcript and Presenter's Notes

Title: 8 Software Design for Low Power

1
8 Software Design for Low Power

Algorithm optimizations
Minimizing memory access
Optimal selection and sequencing of machine
instructions
Power management

2
8.2 Sources of software Power Dissipation

The memory system takes a substantial fraction of
the power budget for portable computers, and it
can be the dominant source of power dissipation
in some memory-intensive DSP applications such as
video processing.
System buses are a high-capacitance component for
which switching activity is largely determined by
the software.

3
8.3 Software Power Estimation

Two basic levels of abstraction
1. Gate level simulation and power estimation
tools on a gate level description of an
instruction processing system.
2. Based on the frequency of execution of each
type of instruction or instruction sequence.

Architectural power estimation determines which
major components of a processor will be active
during each execution cycle of a program.

5
8.3.4 Instruction Level Power Analysis

Ep Si (Bi Ni) Si,j (O i,j N i,j) Sk Ek
Where Ep is the overall energy cost of a program,
decomposed into base costs, circuit state
overhead, and stalls and cache misses.
The first summation represents base costs, where
Bi is the base cost of an instruction of type i
and Ni is the number of type i instructions in
the execution profile of a program.

The second summation represents circuit state
effects where O i,j is the cost incurred when an
instruction of type i is followed by an
instruction of type j.
N i,j is the number of occurrences where
instruction type i is immediately followed by
instruction type j.
The last sum accounts for other effects, such as
stalls and cache misses. Each Ek represents the
cost of one such effect found in the program
execution profile.

7
(No Transcript)
8
8.4 Software Power Optimizations

A prerequisite to optimizing a program for low
power must always be to design an algorithm that
maps well to available hardware and is efficient
for the problem at hand in terms of both time
and storage complexity.

Reduction operations are an example of a common
class of operations that lend themselves to a
trade-off between resource usage and execution
time.

10
Only one adder is available
11
(No Transcript)
12
The basic principle is to try to match the degree
of parallelism in an algorithm to the number of
parallel resources available.
13
8.4.2 Minimizing memory Access Costs

Power minimization techniques related to memory
concentrate on one or more of the following
objectives
Minimize the number of memory accesses required
by an algorithm.
Minimize the total memory required by an
algorithm.
Put memory accesses as close as possible to the
processor. Choose register first, cache next, and
external RAM last.
Make the most efficient use of the available
memory bandwidth use multiple-word parallel
loads instead of single-word loads as much as
possible.

Dual loads are beneficial for energy reduction
but not necessarily for power reduction because
the instantaneous power dissipation of a dual
load is somewhat higher than for two single
loads.
Lower average power solutions were obtained at
the expense of execution cycles and total energy.

15
To evaluate (x y) z
16
(x y) z
17
Memory bandwidth minimization

Register allocation to minimize external memory
references.
Cache blocking transformation on array
computations so that blocks of array elements
only have to be read into cache once.
Register blocking redundant register loading of
array elements is eliminated.
Recurrence detection and optimization use of
registers for values that are carried over from
one level of recursion to the next.
Compact multiple memory references into a single
reference.

The label inside the circle for each vertex
indicates a variable (varv), a symbolic register
name (regv), or a physical register name (X).
The label just outside of each circle indicates
the assignment of a physical register name to a
symbolic register or a memory bank to a variable.
A red edge indicates that two register values are
active at the same time and should be assigned to
different physical registers.
A green edge indicates a parallel memory to
register transfer that should be preserved.
Black edges indicate limitations on operand usage
imposed by the target processor.

19
(No Transcript)
20

ADD R1 ? a, e ADD R2 ? e, b
ADD R3 ? a, b ADD R4 ? b, d
ADD R5 ? a, c ADD R6 ? c, d

All the variables (vertices) in one partition are
assigned to the same memory bank.
If an edge crosses the partition, then the memory
accesses for those two variables can be compacted
into a dual-load operation at a cost of one
execution cycle.

22
(No Transcript)
23
8.5 Automated Low-Power code generation

Cold scheduling algorithm an instruction
scheduling algorithm that reduces bus switching
activity related to the change in state when
execution moves from one instruction type to
another.
1. Allocate registers.
2. Pre-assemble calculate target addresses,
index symbol table, and transform instruction to
binary.
3. Schedule instructions using cold scheduling
algorithm.
Post-assemble complete the assembly process.

24
Code generation and optimization methodology

1. Allocate registers and select instructions.
2. Build a data flow graph (DFG) for each basic
block.
3. Optimize memory bank assignments by simulating
annealing.
4. Perform as soon as possible (ASAP) packing of
instructions.
5. Perform list scheduling of instructions.
(similar to cold scheduling)
6. Swap instruction operands where beneficial.

25
list scheduling

is to make an ordered list of processes by
assigning them some priorities, and then
repeatedly execute the following two steps until
a valid schedule is obtained
Select from the list, the process with the
highest priority for scheduling.
Select a resource to accommodate this process.
The priorities are determined statically before
scheduling process begins. The first step choose
the process with highest priority, the second
step select the best possible resource.

26
8.6 codesign for low power

Instruction set design and implementation seems
to be one of the most well defined of the
codesign problems.
Using a processor for which some portion of the
interconnect and logic is reconfigurable.

27
Memory system considerations for low-power
software

Total memory size minimizing total memory
requirements through code compaction and
algorithm transformations allows for a smaller
memory system with lower capacitances and
improved performance. However, this benefit might
have to be traded off against a faster algorithm
that requires more memory.

Partitioning into banks how many, how big
needed if parallel loads are to be used. The size
of application, data structures, and access
patterns should determine this.
Wide data bus needed for parallel loads.
Proximity to CPU Memory close to CPU reduces
capacitances makes memory use less expensive.

Cache size Should be sized to minimize cache
misses for the intended application. Software
should apply techniques such as cache blocking to
maintain high degree of spatial and temporal
locality.
Cache protocol If an applications memory access
patterns are well defined, the protocol can be
optimized.
Cache locking If significant portions of an
application can run entirely from cache, this can
be used to prevent any external memory accesses
while those portions of code are executing.

30
Architectural considerations for low power
software

Class of processor DSP/RISC/CISC application
dependent. How well does the instruction set fit
the application? Is there hardware support for
operations common to the application? Does the
hardware implement functions not needed by the
application?
Parallel processing VLIW, Superscalar SIMD, MIMD,
How much parallelism? Does the level and type
of parallelism in an algorithm fit one of these
architectures? If so, the parallel architecture
can greatly improve performance and allow reduced
voltages and clock rates to be used.

Bus architecture Separate buses (e.g., address,
data, instruction, I/O) can make it easier to
optimize instruction sequences, addressing
patterns, and data correlations to minimize bus
switching.
Register file size Increased register count
eases register allocation and reduces memory
accesses, but too large of a register file can
make register accesses as expensive as a cache
access.

32
Power management considerations for low power
software

Software vs. hardware control software control
allows power management to be tailored to the
application at the cost of added execution
cycles.
Clock/voltage removal At what level of
granularity? Some kind of shutdown modes are
needed in order to realize a benefit from
minimized execution times.

Clock/voltage reduction If there is a schedule
slack associated with a software task, reduce
power by slowing down the clock and reducing the
supply voltage.
Guarded evaluation Latch the inputs of
functional units to prevent meaningless
calculations when outputs of units are not being
used. This increases the power reduction from
reduction in strength and operand swapping.

34
Sources of Software Energy Consumption

Datapaths in integer ALU and FP units
Cache and memory systems
System buses (address, instruction, and data)
Control circuitry and clock logic and distribution

35
Level of Optimization

Algorithm and Application Design
Compiler Optimizations
Operating System Control

36
Reducing memory energy

Improve the locality of memory accesses
Minimize the number of memory accesses
Optimizing interactions of compiler and cache
architecture
Reducing the total memory area (in embedded
systems)
Make effective use of memory bandwidth

37
Improving localityloop transformations

Linear loop transformations
Loop permutation
Loop skewing
Loop reversal
Loop scaling
Multi-loop transformations
Loop fusion/fission

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Improved localityData Transformations

Linear layout transformations
Dimension re-indexing
Diagonal (skewed) memory layouts
Blocked memory layouts

42
(No Transcript)
43
(No Transcript)
44
Data partitioning steps

All scalars are mapped to the SPM
All arrays larger than SPM are mapped into
off-chip memory and accessed through the cache
For the remaining arrays conflicting arrays go
to different memories
Experiments show 30-33 improvement in memory
latencies

45
Features affecting data partitioning

Scalar variables and constant
Array sizes
Life times of arrays
Access frequency of arrays
Access pattern and potential conflict misses

46
Interaction of optimizations and cache
architectures

Multiple access caches
Sequential predictive accesses
Most recently used way cache (MRU)
Column associative cache (CA)
Selective way caches
Activate different number of ways dynamically

47
(No Transcript)
48
Evaluation of different cache architectures

MRU caches consume the least energy for all sizes
of caches and for all the benchmarks

49
Minimizing code space

Storage assignment
Effective use auto-increment/decrement addressing
modes
Code compression
Pure software approach
Common sequences are extracted and placed in a
directory
Instances of these sequences are replaced by
minisubroutine calls
Hardware approach
new instruction are defined
Flexible dictionary structures with architectural
support

50
Storage assignment

Many DSPs have limited addressing modes
They use address registers to access memory
No indexing modes, have auto-increment/decrement
modes
Compiler support is needed to minimize the
number of explicit assignments to address
registers

51
Instruction selection ordering

There are usually many possible code sequences
that accomplish the same task
Techniques
Instruction packing
Instruction re-ordering (scheduling)
Operand ordering/swapping

52
Low power file system

Predictive shut down of disks
Spin mode consumes 1 watt
Sleep mode consumes 25 milliWatts
Latency constraints
Seek latency 10 milliseconds
Spinup latency 2 seconds
OS must balance between the need for low latency
file access and need to reduce disk power

53
Requirements for a low power file system

Fine-grained disk spindown
Whole-file prefetching cache
8-16MB of low power, low read latency memory

54
Factors influencing file system design

Number of disk spinups affects reliability
Friction induced wear
Disk spinups is a function of
Read/write inter-request time
Disk spindown delay

55
Spindown prediction techniques

Threshold_demand spun down after a fixed period
of inactivity and is spun up upon the next access
Optimum offline algorithm Oracle

56
(No Transcript)
57
(No Transcript)
58
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

8 Software Design for Low Power - PowerPoint PPT Presentation

8 Software Design for Low Power

Effective use auto-increment/decrement addressing modes. Code compression. Pure software approach ... No indexing modes, have auto-increment/decrement modes ... – PowerPoint PPT presentation