Title: 8 Software Design for Low Power
18 Software Design for Low Power
- Algorithm optimizations
- Minimizing memory access
- Optimal selection and sequencing of machine
instructions - Power management
28.2 Sources of software Power Dissipation
- The memory system takes a substantial fraction of
the power budget for portable computers, and it
can be the dominant source of power dissipation
in some memory-intensive DSP applications such as
video processing. - System buses are a high-capacitance component for
which switching activity is largely determined by
the software.
38.3 Software Power Estimation
- Two basic levels of abstraction
- 1. Gate level simulation and power estimation
tools on a gate level description of an
instruction processing system. - 2. Based on the frequency of execution of each
type of instruction or instruction sequence.
4- Architectural power estimation determines which
major components of a processor will be active
during each execution cycle of a program.
58.3.4 Instruction Level Power Analysis
- Ep Si (Bi Ni) Si,j (O i,j N i,j) Sk Ek
- Where Ep is the overall energy cost of a program,
decomposed into base costs, circuit state
overhead, and stalls and cache misses. - The first summation represents base costs, where
Bi is the base cost of an instruction of type i
and Ni is the number of type i instructions in
the execution profile of a program.
6- The second summation represents circuit state
effects where O i,j is the cost incurred when an
instruction of type i is followed by an
instruction of type j. - N i,j is the number of occurrences where
instruction type i is immediately followed by
instruction type j. - The last sum accounts for other effects, such as
stalls and cache misses. Each Ek represents the
cost of one such effect found in the program
execution profile.
7(No Transcript)
88.4 Software Power Optimizations
- A prerequisite to optimizing a program for low
power must always be to design an algorithm that
maps well to available hardware and is efficient
for the problem at hand in terms of both time
and storage complexity.
9- Reduction operations are an example of a common
class of operations that lend themselves to a
trade-off between resource usage and execution
time.
10Only one adder is available
11(No Transcript)
12The basic principle is to try to match the degree
of parallelism in an algorithm to the number of
parallel resources available.
138.4.2 Minimizing memory Access Costs
- Power minimization techniques related to memory
concentrate on one or more of the following
objectives - Minimize the number of memory accesses required
by an algorithm. - Minimize the total memory required by an
algorithm. - Put memory accesses as close as possible to the
processor. Choose register first, cache next, and
external RAM last. - Make the most efficient use of the available
memory bandwidth use multiple-word parallel
loads instead of single-word loads as much as
possible.
14- Dual loads are beneficial for energy reduction
but not necessarily for power reduction because
the instantaneous power dissipation of a dual
load is somewhat higher than for two single
loads. - Lower average power solutions were obtained at
the expense of execution cycles and total energy.
15To evaluate (x y) z
16(x y) z
17Memory bandwidth minimization
- Register allocation to minimize external memory
references. - Cache blocking transformation on array
computations so that blocks of array elements
only have to be read into cache once. - Register blocking redundant register loading of
array elements is eliminated. - Recurrence detection and optimization use of
registers for values that are carried over from
one level of recursion to the next. - Compact multiple memory references into a single
reference.
18- The label inside the circle for each vertex
indicates a variable (varv), a symbolic register
name (regv), or a physical register name (X). - The label just outside of each circle indicates
the assignment of a physical register name to a
symbolic register or a memory bank to a variable. - A red edge indicates that two register values are
active at the same time and should be assigned to
different physical registers. - A green edge indicates a parallel memory to
register transfer that should be preserved. - Black edges indicate limitations on operand usage
imposed by the target processor.
19(No Transcript)
20- ADD R1 ? a, e ADD R2 ? e, b
- ADD R3 ? a, b ADD R4 ? b, d
- ADD R5 ? a, c ADD R6 ? c, d
21- All the variables (vertices) in one partition are
assigned to the same memory bank. - If an edge crosses the partition, then the memory
accesses for those two variables can be compacted
into a dual-load operation at a cost of one
execution cycle.
22(No Transcript)
238.5 Automated Low-Power code generation
- Cold scheduling algorithm an instruction
scheduling algorithm that reduces bus switching
activity related to the change in state when
execution moves from one instruction type to
another. - 1. Allocate registers.
- 2. Pre-assemble calculate target addresses,
index symbol table, and transform instruction to
binary. - 3. Schedule instructions using cold scheduling
algorithm. - Post-assemble complete the assembly process.
24Code generation and optimization methodology
- 1. Allocate registers and select instructions.
- 2. Build a data flow graph (DFG) for each basic
block. - 3. Optimize memory bank assignments by simulating
annealing. - 4. Perform as soon as possible (ASAP) packing of
instructions. - 5. Perform list scheduling of instructions.
(similar to cold scheduling) - 6. Swap instruction operands where beneficial.
25list scheduling
- is to make an ordered list of processes by
assigning them some priorities, and then
repeatedly execute the following two steps until
a valid schedule is obtained - Select from the list, the process with the
highest priority for scheduling. - Select a resource to accommodate this process.
- The priorities are determined statically before
scheduling process begins. The first step choose
the process with highest priority, the second
step select the best possible resource.
268.6 codesign for low power
- Instruction set design and implementation seems
to be one of the most well defined of the
codesign problems. - Using a processor for which some portion of the
interconnect and logic is reconfigurable.
27Memory system considerations for low-power
software
- Total memory size minimizing total memory
requirements through code compaction and
algorithm transformations allows for a smaller
memory system with lower capacitances and
improved performance. However, this benefit might
have to be traded off against a faster algorithm
that requires more memory.
28- Partitioning into banks how many, how big
needed if parallel loads are to be used. The size
of application, data structures, and access
patterns should determine this. - Wide data bus needed for parallel loads.
- Proximity to CPU Memory close to CPU reduces
capacitances makes memory use less expensive.
29- Cache size Should be sized to minimize cache
misses for the intended application. Software
should apply techniques such as cache blocking to
maintain high degree of spatial and temporal
locality. - Cache protocol If an applications memory access
patterns are well defined, the protocol can be
optimized. - Cache locking If significant portions of an
application can run entirely from cache, this can
be used to prevent any external memory accesses
while those portions of code are executing.
30Architectural considerations for low power
software
- Class of processor DSP/RISC/CISC application
dependent. How well does the instruction set fit
the application? Is there hardware support for
operations common to the application? Does the
hardware implement functions not needed by the
application? - Parallel processing VLIW, Superscalar SIMD, MIMD,
How much parallelism? Does the level and type
of parallelism in an algorithm fit one of these
architectures? If so, the parallel architecture
can greatly improve performance and allow reduced
voltages and clock rates to be used.
31- Bus architecture Separate buses (e.g., address,
data, instruction, I/O) can make it easier to
optimize instruction sequences, addressing
patterns, and data correlations to minimize bus
switching. - Register file size Increased register count
eases register allocation and reduces memory
accesses, but too large of a register file can
make register accesses as expensive as a cache
access.
32Power management considerations for low power
software
- Software vs. hardware control software control
allows power management to be tailored to the
application at the cost of added execution
cycles. - Clock/voltage removal At what level of
granularity? Some kind of shutdown modes are
needed in order to realize a benefit from
minimized execution times.
33- Clock/voltage reduction If there is a schedule
slack associated with a software task, reduce
power by slowing down the clock and reducing the
supply voltage. - Guarded evaluation Latch the inputs of
functional units to prevent meaningless
calculations when outputs of units are not being
used. This increases the power reduction from
reduction in strength and operand swapping.
34Sources of Software Energy Consumption
- Datapaths in integer ALU and FP units
- Cache and memory systems
- System buses (address, instruction, and data)
- Control circuitry and clock logic and distribution
35Level of Optimization
- Algorithm and Application Design
- Compiler Optimizations
- Operating System Control
36Reducing memory energy
- Improve the locality of memory accesses
- Minimize the number of memory accesses
- Optimizing interactions of compiler and cache
architecture - Reducing the total memory area (in embedded
systems) - Make effective use of memory bandwidth
37Improving localityloop transformations
- Linear loop transformations
- Loop permutation
- Loop skewing
- Loop reversal
- Loop scaling
- Multi-loop transformations
- Loop fusion/fission
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Improved localityData Transformations
- Linear layout transformations
- Dimension re-indexing
- Diagonal (skewed) memory layouts
- Blocked memory layouts
42(No Transcript)
43(No Transcript)
44Data partitioning steps
- All scalars are mapped to the SPM
- All arrays larger than SPM are mapped into
off-chip memory and accessed through the cache - For the remaining arrays conflicting arrays go
to different memories - Experiments show 30-33 improvement in memory
latencies
45Features affecting data partitioning
- Scalar variables and constant
- Array sizes
- Life times of arrays
- Access frequency of arrays
- Access pattern and potential conflict misses
46Interaction of optimizations and cache
architectures
- Multiple access caches
- Sequential predictive accesses
- Most recently used way cache (MRU)
- Column associative cache (CA)
- Selective way caches
- Activate different number of ways dynamically
47(No Transcript)
48Evaluation of different cache architectures
- MRU caches consume the least energy for all sizes
of caches and for all the benchmarks
49Minimizing code space
- Storage assignment
- Effective use auto-increment/decrement addressing
modes - Code compression
- Pure software approach
- Common sequences are extracted and placed in a
directory - Instances of these sequences are replaced by
minisubroutine calls - Hardware approach
- new instruction are defined
- Flexible dictionary structures with architectural
support
50Storage assignment
- Many DSPs have limited addressing modes
- They use address registers to access memory
- No indexing modes, have auto-increment/decrement
modes - Compiler support is needed to minimize the
number of explicit assignments to address
registers
51Instruction selection ordering
- There are usually many possible code sequences
that accomplish the same task - Techniques
- Instruction packing
- Instruction re-ordering (scheduling)
- Operand ordering/swapping
52Low power file system
- Predictive shut down of disks
- Spin mode consumes 1 watt
- Sleep mode consumes 25 milliWatts
- Latency constraints
- Seek latency 10 milliseconds
- Spinup latency 2 seconds
- OS must balance between the need for low latency
file access and need to reduce disk power
53Requirements for a low power file system
- Fine-grained disk spindown
- Whole-file prefetching cache
- 8-16MB of low power, low read latency memory
54Factors influencing file system design
- Number of disk spinups affects reliability
- Friction induced wear
- Disk spinups is a function of
- Read/write inter-request time
- Disk spindown delay
55Spindown prediction techniques
- Threshold_demand spun down after a fixed period
of inactivity and is spun up upon the next access - Optimum offline algorithm Oracle
56(No Transcript)
57(No Transcript)
58(No Transcript)