8 Software Design for Low Power - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

8 Software Design for Low Power

Description:

Effective use auto-increment/decrement addressing modes. Code compression. Pure software approach ... No indexing modes, have auto-increment/decrement modes ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 59
Provided by: NTU
Category:

less

Transcript and Presenter's Notes

Title: 8 Software Design for Low Power


1
8 Software Design for Low Power
  • Algorithm optimizations
  • Minimizing memory access
  • Optimal selection and sequencing of machine
    instructions
  • Power management

2
8.2 Sources of software Power Dissipation
  • The memory system takes a substantial fraction of
    the power budget for portable computers, and it
    can be the dominant source of power dissipation
    in some memory-intensive DSP applications such as
    video processing.
  • System buses are a high-capacitance component for
    which switching activity is largely determined by
    the software.

3
8.3 Software Power Estimation
  • Two basic levels of abstraction
  • 1. Gate level simulation and power estimation
    tools on a gate level description of an
    instruction processing system.
  • 2. Based on the frequency of execution of each
    type of instruction or instruction sequence.

4
  • Architectural power estimation determines which
    major components of a processor will be active
    during each execution cycle of a program.

5
8.3.4 Instruction Level Power Analysis
  • Ep Si (Bi Ni) Si,j (O i,j N i,j) Sk Ek
  • Where Ep is the overall energy cost of a program,
    decomposed into base costs, circuit state
    overhead, and stalls and cache misses.
  • The first summation represents base costs, where
    Bi is the base cost of an instruction of type i
    and Ni is the number of type i instructions in
    the execution profile of a program.

6
  • The second summation represents circuit state
    effects where O i,j is the cost incurred when an
    instruction of type i is followed by an
    instruction of type j.
  • N i,j is the number of occurrences where
    instruction type i is immediately followed by
    instruction type j.
  • The last sum accounts for other effects, such as
    stalls and cache misses. Each Ek represents the
    cost of one such effect found in the program
    execution profile.

7
(No Transcript)
8
8.4 Software Power Optimizations
  • A prerequisite to optimizing a program for low
    power must always be to design an algorithm that
    maps well to available hardware and is efficient
    for the problem at hand in terms of both time
    and storage complexity.

9
  • Reduction operations are an example of a common
    class of operations that lend themselves to a
    trade-off between resource usage and execution
    time.

10
Only one adder is available
11
(No Transcript)
12
The basic principle is to try to match the degree
of parallelism in an algorithm to the number of
parallel resources available.
13
8.4.2 Minimizing memory Access Costs
  • Power minimization techniques related to memory
    concentrate on one or more of the following
    objectives
  • Minimize the number of memory accesses required
    by an algorithm.
  • Minimize the total memory required by an
    algorithm.
  • Put memory accesses as close as possible to the
    processor. Choose register first, cache next, and
    external RAM last.
  • Make the most efficient use of the available
    memory bandwidth use multiple-word parallel
    loads instead of single-word loads as much as
    possible.

14
  • Dual loads are beneficial for energy reduction
    but not necessarily for power reduction because
    the instantaneous power dissipation of a dual
    load is somewhat higher than for two single
    loads.
  • Lower average power solutions were obtained at
    the expense of execution cycles and total energy.

15
To evaluate (x y) z
16
(x y) z
17
Memory bandwidth minimization
  • Register allocation to minimize external memory
    references.
  • Cache blocking transformation on array
    computations so that blocks of array elements
    only have to be read into cache once.
  • Register blocking redundant register loading of
    array elements is eliminated.
  • Recurrence detection and optimization use of
    registers for values that are carried over from
    one level of recursion to the next.
  • Compact multiple memory references into a single
    reference.

18
  • The label inside the circle for each vertex
    indicates a variable (varv), a symbolic register
    name (regv), or a physical register name (X).
  • The label just outside of each circle indicates
    the assignment of a physical register name to a
    symbolic register or a memory bank to a variable.
  • A red edge indicates that two register values are
    active at the same time and should be assigned to
    different physical registers.
  • A green edge indicates a parallel memory to
    register transfer that should be preserved.
  • Black edges indicate limitations on operand usage
    imposed by the target processor.

19
(No Transcript)
20
  • ADD R1 ? a, e ADD R2 ? e, b
  • ADD R3 ? a, b ADD R4 ? b, d
  • ADD R5 ? a, c ADD R6 ? c, d

21
  • All the variables (vertices) in one partition are
    assigned to the same memory bank.
  • If an edge crosses the partition, then the memory
    accesses for those two variables can be compacted
    into a dual-load operation at a cost of one
    execution cycle.

22
(No Transcript)
23
8.5 Automated Low-Power code generation
  • Cold scheduling algorithm an instruction
    scheduling algorithm that reduces bus switching
    activity related to the change in state when
    execution moves from one instruction type to
    another.
  • 1. Allocate registers.
  • 2. Pre-assemble calculate target addresses,
    index symbol table, and transform instruction to
    binary.
  • 3. Schedule instructions using cold scheduling
    algorithm.
  • Post-assemble complete the assembly process.

24
Code generation and optimization methodology
  • 1. Allocate registers and select instructions.
  • 2. Build a data flow graph (DFG) for each basic
    block.
  • 3. Optimize memory bank assignments by simulating
    annealing.
  • 4. Perform as soon as possible (ASAP) packing of
    instructions.
  • 5. Perform list scheduling of instructions.
    (similar to cold scheduling)
  • 6. Swap instruction operands where beneficial.

25
list scheduling
  • is to make an ordered list of processes by
    assigning them some priorities, and then
    repeatedly execute the following two steps until
    a valid schedule is obtained 
  • Select from the list, the process with the
    highest priority for scheduling.
  • Select a resource to accommodate this process.
  • The priorities are determined statically before
    scheduling process begins. The first step choose
    the process with highest priority, the second
    step select the best possible resource.

26
8.6 codesign for low power
  • Instruction set design and implementation seems
    to be one of the most well defined of the
    codesign problems.
  • Using a processor for which some portion of the
    interconnect and logic is reconfigurable.

27
Memory system considerations for low-power
software
  • Total memory size minimizing total memory
    requirements through code compaction and
    algorithm transformations allows for a smaller
    memory system with lower capacitances and
    improved performance. However, this benefit might
    have to be traded off against a faster algorithm
    that requires more memory.

28
  • Partitioning into banks how many, how big
    needed if parallel loads are to be used. The size
    of application, data structures, and access
    patterns should determine this.
  • Wide data bus needed for parallel loads.
  • Proximity to CPU Memory close to CPU reduces
    capacitances makes memory use less expensive.

29
  • Cache size Should be sized to minimize cache
    misses for the intended application. Software
    should apply techniques such as cache blocking to
    maintain high degree of spatial and temporal
    locality.
  • Cache protocol If an applications memory access
    patterns are well defined, the protocol can be
    optimized.
  • Cache locking If significant portions of an
    application can run entirely from cache, this can
    be used to prevent any external memory accesses
    while those portions of code are executing.

30
Architectural considerations for low power
software
  • Class of processor DSP/RISC/CISC application
    dependent. How well does the instruction set fit
    the application? Is there hardware support for
    operations common to the application? Does the
    hardware implement functions not needed by the
    application?
  • Parallel processing VLIW, Superscalar SIMD, MIMD,
    How much parallelism? Does the level and type
    of parallelism in an algorithm fit one of these
    architectures? If so, the parallel architecture
    can greatly improve performance and allow reduced
    voltages and clock rates to be used.

31
  • Bus architecture Separate buses (e.g., address,
    data, instruction, I/O) can make it easier to
    optimize instruction sequences, addressing
    patterns, and data correlations to minimize bus
    switching.
  • Register file size Increased register count
    eases register allocation and reduces memory
    accesses, but too large of a register file can
    make register accesses as expensive as a cache
    access.

32
Power management considerations for low power
software
  • Software vs. hardware control software control
    allows power management to be tailored to the
    application at the cost of added execution
    cycles.
  • Clock/voltage removal At what level of
    granularity? Some kind of shutdown modes are
    needed in order to realize a benefit from
    minimized execution times.

33
  • Clock/voltage reduction If there is a schedule
    slack associated with a software task, reduce
    power by slowing down the clock and reducing the
    supply voltage.
  • Guarded evaluation Latch the inputs of
    functional units to prevent meaningless
    calculations when outputs of units are not being
    used. This increases the power reduction from
    reduction in strength and operand swapping.

34
Sources of Software Energy Consumption
  • Datapaths in integer ALU and FP units
  • Cache and memory systems
  • System buses (address, instruction, and data)
  • Control circuitry and clock logic and distribution

35
Level of Optimization
  • Algorithm and Application Design
  • Compiler Optimizations
  • Operating System Control

36
Reducing memory energy
  • Improve the locality of memory accesses
  • Minimize the number of memory accesses
  • Optimizing interactions of compiler and cache
    architecture
  • Reducing the total memory area (in embedded
    systems)
  • Make effective use of memory bandwidth

37
Improving localityloop transformations
  • Linear loop transformations
  • Loop permutation
  • Loop skewing
  • Loop reversal
  • Loop scaling
  • Multi-loop transformations
  • Loop fusion/fission

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Improved localityData Transformations
  • Linear layout transformations
  • Dimension re-indexing
  • Diagonal (skewed) memory layouts
  • Blocked memory layouts

42
(No Transcript)
43
(No Transcript)
44
Data partitioning steps
  • All scalars are mapped to the SPM
  • All arrays larger than SPM are mapped into
    off-chip memory and accessed through the cache
  • For the remaining arrays conflicting arrays go
    to different memories
  • Experiments show 30-33 improvement in memory
    latencies

45
Features affecting data partitioning
  • Scalar variables and constant
  • Array sizes
  • Life times of arrays
  • Access frequency of arrays
  • Access pattern and potential conflict misses

46
Interaction of optimizations and cache
architectures
  • Multiple access caches
  • Sequential predictive accesses
  • Most recently used way cache (MRU)
  • Column associative cache (CA)
  • Selective way caches
  • Activate different number of ways dynamically

47
(No Transcript)
48
Evaluation of different cache architectures
  • MRU caches consume the least energy for all sizes
    of caches and for all the benchmarks

49
Minimizing code space
  • Storage assignment
  • Effective use auto-increment/decrement addressing
    modes
  • Code compression
  • Pure software approach
  • Common sequences are extracted and placed in a
    directory
  • Instances of these sequences are replaced by
    minisubroutine calls
  • Hardware approach
  • new instruction are defined
  • Flexible dictionary structures with architectural
    support

50
Storage assignment
  • Many DSPs have limited addressing modes
  • They use address registers to access memory
  • No indexing modes, have auto-increment/decrement
    modes
  • Compiler support is needed to minimize the
    number of explicit assignments to address
    registers

51
Instruction selection ordering
  • There are usually many possible code sequences
    that accomplish the same task
  • Techniques
  • Instruction packing
  • Instruction re-ordering (scheduling)
  • Operand ordering/swapping

52
Low power file system
  • Predictive shut down of disks
  • Spin mode consumes 1 watt
  • Sleep mode consumes 25 milliWatts
  • Latency constraints
  • Seek latency 10 milliseconds
  • Spinup latency 2 seconds
  • OS must balance between the need for low latency
    file access and need to reduce disk power

53
Requirements for a low power file system
  • Fine-grained disk spindown
  • Whole-file prefetching cache
  • 8-16MB of low power, low read latency memory

54
Factors influencing file system design
  • Number of disk spinups affects reliability
  • Friction induced wear
  • Disk spinups is a function of
  • Read/write inter-request time
  • Disk spindown delay

55
Spindown prediction techniques
  • Threshold_demand spun down after a fixed period
    of inactivity and is spun up upon the next access
  • Optimum offline algorithm Oracle

56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com