Lecture 14 Software Design for LowPower - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Lecture 14 Software Design for LowPower

Description:

B too large to store in registers - used memory transfers, instead. Loop rearrangement allowed intermediate B to stay in general register ... Allocate registers ... – PowerPoint PPT presentation

Number of Views:644
Avg rating:3.0/5.0
Slides: 34
Provided by: pagr
Category:

less

Transcript and Presenter's Notes

Title: Lecture 14 Software Design for LowPower


1
Lecture 14Software Design for Low-Power
  • Software dictates much of hardware activity
  • Need software power estimation method
  • Must optimize software at several levels of
    abstraction
  • Ultimately involves hardware/software trade-offs
  • Summary
  • Michael L. Bushnell
  • CAIP Center and WINLAB
  • ECE Dept., Rutgers U., Piscataway, NJ

2
Sources of Software Power Dissipation
  • Memory system uses power (1/10 to ¼) in portable
    computers
  • Dominates in video processing
  • System bus switching activity controlled by
    software
  • ALU and FPU data paths needs good scheduling to
    avoid pipeline stalls
  • Control logic and clock reduce by using
    shortest possible program to do the computation
  • Software control of hardware
  • Reduces power of idle components
  • Controls power saving modes

3
Memory Accesses Are Expensive
  • Highly capacitive data / address / row / column
    decode / word data lines with high fanout
  • Mapping of data structures to memory banks
    determines possibility of parallel word loads
    (which are more energy efficient)
  • Memory access patterns greatly affect cache
    performance
  • Cache more power efficient than main memory
  • Closer to CPU and smaller than main memory

4
Methods of Software Power Estimation for Code
  • Lower level use gate-level simulation power
    estimation on gate-level description of hardware
    running the code most accurate
  • Higher level look at frequency of each type of
    instruction / sequence (execution profile)
  • Use lookup table
  • Model ALUs, register files, etc.
  • Use bus switching activity exclusively
  • Must know bus architecture, OPCODEs,
    representative input data to program, mapping of
    code/data to address space
  • Characterize instruction power empirically on
    real hardware

5
Instruction Level Power Analysis
  • For each instruction, need to measure
  • Base power cost (independent of prior state)
  • Prior state cost needs large-scale analysis /
    simulation
  • Pipeline stalls, buffer stalls, cache misses
  • Circuit state effects due to localized
    processor state change from execution of
    instruction pair
  • Example ADD A B, C
  • MULT C A, B
  • Includes change of instruction bus state
  • Switching of control lines
  • ALU mode changes
  • Routing costs to / from register file

6
Instruction Analysis Example
  • EP overall energy cost of program
  • Bi base cost of instruction type i
  • Oi,j cost of instruction type i followed by
    type j
  • Oi,j Oj,i
  • Ek costs of pipeline stalls and cache misses

7
Detailed Example
  • Simple DSP with 4 registers A, B, C, D
  • Evaluate (x y) z

8
Example Concluded
9
Software Power Optimizations
  • Select least expensive instructions / instruction
    sequences
  • Minimize frequency or cost of memory access
  • Use hardware power minimization features
  • Algorithm choices
  • Must map well to available hardware
  • Be efficient for problem being solved
  • Must maximize performance parallel processing
  • Constraints
  • Battery-powered system minimize total energy
    dissipated
  • When heat dissipation or reliability is important
    minimize instantaneous / average power

10
Algorithm Transformations
  • Reduction operations use parallel hardware, but
    run slowly

11
Two Adders Available
12
Four Adders Available
13
Minimizing Memory Access Costs
  • Memory causes power and performance bottleneck
  • Minimize accesses needed by algorithm
  • Minimize total memory size needed by algorithm
  • Put memory accesses as close as possible to CPU
  • Register, then cache, external RAM last
  • Efficiently use memory bandwidth
  • Use multiple-word parallel loads, not single word
    loads

14
Improve Loop Nesting and Operation Order
  • B too large to store in registers - used memory
    transfers, instead
  • Loop rearrangement allowed intermediate B to stay
    in general register
  • Got rid of 2N memory accesses for B and N memory
    locations not needed

15
Reduce Space Requirements
  • C not needed afterward reorder so that B can
    overwrite C

16
Reduced Intermediate Value Storage
  • Reduce storage for intermediate values AiM
  • Reduced from N locations to 1

17
Dual Memory Load Example
18
Maximize Parallel Loads of Multiple Memory Words
  • Dual word loads reduce energy by 47
  • 1st maximize dual loads with memory allocation
  • 2nd Combine memory accesses
  • No dual load Dual
    load Dual load parallel exec.

19
Minimize Memory Bandwidth
  • Allocate registers to minimize memory references
  • Cache blocking loop unrolling fix array
    computations so that blocks of array only read
    once
  • Register blocking eliminate redundant register
    loads
  • Recurrence detection optimization use
    registers for values carried over from one
    recursion level to next
  • Compact multiple memory references into 1
    reference
  • 40 speedup on DEC alpha, 25 on Motorola 88100

20
Cache Locking
  • Lock data into cache
  • Prevents memory reads/writes from going to main
    memory
  • Real benefit -- cache write hits not written
    through to main memory
  • Read hits no main memory reference even if
    cache were unlocked
  • Fujitsu SPARClite writing 0 drew 341 mA from
    power supply when cache unlocked, only 194 mA
    when cache locked

21
Instruction Selection/Ordering
  • Instruction packing
  • Single instruction does both ALU operation and
    memory data transfer
  • Much instruction overhead not duplicated when
    operations run in parallel
  • Concurrent execution of integer floating point
    Ops
  • Easier to do in VLIW and superscalar
    architectures
  • Reorder instructions to minimize circuit state
    effects
  • Most significant for DSP units
  • Accumulator spilling and mode switching are most
    sensitive

22
Operand Swapping/Ordering
  • Swapping minimizes switching of functional unit
    inputs
  • Example x 7 and y 7 keep 7 on same adder
    input
  • Ordering most significant if commutative operands
    not treated symmetrically by hardware
  • Example Booth multiplier
  • 2nd operand bit pattern determines additions
    and subtractions (called recoding weight)
  • Put operand with lowest recoding weight on 2nd
    operand
  • Saved 10-30 of the power in Lees experiments

23
Power Management
  • Software can often control processor power-down
    modes
  • User interfaces activity comes in bursts
  • When system idle time exceeds threshold, likely
    to continue to be idle start shutdown
  • Example SPARClite
  • Power-down register masks/enables clock for
  • SDRAM, DMA module, FPU, floating-point queues
  • Example Hitachi SH3
  • Standby mode CPU core stopped, peripheral
    controller, bus controller, memory refresh
    continue
  • Sleep mode Everything but real-time clock stops

24
More Examples
  • Example Intel 486SL
  • System Management Mode entered by asynchronous
    interrupt
  • Can enable, disable, switch between fast and
    slow clocks for CPU and ISA bus
  • Example PowerPC 603 and 604
  • Dynamic power management removes clock from
    execution units (saves 8-16)
  • Static power management
  • Doze shuts off most function units, keeps bus
    snooping enabled to maintain data cache coherence
  • Nap shuts off bus snooping and sets wakeup
    timer, keeps phase-locked loop running to allow
    quick clock restart
  • Sleep mode also shuts off phase-locked loop
  • Software control of power management better than
    pure hardware control more information

25
Automated Low-Power Code Generation
  • High-level
  • Graphical / textual languages available to
    describe DSP algorithms
  • HYPER_LP uncovers parallelism and minimizes
    critical paths
  • Allows data path supply voltage to be reduced
  • MASAI reorganizes loops to minimize memory
    transfers and size
  • DSP Compiler technology must deal with small
    register set, irregular data paths, fully using
    parallel resources

26
Sus Cold Scheduling Algorithm
  • Allocate registers
  • Pre-assemble Calculate target addresses, index
    symbol table, transform instructions to binary
  • Schedule instructions with cold scheduling
  • Post-assemble Complete assembly
  • Reduced switching activity 20-30, performance
    loss of 2-4
  • Lee et al. had a similar approach, but used an as
    soon as possible packing of instructions
  • Saved 26 to 73 energy compared with no
    instruction packing and no memory bank assignment
    optimization

27
Instruction Set Co-Design
  • PEAS-I system
  • Takes HDL for CPU and C compiler, assembler, and
    simulator for CPU
  • Takes design constraints (chip area, power),
    hardware module database, sample application
    program, program data set
  • Optimizes instruction set and implementations for
    application program and given data set
  • Starts with core instructions needed for any C
    program
  • Augments with instructions for C operators not
    already included as a single instruction
  • Defines hardware, microprogram, and software
    implementations
  • Estimates power and area of each instruction
    implementation
  • Accounts for pipeline hazards
  • Solves as an integer program using
    branch-and-bound search

28
Instruction Set Design
  • Huang and Despain system
  • Optimizes instruction set for sample application,
    too
  • Groups micro-operations (MOPS) together to form
    higher-level instructions
  • Merge MOPS together as byproduct of scheduling
  • MOPS must be scheduled to same clock cycle
  • Constrain with instruction bit width, instruction
    set size, and hardware resources
  • Solved by simulated annealing with objective of
    minimizing execution cycles instruction set size

29
Reconfigurable Computing
  • Some of hardware interconnect and logic is
    modified at run time
  • Can closely optimize to wide variety of
    applications
  • Can reconfigure at gate level or at architectural
    level
  • Usually implemented using FPGAs
  • Gives software designer chance to tailor software
    and processor to fit each other
  • Even after hardware design is fixed

30
Memory System Considerations
  • Feature
  • Total size
  • Bank partitioning
  • Wide data bus
  • Proximity to CPU
  • Cache size
  • Cache protocol
  • Cache locking

Low-Power Impact Code compaction, algorithm
transformations Smaller, lower CL,
faster Determined by application size, data
structure Needed for parallel loads Reduces CL,
makes memory use less costly Minimizes cache
misses for application Need spatial and temporal
locality Optimize for application Use to run part
of application only from cache
31
Architectural Considerations
  • Processor class DSP/RISC/CISC
  • Parallel Processing VLIW, Superscalar, SIMD, MIMD
  • Bus architecture
  • Register file size

Application dependent Does parallel processing
greatly improve performance so that reduced VDD
and slower clock can be used? Separate address,
data, instruction, I/O busses make it easier to
optimize instructions, addressing, and data to
minimize bus activity Eases register allocation,
reduces memory accesses, too many increases power
32
Power Management Considerations
  • Software vs. hardware control
  • Granularity of clock/voltage removal
  • Clock/voltage reduction
  • Guarded evaluation

Software control lets power management be
tailored to application Shutdown modes needed to
benefit from minimized execution times Useful if
there is schedule slack for a software task
(power reduced during slack) Latch functional
unit inputs to avoid meaningless calculation when
not in use
33
Summary
  • Estimation of software contribution to power
    dissipation
  • Minimize software power dissipation
  • Choose best algorithm that is suited to hardware
    resources
  • Minimize memory size and expensive memory
    accesses
  • Algorithm transformations
  • Efficient data mapping onto memory
  • Optimal use of memory bandwidth, registers
    cache
  • Optimally use available parallelism for
    application
  • Use hardware power management support
  • Select instruction sequences to minimize
    switching in CPU and data path
Write a Comment
User Comments (0)
About PowerShow.com