Low Power Processor Design: Part II - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Low Power Processor Design: Part II

Description:

... been reported for power reduction taking advantage of stalls. ... The basic technique has been clock gating of various stage registers in the stall condition. ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 38
Provided by: embeddedC
Category:
Tags: design | low | part | power | processor | stall

less

Transcript and Presenter's Notes

Title: Low Power Processor Design: Part II


1
Low Power Processor Design Part II
  • M. Balakrishnan

2
Contents
  • Dynamic and Leakage Power consumption
  • Low power processor adaptations
  • DVS
  • OS level power reduction
  • Power aware compiler
  • Application transformations for power reduction
  • Some specific power reduction techniques
  • References

3
Dynamic Power
  • Switched capacitance
  • Power consumed for charging and discharging the
    capacitance associated with all inputs, outputs
    as well as interconnects
  • 85 to 90 of dynamic power
  • Short circuit current
  • Power consumed due to overlap in PMOS and NMOS
    on-time during switching
  • 10-15 of dynamic power

4
Switched Capacitance Power Consumption
  • Switched capacitance
  • 80 to 90 of dynamic power
  • Pdynamic a C V2 f
  • Lower switched capacitance (low level design
    techniques mainly transistor sizes as well as
    connectivity at higher levels)
  • Lower switching activity (all levels synthesis,
    clock gating)
  • Reduce clock frequency
  • Reduce supply voltage (DVS effective as cubic
    relationship)

5
Leakage Current
Gate
Source
Drain
No Leakage
Gate
Source
Drain
Leakage
6
Types Of Leakage
  • Reverse-biased-junction leakage
  • Gate-induced-drain leakage
  • Sub-threshold leakage
  • Gate-oxide leakage
  • Gate-current leakage
  • Punch-through leakage

7
Gate-oxide Leakage
  • Flows from gate into the substrate
  • This leakage increases exponentially as thickness
    of the oxide decreases
  • Thickness is to reduce with other reduction in
    geometries and supply voltage
  • One solution is to use high-k dialectric material

8
Sub-threshold Leakage
  • Dominant leakage current
  • Isub K1 W e(Vth/nT) (1 e(V/T))
  • Isub increases exponentially as Vth decreases
  • Isub increases as temperature T increases
    (thermal runaway)

9
Reduction of Sub-threshold Leakage Current
  • Reduce supply voltage
  • Reduce size of the circuit
  • Resize transistors as per performance
    requirements
  • Dynamically cut power supply to unused circuits
  • Cooling
  • Reduce threshold voltage
  • Stack the off-transistors in series
  • Isolating supply through sleep transistors
  • Dual threshold higher threshold on non-critical
    paths
  • Adaptive body biasing

10
Low-Power Processor Adaptations
  • Adaptive Cache
  • Deactivates cache sets depending on the current
    application characteristics
  • Drowsy cache lines in unused portions of the
    cache placed in drowsy state
  • Power down of blocks which had their last use
    (compiler directed)
  • Adaptive instruction queues
  • IPC is monitored and Queue size adjusted
  • Reconfiguring multiple structures
  • Structures like instruction queue, reorder
    buffer, load/store buffer changed based on
    hotspots

11
Dynamic Voltage Scaling
  • DVS is the most widely used technique for
    reducing power consumption.
  • DVS reduces the voltage and slows down the
    processor frequency for workloads which have
    slack time available to execute

12
Issues in DVS
  • Unpredictable nature of workloads as well as
    preemption of tasks by interrupts creates
    further problems
  • Learning techniques create learning lags reducing
    its utility
  • Voltage to power relationship is not quadratic
    due to how I/O signals are derived (separate
    voltage) and also how the peripheral devices are
    being managed
  • Inter-task relationships if one processor or core
    is slowed down

13
DVS Strategies
  • Interval based approaches
  • Processor idle time in a window
  • Aged averages (weighted averages with lower
    weights to previous intervals)
  • Inter-task approaches
  • Voltage linked to a task and along with context
    switch voltage is switched assumes uniform task
    behavior and is not aware of program structure)
  • Intra-task approaches
  • Many approaches like fixed-length timeslots (hw),
    split a task into two sub-programs and use
    highest clock for first and the second adjusts
    (OS) and check-pointing (compiler support)

14
Resource Hibernation
  • Disk drives OS stops disk rotation during
    periods of inactivity. As the dist is restarted
    and adds to delays implying both performance as
    well as energy loss
  • Predictive dynamic threshold adjustment can be
    helped by OS which can cluster requests or delay
    non-urgent requests
  • Disk controllers can modulate speeds looking at
    input queue lengths
  • Network interfaces
  • Displays

15
Compiler-Level Power Reduction
  • Some performance oriented optimizations reduce
    power
  • e.g. Common sub-expression elimination
  • Some performance oriented optimizations may
    increase power consumption (may not be energy)
  • e.g. Loop unrolling
  • Compilers can help reduce power during
    instruction set selection, reducing memory
    accesses, structured data traversal patterns etc.
  • For mobile devices opportunities exist of remote
    compilation and/or remote execution

16
Application Level Power Reduction
  • Application transformations
  • Architecture aware
  • Software architecture graph transformations to
    reduce inter-process communication
  • Accuracy of computation traded against power
  • Quality of service traded against power

17
Some Specific Techniques
18
Value Cache Approach
  • Yang5 first proposed a cache based system for
    transmitting frequently occurring values. With a
    cache size restricted to word length (32), all
    hits can be transmitted by just toggling one bit
    with control indicating hit/miss. In miss the
    original data values are sent.

On-chip Data Cache
Off-chip Memory
Control
Value cache
Value cache
Data Bus
19
TUBE Value Cache Approach
  • Dinesh et.al. 7 proposed the tunable approach
    where the bits are separated by their activity
    coefficients.
  • Two different caches were used. One for high
    activity bits and the other for low activity
    bits.
  • Because of small cache size, this could more
    effectively capture locality in data values.

20
Hierarchical Value Cache Encoding8
  • The HVCE is organized into multiple levels with
    each level storing 32/2(i-1) values.

C0 Cache 32X32
C1 Cache 16X16
C2 Cache 16X16
C5 8X8
C3 8X8
C4 8X8
C68X8
4X4
4X4
4X4
4X4
4X4
4X4
4X4
4X4
21
HVCE(contd.)
  • A match at a higher level implies it will have
    matches at all lower levels as well.
  • The highest level match is encoded as the bit
    change in the 32 bit data bus.
  • A control (15 bit for 15 caches) indicates which
    caches have a hit. Address bus with spare
    bandwidth (cycle stealing is used for this)
  • Switch no. of bit to indicate the same VC address
    (indicating a hit in the control)

10010011100001100010101010111001
10000011100001000010101010011000
C3
C4
C5
C13
C14
control
000111000000011
22
Value Cache Approaches
23
Secondary Memory Storage Management9
  • All current storage management techniques assume
    magnetic storage as secondary memory and
    performance optimization as the sole objective
  • Flash memory is evolving as a popular alternative
    for secondary storage in portable devices
  • Power optimization is an equally important
    objective as performance

24
Proposed Modifications
  • Page size of 4KB to 12 KB used is too large for
    replacement define sub-pages equal to the flash
    memory page for replacement (say 256B)
  • Use a SRAM (battery backed) as hot-cache for the
    flash to avoid frequent writes
  • Manage fragmentation to avoid frequent and
    expensive garbage collection
  • Hot-cache replacement policies have to be power
    sensitive

25
Processor Pipeline Power Reduction
  • Processors today have a very deep pipeline.
  • Stalls occur due to data dependency, control
    dependency, resource contention and cache miss.
  • The stalls provide an opportunity for reducing
    power using clock gating of various stage
    latches.
  • Other techniques have also been reported for
    power reduction taking advantage of stalls.

26
Clock Gating
  • The basic technique has been clock gating of
    various stage registers in the stall condition.

27
Pipelining Using Transparent Latches 10
  • The normal edge triggered latches are called
    opaque latches and clock gating saves energy
  • Level triggered latches when enabled become
    transparent and are called transparent latches.
    Energy is saved by enabling these latches and are
    equivalent to combining stages to create longer
    delay stages.

28
Transparent Latches
  • Transparent latches introduced in the pipeline
    can result in energy saving. This is of course
    dependent on the distance between subsequent
    useful work and thus dynamic in nature.

Write back
Op. fetch
Inst. Buffer
Fetch
Decode
ALU Exec.
29
Stall Cycles Redistribution11
  • Stall cycles if redistributed can help save
    energy. The delay is also referred to as slack.
  • If slacks are longer than pipeline depth, then no
    use.
  • This information is stored with the fetch group
    as extra bits in the BTB. Both the predicted
    slack and confidence level is encoded in these
    bits.
  • Fetch of instructions with slack delayed upto the
    pipeline depth.
  • Report upto 50 reduction in the frontend
    (I-cache, branch predictor and front-end latches)
    energy-delay product. Only transparent latches
    give negligible (5) reduction whereas the
    prediction with clock gating is effective in
    reducing by 25. The loss in performance is less
    than 2.

30
Processors Speculative Execution Energy Reduction
  • High performance processors are all pipelined.
  • The current trend is to support speculative
    instruction execution and throw out the
    instructions in the execution pipeline before the
    write-back stage if the speculation is proved
    wrong.
  • This ensures the correctness while being power
    inefficient as the energy consumed in the
    unfinished instructions is wasted.

31
Basic Strategy
  • The basic strategy is to reduce the pipeline
    feed by gating the fetch stage as and when the
    probability of instruction finishing reduces.

clock
32
Approaches
33
Low Power Multipliers
  • Array multipliers and shift-and-add multipliers
  • Fixed coefficient multipliers DSP applications
    like filters/FFT/DCT etc. functions like sine,
    cosine computation
  • Booths multipliers CSD Coding to reduce the
    number of additions and subtraction

34
CSD Coding1
  • Replace string of 1s (longer than two) by 1 and
    -1 to reduce the number of operations to 2
  • 00111101 01000101
  • Modified CSD coding which consider a set of
    coefficients instead of one at a time to increase
    the number of 0 columns. These can be removed to
    save area.

35
Synergistic Temperature and Energy Management17
  • GALS are Globally asynchronous, locally
    synchronous architectures are getting popular
  • There are many clock domains which interact
    asynchronously
  • A temperature rise in one domain can be addressed
    by reducing only its clock period and thus
    reducing the impact on performance
  • In synchronous systems, reducing clock period
    would reduce overall performance
  • This would introduce slack in other domains due
    to excess capacity. This can be exploited to
    reduce clock in other domains to reduce energy
    further.
  • Cooling of temperature in adjoining domains would
    benefit heat reduction in the affected domain by
    providing a higher temperature gradient.

36
References
  • For first part
  • V. Venkatachalam and M. Franz, Power Reduction
    Techniques for Microprocessor systems, ACM
    Computing Surveys, Vol. 37, No. 3, sep. 2005, pp.
    195-237
  • Low Power Multipliers
  • ISLPED 2006 paper
  • Register file power reduction
  • ISLPED 2006 paper
  • Associative Memory
  • J. Sharkey et.al.,Power efficient wakeup tag
    broadcast, ICCD 2005
  • ISLPED 2006 paper
  • Off-chip Bus Power Reduction
  • J.Yang et. Al.,Fv encoding for low power data
    I/O, ISLPED 2001
  • Basu et.al.,Power protocol reducing power
    dissipation on off-chip data buses, MICRO 2002
  • Dinesh et.al.,A tunable encoder for off-chip
    buses, ISLPED 2005
  • ISLPED 2006 paper
  • Secondary Storage Management
  • ISLPED 2006 paper

37
References (contd.)
  • Processor Pipeline Power Reduction
  • H.M. Jacobson, Improved clock gating through
    transparent pipelining, ISLPED 2004, August 2004
  • ISLPED 2006 paper
  • Speculative Execution Energy Reduction
  • Aragon et.al., Power-aware control speculation
    through selective throttling, HPCA9, pp.
    1003-112, Feb. 2003
  • Baniasadi et. al., Instruction flow based
    front-end throttling for power-aware
    high-performance processors ISLPED 01, pp.
    6-21, Aug. 2001
  • Buyuktosumoglu et. al., Energy efficient
    co-adoptive instruction fetch and issue, ISCA
    03, pp 147-156, June 2003
  • Manne et.al., Pipeline gating speculation
    control for energy reduction, ISCA 98, pp.
    132-141, June 1998
  • ISLPED 2006 paper
  • Synergistic Temp. and Energy Management
  • ISLPED 2006 paper
Write a Comment
User Comments (0)
About PowerShow.com