Low Power Processor Design: Part II - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Low Power Processor Design: Part II

Description:

... been reported for power reduction taking advantage of stalls. ... The basic technique has been clock gating of various stage registers in the stall condition. ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 38

Provided by: embeddedC

Category:

more less

Transcript and Presenter's Notes

Title: Low Power Processor Design: Part II

1
Low Power Processor Design Part II

M. Balakrishnan

2
Contents

Dynamic and Leakage Power consumption
Low power processor adaptations
DVS
OS level power reduction
Power aware compiler
Application transformations for power reduction
Some specific power reduction techniques
References

3
Dynamic Power

Switched capacitance
Power consumed for charging and discharging the
capacitance associated with all inputs, outputs
as well as interconnects
85 to 90 of dynamic power
Short circuit current
Power consumed due to overlap in PMOS and NMOS
on-time during switching
10-15 of dynamic power

4
Switched Capacitance Power Consumption

Switched capacitance
80 to 90 of dynamic power
Pdynamic a C V2 f
Lower switched capacitance (low level design
techniques mainly transistor sizes as well as
connectivity at higher levels)
Lower switching activity (all levels synthesis,
clock gating)
Reduce clock frequency
Reduce supply voltage (DVS effective as cubic
relationship)

5
Leakage Current
Gate
Source
Drain
No Leakage
Gate
Source
Drain
Leakage
6
Types Of Leakage

Reverse-biased-junction leakage
Gate-induced-drain leakage
Sub-threshold leakage
Gate-oxide leakage
Gate-current leakage
Punch-through leakage

7
Gate-oxide Leakage

Flows from gate into the substrate
This leakage increases exponentially as thickness
of the oxide decreases
Thickness is to reduce with other reduction in
geometries and supply voltage
One solution is to use high-k dialectric material

8
Sub-threshold Leakage

Dominant leakage current
Isub K1 W e(Vth/nT) (1 e(V/T))
Isub increases exponentially as Vth decreases
Isub increases as temperature T increases
(thermal runaway)

9
Reduction of Sub-threshold Leakage Current

Reduce supply voltage
Reduce size of the circuit
Resize transistors as per performance
requirements
Dynamically cut power supply to unused circuits
Cooling
Reduce threshold voltage
Stack the off-transistors in series
Isolating supply through sleep transistors
Dual threshold higher threshold on non-critical
paths
Adaptive body biasing

10
Low-Power Processor Adaptations

Adaptive Cache
Deactivates cache sets depending on the current
application characteristics
Drowsy cache lines in unused portions of the
cache placed in drowsy state
Power down of blocks which had their last use
(compiler directed)
Adaptive instruction queues
IPC is monitored and Queue size adjusted
Reconfiguring multiple structures
Structures like instruction queue, reorder
buffer, load/store buffer changed based on
hotspots

11
Dynamic Voltage Scaling

DVS is the most widely used technique for
reducing power consumption.
DVS reduces the voltage and slows down the
processor frequency for workloads which have
slack time available to execute

12
Issues in DVS

Unpredictable nature of workloads as well as
preemption of tasks by interrupts creates
further problems
Learning techniques create learning lags reducing
its utility
Voltage to power relationship is not quadratic
due to how I/O signals are derived (separate
voltage) and also how the peripheral devices are
being managed
Inter-task relationships if one processor or core
is slowed down

13
DVS Strategies

Interval based approaches
Processor idle time in a window
Aged averages (weighted averages with lower
weights to previous intervals)
Inter-task approaches
Voltage linked to a task and along with context
switch voltage is switched assumes uniform task
behavior and is not aware of program structure)
Intra-task approaches
Many approaches like fixed-length timeslots (hw),
split a task into two sub-programs and use
highest clock for first and the second adjusts
(OS) and check-pointing (compiler support)

14
Resource Hibernation

Disk drives OS stops disk rotation during
periods of inactivity. As the dist is restarted
and adds to delays implying both performance as
well as energy loss
Predictive dynamic threshold adjustment can be
helped by OS which can cluster requests or delay
non-urgent requests
Disk controllers can modulate speeds looking at
input queue lengths
Network interfaces
Displays

15
Compiler-Level Power Reduction

Some performance oriented optimizations reduce
power
e.g. Common sub-expression elimination
Some performance oriented optimizations may
increase power consumption (may not be energy)
e.g. Loop unrolling
Compilers can help reduce power during
instruction set selection, reducing memory
accesses, structured data traversal patterns etc.
For mobile devices opportunities exist of remote
compilation and/or remote execution

16
Application Level Power Reduction

Application transformations
Architecture aware
Software architecture graph transformations to
reduce inter-process communication
Accuracy of computation traded against power
Quality of service traded against power

17
Some Specific Techniques
18
Value Cache Approach

Yang5 first proposed a cache based system for
transmitting frequently occurring values. With a
cache size restricted to word length (32), all
hits can be transmitted by just toggling one bit
with control indicating hit/miss. In miss the
original data values are sent.

On-chip Data Cache
Off-chip Memory
Control
Value cache
Value cache
Data Bus
19
TUBE Value Cache Approach

Dinesh et.al. 7 proposed the tunable approach
where the bits are separated by their activity
coefficients.
Two different caches were used. One for high
activity bits and the other for low activity
bits.
Because of small cache size, this could more
effectively capture locality in data values.

20
Hierarchical Value Cache Encoding8

The HVCE is organized into multiple levels with
each level storing 32/2(i-1) values.

C0 Cache 32X32
C1 Cache 16X16
C2 Cache 16X16
C5 8X8
C3 8X8
C4 8X8
C68X8
4X4
4X4
4X4
4X4
4X4
4X4
4X4
4X4
21
HVCE(contd.)

A match at a higher level implies it will have
matches at all lower levels as well.
The highest level match is encoded as the bit
change in the 32 bit data bus.
A control (15 bit for 15 caches) indicates which
caches have a hit. Address bus with spare
bandwidth (cycle stealing is used for this)
Switch no. of bit to indicate the same VC address
(indicating a hit in the control)

10010011100001100010101010111001
10000011100001000010101010011000
C3
C4
C5
C13
C14
control
000111000000011
22
Value Cache Approaches
23
Secondary Memory Storage Management9

All current storage management techniques assume
magnetic storage as secondary memory and
performance optimization as the sole objective
Flash memory is evolving as a popular alternative
for secondary storage in portable devices
Power optimization is an equally important
objective as performance

24
Proposed Modifications

Page size of 4KB to 12 KB used is too large for
replacement define sub-pages equal to the flash
memory page for replacement (say 256B)
Use a SRAM (battery backed) as hot-cache for the
flash to avoid frequent writes
Manage fragmentation to avoid frequent and
expensive garbage collection
Hot-cache replacement policies have to be power
sensitive

25
Processor Pipeline Power Reduction

Processors today have a very deep pipeline.
Stalls occur due to data dependency, control
dependency, resource contention and cache miss.
The stalls provide an opportunity for reducing
power using clock gating of various stage
latches.
Other techniques have also been reported for
power reduction taking advantage of stalls.

26
Clock Gating

The basic technique has been clock gating of
various stage registers in the stall condition.

27
Pipelining Using Transparent Latches 10

The normal edge triggered latches are called
opaque latches and clock gating saves energy
Level triggered latches when enabled become
transparent and are called transparent latches.
Energy is saved by enabling these latches and are
equivalent to combining stages to create longer
delay stages.

28
Transparent Latches

Transparent latches introduced in the pipeline
can result in energy saving. This is of course
dependent on the distance between subsequent
useful work and thus dynamic in nature.

Write back
Op. fetch
Inst. Buffer
Fetch
Decode
ALU Exec.
29
Stall Cycles Redistribution11

Stall cycles if redistributed can help save
energy. The delay is also referred to as slack.
If slacks are longer than pipeline depth, then no
use.
This information is stored with the fetch group
as extra bits in the BTB. Both the predicted
slack and confidence level is encoded in these
bits.
Fetch of instructions with slack delayed upto the
pipeline depth.
Report upto 50 reduction in the frontend
(I-cache, branch predictor and front-end latches)
energy-delay product. Only transparent latches
give negligible (5) reduction whereas the
prediction with clock gating is effective in
reducing by 25. The loss in performance is less
than 2.

30
Processors Speculative Execution Energy Reduction

High performance processors are all pipelined.
The current trend is to support speculative
instruction execution and throw out the
instructions in the execution pipeline before the
write-back stage if the speculation is proved
wrong.
This ensures the correctness while being power
inefficient as the energy consumed in the
unfinished instructions is wasted.

31
Basic Strategy

The basic strategy is to reduce the pipeline
feed by gating the fetch stage as and when the
probability of instruction finishing reduces.

clock
32
Approaches
33
Low Power Multipliers

Array multipliers and shift-and-add multipliers
Fixed coefficient multipliers DSP applications
like filters/FFT/DCT etc. functions like sine,
cosine computation
Booths multipliers CSD Coding to reduce the
number of additions and subtraction

34
CSD Coding1

Replace string of 1s (longer than two) by 1 and
-1 to reduce the number of operations to 2
00111101 01000101
Modified CSD coding which consider a set of
coefficients instead of one at a time to increase
the number of 0 columns. These can be removed to
save area.

35
Synergistic Temperature and Energy Management17

GALS are Globally asynchronous, locally
synchronous architectures are getting popular
There are many clock domains which interact
asynchronously
A temperature rise in one domain can be addressed
by reducing only its clock period and thus
reducing the impact on performance
In synchronous systems, reducing clock period
would reduce overall performance
This would introduce slack in other domains due
to excess capacity. This can be exploited to
reduce clock in other domains to reduce energy
further.
Cooling of temperature in adjoining domains would
benefit heat reduction in the affected domain by
providing a higher temperature gradient.

36
References

For first part
V. Venkatachalam and M. Franz, Power Reduction
Techniques for Microprocessor systems, ACM
Computing Surveys, Vol. 37, No. 3, sep. 2005, pp.
195-237
Low Power Multipliers
ISLPED 2006 paper
Register file power reduction
ISLPED 2006 paper
Associative Memory
J. Sharkey et.al.,Power efficient wakeup tag
broadcast, ICCD 2005
ISLPED 2006 paper
Off-chip Bus Power Reduction
J.Yang et. Al.,Fv encoding for low power data
I/O, ISLPED 2001
Basu et.al.,Power protocol reducing power
dissipation on off-chip data buses, MICRO 2002
Dinesh et.al.,A tunable encoder for off-chip
buses, ISLPED 2005
ISLPED 2006 paper
Secondary Storage Management
ISLPED 2006 paper

37
References (contd.)

Processor Pipeline Power Reduction
H.M. Jacobson, Improved clock gating through
transparent pipelining, ISLPED 2004, August 2004
ISLPED 2006 paper
Speculative Execution Energy Reduction
Aragon et.al., Power-aware control speculation
through selective throttling, HPCA9, pp.
1003-112, Feb. 2003
Baniasadi et. al., Instruction flow based
front-end throttling for power-aware
high-performance processors ISLPED 01, pp.
6-21, Aug. 2001
Buyuktosumoglu et. al., Energy efficient
co-adoptive instruction fetch and issue, ISCA
03, pp 147-156, June 2003
Manne et.al., Pipeline gating speculation
control for energy reduction, ISCA 98, pp.
132-141, June 1998
ISLPED 2006 paper
Synergistic Temp. and Energy Management
ISLPED 2006 paper