Title: Low Power Processor Design: Part II
1Low Power Processor Design Part II
2Contents
- Dynamic and Leakage Power consumption
- Low power processor adaptations
- DVS
- OS level power reduction
- Power aware compiler
- Application transformations for power reduction
- Some specific power reduction techniques
- References
3Dynamic Power
- Switched capacitance
- Power consumed for charging and discharging the
capacitance associated with all inputs, outputs
as well as interconnects - 85 to 90 of dynamic power
- Short circuit current
- Power consumed due to overlap in PMOS and NMOS
on-time during switching - 10-15 of dynamic power
4 Switched Capacitance Power Consumption
- Switched capacitance
- 80 to 90 of dynamic power
- Pdynamic a C V2 f
- Lower switched capacitance (low level design
techniques mainly transistor sizes as well as
connectivity at higher levels) - Lower switching activity (all levels synthesis,
clock gating) - Reduce clock frequency
- Reduce supply voltage (DVS effective as cubic
relationship)
5Leakage Current
Gate
Source
Drain
No Leakage
Gate
Source
Drain
Leakage
6Types Of Leakage
- Reverse-biased-junction leakage
- Gate-induced-drain leakage
- Sub-threshold leakage
- Gate-oxide leakage
- Gate-current leakage
- Punch-through leakage
7Gate-oxide Leakage
- Flows from gate into the substrate
- This leakage increases exponentially as thickness
of the oxide decreases - Thickness is to reduce with other reduction in
geometries and supply voltage - One solution is to use high-k dialectric material
8Sub-threshold Leakage
- Dominant leakage current
- Isub K1 W e(Vth/nT) (1 e(V/T))
- Isub increases exponentially as Vth decreases
- Isub increases as temperature T increases
(thermal runaway)
9Reduction of Sub-threshold Leakage Current
- Reduce supply voltage
- Reduce size of the circuit
- Resize transistors as per performance
requirements - Dynamically cut power supply to unused circuits
- Cooling
- Reduce threshold voltage
- Stack the off-transistors in series
- Isolating supply through sleep transistors
- Dual threshold higher threshold on non-critical
paths - Adaptive body biasing
10Low-Power Processor Adaptations
- Adaptive Cache
- Deactivates cache sets depending on the current
application characteristics - Drowsy cache lines in unused portions of the
cache placed in drowsy state - Power down of blocks which had their last use
(compiler directed) - Adaptive instruction queues
- IPC is monitored and Queue size adjusted
- Reconfiguring multiple structures
- Structures like instruction queue, reorder
buffer, load/store buffer changed based on
hotspots
11Dynamic Voltage Scaling
- DVS is the most widely used technique for
reducing power consumption. - DVS reduces the voltage and slows down the
processor frequency for workloads which have
slack time available to execute
12Issues in DVS
- Unpredictable nature of workloads as well as
preemption of tasks by interrupts creates
further problems - Learning techniques create learning lags reducing
its utility - Voltage to power relationship is not quadratic
due to how I/O signals are derived (separate
voltage) and also how the peripheral devices are
being managed - Inter-task relationships if one processor or core
is slowed down
13DVS Strategies
- Interval based approaches
- Processor idle time in a window
- Aged averages (weighted averages with lower
weights to previous intervals) - Inter-task approaches
- Voltage linked to a task and along with context
switch voltage is switched assumes uniform task
behavior and is not aware of program structure) - Intra-task approaches
- Many approaches like fixed-length timeslots (hw),
split a task into two sub-programs and use
highest clock for first and the second adjusts
(OS) and check-pointing (compiler support)
14Resource Hibernation
- Disk drives OS stops disk rotation during
periods of inactivity. As the dist is restarted
and adds to delays implying both performance as
well as energy loss - Predictive dynamic threshold adjustment can be
helped by OS which can cluster requests or delay
non-urgent requests - Disk controllers can modulate speeds looking at
input queue lengths - Network interfaces
- Displays
15Compiler-Level Power Reduction
- Some performance oriented optimizations reduce
power - e.g. Common sub-expression elimination
- Some performance oriented optimizations may
increase power consumption (may not be energy) - e.g. Loop unrolling
- Compilers can help reduce power during
instruction set selection, reducing memory
accesses, structured data traversal patterns etc. - For mobile devices opportunities exist of remote
compilation and/or remote execution
16Application Level Power Reduction
- Application transformations
- Architecture aware
- Software architecture graph transformations to
reduce inter-process communication - Accuracy of computation traded against power
- Quality of service traded against power
17Some Specific Techniques
18Value Cache Approach
- Yang5 first proposed a cache based system for
transmitting frequently occurring values. With a
cache size restricted to word length (32), all
hits can be transmitted by just toggling one bit
with control indicating hit/miss. In miss the
original data values are sent.
On-chip Data Cache
Off-chip Memory
Control
Value cache
Value cache
Data Bus
19TUBE Value Cache Approach
- Dinesh et.al. 7 proposed the tunable approach
where the bits are separated by their activity
coefficients. - Two different caches were used. One for high
activity bits and the other for low activity
bits. - Because of small cache size, this could more
effectively capture locality in data values.
20Hierarchical Value Cache Encoding8
- The HVCE is organized into multiple levels with
each level storing 32/2(i-1) values.
C0 Cache 32X32
C1 Cache 16X16
C2 Cache 16X16
C5 8X8
C3 8X8
C4 8X8
C68X8
4X4
4X4
4X4
4X4
4X4
4X4
4X4
4X4
21HVCE(contd.)
- A match at a higher level implies it will have
matches at all lower levels as well. - The highest level match is encoded as the bit
change in the 32 bit data bus. - A control (15 bit for 15 caches) indicates which
caches have a hit. Address bus with spare
bandwidth (cycle stealing is used for this) - Switch no. of bit to indicate the same VC address
(indicating a hit in the control)
10010011100001100010101010111001
10000011100001000010101010011000
C3
C4
C5
C13
C14
control
000111000000011
22Value Cache Approaches
23Secondary Memory Storage Management9
- All current storage management techniques assume
magnetic storage as secondary memory and
performance optimization as the sole objective - Flash memory is evolving as a popular alternative
for secondary storage in portable devices - Power optimization is an equally important
objective as performance
24Proposed Modifications
- Page size of 4KB to 12 KB used is too large for
replacement define sub-pages equal to the flash
memory page for replacement (say 256B) - Use a SRAM (battery backed) as hot-cache for the
flash to avoid frequent writes - Manage fragmentation to avoid frequent and
expensive garbage collection - Hot-cache replacement policies have to be power
sensitive
25Processor Pipeline Power Reduction
- Processors today have a very deep pipeline.
- Stalls occur due to data dependency, control
dependency, resource contention and cache miss. - The stalls provide an opportunity for reducing
power using clock gating of various stage
latches. - Other techniques have also been reported for
power reduction taking advantage of stalls.
26Clock Gating
- The basic technique has been clock gating of
various stage registers in the stall condition.
27Pipelining Using Transparent Latches 10
- The normal edge triggered latches are called
opaque latches and clock gating saves energy - Level triggered latches when enabled become
transparent and are called transparent latches.
Energy is saved by enabling these latches and are
equivalent to combining stages to create longer
delay stages.
28 Transparent Latches
- Transparent latches introduced in the pipeline
can result in energy saving. This is of course
dependent on the distance between subsequent
useful work and thus dynamic in nature.
Write back
Op. fetch
Inst. Buffer
Fetch
Decode
ALU Exec.
29Stall Cycles Redistribution11
- Stall cycles if redistributed can help save
energy. The delay is also referred to as slack. - If slacks are longer than pipeline depth, then no
use. - This information is stored with the fetch group
as extra bits in the BTB. Both the predicted
slack and confidence level is encoded in these
bits. - Fetch of instructions with slack delayed upto the
pipeline depth. - Report upto 50 reduction in the frontend
(I-cache, branch predictor and front-end latches)
energy-delay product. Only transparent latches
give negligible (5) reduction whereas the
prediction with clock gating is effective in
reducing by 25. The loss in performance is less
than 2.
30Processors Speculative Execution Energy Reduction
- High performance processors are all pipelined.
- The current trend is to support speculative
instruction execution and throw out the
instructions in the execution pipeline before the
write-back stage if the speculation is proved
wrong. - This ensures the correctness while being power
inefficient as the energy consumed in the
unfinished instructions is wasted.
31Basic Strategy
- The basic strategy is to reduce the pipeline
feed by gating the fetch stage as and when the
probability of instruction finishing reduces.
clock
32Approaches
33Low Power Multipliers
- Array multipliers and shift-and-add multipliers
- Fixed coefficient multipliers DSP applications
like filters/FFT/DCT etc. functions like sine,
cosine computation - Booths multipliers CSD Coding to reduce the
number of additions and subtraction
34CSD Coding1
- Replace string of 1s (longer than two) by 1 and
-1 to reduce the number of operations to 2 - 00111101 01000101
- Modified CSD coding which consider a set of
coefficients instead of one at a time to increase
the number of 0 columns. These can be removed to
save area.
35Synergistic Temperature and Energy Management17
- GALS are Globally asynchronous, locally
synchronous architectures are getting popular - There are many clock domains which interact
asynchronously - A temperature rise in one domain can be addressed
by reducing only its clock period and thus
reducing the impact on performance - In synchronous systems, reducing clock period
would reduce overall performance - This would introduce slack in other domains due
to excess capacity. This can be exploited to
reduce clock in other domains to reduce energy
further. - Cooling of temperature in adjoining domains would
benefit heat reduction in the affected domain by
providing a higher temperature gradient.
36References
- For first part
- V. Venkatachalam and M. Franz, Power Reduction
Techniques for Microprocessor systems, ACM
Computing Surveys, Vol. 37, No. 3, sep. 2005, pp.
195-237 - Low Power Multipliers
- ISLPED 2006 paper
- Register file power reduction
- ISLPED 2006 paper
- Associative Memory
- J. Sharkey et.al.,Power efficient wakeup tag
broadcast, ICCD 2005 - ISLPED 2006 paper
- Off-chip Bus Power Reduction
- J.Yang et. Al.,Fv encoding for low power data
I/O, ISLPED 2001 - Basu et.al.,Power protocol reducing power
dissipation on off-chip data buses, MICRO 2002 - Dinesh et.al.,A tunable encoder for off-chip
buses, ISLPED 2005 - ISLPED 2006 paper
- Secondary Storage Management
- ISLPED 2006 paper
37References (contd.)
- Processor Pipeline Power Reduction
- H.M. Jacobson, Improved clock gating through
transparent pipelining, ISLPED 2004, August 2004 - ISLPED 2006 paper
- Speculative Execution Energy Reduction
- Aragon et.al., Power-aware control speculation
through selective throttling, HPCA9, pp.
1003-112, Feb. 2003 - Baniasadi et. al., Instruction flow based
front-end throttling for power-aware
high-performance processors ISLPED 01, pp.
6-21, Aug. 2001 - Buyuktosumoglu et. al., Energy efficient
co-adoptive instruction fetch and issue, ISCA
03, pp 147-156, June 2003 - Manne et.al., Pipeline gating speculation
control for energy reduction, ISCA 98, pp.
132-141, June 1998 - ISLPED 2006 paper
- Synergistic Temp. and Energy Management
- ISLPED 2006 paper