Title: Low Power Processor Design: Part I
1Low Power Processor Design Part I
2Contents
- Introduction
- Processor development history
- Data storage and movement
- Control and sequencing
- Component
- Processor Power
3Components of Computation
- Any algorithm is a sequence of steps
- Any algorithm execution has the following
components - Computation perform the computation
- Data bring the data to the compute units and
take away the results - Control schedule and activate the steps and
generate the associated control signals to
bring/take away the data and perform the
computation
4Processors What is different?
- Processors vis-Ã -vis ASICs are distinguished (or
identified) by instructions being associated
with control - These instructions isolate the user from the
micro-architectural details and provide an easy
programming paradigm
Instructions
Micro-architecture
5Processor Developments 1971-2008
- Processor architecture over the years have
developed in all the three components with the
primary focus on increasing performance - Apart from performance, complexity of
applications and portability across platforms
have also driven the architectural developments - More recently, we have been forced to look into
the impact on power due to these architectural
developments. Now the trend exemplified by
multi-core design is to look at performance-power
benefits
6Computation in Processors
Single ALU for add/sub/logic ops
Increase in effective rate of computation per
unit time (integer ops or floating ops per
second)
Increase in data word-length
Dedicated hardware op. units
Pipelined ALUs
Vector ops
Multiple ALUs (superscalar/VLIW)
7Data Movement in Processors
Memory Single Accumulator
Increase in effective rate of memory access per
unit time
Secondary storage/ Virtual memory
Register file
Cache
Bypass paths
Write/read buffers
8Instruction Flow
Instruction register decoder
Increase in effective rate of instructions
executed per unit time (mips)
Microprogramming
Instruction prefetch
Instruction pipelining
Speculative execution
Issue units
9Register File Power Reduction2
- Large size register file with multiple ports is
a key to processor performance measured in term
of IPC (instructions per cycle). - Such large register files consume considerable
energy as well as delay. - Many techniques have been used to reduce RF
access requirements and many of them revolve
around de-allocating the existing registers as
soon as it is possible, which reduces the
register pressure and thus improves performance
for a given RF. - Here we discuss a technique to reduce writes to
the RF instead use the bypass network for feeding
the operand.
10Multi-port Register File
Multi-ported RF
ALU
ALU
11Multi-port RF Access Energy
ReadPort 1
WriteAdr 1
ReadAdr 1
ReadPort 2
ReadAdr 2
12Writeback Avoidance Condition
- A result value is a transient if
- The value must be short-lived. A value generated
by instruction x is short-lived, if the target
register of x is redefined by another instruction
before x is written back - There is only one consumer for this value.
- The consuming instruction is issued before the
value is produced - There must be no branch instruction between the
value producer and re-definer. - The sole consumer of the value should not be
subject to replay caused by load latency
mis-prediction or memory dependence
mis-prediction.
13Selective Writeback
- Transient values need not be written back into
the register file but can be sent directly to the
ALU executing units through Bypass. - This requires three bits per register file for
making sure the required conditions for transient
value is met. Further check-pointing is used for
rolling back in case of interrupts etc.
14Register File Bypass Path
Multi-ported RF
Bypass buffer
ALU
ALU
15Results of Selective Write-back
- Write energy is 1.8 times read energy
- 45 of the produced results are transients and
need not be written back - This results in 36 reduction in energy
consumption in the RF. Assuming RF itself takes
10 to 25 of the overall energy, this results in
3 to 7 overall reduction. - It also improves10.9 performance of the base
processor. Over related techniques it improves
performance by 5 to 10. - The technique can be used to reduce the number of
registers and/or ports and thus save energy.
16Associative Memory
- A number of associative memories are getting
used in a modern processor design. These include
DTLB, ITLB, (Data and Instruction TLB), STQ and
LDQ (Store and load queue). - These are energy intensive components as both
broadcast as well as concurrent comparison across
all the key elements are involved at each step. - As the key data repeats frequently, the same can
be exploited for energy reduction.3
17Associative Memory Structure
comparator
multiplexer
Broadcast key
key 0
data 0
key 1
data 1
match
key (n-1)
data (n-1)
encoder
18Search Key Memoization3
- Typically high order bits repeat frequently
- The optimal dividing line between H and L bits
vary from application to application. It is also
a function of address allocation strategy of OS - If the H part of the key repeats, it is neither
broadcast nor compared again. The result of the
previous match for H bits is stored in a
flip-flop in each entry and that is reused
19Key Memoization Structure
comparator
KeyL
KeyH
comparator
key 0H
key 0L
match
clock
drive-upper
20Results
- For a 40-bit virtual address, size of L varied
from 10 to 20 bits - DTLB power consumption was reduced by 70 and
ITLB by 93. For ITLB the L was only 3-bits - More than 2-way split could reduce power further.
With 3 components, DTLB power reduction went up
to 81
21Off-Chip Bus Energy Reduction
- Off-chip busses typically connect cache to the
main memory (and possibly other peripherals). - Studies show they also consume 10 to 23 of
system power. - Techniques for data encoding have been proposed
to reduce the activity on these busses. - More recently value caches on both sides with
associated encoding have been used for power
reduction.
22Value Cache Approach
- Yang4 first proposed a cache based system for
transmitting frequently occurring values. With a
cache size restricted to word length (32), all
hits can be transmitted by just toggling one bit
with control indicating hit/miss. In miss the
original data values are sent.
On-chip Data Cache
Off-chip Memory
Control
Value cache
Value cache
Data Bus
23References
- V. Venkatachalam and M. Franz, Power Reduction
Techniques for Microprocessor systems, ACM
Computing Surveys, Vol. 37, No. 3, sep. 2005, pp.
195-237 - D. Balkan et.al., Selective writeback exploiting
transient values for energy-efficiency and
performance, ISLPED 2006, Oct. 2006, pp. 37-42 - J. Sharkey et.al.,Power efficient wakeup tag
broadcast, ICCD 2005, Oct 2005, pp. 654-661 - J.Yang et. al,FV encoding for low power data
I/O, ISLPED 2001, Aug. 2001, pp. 84-87