Title: Decomposition of Instruction Decoder for Low Power Design
1 Decomposition of Instruction Decoder for Low
Power Design
- TingTing Hwang
- Department of Computer Science
- Tsing Hua University
2Power Dissipation
- Static dissipation due to leakage circuit
- Short-circuit dissipation
- Charge and discharge of output load capacitor
3Power Dissipation
- Static dissipation due to leakage circuit
- Short-circuit dissipation
- Charge and discharge of output load capacitor
4Dynamic Power Dissipation Model
- P power dissipation
- C load capacitance
- E avg. transition count of the gate/ clock
cycle - Vdd supply voltage
- Tcyc clock period
5Dynamic Power Dissipation Model
- P power dissipation
- C load capacitance
- E Avg. transition count of the gate/ clock
cycle - Vdd supply voltage
- Tcyc clock period
6Motivation
- Execution frequency of instructions is uneven
- Take MOV class as an example
- three instructions
- 22 execution frequency
Profiling from Powerstone
7Coupling Sub-decoders
- Partition an instruction decoder into two
coupling sub-decoders - The smaller decoder decodes only a small number
of instructions - When the smaller decoder is active, the larger
decoder is turned off - The smaller decoder is active frequently
8Architecture of Coupling Sub-decoders
- Controls to turn on/off sub-decoders
- Activate-Control
- Input AND-OR
- Output OR
instruction
0
FF1
FF2
FF3
FFn
I-Activate Control
1
0
I-Control0
I-Control1
...
...
FFn
1
1
0
1
FF1
I-Decoder0
I-Decoder1
0
Output bit0
...
Output bit0
S-Activate Control
S-Control0
S-Control1
...
...
S-Decoder0
S-Decoder1
...
...
...
9 Instruction Grouping Problem
- How to decompose Decoder so that
- the smaller sub-decoder is small
- the smaller sub-decoder is executed frequently
- the activate logic is small
10Weighted Graph Model of Execution Sequence
- Node instruction type
- Edge (U,V) instruction U (V) executed after V
(U) - Weights on nodes and edges execution frequency
2
mov
2
3
14
14
14
mul
4
5
14
ldr
1
2
15
14
1
14
15
3
1
b
cmp
15
15
11Power Model
Mj
5
Mi
15
2
mov
2
3
14
14
14
mul
4
5
14
ldr
1
2
15
14
14
1
15
3
1
cmp
b
15
15
- SFi transition frequency from Mi to Mi
- CFij transition frequency between Mi and Mj
- Poweri power of Mi estimated by Synopsys
12Instruction Grouping Problem Graph Partitioning
- Generation of transition graph
- Initial clustering by random walk
- Initial partition of clusters
- Iterative improvement by moving clusters among
groups
13Experimental Process
- ARM7tdmi
- Circuit described by Verilog
- Circuit synthesized by Synopsys Design Compiler
- Power estimated by PrimePower switching
activities are collected by simulating Powerstone
benchmark set
14Results on Two-way Decomposition
15Power Consumption Comparisons
16Critical Path and Area Comparisons
- Shorter critical path timing
- Area overhead
17Results on Multiple-way Decomposition
18Power Consumption for Different Multi-way Grouping
- Two-way decomposition has best power reduction
- more groups ? more overhead
5.E-04
4.E-04
3.E-04
Power (W)
2.E-04
1.E-04
0
4way
Original
3way
2way
Decoder
Overhead
19Critical Path Timing for Different Multi-way
Grouping
- Four-way decomposition has best timing reduction
40
20Area Comparisons
- Area for different multi-way grouping
42000
40000
38000
Area
36000
34000
32000
30000
4way
Original
3way
5way
2way
21Conclusions
- Two-way partitioning has the best results for
142-instruction set - Compared to un-decomposed decoder
- 30 reduction in power consumption
- 13 improvement in critical path timing
- Compared to un-decomposed control-U
- 19 reduction in power consumption
- 12 improvement in critical path timing
22