Title: Automatically Generating Custom Instruction Set Extensions
1Automatically Generating Custom Instruction Set
Extensions
- Nathan Clark, Wilkin Tang, Scott Mahlke
- Workshop on Application Specific Processors
2Problem Statement
- Theres a demand for high performance, low power
special purpose systems - E.g. Cell phones, network routers, PDAs
- One way to achieve these goals is augmenting a
general purpose processor with Custom Function
Units (CFUs) - Combine several primitive operations
- We propose an automated method for CFU generation
3System Overview
4Example
1
2
Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8
3
4
6
5
7
8
5Example
1
2
Potential CFUs 1,3 2,4 2,6 1,3,4 2,4,5 2,6,7
3
4
6
5
7
8
6Example
1
2
Potential CFUs 1,3 2,4 2,6 1,3,4,5 2,4,5,8 2,6,7
,8 1,3,4,5,8
3
4
6
5
7
8
7Characterization
- Use the macro library to get information on each
potential CFU - Latency is the sum of each primitives latency
- Area is the sum of each primitives macrocell
8Issues we consider
- Performance
- On critical path
- Cycles saved
- Cost
- CFU area
- Control logic
- Difficult to measure
- Decode logic
- Difficult to measure
- Register file area
- Can be amortized
LD
AND
1
0.1
ADD
1
0.6
ASL
1
0.1
ADD
1
0.6
XOR
0.1
1
BR
9More Issues to Consider
- IO
- number of input and output operands
- Usability
- How well can the compiler use the pattern
OR
LSL
AND
CMPP
10Selection
- Currently use a Greedy Algorithm
- Pick the best performance gain / area first
- Can yield bad selections
OR
LSL
AND
CMPP
11Case study 1 Blowfish
r65 r70
ADD
r76
XOR
- Speedup 1.24
- 10 cycles can be compressed down to 2!
- Cost 6 adders
- 6 inputs, 2 outputs
- C code this DFG came from
- r (((s(tgtgt24)
- s0x0100((tgtgt16)0xff))
- s0x0200((tgtgt8)0xff))
- s0x0300((t0xff))0xffffffff
r81
ADD
-1
AND
r891
XOR
16
LSR
255
AND
256
ADD
2
LSL
r91
ADD
12Case study 2 ADPCM Decode
- Speedup 1.20
- 3 cycles can be compressed down to 1
- Cost 1.5 adders
- 2 inputs, 2 outputs
- C code this DFG came from
- d d 7
- if ( d 4 )
7
r16
AND
4
AND
0
CMPP
13Experimental Setup
- CFU recognition implemented in the Trimaran
research infrastructure - Speedup shown is with CFUs relative to a baseline
machine - Four wide VLIW with predication
- Can issue at most 1 Int, Flt, Mem, Brn inst./cyc.
- 300 MHz clock
- CFU Latency is estimated using standard cells
from Synopsis design library
14Varying the Number of CFUs
- More CFUs yields more performance
- Weakness in our selection algorithm causes
plateaus
15Varying the Number of Ops
- Bigger CFUs yield better performance
- If theyre too big, they cant be used as often
and they expose alternate critical paths
16Related Work
- Many people have done this for code size
- Bose et al., Liao et al.
- Typically done with traces
- Arnold, et al.
- Previous paper used more enumerative discovery
algorithm - We are unique because
- Compiler based approach
- Novel analyzation of CFUs
17Conclusion and Future Work
- CFUs have the potential to offer big performance
gain for small cost - Recognize more complex subgraphs
- Generalized acyclic/cyclic subgraphs
- Develop our system to automatically synthesize
application tailored coprocessors