Automatically Generating Custom Instruction Set Extensions

About This Presentation

Title:

Automatically Generating Custom Instruction Set Extensions

Description:

There's a demand for high performance, low power special purpose systems ... CFU Latency is estimated using standard cells from Synopsis' design library ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 18

Provided by: rando8

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatically Generating Custom Instruction Set Extensions

1
Automatically Generating Custom Instruction Set
Extensions

Nathan Clark, Wilkin Tang, Scott Mahlke
Workshop on Application Specific Processors

2
Problem Statement

Theres a demand for high performance, low power
special purpose systems
E.g. Cell phones, network routers, PDAs
One way to achieve these goals is augmenting a
general purpose processor with Custom Function
Units (CFUs)
Combine several primitive operations
We propose an automated method for CFU generation

3
System Overview
4
Example
1
2
Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8
3
4
6
5
7
8
5
Example
1
2
Potential CFUs 1,3 2,4 2,6 1,3,4 2,4,5 2,6,7
3
4
6
5
7
8
6
Example
1
2
Potential CFUs 1,3 2,4 2,6 1,3,4,5 2,4,5,8 2,6,7
,8 1,3,4,5,8
3
4
6
5
7
8
7
Characterization

Use the macro library to get information on each
potential CFU
Latency is the sum of each primitives latency
Area is the sum of each primitives macrocell

8
Issues we consider

Performance
On critical path
Cycles saved

Cost
CFU area
Control logic
Difficult to measure
Decode logic
Difficult to measure
Register file area
Can be amortized

LD
AND
1
0.1
ADD
1
0.6
ASL
1
0.1
ADD
1
0.6
XOR
0.1
1
BR
9
More Issues to Consider

IO
number of input and output operands
Usability
How well can the compiler use the pattern

OR
LSL
AND
CMPP
10
Selection

Currently use a Greedy Algorithm
Pick the best performance gain / area first
Can yield bad selections

OR
LSL
AND
CMPP
11
Case study 1 Blowfish
r65 r70
ADD
r76
XOR

Speedup 1.24
10 cycles can be compressed down to 2!
Cost 6 adders
6 inputs, 2 outputs
C code this DFG came from
r (((s(tgtgt24)
s0x0100((tgtgt16)0xff))
s0x0200((tgtgt8)0xff))
s0x0300((t0xff))0xffffffff

r81
ADD
-1
AND
r891
XOR
16
LSR
255
AND
256
ADD
2
LSL
r91
ADD
12
Case study 2 ADPCM Decode

Speedup 1.20
3 cycles can be compressed down to 1
Cost 1.5 adders
2 inputs, 2 outputs
C code this DFG came from
d d 7
if ( d 4 )

7
r16
AND
4
AND
0
CMPP
13
Experimental Setup

CFU recognition implemented in the Trimaran
research infrastructure
Speedup shown is with CFUs relative to a baseline
machine
Four wide VLIW with predication
Can issue at most 1 Int, Flt, Mem, Brn inst./cyc.
300 MHz clock
CFU Latency is estimated using standard cells
from Synopsis design library

14
Varying the Number of CFUs

More CFUs yields more performance
Weakness in our selection algorithm causes
plateaus

15
Varying the Number of Ops

Bigger CFUs yield better performance
If theyre too big, they cant be used as often
and they expose alternate critical paths

16
Related Work

Many people have done this for code size
Bose et al., Liao et al.
Typically done with traces
Arnold, et al.
Previous paper used more enumerative discovery
algorithm
We are unique because
Compiler based approach
Novel analyzation of CFUs

17
Conclusion and Future Work

CFUs have the potential to offer big performance
gain for small cost
Recognize more complex subgraphs
Generalized acyclic/cyclic subgraphs
Develop our system to automatically synthesize
application tailored coprocessors

Write a Comment

User Comments (0)

About PowerShow.com

Automatically Generating Custom Instruction Set Extensions - PowerPoint PPT Presentation

Automatically Generating Custom Instruction Set Extensions

There's a demand for high performance, low power special purpose systems ... CFU Latency is estimated using standard cells from Synopsis' design library ... – PowerPoint PPT presentation