Automatically Generating Custom Instruction Set Extensions - PowerPoint PPT Presentation

About This Presentation
Title:

Automatically Generating Custom Instruction Set Extensions

Description:

There's a demand for high performance, low power special purpose systems ... CFU Latency is estimated using standard cells from Synopsis' design library ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 18
Provided by: rando8
Category:

less

Transcript and Presenter's Notes

Title: Automatically Generating Custom Instruction Set Extensions


1
Automatically Generating Custom Instruction Set
Extensions
  • Nathan Clark, Wilkin Tang, Scott Mahlke
  • Workshop on Application Specific Processors

2
Problem Statement
  • Theres a demand for high performance, low power
    special purpose systems
  • E.g. Cell phones, network routers, PDAs
  • One way to achieve these goals is augmenting a
    general purpose processor with Custom Function
    Units (CFUs)
  • Combine several primitive operations
  • We propose an automated method for CFU generation

3
System Overview
4
Example
1
2
Potential CFUs 1,3 2,4 2,6 3,4 4,5 5,8 6,7 7,8
3
4
6
5
7
8
5
Example
1
2
Potential CFUs 1,3 2,4 2,6 1,3,4 2,4,5 2,6,7
3
4
6
5
7
8
6
Example
1
2
Potential CFUs 1,3 2,4 2,6 1,3,4,5 2,4,5,8 2,6,7
,8 1,3,4,5,8
3
4
6
5
7
8
7
Characterization
  • Use the macro library to get information on each
    potential CFU
  • Latency is the sum of each primitives latency
  • Area is the sum of each primitives macrocell

8
Issues we consider
  • Performance
  • On critical path
  • Cycles saved
  • Cost
  • CFU area
  • Control logic
  • Difficult to measure
  • Decode logic
  • Difficult to measure
  • Register file area
  • Can be amortized

LD
AND
1
0.1
ADD
1
0.6
ASL
1
0.1
ADD
1
0.6
XOR
0.1
1
BR
9
More Issues to Consider
  • IO
  • number of input and output operands
  • Usability
  • How well can the compiler use the pattern

OR
LSL
AND
CMPP
10
Selection
  • Currently use a Greedy Algorithm
  • Pick the best performance gain / area first
  • Can yield bad selections

OR
LSL
AND
CMPP
11
Case study 1 Blowfish
r65 r70
ADD
r76
XOR
  • Speedup 1.24
  • 10 cycles can be compressed down to 2!
  • Cost 6 adders
  • 6 inputs, 2 outputs
  • C code this DFG came from
  • r (((s(tgtgt24)
  • s0x0100((tgtgt16)0xff))
  • s0x0200((tgtgt8)0xff))
  • s0x0300((t0xff))0xffffffff

r81
ADD
-1
AND
r891
XOR
16
LSR
255
AND
256
ADD
2
LSL
r91
ADD
12
Case study 2 ADPCM Decode
  • Speedup 1.20
  • 3 cycles can be compressed down to 1
  • Cost 1.5 adders
  • 2 inputs, 2 outputs
  • C code this DFG came from
  • d d 7
  • if ( d 4 )

7
r16
AND
4
AND
0
CMPP
13
Experimental Setup
  • CFU recognition implemented in the Trimaran
    research infrastructure
  • Speedup shown is with CFUs relative to a baseline
    machine
  • Four wide VLIW with predication
  • Can issue at most 1 Int, Flt, Mem, Brn inst./cyc.
  • 300 MHz clock
  • CFU Latency is estimated using standard cells
    from Synopsis design library

14
Varying the Number of CFUs
  • More CFUs yields more performance
  • Weakness in our selection algorithm causes
    plateaus

15
Varying the Number of Ops
  • Bigger CFUs yield better performance
  • If theyre too big, they cant be used as often
    and they expose alternate critical paths

16
Related Work
  • Many people have done this for code size
  • Bose et al., Liao et al.
  • Typically done with traces
  • Arnold, et al.
  • Previous paper used more enumerative discovery
    algorithm
  • We are unique because
  • Compiler based approach
  • Novel analyzation of CFUs

17
Conclusion and Future Work
  • CFUs have the potential to offer big performance
    gain for small cost
  • Recognize more complex subgraphs
  • Generalized acyclic/cyclic subgraphs
  • Develop our system to automatically synthesize
    application tailored coprocessors
Write a Comment
User Comments (0)
About PowerShow.com