Processor%20Acceleration%20Through%20Automated%20Instruction%20Set%20Customization - PowerPoint PPT Presentation

About This Presentation
Title:

Processor%20Acceleration%20Through%20Automated%20Instruction%20Set%20Customization

Description:

Electrical Engineering and Computer Science. Area. Want the most benefit ... Electrical Engineering and Computer Science. Finished Met External Constraints ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 40
Provided by: Kevi1
Category:

less

Transcript and Presenter's Notes

Title: Processor%20Acceleration%20Through%20Automated%20Instruction%20Set%20Customization


1
Processor Acceleration Through Automated
Instruction Set Customization
  • Nathan Clark, Hongtao Zhong, Scott Mahlke
  • Advanced Computer Architecture Lab
  • University of Michigan, Ann Arbor
  • December 3, 2003

2
Motivation
  • Cell phones, PDAs, digital cameras, etc. are
    everywhere
  • High performance yet low power design point
  • General core ASIC solution
  • Limited post-programmability
  • General core application specific instructions
    (CFUs)

CPU
CFU
3
What is a CFU?
  • Combine multiple primitive operations
  • Smaller code size, fewer RF reads
  • Increases performance

CFU 1
1
1
2



2
1
4
Automation is Key
  • This is ¼ of the DFG for a single basic block of
    blowfish

159 XOR
164 SHR
173 AND
5
Related Work
  • Tensilica Xtensa
  • Commercial example
  • MIPS core manually constructed CFU
  • Automatic instruction set synthesis is mature
    field
  • See paper for comparison of techniques
  • Our contributions
  • Novel technique for automatic CFU creation
  • System to utilize CFUs in multiple applications
  • Analysis of how effectively CFUs for one
    application apply to other applications in the
    same domain

6
System Overview
  • Synthesis
  • Subgraph identification
  • Discover candidates for CFUs
  • Weed out what shouldnt be picked
  • Selection
  • Determine which candidates to use as CFUs
  • Compilation
  • Subgraph replacement
  • Make use of the CFUs in a range of applications

7
Subgraph Identification

  • Grow subgraphs from seed nodes
  • All nodes are seeds
  • Most directions dont make sense
  • How to decide where to grow?
  • Making decisions using factors similar to an
    architect
  • Take 4 factors into consideration
  • Criticality, Latency, Area, Input/Output




ltlt

8
Subgraph Identification

  • Grow subgraphs from seed nodes
  • All nodes are seeds
  • Most directions dont make sense
  • How to decide where to grow?
  • Making decisions using factors similar to an
    architect
  • Take 4 factors into consideration
  • Criticality, Latency, Area, Input/Output




ltlt

CFU Candidates

ltlt
9
Subgraph Identification

  • Grow subgraphs from seed nodes
  • All nodes are seeds
  • Most directions dont make sense
  • How to decide where to grow?
  • Making decisions using factors similar to an
    architect
  • Take 4 factors into consideration
  • Criticality, Latency, Area, Input/Output
  • Sum of these factors determines value of each
    direction
  • NOT picking CFUs




ltlt

CFU Candidates


ltlt

10
Critical Path
  • Combining operations on the critical path will
    shrink the longer dependence chains
  • Maximize potential performance gain
  • Wt
  • Slack is cycles off longest dependence path




10/(01) 10
10/(21) 3.33
gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




11
Latency
  • Growing toward low latency operations allows
    combination of more nodes in a cycle
  • Maximize DFG compression
  • Wt




gtgt
gtgt
gtgt



100.3 / 0.36 8.33



Opcode Area Cycles
1.00 0.30
0.12 0.06
ltlt, gtgt 0.01 0.00
0.16 0.09
ltlt
ltlt
ltlt
ltlt
100.3 / 0.6 5




12
Area
  • Want the most benefit for the least area
  • Wt
  • Area is the sum of macrocell areas

100.5/0.5 10
100.5/1.5 3.33
Opcode Area Cycles
1.00 0.30
0.12 0.06
ltlt, gtgt 0.01 0.00
0.16 0.09
13
Input/Output
  • Want CFUs to use as few RF ports as possible
  • Smaller encoding
  • Allow growth of larger candidates
  • Wt




102/(41) 4
gtgt
gtgt
gtgt



102/(21) 6.67



ltlt
ltlt
ltlt
ltlt




14
Example


28.5
35

30.8
37.5
28.5
37.5
gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




15
Example


28.5
35

30.8
28.5
40
gtgt
gtgt
gtgt
33.5






ltlt
ltlt
ltlt
ltlt




16
Example


28.5
35

30.8
28.5
gtgt
gtgt
gtgt
36
36






ltlt
ltlt
ltlt
ltlt




17
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




18
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




19
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




20
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




21
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




22
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




23
Example



gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt




24
Finished Met External Constraints
25
Set of Candidates







ltlt
ltlt
ltlt
ltlt
ltlt
ltlt












ltlt
ltlt
ltlt
ltlt

ltlt
ltlt
ltlt


26
Avoids Exponential Explosion
27
Greedy Selection Heuristic
  • Use estimates of performance improvement / cost

Subgraph Number Value Cost Ops
1 20 4 (3,4),(6,8)
2 6 1 (1,3,7)

N 9 5 (1,7)
Subgraph Number Value Cost Ops
1 10 4 (6,8)
2 6 1 (1,3,7)

N 0 5
28
Compiler Replacement
  • Multiple applications can utilize CFUs
  • Vflib pattern matcher Cor 99

Instruction Synthesis
CFU Description
Compiler
29
Experimental Setup
  • Implemented in the Trimaran toolset
  • Baseline machine 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle
  • CFUs use Int issue slot
  • CFU latency/area generated as sum of each
    individual macrocell
  • Pipeline latches were added if CFU latency gt1
    clock cycle
  • 300 MHz clock assumed
  • No branch or memory instructions in CFUs
  • Four application domains tested
  • Audio, Encryption, Image, Network

30
Native Encryption Results
31
Encryption Cross Compile
32
Generalizing CFUs
Subsumed (Multiple Paths)
Wildcards (Multiple Nodes)
IN_1
0x8, 0x0
IN_1
0x8
gtgt
gtgt
0xF, 0x0
0xF

,
IN_2
IN_2

,-
33
Effects of Generalization
Speedup
34
Conclusions
  • Developed two phase instruction set synthesis
    system
  • Guide function removes bad candidates
  • Greedy selection heuristic
  • Substantial speedups can be attained with very
    little die impact
  • Subsumed subgraphs and wildcarding increase
    cross-application effectiveness

Domain Encryption Network Image Audio
Ave. Speedup 1.61 1.38 1.16 1.66
35
Questions?
http//cccp.eecs.umich.edu
36
Backup slides
37
Individual Factors - Blowfish
38
Individual Factors - Djpeg
39
Selection
  • Uses estimates of performance improvement
  • Greedy Heuristic used




gtgt
gtgt
gtgt






ltlt
ltlt
ltlt
ltlt



Write a Comment
User Comments (0)
About PowerShow.com