Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines - PowerPoint PPT Presentation

About This Presentation
Title:

Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines

Description:

K3. time. Task throughput. 9. University of Michigan. Electrical Engineering and Computer Science ... K3. 100. 200. 300. K4. 400. Throughput = 1 task / 200 ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 19
Provided by: Micha1
Category:

less

Transcript and Presenter's Notes

Title: Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines


1
Streamroller Automatic Synthesis of Prescribed
Throughput Accelerator Pipelines
  • Manjunath Kudlur, Kevin Fan, Scott Mahlke
  • Advanced Computer Architecture Lab
  • University of Michigan

2
Automated C to Gates Solution
  • SoC design
  • 10-100 Gops, 200 mW power budget
  • Low level tools ineffective
  • Automated accelerator synthesis for whole
    application
  • Correct by construction
  • Increase designer productivity
  • Faster time to market

3
Streaming Applications
  • Data streaming through kernels
  • Kernels are tight loops
  • FIR, Viterbi, DCT
  • Coarse grain dataflow between kernels
  • Sub-blocks of images, network packets

4
Software Overview
1
SRAM Buffers
System Level Synthesis
Frontend Analyses
2
3
4
Whole Application
Accelerator Pipeline
Loop Graph
Multifunction Accelerator
5
Input Specification
  • Sequential C program
  • Kernel specification
  • Perfectly nested FOR loop
  • Wrapped inside C function
  • All data access made explicit
  • System specification
  • Function with main input/output
  • Local arrays to pass data
  • Sequence of calls to kernels

row_trans(char inp88, char
out88 )
dct(char inp88, char out88)
for(i0 ilt8 i) for(j0 jlt8 j) .
. . inpij outij . . .
char tmp188, tmp288
row_trans(inp, tmp1) col_trans(tmp1, tmp2)
zigzag_trans(tmp2, out)

col_trans(char inp88, char
out88) zigzag_trans(char inp88,
char out88)
6
Performance Specification
Input image (1024 x 768)
  • High performance DCT
  • Process one 1024x768 image every 2ms
  • Given 400 Mhz clock
  • One image every 800000 cycles
  • One block every 64 cycles
  • Low Performance DCT
  • Process one 1024x768 image every 4ms
  • One block every 128 cycles

inp
row_trans
tmp1
col_trans
Task
tmp2
zigzag_trans
Output coeffs
out
Performance goal Task throughput in number of
cycles between tasks
7
Building Blocks
Kernel 1
tmp1
Kernel 2
tmp2
Kernel 3
Multifunction Loop Accelerator CODES/ISSS 06
tmp3
Kernel 4
SRAM buffers
8
System Schema Overview
LA 1
Kernel 1
Task throughput
Kernel 2
Kernel 3
LA 2
time
Kernel 4
Kernel 5
LA 3
9
Cost Components
  • Cost of loop accelerator data path
  • Cost of FUs, shift registers, muxes, interconnect
  • Initiation interval (II)
  • Key parameter that decides LA cost
  • Low II ? high performance ? high cost
  • Loop execution time (trip count) x II
  • Appropriate II chosen to satisfy task throughput

Low performance
10
Cost Components (Contd..)
  • Grouping of loops into a multifunction LA
  • More loops in a single LA ? LA occupied for
    longer time in current task

Throughput 1 task / 200 cycles
LA 2
LA 1 occupied for 200 cycles
LA 3
11
Cost Components (Contd..)
  • Cost of SRAM buffers for intermediate arrays
  • More buffers ? more task overlap ? high
    performance

tmp1 buffer in use by LA2
Adjacent tasks use different buffers
12
ILP Formulation
  • Variables
  • II for each loop
  • Which loops are combined into single LA
  • Number of buffers for temp array
  • Objective function
  • Cost of LAs cost of buffers
  • Constraints
  • Overall task throughput should be achieved

13
Non-linear LA Cost
Relative Cost
Initiation interval
IImin
IImax
IImin II IImax
II 1II1 2II2 3II3 . . . . 14II14
and 0 IIi 1
Cost(II) C1II1 C2II2 C3II3 . . . .
C14II14
14
Multifunction Accelerator Cost
LA 1
LA 2
LA 1
LA 2
LA 1
LA 2
LA 4
LA 3
LA 4
LA 4
LA 3
LA 3
Worst Case No sharing Cost Sum
Realistic Case Some sharing Cost Between Sum
and Max
Best case Full sharing Cost Max
  • Impractical to obtain accurate cost of all
    combinations
  • CLA 0.5 (SUMCLA MAXCLA)

15
Case Study Simple benchmark
LA 1
Loop graph
TC256
3
16
Beamformer
  • Beamformer
  • 10 loops
  • Memory Cost 60 to 70
  • Up to 20 cost savings due to hardware sharing in
    multifunction accelerators
  • Systems at lower throughput have over-designed
    LAs
  • Not profitable to pick a lower performance LA
  • Memory buffer cost significant
  • High performance producer consumer better than
    more buffers

17
Conclusions
  • Automated design realistic for system of loops
  • Designers can move up the abstraction hierarchy
  • Observations
  • Macro level hardware sharing can achieve
    significant cost savings
  • Memory cost is significant need to
    simultaneously optimize for datapath and memory
    cost
  • ILP formulation tractable
  • Solver took less than 1 minute for systems with
    30 loops

18
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com