A Code Refinement Methodology for Performance-Improved Synthesis from C - PowerPoint PPT Presentation

About This Presentation

Title:

A Code Refinement Methodology for Performance-Improved Synthesis from C

Description:

fir(); Problem: Arrays of constants commonly not specified as constants. Initialized ... Almost identical for mpeg2, fir. Several examples still far from ideal ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 24

Provided by: gregs77

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Code Refinement Methodology for Performance-Improved Synthesis from C

1
A Code Refinement Methodology for
Performance-Improved Synthesis from C

Greg Stitt, Frank Vahid, Walid Najjar
Department of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems, UC Irvine

This research is supported in part by the
National Science Foundation and the Semiconductor
Research Corporation
2
Introduction

Previous work In-depth hw/sw partitioning study
of H.264 decoder
Collaboration with Freescale

H.264
motionComp() filterLuma() filterChroma() deblockin
g() . . . . . . . . . .
uP
FPGA
3
Introduction

Previous work In-depth hw/sw partitioning study
of H.264 decoder
Collaboration with Freescale

4
Introduction

Noticed coding constructs/practices limited hw
speed
Identified problematic coding constructs
Developed simple coding guidelines
Dozens of lines of code
Minutes per guideline
Refined critical regions using guidelines

motionComp() filterLuma() filterChroma() deblockin
g() . . . . . . . . . .
5
Introduction

Noticed coding constructs/practices limited hw
speed
Identified problematic coding constructs
Developed simple coding guidelines
Dozens of lines of code
Minutes per guideline
Refined critical regions using guidelines

Can simple coding guidelines show similar
improvements on other applications?
6
Coding Guidelines

Analyzed dozens of benchmarks
Identified common problems related to synthesis
Developed 10 guidelines to fix problems
Although some are well known, analysis shows they
are rarely applied
Automation unlikely or impossible in many cases

Coding Guidelines
7
Fast Refinement
Sample Application
Idct() Memset() FIR() Sort() Search() ReadInput()
WriteOutput() Matrix() Brev() Compress() Quantize(
) . . . . .

Several dozen lines of code provide most
performance improvement
Refining takes minutes/hours

8
Conversion to Constants (CC)
int coef100 void initCoef() // initialize
coef void fir() // fir filter using coef
void f() initCoef() // other code
fir()
int coef100 void initCoef() // initialize
coef void fir() // fir filter using coef
void firConstWrapper(const int array100)
// misc code . . . fir(array) void
f() initCoef() // other code fir()
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100) //
misc code . . . fir(array) void f()
initCoef() // other code fir()
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100) //
misc code . . . fir(array) void f()
initCoef() constWrapper(coef)
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100)
prefetchArray( array ) // misc code . . .
fir(array) void f() initCoef()
constWrapper(coef)

Problem Arrays of constants commonly not
specified as constants
Initialized at runtime
Guideline Use constant wrapper function
Specifies array constant for all future functions
Automation
Difficult, requires global def-use/alias analysis

Can also enable constant folding
9
Conversion to Explicit Data Flow (CEDF)
int array100 void a() for (i0 i lt 100
i) arrayi . . . . . void b()
for (i0 i lt 100 i) arrayi
arrayif(i) int c() for (i0 i lt 100
i) temp arrayi void d()
for (. . . . . ) a() b()
c()
void a(int array100) for (i0 i lt 100
i) arrayi . . . . . void b(int
array1100, int array2100) for (i0 i lt
100 i) array2i array1if(i) i
nt c(int array100) for (i0 i lt 100 i)
temp arrayi void d() int
array1100, array2100 for (. . . . . )
a(array1 ) b(array1, array2 )
c(array2 )

Problem Global variables make determination of
parallelism difficult
Requires global def-use/alias analysis
Guideline Replace globals with extra parameters
Makes data flow explicit
Simpler analysis may expose parallelism
Automation
Been proposed Lee01
But, difficult because of aliases

10
Constant Input Enumeration (CIE)

Problem Function parameters may limit
parallelism
Guideline Create enum for possible values
Synthesis can create specialized functions
Automation
In some cases, def-use analysis may identify all
inputs
In general, difficult due to aliases

void f(int a, int b) . . . . for (i0
i lt a i) for (j0 j lt b i)
cijij
enum PRM VAL12, VAL24 void f(enum PRM a,
enum PRM b) . . . . for (i0 i lt a
i) for (j0 j lt b i)
cijij
11
Conversion to Explicit Control Flow (CECF)

Problem Function pointers may prevent static
control flow analysis
Guideline Replace function pointer with if-else,
static calls
Makes possible targets explicit
Automation
In general, is impossible
Equivalent to halting problem

void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
12
Algorithmic Specialization (AS)

Algorithms targeting sw may not be fast in hw
Sequential vs. parallel
C code generally uses sw algorithms
Guideline Specialize critical functions with hw
algorithms
Automation
Requires higher level specification
Intrinsics

void search(int a, int k, int l, int r)
while (l lt r) mid (lr)/2 if (k
gt amid) l mid1 else if (k lt
amid) r mid-1 else
return mid return 1
void search(int a, int k, const int s) for
(i0 i lt s i) if (ai k)
return i return 1
13
Pass-By-Value Return (PVR)

Problem Array parameters cannot be prefetched
due to potential aliases
Designer may know aliases dont exist
Guideline Use pass-by-value-return
Automation
Requires global alias analysis

void f(int a, int b, int array16)
// unrelated computation g(array) //
unrelated computation int g(int array16)
// computation done on array
void f(int a, int b, int array16) int
localArray16 memcpy(localArray,array,16size
of(int)) // misc computation
g(localArray) // misc computation
memcpy(array, localArray,16sizeof(int)) int
g(int array16) // computation done on
array
14
Why Synthesis From C?

Why not use HDL?
HDL may yield better results
C is mainstream language
Acceptable performance in many cases
Learning HDL is large overhead
Approaches are orthogonal
This work focuses on improving mainstream
Guidelines common for HDL
Can also be applied to algorithmic HDL

15
Software Overhead

Refined regions may not be partitioned to
hardware
Partitioner may select non-refined regions
OS may select software or hardware implementation
Based on state of FPGA
Coding guidelines have potential software overhead

motionComp() filterLuma() filterChroma() deblocki
ng() . . . . . . . . . .
Hw/Sw Partitioning
motionComp() deblocking()
filterLuma() filterChroma()
Problem - Refined code mapped to software
16
Refinement Methodology
Profile

Considerations
Reduce software overhead
Reduce refinement time
Methodology
Profile
Iterative-improvement
Determine critical region
Apply all except PVR/AS
Minimal overhead
Apply PVR if overhead acceptable
Apply AS if known algorithm and overhead
acceptable

Determine Critical Region
Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR
Repeat until performance acceptable
no
yes
Apply PVR
no
yes
Apply AS
17
Experimental Setup
Benchmarks

Benchmark suite
MediaBench, Powerstone
Manually applied guidelines
1-2 hours
23 additional lines/benchmark, on average
Target Architecture
Xilinx VirtexII FPGA with ARM9 uP
Hardware/software partitioning
Selects critical regions for hardware
Synthesis
High-level synthesis tool
30,000 lines of C code
Outputs register-transfer level (RTL) VHDL
RTL Synthesis using Xilinx ISE
Compilation
Gcc with O1 optimizations

Manual Refinement
Refined Code
Hw/Sw Partitioning
Sw
Hw
Synthesis
Compilation
Bitfile
18
Speedups from Guidelines
19
Speedups from Guidelines
Algorithmic Specialization Speedup 19x
Time 30 minutes Sw Overhead
6000
20
Speedups from Guidelines