Title: A Code Refinement Methodology for Performance-Improved Synthesis from C
1A Code Refinement Methodology for
Performance-Improved Synthesis from C
- Greg Stitt, Frank Vahid, Walid Najjar
- Department of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems, UC Irvine
This research is supported in part by the
National Science Foundation and the Semiconductor
Research Corporation
2Introduction
- Previous work In-depth hw/sw partitioning study
of H.264 decoder - Collaboration with Freescale
H.264
motionComp() filterLuma() filterChroma() deblockin
g() . . . . . . . . . .
uP
FPGA
3Introduction
- Previous work In-depth hw/sw partitioning study
of H.264 decoder - Collaboration with Freescale
4Introduction
- Noticed coding constructs/practices limited hw
speed - Identified problematic coding constructs
- Developed simple coding guidelines
- Dozens of lines of code
- Minutes per guideline
- Refined critical regions using guidelines
motionComp() filterLuma() filterChroma() deblockin
g() . . . . . . . . . .
5Introduction
- Noticed coding constructs/practices limited hw
speed - Identified problematic coding constructs
- Developed simple coding guidelines
- Dozens of lines of code
- Minutes per guideline
- Refined critical regions using guidelines
Can simple coding guidelines show similar
improvements on other applications?
6Coding Guidelines
- Analyzed dozens of benchmarks
- Identified common problems related to synthesis
- Developed 10 guidelines to fix problems
- Although some are well known, analysis shows they
are rarely applied - Automation unlikely or impossible in many cases
Coding Guidelines
7Fast Refinement
Sample Application
Idct() Memset() FIR() Sort() Search() ReadInput()
WriteOutput() Matrix() Brev() Compress() Quantize(
) . . . . .
- Several dozen lines of code provide most
performance improvement - Refining takes minutes/hours
8Conversion to Constants (CC)
int coef100 void initCoef() // initialize
coef void fir() // fir filter using coef
void f() initCoef() // other code
fir()
int coef100 void initCoef() // initialize
coef void fir() // fir filter using coef
void firConstWrapper(const int array100)
// misc code . . . fir(array) void
f() initCoef() // other code fir()
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100) //
misc code . . . fir(array) void f()
initCoef() // other code fir()
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100) //
misc code . . . fir(array) void f()
initCoef() constWrapper(coef)
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100)
prefetchArray( array ) // misc code . . .
fir(array) void f() initCoef()
constWrapper(coef)
- Problem Arrays of constants commonly not
specified as constants - Initialized at runtime
- Guideline Use constant wrapper function
- Specifies array constant for all future functions
- Automation
- Difficult, requires global def-use/alias analysis
Can also enable constant folding
9Conversion to Explicit Data Flow (CEDF)
int array100 void a() for (i0 i lt 100
i) arrayi . . . . . void b()
for (i0 i lt 100 i) arrayi
arrayif(i) int c() for (i0 i lt 100
i) temp arrayi void d()
for (. . . . . ) a() b()
c()
void a(int array100) for (i0 i lt 100
i) arrayi . . . . . void b(int
array1100, int array2100) for (i0 i lt
100 i) array2i array1if(i) i
nt c(int array100) for (i0 i lt 100 i)
temp arrayi void d() int
array1100, array2100 for (. . . . . )
a(array1 ) b(array1, array2 )
c(array2 )
- Problem Global variables make determination of
parallelism difficult - Requires global def-use/alias analysis
- Guideline Replace globals with extra parameters
- Makes data flow explicit
- Simpler analysis may expose parallelism
- Automation
- Been proposed Lee01
- But, difficult because of aliases
10Constant Input Enumeration (CIE)
- Problem Function parameters may limit
parallelism - Guideline Create enum for possible values
- Synthesis can create specialized functions
- Automation
- In some cases, def-use analysis may identify all
inputs - In general, difficult due to aliases
void f(int a, int b) . . . . for (i0
i lt a i) for (j0 j lt b i)
cijij
enum PRM VAL12, VAL24 void f(enum PRM a,
enum PRM b) . . . . for (i0 i lt a
i) for (j0 j lt b i)
cijij
11Conversion to Explicit Control Flow (CECF)
- Problem Function pointers may prevent static
control flow analysis - Guideline Replace function pointer with if-else,
static calls - Makes possible targets explicit
- Automation
- In general, is impossible
- Equivalent to halting problem
void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
12Algorithmic Specialization (AS)
- Algorithms targeting sw may not be fast in hw
- Sequential vs. parallel
- C code generally uses sw algorithms
- Guideline Specialize critical functions with hw
algorithms - Automation
- Requires higher level specification
- Intrinsics
void search(int a, int k, int l, int r)
while (l lt r) mid (lr)/2 if (k
gt amid) l mid1 else if (k lt
amid) r mid-1 else
return mid return 1
void search(int a, int k, const int s) for
(i0 i lt s i) if (ai k)
return i return 1
13Pass-By-Value Return (PVR)
- Problem Array parameters cannot be prefetched
due to potential aliases - Designer may know aliases dont exist
- Guideline Use pass-by-value-return
- Automation
- Requires global alias analysis
void f(int a, int b, int array16)
// unrelated computation g(array) //
unrelated computation int g(int array16)
// computation done on array
void f(int a, int b, int array16) int
localArray16 memcpy(localArray,array,16size
of(int)) // misc computation
g(localArray) // misc computation
memcpy(array, localArray,16sizeof(int)) int
g(int array16) // computation done on
array
14Why Synthesis From C?
- Why not use HDL?
- HDL may yield better results
- C is mainstream language
- Acceptable performance in many cases
- Learning HDL is large overhead
- Approaches are orthogonal
- This work focuses on improving mainstream
- Guidelines common for HDL
- Can also be applied to algorithmic HDL
15Software Overhead
- Refined regions may not be partitioned to
hardware - Partitioner may select non-refined regions
- OS may select software or hardware implementation
- Based on state of FPGA
- Coding guidelines have potential software overhead
motionComp() filterLuma() filterChroma() deblocki
ng() . . . . . . . . . .
Hw/Sw Partitioning
motionComp() deblocking()
filterLuma() filterChroma()
Problem - Refined code mapped to software
16Refinement Methodology
Profile
- Considerations
- Reduce software overhead
- Reduce refinement time
- Methodology
- Profile
- Iterative-improvement
- Determine critical region
- Apply all except PVR/AS
- Minimal overhead
- Apply PVR if overhead acceptable
- Apply AS if known algorithm and overhead
acceptable
Determine Critical Region
Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR
Repeat until performance acceptable
no
yes
Apply PVR
no
yes
Apply AS
17Experimental Setup
Benchmarks
- Benchmark suite
- MediaBench, Powerstone
- Manually applied guidelines
- 1-2 hours
- 23 additional lines/benchmark, on average
- Target Architecture
- Xilinx VirtexII FPGA with ARM9 uP
- Hardware/software partitioning
- Selects critical regions for hardware
- Synthesis
- High-level synthesis tool
- 30,000 lines of C code
- Outputs register-transfer level (RTL) VHDL
- RTL Synthesis using Xilinx ISE
- Compilation
- Gcc with O1 optimizations
Manual Refinement
Refined Code
Hw/Sw Partitioning
Sw
Hw
Synthesis
Compilation
Bitfile
18Speedups from Guidelines
19Speedups from Guidelines
Algorithmic Specialization Speedup 19x
Time 30 minutes Sw Overhead
6000
20Speedups from Guidelines
- Original code
- Speedups range from 1x (no speedup) to 573x
- Average 2.6x (excludes brev)
- Refined code with guidelines
- Average 8.4x (excludes brev)
- 3.5x average improvement compared to original code
21Speedups from Guidelines
- Guidelines move speedups closer to ideal
- Almost identical for mpeg2, fir
- Several examples still far from ideal
- May imply new guidelines needed
22Guideline SW Overhead/Improvement
Overhead
Improvement
- Average Sw performance overhead -15.7
(improvement) - -1.1 excluding brev
- 3 examples improved
- Average Sw size overhead (lines of C code)
- 8.4 excluding brev
23Summary
- Simple coding guidelines significantly improve
synthesis from C - 3.5x speedup compared to Hw/Sw synthesized from
unrefined code - Major rewrites may not be necessary
- Between 1-2 hours
- Refinement Methodology
- Reduces software size/performance overhead
- In some cases, improvement
- Future Work
- Test on commercial synthesis tools
- New guidelines for different domains