A Code Refinement Methodology for Performance-Improved Synthesis from C - PowerPoint PPT Presentation

About This Presentation
Title:

A Code Refinement Methodology for Performance-Improved Synthesis from C

Description:

fir(); Problem: Arrays of constants commonly not specified as constants. Initialized ... Almost identical for mpeg2, fir. Several examples still far from ideal ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 24
Provided by: gregs77
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: A Code Refinement Methodology for Performance-Improved Synthesis from C


1
A Code Refinement Methodology for
Performance-Improved Synthesis from C
  • Greg Stitt, Frank Vahid, Walid Najjar
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems, UC Irvine

This research is supported in part by the
National Science Foundation and the Semiconductor
Research Corporation
2
Introduction
  • Previous work In-depth hw/sw partitioning study
    of H.264 decoder
  • Collaboration with Freescale

H.264
motionComp() filterLuma() filterChroma() deblockin
g() . . . . . . . . . .
uP
FPGA
3
Introduction
  • Previous work In-depth hw/sw partitioning study
    of H.264 decoder
  • Collaboration with Freescale

4
Introduction
  • Noticed coding constructs/practices limited hw
    speed
  • Identified problematic coding constructs
  • Developed simple coding guidelines
  • Dozens of lines of code
  • Minutes per guideline
  • Refined critical regions using guidelines

motionComp() filterLuma() filterChroma() deblockin
g() . . . . . . . . . .
5
Introduction
  • Noticed coding constructs/practices limited hw
    speed
  • Identified problematic coding constructs
  • Developed simple coding guidelines
  • Dozens of lines of code
  • Minutes per guideline
  • Refined critical regions using guidelines

Can simple coding guidelines show similar
improvements on other applications?
6
Coding Guidelines
  • Analyzed dozens of benchmarks
  • Identified common problems related to synthesis
  • Developed 10 guidelines to fix problems
  • Although some are well known, analysis shows they
    are rarely applied
  • Automation unlikely or impossible in many cases

Coding Guidelines
7
Fast Refinement
Sample Application
Idct() Memset() FIR() Sort() Search() ReadInput()
WriteOutput() Matrix() Brev() Compress() Quantize(
) . . . . .
  • Several dozen lines of code provide most
    performance improvement
  • Refining takes minutes/hours

8
Conversion to Constants (CC)
int coef100 void initCoef() // initialize
coef void fir() // fir filter using coef
void f() initCoef() // other code
fir()
int coef100 void initCoef() // initialize
coef void fir() // fir filter using coef
void firConstWrapper(const int array100)
// misc code . . . fir(array) void
f() initCoef() // other code fir()
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100) //
misc code . . . fir(array) void f()
initCoef() // other code fir()
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100) //
misc code . . . fir(array) void f()
initCoef() constWrapper(coef)
int coef100 void initCoef() // initialize
coef void fir(const int array100) // fir
filter using const array void
constWrapper(const int array100)
prefetchArray( array ) // misc code . . .
fir(array) void f() initCoef()
constWrapper(coef)
  • Problem Arrays of constants commonly not
    specified as constants
  • Initialized at runtime
  • Guideline Use constant wrapper function
  • Specifies array constant for all future functions
  • Automation
  • Difficult, requires global def-use/alias analysis

Can also enable constant folding
9
Conversion to Explicit Data Flow (CEDF)
int array100 void a() for (i0 i lt 100
i) arrayi . . . . . void b()
for (i0 i lt 100 i) arrayi
arrayif(i) int c() for (i0 i lt 100
i) temp arrayi void d()
for (. . . . . ) a() b()
c()
void a(int array100) for (i0 i lt 100
i) arrayi . . . . . void b(int
array1100, int array2100) for (i0 i lt
100 i) array2i array1if(i) i
nt c(int array100) for (i0 i lt 100 i)
temp arrayi void d() int
array1100, array2100 for (. . . . . )
a(array1 ) b(array1, array2 )
c(array2 )
  • Problem Global variables make determination of
    parallelism difficult
  • Requires global def-use/alias analysis
  • Guideline Replace globals with extra parameters
  • Makes data flow explicit
  • Simpler analysis may expose parallelism
  • Automation
  • Been proposed Lee01
  • But, difficult because of aliases

10
Constant Input Enumeration (CIE)
  • Problem Function parameters may limit
    parallelism
  • Guideline Create enum for possible values
  • Synthesis can create specialized functions
  • Automation
  • In some cases, def-use analysis may identify all
    inputs
  • In general, difficult due to aliases

void f(int a, int b) . . . . for (i0
i lt a i) for (j0 j lt b i)
cijij
enum PRM VAL12, VAL24 void f(enum PRM a,
enum PRM b) . . . . for (i0 i lt a
i) for (j0 j lt b i)
cijij
11
Conversion to Explicit Control Flow (CECF)
  • Problem Function pointers may prevent static
    control flow analysis
  • Guideline Replace function pointer with if-else,
    static calls
  • Makes possible targets explicit
  • Automation
  • In general, is impossible
  • Equivalent to halting problem

void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
12
Algorithmic Specialization (AS)
  • Algorithms targeting sw may not be fast in hw
  • Sequential vs. parallel
  • C code generally uses sw algorithms
  • Guideline Specialize critical functions with hw
    algorithms
  • Automation
  • Requires higher level specification
  • Intrinsics

void search(int a, int k, int l, int r)
while (l lt r) mid (lr)/2 if (k
gt amid) l mid1 else if (k lt
amid) r mid-1 else
return mid return 1
void search(int a, int k, const int s) for
(i0 i lt s i) if (ai k)
return i return 1
13
Pass-By-Value Return (PVR)
  • Problem Array parameters cannot be prefetched
    due to potential aliases
  • Designer may know aliases dont exist
  • Guideline Use pass-by-value-return
  • Automation
  • Requires global alias analysis

void f(int a, int b, int array16)
// unrelated computation g(array) //
unrelated computation int g(int array16)
// computation done on array
void f(int a, int b, int array16) int
localArray16 memcpy(localArray,array,16size
of(int)) // misc computation
g(localArray) // misc computation
memcpy(array, localArray,16sizeof(int)) int
g(int array16) // computation done on
array
14
Why Synthesis From C?
  • Why not use HDL?
  • HDL may yield better results
  • C is mainstream language
  • Acceptable performance in many cases
  • Learning HDL is large overhead
  • Approaches are orthogonal
  • This work focuses on improving mainstream
  • Guidelines common for HDL
  • Can also be applied to algorithmic HDL

15
Software Overhead
  • Refined regions may not be partitioned to
    hardware
  • Partitioner may select non-refined regions
  • OS may select software or hardware implementation
  • Based on state of FPGA
  • Coding guidelines have potential software overhead

motionComp() filterLuma() filterChroma() deblocki
ng() . . . . . . . . . .
Hw/Sw Partitioning
motionComp() deblocking()
filterLuma() filterChroma()
Problem - Refined code mapped to software
16
Refinement Methodology
Profile
  • Considerations
  • Reduce software overhead
  • Reduce refinement time
  • Methodology
  • Profile
  • Iterative-improvement
  • Determine critical region
  • Apply all except PVR/AS
  • Minimal overhead
  • Apply PVR if overhead acceptable
  • Apply AS if known algorithm and overhead
    acceptable

Determine Critical Region
Apply CC, CF, CEMA, CIE, CEDF, CECF, FS, LR
Repeat until performance acceptable
no
yes
Apply PVR
no
yes
Apply AS
17
Experimental Setup
Benchmarks
  • Benchmark suite
  • MediaBench, Powerstone
  • Manually applied guidelines
  • 1-2 hours
  • 23 additional lines/benchmark, on average
  • Target Architecture
  • Xilinx VirtexII FPGA with ARM9 uP
  • Hardware/software partitioning
  • Selects critical regions for hardware
  • Synthesis
  • High-level synthesis tool
  • 30,000 lines of C code
  • Outputs register-transfer level (RTL) VHDL
  • RTL Synthesis using Xilinx ISE
  • Compilation
  • Gcc with O1 optimizations

Manual Refinement
Refined Code
Hw/Sw Partitioning
Sw
Hw
Synthesis
Compilation
Bitfile
18
Speedups from Guidelines
19
Speedups from Guidelines
Algorithmic Specialization Speedup 19x
Time 30 minutes Sw Overhead
6000
20
Speedups from Guidelines
  • Original code
  • Speedups range from 1x (no speedup) to 573x
  • Average 2.6x (excludes brev)
  • Refined code with guidelines
  • Average 8.4x (excludes brev)
  • 3.5x average improvement compared to original code

21
Speedups from Guidelines
  • Guidelines move speedups closer to ideal
  • Almost identical for mpeg2, fir
  • Several examples still far from ideal
  • May imply new guidelines needed

22
Guideline SW Overhead/Improvement
Overhead
Improvement
  • Average Sw performance overhead -15.7
    (improvement)
  • -1.1 excluding brev
  • 3 examples improved
  • Average Sw size overhead (lines of C code)
  • 8.4 excluding brev

23
Summary
  • Simple coding guidelines significantly improve
    synthesis from C
  • 3.5x speedup compared to Hw/Sw synthesized from
    unrefined code
  • Major rewrites may not be necessary
  • Between 1-2 hours
  • Refinement Methodology
  • Reduces software size/performance overhead
  • In some cases, improvement
  • Future Work
  • Test on commercial synthesis tools
  • New guidelines for different domains
Write a Comment
User Comments (0)
About PowerShow.com