Title: Portability%20for%20FPGA%20Applications
1Portability for FPGA ApplicationsWarp Processing
and SystemC Bytecode
- Contributing Ph.D. Students
- Roman Lysecky (Ph.D. 2005, now Asst. Prof. at
Univ. of Arizona - Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
of Florida, Gainesville - Scotty Sirowy (current)
- David Sheldon (current)
- Chen Huang (current)
Frank Vahid Dept. of CSE University of
California, Riverside Associate Director, Center
for Embedded Computer Systems, UC Irvine
This research was supported in part by the
National Science Foundation, the Semiconductor
Research Corporation, Intel, Freescale, IBM, and
Xilinx
2Portable Applications on PCs
One binary
x86 binary
How? Why?
Pentium
Opteron
Atom
Dual Core
Multiple platforms
3Portable Applications on PCs
- Standard software binary
- Dynamic software binary translation
x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
4Meanwhile, Circuits on FPGAs Show Large Speedups
- Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS,
MICRO, CASES, DAC, DATE, ICCAD, RAW,
5FPGAs Entering Computing Mainstream
- AMD Opteron
- Intel QuickAssist
- Cray, SGI
- Mitrionics
- IBM Cell (research)
- Xilinx, Altera
SGI Altix supercomputer (UCR 64 Itaniums plus 2
FPGA RASCs)
6Circuits on FPGAs are Software Binaries
Microprocessor Binaries (Instructions)
FPGA Binaries (Circuits)
not hardware
aka "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
7Portable Applications FPGAs
- Standard software binary
- Dynamic translation
x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
8Warp Processing
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
9Warp Processing
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
10Warp Processing
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
11Warp Processing
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
12Warp Processing
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
Recover loops, arrays, subroutines, etc. needed
to synthesize good circuits
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
13Warp Processing
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
14Warp Processing
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
15Warp Processing
gt10x speedups for some apps
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
16Warp Processing Challenges
- Can we decompile binaries sufficiently for
synthesis? - Can we just-in-time (JIT) compile to FPGAs?
Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
17Decompilation
- Recover high-level information from binary
branches, loops, arrays, subroutines, - Adapted previous methods for processor-processor
translation (UQBT) - Developed new synthesis-oriented methods (e.g.,
reroll loops, strength promotion)
Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18Decompilation Results vs. C
- Synthesis from decompiled binary is competitive
with synthesis from C
19Decompilation Results on Optimized H.264In-depth
Study with Freescale
- Again, competitive with synthesis from C
20Decompilation Effective Even with Compiler
Optimizations
- Do compiler optimizations hurt decompilation?
- (Surprisingly) found optimized code synthesizes
to even better circuits
Speedup when decompiled binary is partitioned and
synthesized to FPGA
Average Speedup of 10 Examples
21Decompilation
Summary Decompilation is surprisingly effective
at recovering high-level program structures for
synthesis Stitt et al ICCAD02, DAC03,
CODES/ISSS05, ICCAD05, FPGA05, TODAES06,
TODAES07
Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now
Asst. Prof. at UF Gainesville)
22Warp Processing Challenges
- Can we decompile binaries sufficiently for
synthesis? - Can we just-in-time (JIT) compile to FPGAs?
Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
23Challenge JIT Compile to FPGA
60 MB
Commercial tool
Logic synthesis
Tech. map.
Placement
Routing
9.1 s
- Developed ultra-lean CAD heuristics for
synthesis, placement, routing, and technology
mapping, e.g., - Logic synthesis run single expand phase
- Technology mapping bottom-up graph clustering
heuristic - Placement place critical path first, then
adjacent items - Routing use resource graph that matches switch
matrix / channel structure
Ultra-lean Riverside JIT FPGA tools (drawn to
scale)
Penalty 1.3-2x in performance size (even more
might be acceptable)
0.2 s
24JIT Compile to FPGA
Summary Ultra-lean JIT FPGA compiler ? 40x
speedup, 20x less memory, 1.3x-2x circuit
penalty Lysecky et al, DAC03, ISSS/CODES03,
DATE04, DAC04, DATE05, FCCM05, TODAES06
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona)
25Warp Processing ResultsPerformance Speedup (Most
Frequent Kernel Only)
vs. 200 MHz ARM
1 ARM-only execution
Overall application speedup average is 7.4
26Warping Thread-Based Applications
for (i 0 i lt 10 i) thread_create( f, i
)
Multi-core platforms ? multi-threaded apps
Performance
OS schedules threads onto accelerators (possibly
dozens), in addition to µPs
Compiler
Very large speedups possible parallelism at
bit, arithmetic, and now thread level too
µP
µP
FPGA
Binary
f()
OS schedules threads onto available µPs
µP
µP
µP
f()
OS
OS invokes on-chip CAD tools to create
accelerators for f()
Thread warping use one core to create
accelerator for waiting threads
Remaining threads added to queue
27Memory Access Synchronization (MAS)
- Must deal with widely known memory bottleneck
problem - FPGAs great, but often cant get data to them
fast enough
for (i 0 i lt 10 i) thread_create(
thread_function, a, i )
RAM
DMA
Data for dozens of threads can create bottleneck
void f( int a, int val ) int result
for (i 0 i lt 10 i) result ai
val . . . .
FPGA
.
Same array
- Threaded programs exhibit unique feature
Multiple threads often access same or overlapping
data - Solution Fetch data once, broadcast to multiple
threads (MAS)
28Memory Access Synchronization (MAS)
- Detect overlapping memory regions windows
- Synthesis creates active smart buffer
Guo/Najjar FPGA04 - Actively fetches data, stores the reused data,
delivers windows to threads - Active rather than passive component designed
for specific threads
a0
a1
a2
a3
a4
a5
for (i 0 i lt 100 i) thread_create(
thread_function, a, i )
Data streamed to smart buffer
DMA
RAM
void f( int a, int i ) int result
result aiai1ai2ai3 . . . .
A0-103
Smart Buffer
A0-3
A6-9
A1-4
Each thread accesses different addresses but
addresses may overlap
Buffer delivers window to each thread
W/O smart buffer 400 memory accesses With smart
buffer 104 memory accesses
29Speedups from Thread Warping
- Chose benchmarks with extensive parallelism
- Four core (ARM11 400 MHz) base system
- Virtex IV FPGA at circuit-specific clock
frequency (100-300 MHz) - Average 130x speedup
But, FPGA uses additional area. Our FPGA size
36 ARM11s
- Still 20x faster than 32-core system (and 11x
faster than 64-core) - Simulation pessimistic, actual results likely
better - FPGA more flexible
30Warp Scenarios
Warping takes time (seconds, minutes, or more)
when useful?
- Long-running applications
- Scientific computing, etc.
- Recurring applications (save and reuse FPGA
configurations) - Common in embedded systems
- Might view as (long) boot phase
- For networked/docked devices, CAD can occur on
server (ongoing work)
Long Running Applications
Recurring Applications
µP (1st execution)
On-chip CAD
µP
Time
Time
31Why Dynamic?
- Static good, but hiding FPGA opens technique to
all sw platforms - Standard languages/tools/binaries
Static Compiling to FPGAs
Dynamic Compiling to FPGAs
Specialized Language
Any Language
Specialized Compiler
Any Compiler
Binary
Netlist
Binary
FPGA
µP
32Synthesis-Friendly Applications
- Coding style impacts synthesis results
33Synthesis-Friendly Application Coding Guidelines
Coding Guidelines
34Conversion to Explicit Control Flow (CECF)
- Problem Function pointers may prevent static
control flow analysis - Guideline Dont use function pointers. Replace
with if-else, static calls - Makes possible targets explicit
void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
35Speedups from Synthesis-Friendly Coding
Guidelines
- 10 guidelines
- For 1,000 line benchmark 5-6 changes typical,
tens of minutes each
36Speedups from Synthesis-Friendly Coding Guidelines
- Original C code (Powerstone, Mediabench)
- Original average speedups with FPGA 2.6x
(excludes brev) - Refined C code with guidelines
- Average speedup 8.4x (excludes brev)
- Guidelines led to 3.5x improvement of speedup
37Spatial Algorithms for FPGAs
- Example Count patterns
- Sequential algorithm
- Hash table
- 10s cycles per pattern
- Spatial algorithm
- Pipelined stages
- Essence is the connectivity of components, not
the sequencing of instructions
bus
int patterns1,000 int counts1,000 while
(1) WaitForPattern() CurrPattern X
hash HashFct(CurrPattern) item
Find(patterns, CurrPattern,
hash) if (item) countsitem
CurrPattern
count
pattern
logic
Level 1
count
pattern
logic
Level 2
. . .
count
pattern
logic
Level m
38Spatial Algorithms for FPGAs
Current pattern
- Spatial algorithm 2
- Pipelined binary tree
1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
39Example
48
73
Possible patterns pre-stored in binary search
tree circuit
Stage 1
Stage 2
Stage 3
Stage 4
40Example
23
48
Stage 1
73
Stage 2
Stage 3
Stage 4
41Example
75
23
Stage 1
48
Stage 2
73
Stage 3
Stage 4
42Example
11
75
Stage 1
23
Stage 2
48
Stage 3
73
Stage 4
1
43Example
11
Stage 1
75
Stage 2
1
23
Stage 3
48
Stage 4
1
1
44Study of Spatial Algorithms in FCCM
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial
- FCCM 2001-2006
- 70 papers describing fast application on FPGA
- Examined 35 in depth (every other one)
- 6 used device-specific features
- 9 represented expected synthesized circuit from
the obvious sequential algorithm - 20 were spatially-oriented applications
- e.g., earlier pipelined binary tree
45Portable Spatial Applications?
- Current portable microprocessor binaries
sequential - Extensions for threads, processes, ...
- How support spatial constructs
- Ports, connections, timing model
- .....
- Adds libraries and macros, still standard C
- Sequential and spatial constructs
- Compiling links in the simulation kernel
- Self-executing simulation
- Intended for SoC simulation
www.systemc.org
46Bytecode
- Modern portability approach
- Java, C
Compiler
Virtual Machine (VM) Program that executes
bytecode May JIT compile to native architecture
bytecode
VM
VM
VM
Pentium
Opteron
Atom
47SystemC Bytecode?
SystemC
Compiler
SystemC bytecode
VM
VM
VM
Opteron FPGA
Pentium
FPGA
48UCR SystemC Bytecode and Compiler
class EDGE_DETECTOR public sc_module //signal
declarations EDGE_DETECTOR()
SC_method(mainComp) sensitive ltlt dataReady
SC_method(getPixel) sensitive ltlt
clock.pos() void getPixel()
dataReady.write(1) void mainComp() int
i, j for(i 0 i lt 3 i) for(j
0 j lt 3 j) sumX sumX
mem.read()GXij
edge.write(sumX sumY)
--header signal clock 1 signal reset 1 signal
memory_in 32 signal fb_data 32 signal leds
4 process(clock) READ 1 memory_in ADD 2 0
3 ADD 3 2 1 WRITE 3 s1 ADDI 1 0 1 WRITE 1
dataReady END process(dataReady) READ 5 val6
SW 5 24(0) READ 5 val7 ADDI 10 0 0
ADDI 7 0 0 ADDI 13 0 8 END
UCRs SystemC bytecode
SystemC
UCRs SystemC-to-bytecode compiler
Spatial Constructs
MIPS-like sequential instructions
49SystemC Bytecode for FPGAs
50SystemC Bytecode Emulator
SystemC bytecode
Bytecode uploadable via USB drive
FPGA
Accelerators speedup emulation
51SystemC Bytecode Accelerators
- Implementation
- MIPS-like multicycle RISC datapath
- 100 MHz Clock
- 33 Million Instr/Sec
- Communicates to core emulator memory mapped
registers - Area 5000 slices
- of accelerators limited to of masters allowed
on bus - 1200 lines of VHDL
SystemC bytecode
Accelerator 1
Accelerator 2
Accelerator 3
FPGA
52Dynamic SystemC Accelerator Management
- Only a limited number of SystemC accelerators can
fit on an FPGA fabric - Dynamically map processes to accelerators based
on process usage - Involves online algorithms
SystemC bytecode
42
43
11
12
10
44
Accelerator 1
Accelerator 2
Accelerator 3
FPGA
Image Filter Example
53Just-in-Time Synthesis
Send SystemC bytecode to synthesis server
SystemC bytecode
Dynamically reconfigure some or all of the FPGA
FPGA Specific Bitstream
Possible to even perform synthesis on-chip
warp processing (previous UCR work)
54Spatial Algorithms for FPGAs
CurrPattern
- Even better spatial algorithm for pattern
counting - Pipelined binary tree
1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
55Study of Spatial Algorithms in FCCM (Sirowy
FPGA2008)
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial
- FCCM 2001-2006
- 70 papers describing fast application on FPGA
- Examined 35 in depth (every other one)
- 6 used device-specific features
- 9 represented expected synthesized circuit from
the obvious sequential algorithm - 20 were spatially-oriented applications
- akin to earlier pipelined binary tree
56Portable Spatial Applications?
- Current portable microprocessor binaries
sequential - Extensions for threads, processes, ...
- How support spatial constructs
- Ports, connections, timing model
- .....
- Adds libraries and macros, still standard C
- Sequential and spatial constructs
- Compiling links in the simulation kernel
- Self-executing simulation
- Intended for SoC simulation
www.systemc.org
57Transmuting Coprocessors
58FPGA is a Size-Limited Coprocessing Resource
App executions change. Must decide which
coprocessors should be FPGA-resident at a given
time transmuting coprocessors
Speedup with previous apps
Upload app profile info
Select coproc. set, generate new FPGA bitstream
FPGA implements coprocessors
Send back new bitstream, re-program FPGA
59Transmuting Coprocessor Demo
- Three image filters
- Blur filter (S/L) Blur the image
- Sobel filter (S/L) Find the edge of the image
- Emboss filter(S/L) Emboss the image
- Platform
- Virtex 2P(XC2VP30) PPC Coprocessors
- PPC Frequency 100Mhz
- Coproc. Frequency 50Mhz
30x
120x
Size(slice) Small Large
Blur 30 120
Sobel 228 912
Emboss 81 324
60Demo architecture
UART
Push button
- Image (128128 pixels and 24bit color) 24 BRAMs
- Soft version Read (Image BRAM)?Execution
(PPC)?Write (Display BRAM) - Coprocessor version Read (Image
BRAM)?Execution(Coproc)?Write (Display BRAM) - Dock send the profile information through UART.
Image BRAM
PLB
PPC
Peripherals
Coproc
Interface to external
Instruction BRAM
Display BRAM
EDK
VGA control
ISE
VGA display
61Coprocessor configurations
- Microprocessor only
- Small blur small sobel
- Small blur small emboss
- Small sobel small emboss
- Large blur
- Large sobel
- Large emboss
- Choose the configuration according to app profile
info.
PPC
Peripherals
Blur (S)
Blur (S)
Sobel(s)
Blur (L)
Sobel (L)
Emboss(L)
Memory
Sobel(S)
Emboss(s)
Emboss(s)
Coprocessor region
Virtex2P
62Video demo program flow
Time information Time information
Dock CP selection 0.001s
Start IMPACT FPGA reprogramming 12s
Filter PPC only (128 frames) 30s
Filter CP small (128 frames) 1s
Filter CP large (128 frames) 0.25s
Update profile information
Execution
Dock
Reprogram FPGA
Read profile info from UART
Different objectives and different heuristics.
Select new program file
63Dynamic Enables Expandable Logic Concept
RAM
Expandable Logic
Expandable RAM
uP
Performance
64Summary
- FPGAs entering mainstream
- Portability of applications is important
- Dynamic binary translation to FPGAs Warp
processing - Shown feasible Extensive future work
- Trends towards FPGA ubiquity
- Microprocessor binaries need extensions for
spatial constructs - One approach SystemC bytecode and virtual
machine - Can also be warped for circuit-speed
http//www.cs.ucr.edu/vahid/pubs