Portability%20for%20FPGA%20Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Portability%20for%20FPGA%20Applications

Description:

Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) ... Warp speed, Scotty. 16 /64. Frank Vahid, UC Riverside. Warp Processing Challenges ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 61

Provided by: vah

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Portability%20for%20FPGA%20Applications

1
Portability for FPGA ApplicationsWarp Processing
and SystemC Bytecode

Contributing Ph.D. Students
Roman Lysecky (Ph.D. 2005, now Asst. Prof. at
Univ. of Arizona
Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
of Florida, Gainesville
Scotty Sirowy (current)
David Sheldon (current)
Chen Huang (current)

Frank Vahid Dept. of CSE University of
California, Riverside Associate Director, Center
for Embedded Computer Systems, UC Irvine
This research was supported in part by the
National Science Foundation, the Semiconductor
Research Corporation, Intel, Freescale, IBM, and
Xilinx
2
Portable Applications on PCs
One binary
x86 binary
How? Why?
Pentium
Opteron
Atom
Dual Core
Multiple platforms
3
Portable Applications on PCs

Standard software binary
Dynamic software binary translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
4
Meanwhile, Circuits on FPGAs Show Large Speedups

Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS,
MICRO, CASES, DAC, DATE, ICCAD, RAW,

5
FPGAs Entering Computing Mainstream

AMD Opteron
Intel QuickAssist
Cray, SGI
Mitrionics
IBM Cell (research)
Xilinx, Altera

SGI Altix supercomputer (UCR 64 Itaniums plus 2
FPGA RASCs)
6
Circuits on FPGAs are Software Binaries
Microprocessor Binaries (Instructions)
FPGA Binaries (Circuits)
not hardware
aka "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
7
Portable Applications FPGAs

Standard software binary
Dynamic translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
8
Warp Processing
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
9
Warp Processing
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
10
Warp Processing
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
11
Warp Processing
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
12
Warp Processing
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
Recover loops, arrays, subroutines, etc. needed
to synthesize good circuits
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
13
Warp Processing
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
14
Warp Processing
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

15
Warp Processing
gt10x speedups for some apps
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

16
Warp Processing Challenges

Can we decompile binaries sufficiently for
synthesis?
Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
17
Decompilation

Recover high-level information from binary
branches, loops, arrays, subroutines,
Adapted previous methods for processor-processor
translation (UQBT)
Developed new synthesis-oriented methods (e.g.,
reroll loops, strength promotion)

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18
Decompilation Results vs. C

Synthesis from decompiled binary is competitive
with synthesis from C

19
Decompilation Results on Optimized H.264In-depth
Study with Freescale

Again, competitive with synthesis from C

20
Decompilation Effective Even with Compiler
Optimizations

Do compiler optimizations hurt decompilation?
(Surprisingly) found optimized code synthesizes
to even better circuits

Speedup when decompiled binary is partitioned and
synthesized to FPGA
Average Speedup of 10 Examples
21
Decompilation
Summary Decompilation is surprisingly effective
at recovering high-level program structures for
synthesis Stitt et al ICCAD02, DAC03,
CODES/ISSS05, ICCAD05, FPGA05, TODAES06,
TODAES07
Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now
Asst. Prof. at UF Gainesville)
22
Warp Processing Challenges

Can we decompile binaries sufficiently for
synthesis?
Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
23
Challenge JIT Compile to FPGA
60 MB
Commercial tool
Logic synthesis
Tech. map.
Placement
Routing
9.1 s

Developed ultra-lean CAD heuristics for
synthesis, placement, routing, and technology
mapping, e.g.,
Logic synthesis run single expand phase
Technology mapping bottom-up graph clustering
heuristic
Placement place critical path first, then
adjacent items
Routing use resource graph that matches switch
matrix / channel structure

Ultra-lean Riverside JIT FPGA tools (drawn to
scale)
Penalty 1.3-2x in performance size (even more
might be acceptable)
0.2 s
24
JIT Compile to FPGA
Summary Ultra-lean JIT FPGA compiler ? 40x
speedup, 20x less memory, 1.3x-2x circuit
penalty Lysecky et al, DAC03, ISSS/CODES03,
DATE04, DAC04, DATE05, FCCM05, TODAES06
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona)
25
Warp Processing ResultsPerformance Speedup (Most
Frequent Kernel Only)
vs. 200 MHz ARM
1 ARM-only execution
Overall application speedup average is 7.4
26
Warping Thread-Based Applications
for (i 0 i lt 10 i) thread_create( f, i
)
Multi-core platforms ? multi-threaded apps
Performance
OS schedules threads onto accelerators (possibly
dozens), in addition to µPs
Compiler
Very large speedups possible parallelism at
bit, arithmetic, and now thread level too
µP
µP
FPGA
Binary
f()
OS schedules threads onto available µPs
µP
µP
µP
f()
OS
OS invokes on-chip CAD tools to create
accelerators for f()
Thread warping use one core to create
accelerator for waiting threads
Remaining threads added to queue
27
Memory Access Synchronization (MAS)

Must deal with widely known memory bottleneck
problem
FPGAs great, but often cant get data to them
fast enough

for (i 0 i lt 10 i) thread_create(
thread_function, a, i )
RAM
DMA
Data for dozens of threads can create bottleneck
void f( int a, int val ) int result
for (i 0 i lt 10 i) result ai
val . . . .
FPGA
.
Same array

Threaded programs exhibit unique feature
Multiple threads often access same or overlapping
data
Solution Fetch data once, broadcast to multiple
threads (MAS)

28
Memory Access Synchronization (MAS)

Detect overlapping memory regions windows

Synthesis creates active smart buffer
Guo/Najjar FPGA04
Actively fetches data, stores the reused data,
delivers windows to threads
Active rather than passive component designed
for specific threads

a0
a1
a2
a3
a4
a5
for (i 0 i lt 100 i) thread_create(
thread_function, a, i )
Data streamed to smart buffer
DMA
RAM
void f( int a, int i ) int result
result aiai1ai2ai3 . . . .
A0-103
Smart Buffer
A0-3
A6-9
A1-4

Each thread accesses different addresses but
addresses may overlap
Buffer delivers window to each thread
W/O smart buffer 400 memory accesses With smart
buffer 104 memory accesses
29
Speedups from Thread Warping

Chose benchmarks with extensive parallelism
Four core (ARM11 400 MHz) base system
Virtex IV FPGA at circuit-specific clock
frequency (100-300 MHz)
Average 130x speedup

But, FPGA uses additional area. Our FPGA size
36 ARM11s

Still 20x faster than 32-core system (and 11x
faster than 64-core)
Simulation pessimistic, actual results likely
better
FPGA more flexible

30
Warp Scenarios
Warping takes time (seconds, minutes, or more)
when useful?

Long-running applications
Scientific computing, etc.

Recurring applications (save and reuse FPGA
configurations)
Common in embedded systems
Might view as (long) boot phase
For networked/docked devices, CAD can occur on
server (ongoing work)

Long Running Applications
Recurring Applications
µP (1st execution)
On-chip CAD
µP
Time
Time
31
Why Dynamic?

Static good, but hiding FPGA opens technique to
all sw platforms
Standard languages/tools/binaries

Static Compiling to FPGAs
Dynamic Compiling to FPGAs
Specialized Language
Any Language
Specialized Compiler
Any Compiler
Binary
Netlist
Binary
FPGA
µP
32
Synthesis-Friendly Applications

Coding style impacts synthesis results

33
Synthesis-Friendly Application Coding Guidelines
Coding Guidelines
34
Conversion to Explicit Control Flow (CECF)

Problem Function pointers may prevent static
control flow analysis
Guideline Dont use function pointers. Replace
with if-else, static calls
Makes possible targets explicit

void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
35
Speedups from Synthesis-Friendly Coding
Guidelines

10 guidelines
For 1,000 line benchmark 5-6 changes typical,
tens of minutes each

36
Speedups from Synthesis-Friendly Coding Guidelines

Original C code (Powerstone, Mediabench)
Original average speedups with FPGA 2.6x
(excludes brev)
Refined C code with guidelines
Average speedup 8.4x (excludes brev)
Guidelines led to 3.5x improvement of speedup

37
Spatial Algorithms for FPGAs

Example Count patterns
Sequential algorithm
Hash table
10s cycles per pattern

Spatial algorithm
Pipelined stages
Essence is the connectivity of components, not
the sequencing of instructions

bus
int patterns1,000 int counts1,000 while
(1) WaitForPattern() CurrPattern X
hash HashFct(CurrPattern) item
Find(patterns, CurrPattern,
hash) if (item) countsitem

CurrPattern
count
pattern
logic
Level 1
count
pattern
logic
Level 2
. . .
count
pattern
logic
Level m
38
Spatial Algorithms for FPGAs
Current pattern

Spatial algorithm 2
Pipelined binary tree

1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
39
Example
48
73
Possible patterns pre-stored in binary search
tree circuit
Stage 1
Stage 2
Stage 3
Stage 4
40
Example
23
48
Stage 1
73
Stage 2
Stage 3
Stage 4
41
Example
75
23
Stage 1
48
Stage 2
73
Stage 3
Stage 4
42
Example
11
75
Stage 1
23
Stage 2
48
Stage 3
73
Stage 4
1
43
Example
11
Stage 1
75
Stage 2
1
23
Stage 3
48
Stage 4
1
1
44
Study of Spatial Algorithms in FCCM
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial

FCCM 2001-2006
70 papers describing fast application on FPGA
Examined 35 in depth (every other one)
6 used device-specific features
9 represented expected synthesized circuit from
the obvious sequential algorithm
20 were spatially-oriented applications
e.g., earlier pipelined binary tree

45
Portable Spatial Applications?

Current portable microprocessor binaries
sequential
Extensions for threads, processes, ...
How support spatial constructs
Ports, connections, timing model
.....

Adds libraries and macros, still standard C
Sequential and spatial constructs
Compiling links in the simulation kernel
Self-executing simulation
Intended for SoC simulation

www.systemc.org
46
Bytecode

Modern portability approach
Java, C

Compiler
Virtual Machine (VM) Program that executes
bytecode May JIT compile to native architecture
bytecode
VM
VM
VM
Pentium
Opteron
Atom
47
SystemC Bytecode?
SystemC
Compiler
SystemC bytecode
VM
VM
VM
Opteron FPGA
Pentium
FPGA
48
UCR SystemC Bytecode and Compiler
class EDGE_DETECTOR public sc_module //signal
declarations EDGE_DETECTOR()
SC_method(mainComp) sensitive ltlt dataReady
SC_method(getPixel) sensitive ltlt
clock.pos() void getPixel()
dataReady.write(1) void mainComp() int
i, j for(i 0 i lt 3 i) for(j
0 j lt 3 j) sumX sumX
mem.read()GXij
edge.write(sumX sumY)
--header signal clock 1 signal reset 1 signal
memory_in 32 signal fb_data 32 signal leds
4 process(clock) READ 1 memory_in ADD 2 0
3 ADD 3 2 1 WRITE 3 s1 ADDI 1 0 1 WRITE 1
dataReady END process(dataReady) READ 5 val6
SW 5 24(0) READ 5 val7 ADDI 10 0 0
ADDI 7 0 0 ADDI 13 0 8 END
UCRs SystemC bytecode
SystemC
UCRs SystemC-to-bytecode compiler
Spatial Constructs
MIPS-like sequential instructions
49
SystemC Bytecode for FPGAs

Demo

50
SystemC Bytecode Emulator
SystemC bytecode
Bytecode uploadable via USB drive
FPGA
Accelerators speedup emulation
51
SystemC Bytecode Accelerators

Implementation
MIPS-like multicycle RISC datapath
100 MHz Clock
33 Million Instr/Sec
Communicates to core emulator memory mapped
registers
Area 5000 slices
of accelerators limited to of masters allowed
on bus
1200 lines of VHDL

SystemC bytecode
Accelerator 1
Accelerator 2
Accelerator 3
FPGA
52
Dynamic SystemC Accelerator Management

Only a limited number of SystemC accelerators can
fit on an FPGA fabric
Dynamically map processes to accelerators based
on process usage
Involves online algorithms

SystemC bytecode
42
43
11
12
10
44
Accelerator 1
Accelerator 2
Accelerator 3
FPGA
Image Filter Example
53
Just-in-Time Synthesis
Send SystemC bytecode to synthesis server
SystemC bytecode
Dynamically reconfigure some or all of the FPGA
FPGA Specific Bitstream
Possible to even perform synthesis on-chip
warp processing (previous UCR work)
54
Spatial Algorithms for FPGAs
CurrPattern

Even better spatial algorithm for pattern
counting
Pipelined binary tree

1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
55
Study of Spatial Algorithms in FCCM (Sirowy
FPGA2008)
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial

FCCM 2001-2006
70 papers describing fast application on FPGA
Examined 35 in depth (every other one)
6 used device-specific features
9 represented expected synthesized circuit from
the obvious sequential algorithm
20 were spatially-oriented applications
akin to earlier pipelined binary tree

56
Portable Spatial Applications?

Current portable microprocessor binaries
sequential
Extensions for threads, processes, ...
How support spatial constructs
Ports, connections, timing model
.....

Adds libraries and macros, still standard C
Sequential and spatial constructs
Compiling links in the simulation kernel
Self-executing simulation
Intended for SoC simulation

www.systemc.org
57
Transmuting Coprocessors

Demo

58
FPGA is a Size-Limited Coprocessing Resource
App executions change. Must decide which
coprocessors should be FPGA-resident at a given
time transmuting coprocessors
Speedup with previous apps
Upload app profile info
Select coproc. set, generate new FPGA bitstream
FPGA implements coprocessors
Send back new bitstream, re-program FPGA
59
Transmuting Coprocessor Demo

Three image filters
Blur filter (S/L) Blur the image
Sobel filter (S/L) Find the edge of the image
Emboss filter(S/L) Emboss the image
Platform
Virtex 2P(XC2VP30) PPC Coprocessors
PPC Frequency 100Mhz
Coproc. Frequency 50Mhz

30x
120x
Size(slice) Small Large
Blur 30 120
Sobel 228 912
Emboss 81 324
60
Demo architecture
UART
Push button

Image (128128 pixels and 24bit color) 24 BRAMs
Soft version Read (Image BRAM)?Execution
(PPC)?Write (Display BRAM)
Coprocessor version Read (Image
BRAM)?Execution(Coproc)?Write (Display BRAM)
Dock send the profile information through UART.

Image BRAM
PLB
PPC
Peripherals
Coproc
Interface to external
Instruction BRAM
Display BRAM
EDK
VGA control
ISE
VGA display
61
Coprocessor configurations

Microprocessor only
Small blur small sobel
Small blur small emboss
Small sobel small emboss
Large blur
Large sobel
Large emboss
Choose the configuration according to app profile
info.

PPC
Peripherals
Blur (S)
Blur (S)
Sobel(s)
Blur (L)
Sobel (L)
Emboss(L)
Memory
Sobel(S)
Emboss(s)
Emboss(s)
Coprocessor region
Virtex2P
62
Video demo program flow
Time information Time information
Dock CP selection 0.001s
Start IMPACT FPGA reprogramming 12s
Filter PPC only (128 frames) 30s
Filter CP small (128 frames) 1s
Filter CP large (128 frames) 0.25s
Update profile information
Execution
Dock
Reprogram FPGA
Read profile info from UART
Different objectives and different heuristics.
Select new program file
63
Dynamic Enables Expandable Logic Concept
RAM
Expandable Logic
Expandable RAM
uP
Performance
64
Summary