Portability%20for%20FPGA%20Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Portability%20for%20FPGA%20Applications

Description:

Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) ... Warp speed, Scotty. 16 /64. Frank Vahid, UC Riverside. Warp Processing Challenges ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 61
Provided by: vah
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Portability%20for%20FPGA%20Applications


1
Portability for FPGA ApplicationsWarp Processing
and SystemC Bytecode
  • Contributing Ph.D. Students
  • Roman Lysecky (Ph.D. 2005, now Asst. Prof. at
    Univ. of Arizona
  • Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
    of Florida, Gainesville
  • Scotty Sirowy (current)
  • David Sheldon (current)
  • Chen Huang (current)

Frank Vahid Dept. of CSE University of
California, Riverside Associate Director, Center
for Embedded Computer Systems, UC Irvine
This research was supported in part by the
National Science Foundation, the Semiconductor
Research Corporation, Intel, Freescale, IBM, and
Xilinx
2
Portable Applications on PCs
One binary
x86 binary
How? Why?
Pentium
Opteron
Atom
Dual Core
Multiple platforms
3
Portable Applications on PCs
  • Standard software binary
  • Dynamic software binary translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
4
Meanwhile, Circuits on FPGAs Show Large Speedups
  • Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS,
    MICRO, CASES, DAC, DATE, ICCAD, RAW,

5
FPGAs Entering Computing Mainstream
  • AMD Opteron
  • Intel QuickAssist
  • Cray, SGI
  • Mitrionics
  • IBM Cell (research)
  • Xilinx, Altera

SGI Altix supercomputer (UCR 64 Itaniums plus 2
FPGA RASCs)
6
Circuits on FPGAs are Software Binaries
Microprocessor Binaries (Instructions)
FPGA Binaries (Circuits)
not hardware
aka "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
7
Portable Applications FPGAs
  • Standard software binary
  • Dynamic translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
8
Warp Processing
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
9
Warp Processing
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
10
Warp Processing
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
11
Warp Processing
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
12
Warp Processing
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
Recover loops, arrays, subroutines, etc. needed
to synthesize good circuits
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
13
Warp Processing
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
14
Warp Processing
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


15
Warp Processing
gt10x speedups for some apps
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


16
Warp Processing Challenges
  • Can we decompile binaries sufficiently for
    synthesis?
  • Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
17
Decompilation
  • Recover high-level information from binary
    branches, loops, arrays, subroutines,
  • Adapted previous methods for processor-processor
    translation (UQBT)
  • Developed new synthesis-oriented methods (e.g.,
    reroll loops, strength promotion)

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18
Decompilation Results vs. C
  • Synthesis from decompiled binary is competitive
    with synthesis from C

19
Decompilation Results on Optimized H.264In-depth
Study with Freescale
  • Again, competitive with synthesis from C

20
Decompilation Effective Even with Compiler
Optimizations
  • Do compiler optimizations hurt decompilation?
  • (Surprisingly) found optimized code synthesizes
    to even better circuits

Speedup when decompiled binary is partitioned and
synthesized to FPGA
Average Speedup of 10 Examples
21
Decompilation
Summary Decompilation is surprisingly effective
at recovering high-level program structures for
synthesis Stitt et al ICCAD02, DAC03,
CODES/ISSS05, ICCAD05, FPGA05, TODAES06,
TODAES07
Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now
Asst. Prof. at UF Gainesville)
22
Warp Processing Challenges
  • Can we decompile binaries sufficiently for
    synthesis?
  • Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
23
Challenge JIT Compile to FPGA
60 MB
Commercial tool
Logic synthesis
Tech. map.
Placement
Routing
9.1 s
  • Developed ultra-lean CAD heuristics for
    synthesis, placement, routing, and technology
    mapping, e.g.,
  • Logic synthesis run single expand phase
  • Technology mapping bottom-up graph clustering
    heuristic
  • Placement place critical path first, then
    adjacent items
  • Routing use resource graph that matches switch
    matrix / channel structure

Ultra-lean Riverside JIT FPGA tools (drawn to
scale)
Penalty 1.3-2x in performance size (even more
might be acceptable)
0.2 s
24
JIT Compile to FPGA
Summary Ultra-lean JIT FPGA compiler ? 40x
speedup, 20x less memory, 1.3x-2x circuit
penalty Lysecky et al, DAC03, ISSS/CODES03,
DATE04, DAC04, DATE05, FCCM05, TODAES06
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona)
25
Warp Processing ResultsPerformance Speedup (Most
Frequent Kernel Only)
vs. 200 MHz ARM
1 ARM-only execution
Overall application speedup average is 7.4
26
Warping Thread-Based Applications
for (i 0 i lt 10 i) thread_create( f, i
)
Multi-core platforms ? multi-threaded apps
Performance
OS schedules threads onto accelerators (possibly
dozens), in addition to µPs
Compiler
Very large speedups possible parallelism at
bit, arithmetic, and now thread level too
µP
µP
FPGA
Binary
f()
OS schedules threads onto available µPs
µP
µP
µP
f()
OS
OS invokes on-chip CAD tools to create
accelerators for f()
Thread warping use one core to create
accelerator for waiting threads
Remaining threads added to queue
27
Memory Access Synchronization (MAS)
  • Must deal with widely known memory bottleneck
    problem
  • FPGAs great, but often cant get data to them
    fast enough

for (i 0 i lt 10 i) thread_create(
thread_function, a, i )
RAM
DMA
Data for dozens of threads can create bottleneck
void f( int a, int val ) int result
for (i 0 i lt 10 i) result ai
val . . . .
FPGA
.
Same array
  • Threaded programs exhibit unique feature
    Multiple threads often access same or overlapping
    data
  • Solution Fetch data once, broadcast to multiple
    threads (MAS)

28
Memory Access Synchronization (MAS)
  • Detect overlapping memory regions windows
  • Synthesis creates active smart buffer
    Guo/Najjar FPGA04
  • Actively fetches data, stores the reused data,
    delivers windows to threads
  • Active rather than passive component designed
    for specific threads


a0
a1
a2
a3
a4
a5
for (i 0 i lt 100 i) thread_create(
thread_function, a, i )
Data streamed to smart buffer
DMA
RAM
void f( int a, int i ) int result
result aiai1ai2ai3 . . . .
A0-103
Smart Buffer
A0-3
A6-9
A1-4

Each thread accesses different addresses but
addresses may overlap
Buffer delivers window to each thread
W/O smart buffer 400 memory accesses With smart
buffer 104 memory accesses
29
Speedups from Thread Warping
  • Chose benchmarks with extensive parallelism
  • Four core (ARM11 400 MHz) base system
  • Virtex IV FPGA at circuit-specific clock
    frequency (100-300 MHz)
  • Average 130x speedup

But, FPGA uses additional area. Our FPGA size
36 ARM11s
  • Still 20x faster than 32-core system (and 11x
    faster than 64-core)
  • Simulation pessimistic, actual results likely
    better
  • FPGA more flexible

30
Warp Scenarios
Warping takes time (seconds, minutes, or more)
when useful?
  • Long-running applications
  • Scientific computing, etc.
  • Recurring applications (save and reuse FPGA
    configurations)
  • Common in embedded systems
  • Might view as (long) boot phase
  • For networked/docked devices, CAD can occur on
    server (ongoing work)

Long Running Applications
Recurring Applications
µP (1st execution)
On-chip CAD
µP
Time
Time
31
Why Dynamic?
  • Static good, but hiding FPGA opens technique to
    all sw platforms
  • Standard languages/tools/binaries

Static Compiling to FPGAs
Dynamic Compiling to FPGAs
Specialized Language
Any Language
Specialized Compiler
Any Compiler
Binary
Netlist
Binary
FPGA
µP
32
Synthesis-Friendly Applications
  • Coding style impacts synthesis results

33
Synthesis-Friendly Application Coding Guidelines
Coding Guidelines
34
Conversion to Explicit Control Flow (CECF)
  • Problem Function pointers may prevent static
    control flow analysis
  • Guideline Dont use function pointers. Replace
    with if-else, static calls
  • Makes possible targets explicit

void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
35
Speedups from Synthesis-Friendly Coding
Guidelines
  • 10 guidelines
  • For 1,000 line benchmark 5-6 changes typical,
    tens of minutes each

36
Speedups from Synthesis-Friendly Coding Guidelines
  • Original C code (Powerstone, Mediabench)
  • Original average speedups with FPGA 2.6x
    (excludes brev)
  • Refined C code with guidelines
  • Average speedup 8.4x (excludes brev)
  • Guidelines led to 3.5x improvement of speedup

37
Spatial Algorithms for FPGAs
  • Example Count patterns
  • Sequential algorithm
  • Hash table
  • 10s cycles per pattern
  • Spatial algorithm
  • Pipelined stages
  • Essence is the connectivity of components, not
    the sequencing of instructions

bus
int patterns1,000 int counts1,000 while
(1) WaitForPattern() CurrPattern X
hash HashFct(CurrPattern) item
Find(patterns, CurrPattern,
hash) if (item) countsitem

CurrPattern
count
pattern
logic
Level 1
count
pattern
logic
Level 2
. . .
count
pattern
logic
Level m
38
Spatial Algorithms for FPGAs
Current pattern
  • Spatial algorithm 2
  • Pipelined binary tree

1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
39
Example
48
73
Possible patterns pre-stored in binary search
tree circuit
Stage 1
Stage 2
Stage 3
Stage 4
40
Example
23
48
Stage 1
73
Stage 2
Stage 3
Stage 4
41
Example
75
23
Stage 1
48
Stage 2
73
Stage 3
Stage 4
42
Example
11
75
Stage 1
23
Stage 2
48
Stage 3
73
Stage 4
1
43
Example
11
Stage 1
75
Stage 2
1
23
Stage 3
48
Stage 4
1
1
44
Study of Spatial Algorithms in FCCM
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial
  • FCCM 2001-2006
  • 70 papers describing fast application on FPGA
  • Examined 35 in depth (every other one)
  • 6 used device-specific features
  • 9 represented expected synthesized circuit from
    the obvious sequential algorithm
  • 20 were spatially-oriented applications
  • e.g., earlier pipelined binary tree

45
Portable Spatial Applications?
  • Current portable microprocessor binaries
    sequential
  • Extensions for threads, processes, ...
  • How support spatial constructs
  • Ports, connections, timing model
  • .....
  • Adds libraries and macros, still standard C
  • Sequential and spatial constructs
  • Compiling links in the simulation kernel
  • Self-executing simulation
  • Intended for SoC simulation

www.systemc.org
46
Bytecode
  • Modern portability approach
  • Java, C

Compiler
Virtual Machine (VM) Program that executes
bytecode May JIT compile to native architecture
bytecode
VM
VM
VM
Pentium
Opteron
Atom
47
SystemC Bytecode?
SystemC
Compiler
SystemC bytecode
VM
VM
VM
Opteron FPGA
Pentium
FPGA
48
UCR SystemC Bytecode and Compiler
class EDGE_DETECTOR public sc_module //signal
declarations EDGE_DETECTOR()
SC_method(mainComp) sensitive ltlt dataReady
SC_method(getPixel) sensitive ltlt
clock.pos() void getPixel()
dataReady.write(1) void mainComp() int
i, j for(i 0 i lt 3 i) for(j
0 j lt 3 j) sumX sumX
mem.read()GXij
edge.write(sumX sumY)
--header signal clock 1 signal reset 1 signal
memory_in 32 signal fb_data 32 signal leds
4 process(clock) READ 1 memory_in ADD 2 0
3 ADD 3 2 1 WRITE 3 s1 ADDI 1 0 1 WRITE 1
dataReady END process(dataReady) READ 5 val6
SW 5 24(0) READ 5 val7 ADDI 10 0 0
ADDI 7 0 0 ADDI 13 0 8 END
UCRs SystemC bytecode
SystemC
UCRs SystemC-to-bytecode compiler
Spatial Constructs
MIPS-like sequential instructions
49
SystemC Bytecode for FPGAs
  • Demo

50
SystemC Bytecode Emulator
SystemC bytecode
Bytecode uploadable via USB drive
FPGA
Accelerators speedup emulation
51
SystemC Bytecode Accelerators
  • Implementation
  • MIPS-like multicycle RISC datapath
  • 100 MHz Clock
  • 33 Million Instr/Sec
  • Communicates to core emulator memory mapped
    registers
  • Area 5000 slices
  • of accelerators limited to of masters allowed
    on bus
  • 1200 lines of VHDL

SystemC bytecode
Accelerator 1
Accelerator 2
Accelerator 3
FPGA
52
Dynamic SystemC Accelerator Management
  • Only a limited number of SystemC accelerators can
    fit on an FPGA fabric
  • Dynamically map processes to accelerators based
    on process usage
  • Involves online algorithms

SystemC bytecode
42
43
11
12
10
44
Accelerator 1
Accelerator 2
Accelerator 3
FPGA
Image Filter Example
53
Just-in-Time Synthesis
Send SystemC bytecode to synthesis server
SystemC bytecode
Dynamically reconfigure some or all of the FPGA
FPGA Specific Bitstream
Possible to even perform synthesis on-chip
warp processing (previous UCR work)
54
Spatial Algorithms for FPGAs
CurrPattern
  • Even better spatial algorithm for pattern
    counting
  • Pipelined binary tree

1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
55
Study of Spatial Algorithms in FCCM (Sirowy
FPGA2008)
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial
  • FCCM 2001-2006
  • 70 papers describing fast application on FPGA
  • Examined 35 in depth (every other one)
  • 6 used device-specific features
  • 9 represented expected synthesized circuit from
    the obvious sequential algorithm
  • 20 were spatially-oriented applications
  • akin to earlier pipelined binary tree

56
Portable Spatial Applications?
  • Current portable microprocessor binaries
    sequential
  • Extensions for threads, processes, ...
  • How support spatial constructs
  • Ports, connections, timing model
  • .....
  • Adds libraries and macros, still standard C
  • Sequential and spatial constructs
  • Compiling links in the simulation kernel
  • Self-executing simulation
  • Intended for SoC simulation

www.systemc.org
57
Transmuting Coprocessors
  • Demo

58
FPGA is a Size-Limited Coprocessing Resource
App executions change. Must decide which
coprocessors should be FPGA-resident at a given
time transmuting coprocessors
Speedup with previous apps
Upload app profile info
Select coproc. set, generate new FPGA bitstream
FPGA implements coprocessors
Send back new bitstream, re-program FPGA
59
Transmuting Coprocessor Demo
  • Three image filters
  • Blur filter (S/L) Blur the image
  • Sobel filter (S/L) Find the edge of the image
  • Emboss filter(S/L) Emboss the image
  • Platform
  • Virtex 2P(XC2VP30) PPC Coprocessors
  • PPC Frequency 100Mhz
  • Coproc. Frequency 50Mhz

30x
120x
Size(slice) Small Large
Blur 30 120
Sobel 228 912
Emboss 81 324
60
Demo architecture
UART
Push button
  • Image (128128 pixels and 24bit color) 24 BRAMs
  • Soft version Read (Image BRAM)?Execution
    (PPC)?Write (Display BRAM)
  • Coprocessor version Read (Image
    BRAM)?Execution(Coproc)?Write (Display BRAM)
  • Dock send the profile information through UART.

Image BRAM
PLB
PPC
Peripherals
Coproc
Interface to external
Instruction BRAM
Display BRAM
EDK
VGA control
ISE
VGA display
61
Coprocessor configurations
  • Microprocessor only
  • Small blur small sobel
  • Small blur small emboss
  • Small sobel small emboss
  • Large blur
  • Large sobel
  • Large emboss
  • Choose the configuration according to app profile
    info.

PPC
Peripherals
Blur (S)
Blur (S)
Sobel(s)
Blur (L)
Sobel (L)
Emboss(L)
Memory
Sobel(S)
Emboss(s)
Emboss(s)
Coprocessor region
Virtex2P
62
Video demo program flow
Time information Time information
Dock CP selection 0.001s
Start IMPACT FPGA reprogramming 12s
Filter PPC only (128 frames) 30s
Filter CP small (128 frames) 1s
Filter CP large (128 frames) 0.25s
Update profile information
Execution
Dock
Reprogram FPGA
Read profile info from UART
Different objectives and different heuristics.
Select new program file
63
Dynamic Enables Expandable Logic Concept
RAM
Expandable Logic
Expandable RAM
uP
Performance
64
Summary
  • FPGAs entering mainstream
  • Portability of applications is important
  • Dynamic binary translation to FPGAs Warp
    processing
  • Shown feasible Extensive future work
  • Trends towards FPGA ubiquity
  • Microprocessor binaries need extensions for
    spatial constructs
  • One approach SystemC bytecode and virtual
    machine
  • Can also be warped for circuit-speed

http//www.cs.ucr.edu/vahid/pubs
Write a Comment
User Comments (0)
About PowerShow.com