Title: Warp Processing
1Warp Processing Towards FPGA Ubiquity
- Frank Vahid
- Professor
- Department of Computer Science and Engineering
- University of California, Riverside
- Associate Director, Center for Embedded Computer
Systems, UC Irvine - Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, Xilinx, Intel, and Freescale - Contributing Students Roman Lysecky (PhD 2005,
now asst. prof. at U. Arizona), Greg Stitt (PhD
2006), Kris Miller (MS 2007), David Sheldon (3rd
yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy
(1st yr PhD)
2Outline
- FPGAs
- Why theyre great
- Why theyre not ubiquitous yet
- Hiding FPGAs from programmers
- Warp processing
- Binary decompilation
- Just-in-time FPGA compilation
- Directions
3FPGAs
Implement circuit by downloading particular bits
a
b
LUT
F
G
- FPGA -- Field-Programmable Gate Array
- Implement circuit by downloading bits
- N-address memory (LUT) implements N-input
combinational logic - Register-controlled switch matrix (SM) connects
LUTs - FPGA fabric
- Thousands of LUTs and SMs, increasingly
additional hard core components like multipliers,
RAM, etc. - CAD tools automatically map desired circuit onto
FPGA fabric
4FPGAs are "Programmable" like Microprocessors
Just Download Bits
Microprocessor Binaries
FPGA "Binaries"
More commonly known as "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
5FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
6FPGA Why (Sometimes) Better than Microprocessor
C Code for FIR Filter
Circuit for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
- 1000s of instructions
- Several thousand cycles
In general, FPGA better due to circuit's
concurrency, from bit-level to task level
7Extensive Studies over Past Decade
- Large speedups on many important applications
- See ACM/SIGDA Int. Symp. on FPGAs
- So why aren't FPGAs ubiquitous?
8Why FPGAs arent Mainstream
- Cost But improving yearly
- Power But improving yearly, and energy benefits
too - Extra chip But integration continues
- Programming methodology
1 million system gate FPGA cost
Source Xilinx
9Why FPGAs arent Mainstream
- Cost
- Power
- Extra chip
- Programming methodology
- Though tremendous progress in past decade
Application (C/C/Java/SystemC/Handel-C/Streams-C
/)
Automated hardware/software partitioning
C/C/Java
C/C/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C
...
Behavioral synthesis (1990s)
Register transfers
Compilation (1960s, 1970s)
RT synthesis (1980s, 1990s)
Logic equations / FSMs
Assembly code
Logic synthesis, physical design (1970s, 1980s)
Assembling, linking (1950s, 1960s)
Microprocessor binary
FPGA binary
Downloading
Downloading
Implementation
Microprocessors
FPGA circuits
10So Whats the Holdup?
- FPGAs require special compilers
- Limits adoption desktop world dominates
- 100 software writers for every CAD user
- Millions of compiler seats worldwide, vs. 15,000
CAD seats
Standard Compiler
11Outline
- FPGAs
- Why theyre great
- Why theyre not ubiquitous yet
- Hiding FPGAs from programmers
- Warp processing
- Binary decompilation
- Just-in-time FPGA compilation
- Directions
12Can we Hide FPGAs from Programmers and Standard
Tools?
- Example
- Radically different x86 architectures hidden from
programmers and tools - All execute standard x86 binaries
- On-chip tools dynamically translate binary to
particular architecture - Idea Hide FPGA from programmers and tools
- Download standard binary
- Have on-chip tools dynamically translate binary
(portions) to FPGA - We call this Warp Processing
Traditional partitioning done here
Translator
Translator
RISC architecture
VLIW architecture
13Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
20Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
21Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
On-chip CAD
22Warp Processing Challenges
- Two key challenges
- Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs? - Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources?
23Decompilation
- Synthesis from binary has a potential hurdle
- High-level information (e.g., loops, arrays) lost
during compilation - Direct translation of assembly to circuit huge
overheads - Need to recover high-level information
Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
24Decompilation
- Solution Recover high-level information from
binary decompilation - Adapted extensive previous work (for different
purposes) - Developed new decompilation methods also
- Ph.D. work of Greg Stitt (Ph.D. UCR 2006)
- Numerous publications http//www.cs.ucr.edu/vahi
d/pubs
Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
25Decompilation Results vs. C
- Compared with synthesis from C
- Synthesis after decompilation often quite similar
- Almost identical performance, small area overhead
FPGA 2005
26Decompilation Results on Optimized H.264In-depth
Study with Freescale
- Used highly-optimized benchmark
- Results Binary approach competitive
- Speedups compared to ARM9 software
- Binary 2.48, C 2.53
- Decompilation recovered nearly all high-level
information needed for partitioning and synthesis
27Simple Coding Guidelines Bring Speedups Closer to
Ideal
- Interesting discovery during H264 study C style
limited speedup - Orthogonal to binary vs. C issue coding style
hurt both - Developed simple coding guidelines
- Rewritten software 20 minutes, and only 3
slower than original - New speedups Binary 6.55, C 6.56
- Binary still competitive with C
- Following guidelines not required, but helps any
approach targeting FPGAs
28Warp Processing Challenges
- Two key challenges
- Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs? - Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources?
29JIT FPGA Compilation
- Developed ultra-lean CAD heuristics for
synthesis, placement, routing, and technology
mapping simultaneously developed CAD-oriented
FPGA - e.g., Our router (ROCR) 10x faster and 20x less
memory than popular VPR tool, at cost of 30
longer critical path. Similar results for synth
placement - Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona) - Numerous publications http//www.cs.ucr.edu/vahi
d/pubs
DAC04
30JIT FPGA Compilation
31Overall Warp Processing ResultsPerformance
Speedup (Most Frequent Kernel Only)
SW Only Execution
32Overall Warp Processing ResultsPerformance
Speedup (Overall, Multiple Kernels)
- Energy reduction of 38 - 94
SW Only Execution
Assuming 100 MHz ARM, fabric in same technology
and clocked at rate determined by synthesis
33FPGA Ubiquity via Obscurity
- FPGA is hidden from languages and tools
- Thus, ANY microprocessor platform extendible with
FPGA - So any program can potentially be sped up by
FPGAs - No new languages, no new tools
- Maintains "ecosystem" among application, tool,
and architecture developers
Profiling
Standard Compiler
34Outline
- FPGAs
- Why theyre great
- Why theyre not ubiquitous yet
- Hiding FPGAs from programmers
- Warp processing
- Binary decompilation
- Just-in-time FPGA compilation
- Directions
35Directions Whats Next?
- Immediate future Develop warp processing using
benchmarks from other domains - Desktop, server, scientific
- With partners IBM, Freescale
- May require new decompilation techniques
36Directions Whats Next?
- Application-specific FPGA
- Tune FPGA fabric to application (or domain)
- Parameters LUTs/CLB, LUT size
- Many more possible, e.g., switch matrix size,
long vs. short channels
Delay for each configuration (LUTs/CLB, and LUT
sizes 2-7) for one application
Delay area when tuning parameters for best
delay for each app, rather than for all apps
37Directions Whats Next?
Thrd1
Thrd2
Thrd3
- Parallel benchmarks
- NAS, SPEComp, Splash,
- Map each thread to custom FPGA circuit
- Huge potential speedups
ThrdN
Sample speedups from other works
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
Thrd1
Thrd2
On-chip CAD
Thrd3
ThrdN
38Directions Whats Next?
- With JIT FPGA compiler, what else is possible?
- Implications for existing applications?
- Image processing, neural networks, ...
- Add FPGA hardware to improve performance, like
expandable memory? - Standard binaries for FPGAs?
- Rather than extracting circuit from sequential
code, distribute circuit binary itself, use JIT
FPGA compiler to best map to FPGA resources
39Summary
- FPGA future looks bright
- Hiding FPGA via warp processing is feasible
- Decompilation can recover high-level constructs
to yield speedups competitive with source-level - JIT FPGA compilation can be made sufficiently
lean - Many possible directions exist that may use FPGAs
to gain ultra-high performance without ultra-high
engineering or hardware costs
Publications can be found at http//www.cs.ucr.ed
u/vahid/pubs