Title: The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems
1The New Software Invisible Ubiquitous FPGAs
that Enable Next-Generation Embedded Systems
- Frank Vahid
- Professor
- Department of Computer Science and Engineering
- University of California, Riverside
- Associate Director, Center for Embedded Computer
Systems, UC Irvine - Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, Xilinx, Intel, and Freescale - Contributing Students Roman Lysecky (PhD 2005,
now asst. prof. at U. Arizona), Greg Stitt (PhD
2006), David Sheldon (3rd yr PhD), Ryan Mannion
(2nd yr PhD), Scott Sirowy (1st yr PhD)
2Outline
- FPGAs The New Software
- Why theyre great
- Why theyre not ubiquitous yet
- Hiding FPGAs from programmers
- Warp processing
- Binary decompilation
- Just-in-time FPGA compilation
- Towards Standard Binaries for FPGAs
3FPGAs
Implement circuit by downloading particular bits
a
b
LUT
F
G
- FPGA -- Field-Programmable Gate Array
- Implement circuit by downloading bits
- N-address memory (LUT) implements N-input
combinational logic - Register-controlled switch matrix (SM) connects
LUTs - FPGA fabric
- Thousands of LUTs and SMs, increasingly
additional hard core components like multipliers,
RAM, etc. - CAD tools automatically map desired circuit onto
FPGA fabric
4FPGAs are "Programmable" like Microprocessors
Just Download Bits
Microprocessor Binaries
FPGA "Binaries"
More commonly known as "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
5FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
6FPGA Why (Sometimes) Better than Microprocessor
C Code for FIR Filter
Circuit for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
- 1000s of instructions
- Several thousand cycles
In general, FPGA better due to circuit's
concurrency, from bit-level to task level
7Extensive Studies over Past Decade
- Large speedups on many important applications
- See ACM/SIGDA Int. Symp. on FPGAs
- So why aren't FPGAs ubiquitous?
8Why FPGAs arent Ubiquitous
- Cost But improving yearly
- Power But improving yearly, and energy benefits
too - Extra chip But integration continues
- Programming methodology
1 million system gate FPGA cost
Source Xilinx
9Why FPGAs arent Mainstream
- Cost
- Power
- Extra chip
- Programming methodology
- Though tremendous progress in past decade
Application (C/C/Java/SystemC/Handel-C/Streams-C
/)
Automated hardware/software partitioning
C/C/Java
C/C/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C
...
Behavioral synthesis (1990s)
Register transfers
Compilation (1960s, 1970s)
RT synthesis (1980s, 1990s)
Logic equations / FSMs
Assembly code
Logic synthesis, physical design (1970s, 1980s)
Assembling, linking (1950s, 1960s)
Microprocessor binary
FPGA binary
Downloading
Downloading
Implementation
Microprocessors
FPGA circuits
10So Whats the Holdup?
- FPGAs require special compilers
- Limits adoption desktop world dominates
- 100 software writers for every CAD user
- Millions of compiler seats worldwide, vs. 15,000
CAD seats - Can't ignore "ecosystem" from separation of
applications, tools, and architectures - Just consider history of popular processors
Standard Compiler
11Outline
- FPGAs The New Software
- Why theyre great
- Why theyre not ubiquitous yet
- Hiding FPGAs from programmers
- Warp processing
- Binary decompilation
- Just-in-time FPGA compilation
- Towards Standard Binaries for FPGAs
12Can we Hide FPGAs from Programmers and Standard
Tools?
- Example
- Radically different x86 architectures hidden from
programmers and tools - All execute standard x86 binaries
- On-chip tools dynamically translate binary to
particular architecture - Idea Hide FPGA from programmers and tools
- Download standard binary
- Have on-chip tools dynamically translate binary
(portions) to FPGA - We call this Warp Processing
Traditional partitioning done here
Translator
Translator
RISC architecture
VLIW architecture
13Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
20Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
21Warp Processing Challenges
- Two key challenges
- Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs? - Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources?
22Decompilation
- If we don't decompile
- High-level information (e.g., loops, arrays) lost
during compilation - Direct translation of assembly to circuit big
overhead - Need to recover high-level information
Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
23Decompilation
- Solution Recover high-level information from
binary decompilation - Adapted extensive previous work (for different
purposes) - Developed new decompilation methods also
- Ph.D. work of Greg Stitt (Ph.D. UCR 2006)
- Numerous publications http//www.cs.ucr.edu/vahi
d/pubs
Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
24Decompilation Results vs. C
- Compared with synthesis from C
- Synthesis after decompilation often quite similar
- Almost identical performance, small area overhead
FPGA 2005
25Decompilation Results on Optimized H.264In-depth
Study with Freescale
- Used highly-optimized benchmark
- Results Binary approach competitive
- Speedups compared to ARM9 software
- Binary 2.48, C 2.53
- Decompilation recovered nearly all high-level
information needed for partitioning and synthesis
26Tangent Simple Coding Guidelines Bring Speedups
Closer to Ideal
- Interesting discovery during H264 study C style
limited speedup - Orthogonal to binary vs. C issue coding style
hurt both - Developed simple coding guidelines
- Rewritten software 20 minutes, and only 3
slower than original - New speedups Binary 6.55, C 6.56
- Binary still competitive with C
- Following guidelines not required, but helps any
approach targeting FPGAs
27Warp Processing Challenges
- Two key challenges
- Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs? - Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources?
28JIT FPGA Compilation
- Developed ultra-lean CAD heuristics for
synthesis, placement, routing, and technology
mapping simultaneously developed CAD-oriented
FPGA - e.g., Our router (ROCR) 10x faster and 20x less
memory than popular VPR tool, at cost of 30
longer critical path. Similar results for synth
placement - Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona) - Numerous publications http//www.cs.ucr.edu/vahi
d/pubs
DAC04
29Overall Warp Processing ResultsPerformance
Speedup (Most Frequent Kernel Only)
Currently prototyping our simpler FPGA fabric
with Intel, scheduled for Q3 shuttle
SW Only Execution
Overall application speedup average is 7.4
30Outline
- FPGAs The New Software
- Why theyre great
- Why theyre not ubiquitous yet
- Hiding FPGAs from programmers
- Warp processing
- Binary decompilation
- Just-in-time FPGA compilation
- Towards Standard Binaries for FPGAs
31FPGA Ubiquity via Obscurity
- Warp processing hides FPGA from languages and
tools - ANY microprocessor platform extendible with FPGA
- Maintains "ecosystem" application, tool, and
architecture developers - New platforms with FPGAs appearing
Profiling
Standard Compiler
New processor platforms with FPGA evolving
32FPGA Standard Binaries?
- Microprocessor binary represents one form of a
"standard binary for FPGAs" - Missing is explicit concurrency
- Parallelism, pipelining, queues, etc.
- As FPGAs appear in more platforms, might a more
general FPGA binary evolve?
Profiling
Standard Compiler
Architectures
Standard binaries
Standard FPGA binaries
Applications
Tools
33FPGA Standard Binaries?
- Translator makes best use of existing FPGA
resources - Can even add FPGA, like adding memory, to improve
performance - Add more FPGA to your PDA to implement
compute-intensive application?
34Summary
- FPGAs may be the new software
- Hiding FPGA via warp processing is feasible
- Decompilation can recover high-level constructs
to yield speedups competitive with source-level - JIT FPGA compilation can be made sufficiently
lean - Future Standard binaries for FPGAs?
- Extensive work to be done
Publications can be found at http//www.cs.ucr.ed
u/vahid/pubs