The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems - PowerPoint PPT Presentation

About This Presentation
Title:

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems

Description:

The New Software: Invisible Ubiquitous FPGAs that Enable Next ... Associate Director, Center for Embedded Computer ... Verilog/SystemC/Handel-C/Streams-C... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 35
Provided by: romanl5
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems


1
The New Software Invisible Ubiquitous FPGAs
that Enable Next-Generation Embedded Systems
  • Frank Vahid
  • Professor
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Associate Director, Center for Embedded Computer
    Systems, UC Irvine
  • Work supported by the National Science
    Foundation, the Semiconductor Research
    Corporation, Xilinx, Intel, and Freescale
  • Contributing Students Roman Lysecky (PhD 2005,
    now asst. prof. at U. Arizona), Greg Stitt (PhD
    2006), David Sheldon (3rd yr PhD), Ryan Mannion
    (2nd yr PhD), Scott Sirowy (1st yr PhD)

2
Outline
  • FPGAs The New Software
  • Why theyre great
  • Why theyre not ubiquitous yet
  • Hiding FPGAs from programmers
  • Warp processing
  • Binary decompilation
  • Just-in-time FPGA compilation
  • Towards Standard Binaries for FPGAs

3
FPGAs
Implement circuit by downloading particular bits
a
b
LUT
F
G
  • FPGA -- Field-Programmable Gate Array
  • Implement circuit by downloading bits
  • N-address memory (LUT) implements N-input
    combinational logic
  • Register-controlled switch matrix (SM) connects
    LUTs
  • FPGA fabric
  • Thousands of LUTs and SMs, increasingly
    additional hard core components like multipliers,
    RAM, etc.
  • CAD tools automatically map desired circuit onto
    FPGA fabric

4
FPGAs are "Programmable" like Microprocessors
Just Download Bits
Microprocessor Binaries
FPGA "Binaries"
More commonly known as "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
5
FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
6
FPGA Why (Sometimes) Better than Microprocessor
C Code for FIR Filter
Circuit for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
  • 7 cycles
  • Speedup gt 100x
  • 1000s of instructions
  • Several thousand cycles

In general, FPGA better due to circuit's
concurrency, from bit-level to task level
7
Extensive Studies over Past Decade
  • Large speedups on many important applications
  • See ACM/SIGDA Int. Symp. on FPGAs
  • So why aren't FPGAs ubiquitous?

8
Why FPGAs arent Ubiquitous
  • Cost But improving yearly
  • Power But improving yearly, and energy benefits
    too
  • Extra chip But integration continues
  • Programming methodology

1 million system gate FPGA cost
Source Xilinx
9
Why FPGAs arent Mainstream
  • Cost
  • Power
  • Extra chip
  • Programming methodology
  • Though tremendous progress in past decade

Application (C/C/Java/SystemC/Handel-C/Streams-C
/)
Automated hardware/software partitioning
C/C/Java
C/C/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C
...
Behavioral synthesis (1990s)
Register transfers
Compilation (1960s, 1970s)
RT synthesis (1980s, 1990s)
Logic equations / FSMs
Assembly code
Logic synthesis, physical design (1970s, 1980s)
Assembling, linking (1950s, 1960s)
Microprocessor binary
FPGA binary
Downloading
Downloading
Implementation
Microprocessors
FPGA circuits
10
So Whats the Holdup?
  • FPGAs require special compilers
  • Limits adoption desktop world dominates
  • 100 software writers for every CAD user
  • Millions of compiler seats worldwide, vs. 15,000
    CAD seats
  • Can't ignore "ecosystem" from separation of
    applications, tools, and architectures
  • Just consider history of popular processors

Standard Compiler
11
Outline
  • FPGAs The New Software
  • Why theyre great
  • Why theyre not ubiquitous yet
  • Hiding FPGAs from programmers
  • Warp processing
  • Binary decompilation
  • Just-in-time FPGA compilation
  • Towards Standard Binaries for FPGAs

12
Can we Hide FPGAs from Programmers and Standard
Tools?
  • Example
  • Radically different x86 architectures hidden from
    programmers and tools
  • All execute standard x86 binaries
  • On-chip tools dynamically translate binary to
    particular architecture
  • Idea Hide FPGA from programmers and tools
  • Download standard binary
  • Have on-chip tools dynamically translate binary
    (portions) to FPGA
  • We call this Warp Processing

Traditional partitioning done here
Translator
Translator
RISC architecture
VLIW architecture
13
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


20
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


21
Warp Processing Challenges
  • Two key challenges
  • Can we decompile binaries to recover enough
    high-level constructs to create fast circuits on
    FPGAs?
  • Can we just-in-time (JIT) compile to FPGAs using
    limited on-chip compute resources?

22
Decompilation
  • If we don't decompile
  • High-level information (e.g., loops, arrays) lost
    during compilation
  • Direct translation of assembly to circuit big
    overhead
  • Need to recover high-level information

Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
23
Decompilation
  • Solution Recover high-level information from
    binary decompilation
  • Adapted extensive previous work (for different
    purposes)
  • Developed new decompilation methods also
  • Ph.D. work of Greg Stitt (Ph.D. UCR 2006)
  • Numerous publications http//www.cs.ucr.edu/vahi
    d/pubs

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
24
Decompilation Results vs. C
  • Compared with synthesis from C
  • Synthesis after decompilation often quite similar
  • Almost identical performance, small area overhead

FPGA 2005
25
Decompilation Results on Optimized H.264In-depth
Study with Freescale
  • Used highly-optimized benchmark
  • Results Binary approach competitive
  • Speedups compared to ARM9 software
  • Binary 2.48, C 2.53
  • Decompilation recovered nearly all high-level
    information needed for partitioning and synthesis

26
Tangent Simple Coding Guidelines Bring Speedups
Closer to Ideal
  • Interesting discovery during H264 study C style
    limited speedup
  • Orthogonal to binary vs. C issue coding style
    hurt both
  • Developed simple coding guidelines
  • Rewritten software 20 minutes, and only 3
    slower than original
  • New speedups Binary 6.55, C 6.56
  • Binary still competitive with C
  • Following guidelines not required, but helps any
    approach targeting FPGAs

27
Warp Processing Challenges
  • Two key challenges
  • Can we decompile binaries to recover enough
    high-level constructs to create fast circuits on
    FPGAs?
  • Can we just-in-time (JIT) compile to FPGAs using
    limited on-chip compute resources?

28
JIT FPGA Compilation
  • Developed ultra-lean CAD heuristics for
    synthesis, placement, routing, and technology
    mapping simultaneously developed CAD-oriented
    FPGA
  • e.g., Our router (ROCR) 10x faster and 20x less
    memory than popular VPR tool, at cost of 30
    longer critical path. Similar results for synth
    placement
  • Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
    Asst. Prof. at Univ. of Arizona)
  • Numerous publications http//www.cs.ucr.edu/vahi
    d/pubs

DAC04
29
Overall Warp Processing ResultsPerformance
Speedup (Most Frequent Kernel Only)
Currently prototyping our simpler FPGA fabric
with Intel, scheduled for Q3 shuttle
SW Only Execution
Overall application speedup average is 7.4
30
Outline
  • FPGAs The New Software
  • Why theyre great
  • Why theyre not ubiquitous yet
  • Hiding FPGAs from programmers
  • Warp processing
  • Binary decompilation
  • Just-in-time FPGA compilation
  • Towards Standard Binaries for FPGAs

31
FPGA Ubiquity via Obscurity
  • Warp processing hides FPGA from languages and
    tools
  • ANY microprocessor platform extendible with FPGA
  • Maintains "ecosystem" application, tool, and
    architecture developers
  • New platforms with FPGAs appearing

Profiling
Standard Compiler
New processor platforms with FPGA evolving
32
FPGA Standard Binaries?
  • Microprocessor binary represents one form of a
    "standard binary for FPGAs"
  • Missing is explicit concurrency
  • Parallelism, pipelining, queues, etc.
  • As FPGAs appear in more platforms, might a more
    general FPGA binary evolve?

Profiling
Standard Compiler
Architectures
Standard binaries
Standard FPGA binaries
Applications
Tools
33
FPGA Standard Binaries?
  • Translator makes best use of existing FPGA
    resources
  • Can even add FPGA, like adding memory, to improve
    performance
  • Add more FPGA to your PDA to implement
    compute-intensive application?

34
Summary
  • FPGAs may be the new software
  • Hiding FPGA via warp processing is feasible
  • Decompilation can recover high-level constructs
    to yield speedups competitive with source-level
  • JIT FPGA compilation can be made sufficiently
    lean
  • Future Standard binaries for FPGAs?
  • Extensive work to be done

Publications can be found at http//www.cs.ucr.ed
u/vahid/pubs
Write a Comment
User Comments (0)
About PowerShow.com