Warp Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Warp Processing

Description:

Department of Computer Science and Engineering. University of California, Riverside. Associate ... C/C /Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 40

Provided by: romanl5

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Warp Processing

1
Warp Processing Towards FPGA Ubiquity

Frank Vahid
Professor
Department of Computer Science and Engineering
University of California, Riverside
Associate Director, Center for Embedded Computer
Systems, UC Irvine
Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, Xilinx, Intel, and Freescale
Contributing Students Roman Lysecky (PhD 2005,
now asst. prof. at U. Arizona), Greg Stitt (PhD
2006), Kris Miller (MS 2007), David Sheldon (3rd
yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy
(1st yr PhD)

2
Outline

FPGAs
Why theyre great
Why theyre not ubiquitous yet
Hiding FPGAs from programmers
Warp processing
Binary decompilation
Just-in-time FPGA compilation
Directions

3
FPGAs
Implement circuit by downloading particular bits
a
b
LUT
F
G

FPGA -- Field-Programmable Gate Array
Implement circuit by downloading bits
N-address memory (LUT) implements N-input
combinational logic
Register-controlled switch matrix (SM) connects
LUTs
FPGA fabric
Thousands of LUTs and SMs, increasingly
additional hard core components like multipliers,
RAM, etc.
CAD tools automatically map desired circuit onto
FPGA fabric

4
FPGAs are "Programmable" like Microprocessors
Just Download Bits
Microprocessor Binaries
FPGA "Binaries"
More commonly known as "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
5
FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
6
FPGA Why (Sometimes) Better than Microprocessor
C Code for FIR Filter
Circuit for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..

7 cycles
Speedup gt 100x

1000s of instructions
Several thousand cycles

In general, FPGA better due to circuit's
concurrency, from bit-level to task level
7
Extensive Studies over Past Decade

Large speedups on many important applications
See ACM/SIGDA Int. Symp. on FPGAs
So why aren't FPGAs ubiquitous?

8
Why FPGAs arent Mainstream

Cost But improving yearly
Power But improving yearly, and energy benefits
too
Extra chip But integration continues
Programming methodology

1 million system gate FPGA cost
Source Xilinx
9
Why FPGAs arent Mainstream

Cost
Power
Extra chip
Programming methodology
Though tremendous progress in past decade

Application (C/C/Java/SystemC/Handel-C/Streams-C
/)
Automated hardware/software partitioning
C/C/Java
C/C/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C
...
Behavioral synthesis (1990s)
Register transfers
Compilation (1960s, 1970s)
RT synthesis (1980s, 1990s)
Logic equations / FSMs
Assembly code
Logic synthesis, physical design (1970s, 1980s)
Assembling, linking (1950s, 1960s)
Microprocessor binary
FPGA binary
Downloading
Downloading
Implementation
Microprocessors
FPGA circuits
10
So Whats the Holdup?

FPGAs require special compilers
Limits adoption desktop world dominates
100 software writers for every CAD user
Millions of compiler seats worldwide, vs. 15,000
CAD seats

Standard Compiler
11
Outline

FPGAs
Why theyre great
Why theyre not ubiquitous yet
Hiding FPGAs from programmers
Warp processing
Binary decompilation
Just-in-time FPGA compilation
Directions

12
Can we Hide FPGAs from Programmers and Standard
Tools?

Example
Radically different x86 architectures hidden from
programmers and tools
All execute standard x86 binaries
On-chip tools dynamically translate binary to
particular architecture
Idea Hide FPGA from programmers and tools
Download standard binary
Have on-chip tools dynamically translate binary
(portions) to FPGA
We call this Warp Processing

Traditional partitioning done here
Translator
Translator
RISC architecture
VLIW architecture
13
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

20
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

21
Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
On-chip CAD
22
Warp Processing Challenges

Two key challenges
Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs?
Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources?

23
Decompilation

Synthesis from binary has a potential hurdle
High-level information (e.g., loops, arrays) lost
during compilation
Direct translation of assembly to circuit huge
overheads
Need to recover high-level information

Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
24
Decompilation

Solution Recover high-level information from
binary decompilation
Adapted extensive previous work (for different
purposes)
Developed new decompilation methods also
Ph.D. work of Greg Stitt (Ph.D. UCR 2006)
Numerous publications http//www.cs.ucr.edu/vahi
d/pubs

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
25
Decompilation Results vs. C

Compared with synthesis from C
Synthesis after decompilation often quite similar
Almost identical performance, small area overhead

FPGA 2005
26
Decompilation Results on Optimized H.264In-depth
Study with Freescale

Used highly-optimized benchmark
Results Binary approach competitive
Speedups compared to ARM9 software
Binary 2.48, C 2.53
Decompilation recovered nearly all high-level
information needed for partitioning and synthesis

27
Simple Coding Guidelines Bring Speedups Closer to
Ideal

Interesting discovery during H264 study C style
limited speedup
Orthogonal to binary vs. C issue coding style
hurt both
Developed simple coding guidelines
Rewritten software 20 minutes, and only 3
slower than original
New speedups Binary 6.55, C 6.56
Binary still competitive with C
Following guidelines not required, but helps any
approach targeting FPGAs

28
Warp Processing Challenges

Two key challenges
Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs?
Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources?

29
JIT FPGA Compilation

Developed ultra-lean CAD heuristics for
synthesis, placement, routing, and technology
mapping simultaneously developed CAD-oriented
FPGA
e.g., Our router (ROCR) 10x faster and 20x less
memory than popular VPR tool, at cost of 30
longer critical path. Similar results for synth
placement
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona)
Numerous publications http//www.cs.ucr.edu/vahi
d/pubs

DAC04
30
JIT FPGA Compilation
31
Overall Warp Processing ResultsPerformance
Speedup (Most Frequent Kernel Only)
SW Only Execution
32
Overall Warp Processing ResultsPerformance
Speedup (Overall, Multiple Kernels)

Energy reduction of 38 - 94

SW Only Execution
Assuming 100 MHz ARM, fabric in same technology
and clocked at rate determined by synthesis
33
FPGA Ubiquity via Obscurity

FPGA is hidden from languages and tools
Thus, ANY microprocessor platform extendible with
FPGA
So any program can potentially be sped up by
FPGAs
No new languages, no new tools
Maintains "ecosystem" among application, tool,
and architecture developers

Profiling
Standard Compiler
34
Outline

FPGAs
Why theyre great
Why theyre not ubiquitous yet
Hiding FPGAs from programmers
Warp processing
Binary decompilation
Just-in-time FPGA compilation
Directions

35
Directions Whats Next?

Immediate future Develop warp processing using
benchmarks from other domains
Desktop, server, scientific
With partners IBM, Freescale
May require new decompilation techniques

36
Directions Whats Next?

Application-specific FPGA
Tune FPGA fabric to application (or domain)
Parameters LUTs/CLB, LUT size
Many more possible, e.g., switch matrix size,
long vs. short channels

Delay for each configuration (LUTs/CLB, and LUT
sizes 2-7) for one application
Delay area when tuning parameters for best
delay for each app, rather than for all apps
37
Directions Whats Next?
Thrd1
Thrd2
Thrd3

Parallel benchmarks
NAS, SPEComp, Splash,
Map each thread to custom FPGA circuit
Huge potential speedups

ThrdN
Sample speedups from other works
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
Thrd1
Thrd2
On-chip CAD
Thrd3
ThrdN
38
Directions Whats Next?

With JIT FPGA compiler, what else is possible?
Implications for existing applications?
Image processing, neural networks, ...
Add FPGA hardware to improve performance, like
expandable memory?
Standard binaries for FPGAs?
Rather than extracting circuit from sequential
code, distribute circuit binary itself, use JIT
FPGA compiler to best map to FPGA resources

39
Summary

FPGA future looks bright
Hiding FPGA via warp processing is feasible
Decompilation can recover high-level constructs
to yield speedups competitive with source-level
JIT FPGA compilation can be made sufficiently
lean
Many possible directions exist that may use FPGAs
to gain ultra-high performance without ultra-high
engineering or hardware costs

Publications can be found at http//www.cs.ucr.ed
u/vahid/pubs

Write a Comment

User Comments (0)