Warp Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Warp Processing

Description:

Department of Computer Science and Engineering. University of California, Riverside. Associate ... C/C /Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 40
Provided by: romanl5
Learn more at: http://www.cs.ucr.edu
Category:
Tags: processing | warp

less

Transcript and Presenter's Notes

Title: Warp Processing


1
Warp Processing Towards FPGA Ubiquity
  • Frank Vahid
  • Professor
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Associate Director, Center for Embedded Computer
    Systems, UC Irvine
  • Work supported by the National Science
    Foundation, the Semiconductor Research
    Corporation, Xilinx, Intel, and Freescale
  • Contributing Students Roman Lysecky (PhD 2005,
    now asst. prof. at U. Arizona), Greg Stitt (PhD
    2006), Kris Miller (MS 2007), David Sheldon (3rd
    yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy
    (1st yr PhD)

2
Outline
  • FPGAs
  • Why theyre great
  • Why theyre not ubiquitous yet
  • Hiding FPGAs from programmers
  • Warp processing
  • Binary decompilation
  • Just-in-time FPGA compilation
  • Directions

3
FPGAs
Implement circuit by downloading particular bits
a
b
LUT
F
G
  • FPGA -- Field-Programmable Gate Array
  • Implement circuit by downloading bits
  • N-address memory (LUT) implements N-input
    combinational logic
  • Register-controlled switch matrix (SM) connects
    LUTs
  • FPGA fabric
  • Thousands of LUTs and SMs, increasingly
    additional hard core components like multipliers,
    RAM, etc.
  • CAD tools automatically map desired circuit onto
    FPGA fabric

4
FPGAs are "Programmable" like Microprocessors
Just Download Bits
Microprocessor Binaries
FPGA "Binaries"
More commonly known as "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
5
FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
6
FPGA Why (Sometimes) Better than Microprocessor
C Code for FIR Filter
Circuit for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
  • 7 cycles
  • Speedup gt 100x
  • 1000s of instructions
  • Several thousand cycles

In general, FPGA better due to circuit's
concurrency, from bit-level to task level
7
Extensive Studies over Past Decade
  • Large speedups on many important applications
  • See ACM/SIGDA Int. Symp. on FPGAs
  • So why aren't FPGAs ubiquitous?

8
Why FPGAs arent Mainstream
  • Cost But improving yearly
  • Power But improving yearly, and energy benefits
    too
  • Extra chip But integration continues
  • Programming methodology

1 million system gate FPGA cost
Source Xilinx
9
Why FPGAs arent Mainstream
  • Cost
  • Power
  • Extra chip
  • Programming methodology
  • Though tremendous progress in past decade

Application (C/C/Java/SystemC/Handel-C/Streams-C
/)
Automated hardware/software partitioning
C/C/Java
C/C/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C
...
Behavioral synthesis (1990s)
Register transfers
Compilation (1960s, 1970s)
RT synthesis (1980s, 1990s)
Logic equations / FSMs
Assembly code
Logic synthesis, physical design (1970s, 1980s)
Assembling, linking (1950s, 1960s)
Microprocessor binary
FPGA binary
Downloading
Downloading
Implementation
Microprocessors
FPGA circuits
10
So Whats the Holdup?
  • FPGAs require special compilers
  • Limits adoption desktop world dominates
  • 100 software writers for every CAD user
  • Millions of compiler seats worldwide, vs. 15,000
    CAD seats

Standard Compiler
11
Outline
  • FPGAs
  • Why theyre great
  • Why theyre not ubiquitous yet
  • Hiding FPGAs from programmers
  • Warp processing
  • Binary decompilation
  • Just-in-time FPGA compilation
  • Directions

12
Can we Hide FPGAs from Programmers and Standard
Tools?
  • Example
  • Radically different x86 architectures hidden from
    programmers and tools
  • All execute standard x86 binaries
  • On-chip tools dynamically translate binary to
    particular architecture
  • Idea Hide FPGA from programmers and tools
  • Download standard binary
  • Have on-chip tools dynamically translate binary
    (portions) to FPGA
  • We call this Warp Processing

Traditional partitioning done here
Translator
Translator
RISC architecture
VLIW architecture
13
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


20
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


21
Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
On-chip CAD
22
Warp Processing Challenges
  • Two key challenges
  • Can we decompile binaries to recover enough
    high-level constructs to create fast circuits on
    FPGAs?
  • Can we just-in-time (JIT) compile to FPGAs using
    limited on-chip compute resources?

23
Decompilation
  • Synthesis from binary has a potential hurdle
  • High-level information (e.g., loops, arrays) lost
    during compilation
  • Direct translation of assembly to circuit huge
    overheads
  • Need to recover high-level information

Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
24
Decompilation
  • Solution Recover high-level information from
    binary decompilation
  • Adapted extensive previous work (for different
    purposes)
  • Developed new decompilation methods also
  • Ph.D. work of Greg Stitt (Ph.D. UCR 2006)
  • Numerous publications http//www.cs.ucr.edu/vahi
    d/pubs

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
25
Decompilation Results vs. C
  • Compared with synthesis from C
  • Synthesis after decompilation often quite similar
  • Almost identical performance, small area overhead

FPGA 2005
26
Decompilation Results on Optimized H.264In-depth
Study with Freescale
  • Used highly-optimized benchmark
  • Results Binary approach competitive
  • Speedups compared to ARM9 software
  • Binary 2.48, C 2.53
  • Decompilation recovered nearly all high-level
    information needed for partitioning and synthesis

27
Simple Coding Guidelines Bring Speedups Closer to
Ideal
  • Interesting discovery during H264 study C style
    limited speedup
  • Orthogonal to binary vs. C issue coding style
    hurt both
  • Developed simple coding guidelines
  • Rewritten software 20 minutes, and only 3
    slower than original
  • New speedups Binary 6.55, C 6.56
  • Binary still competitive with C
  • Following guidelines not required, but helps any
    approach targeting FPGAs

28
Warp Processing Challenges
  • Two key challenges
  • Can we decompile binaries to recover enough
    high-level constructs to create fast circuits on
    FPGAs?
  • Can we just-in-time (JIT) compile to FPGAs using
    limited on-chip compute resources?

29
JIT FPGA Compilation
  • Developed ultra-lean CAD heuristics for
    synthesis, placement, routing, and technology
    mapping simultaneously developed CAD-oriented
    FPGA
  • e.g., Our router (ROCR) 10x faster and 20x less
    memory than popular VPR tool, at cost of 30
    longer critical path. Similar results for synth
    placement
  • Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
    Asst. Prof. at Univ. of Arizona)
  • Numerous publications http//www.cs.ucr.edu/vahi
    d/pubs

DAC04
30
JIT FPGA Compilation
31
Overall Warp Processing ResultsPerformance
Speedup (Most Frequent Kernel Only)
SW Only Execution
32
Overall Warp Processing ResultsPerformance
Speedup (Overall, Multiple Kernels)
  • Energy reduction of 38 - 94

SW Only Execution
Assuming 100 MHz ARM, fabric in same technology
and clocked at rate determined by synthesis
33
FPGA Ubiquity via Obscurity
  • FPGA is hidden from languages and tools
  • Thus, ANY microprocessor platform extendible with
    FPGA
  • So any program can potentially be sped up by
    FPGAs
  • No new languages, no new tools
  • Maintains "ecosystem" among application, tool,
    and architecture developers

Profiling
Standard Compiler
34
Outline
  • FPGAs
  • Why theyre great
  • Why theyre not ubiquitous yet
  • Hiding FPGAs from programmers
  • Warp processing
  • Binary decompilation
  • Just-in-time FPGA compilation
  • Directions

35
Directions Whats Next?
  • Immediate future Develop warp processing using
    benchmarks from other domains
  • Desktop, server, scientific
  • With partners IBM, Freescale
  • May require new decompilation techniques

36
Directions Whats Next?
  • Application-specific FPGA
  • Tune FPGA fabric to application (or domain)
  • Parameters LUTs/CLB, LUT size
  • Many more possible, e.g., switch matrix size,
    long vs. short channels

Delay for each configuration (LUTs/CLB, and LUT
sizes 2-7) for one application
Delay area when tuning parameters for best
delay for each app, rather than for all apps
37
Directions Whats Next?
Thrd1
Thrd2
Thrd3
  • Parallel benchmarks
  • NAS, SPEComp, Splash,
  • Map each thread to custom FPGA circuit
  • Huge potential speedups

ThrdN
Sample speedups from other works
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
Thrd1
Thrd2
On-chip CAD
Thrd3
ThrdN
38
Directions Whats Next?
  • With JIT FPGA compiler, what else is possible?
  • Implications for existing applications?
  • Image processing, neural networks, ...
  • Add FPGA hardware to improve performance, like
    expandable memory?
  • Standard binaries for FPGAs?
  • Rather than extracting circuit from sequential
    code, distribute circuit binary itself, use JIT
    FPGA compiler to best map to FPGA resources

39
Summary
  • FPGA future looks bright
  • Hiding FPGA via warp processing is feasible
  • Decompilation can recover high-level constructs
    to yield speedups competitive with source-level
  • JIT FPGA compilation can be made sufficiently
    lean
  • Many possible directions exist that may use FPGAs
    to gain ultra-high performance without ultra-high
    engineering or hardware costs

Publications can be found at http//www.cs.ucr.ed
u/vahid/pubs
Write a Comment
User Comments (0)
About PowerShow.com