Title: Warp Processors
1Warp Processors
- Frank Vahid (Task Leader)
- Department of Computer Science and Engineering
- University of California, Riverside
- Associate Director, Center for Embedded Computer
Systems, UC Irvine - Task ID 1331.001 July 2005 June 2008
- Ph.D. students
- Greg Stitt Ph.D. expected June 2006
- Ann Gordon-Ross Ph.D. expected June 2006
- David Sheldon Ph.D. expected 2009
- Ryan Mannion Ph.D. expected 2009
- Scott Sirowy Ph.D. expected 2010
- Industrial Liaisons
- Brian W. Einloth, Motorola
- Serge Rutman, Dave Clark, Intel
- Jeff Welser, IBM
2Task Description
- Warp processing background
- Two seed SRC CSR grants (2002-2005) showed
feasibility - Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more - Task Mature warp technology
- Years 1/2 (in progress)
- Automatic high-level construct recovery from
binaries - In-depth case studies (with Freescale)
- Also discovered unanticipated problem, developed
solution - Warp-tailored FPGA prototype (with Intel)
- Years 2/3
- Reduce memory bottleneck by using smart buffer
- Investigate domain-specific-FPGA concepts (with
Freescale) - Consider desktop/server domains (with IBM)
3Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
4Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
5Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
6Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
7Warp Processing Background Basic Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
8Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
9Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
10Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
11Warp Processing Background Trend Towards
Processor/FPGA Programmable Platforms
- FPGAs with hard core processors
- FPGAs with soft core processors
- Computer boards with FPGAs
Xilinx Virtex II Pro. Source Xilinx
Altera Excalibur. Source Altera
Xilinx Spartan. Source Xilinx
Cray XD1. Source FPGA journal, Apr05
12Warp Processing Background Trend Towards
Processor/FPGA Programmable Platforms
- Programming a key challenge
- Soln 1 Compile high-level language to custom
binaries - Soln 2 Use standard binaries, dynamically re-map
(warp) - Cons
- Less high-level information, less optimization
- Pros
- Available to all software developers, not just
specialists - Data dependent optimization
- Most importantly, standard binaries enable
ecosystem among tools, architecture, and
applications
Xilinx Virtex II Pro. Source Xilinx
Most significant concept presently absent in
FPGAs and other new programmable platforms
13Warp Processing Background Basic Technology
- Warp processing
- On-chip profiler
- Warp-tuned FPGA
- On-chip CAD, including Just-in-Time FPGA
compilation
JIT FPGA compilation
14Warp Processing Background Initial Results
Tech. Map
Decomp.
Partitioning
Log. Syn.
RT Syn.
Place
Route
9.1 s
Xilinx ISE
15Warp Processing Background Publications 2002-2005
- On-chip profiler
- Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware, A. Gordon-Ross
and F. Vahid, ACM/IEEE Conf. on Compilers,
Architecture and Synthesis for Embedded Systems
(CASES), 2003 - Extended version of above in special issue Best
of CASES/MICRO of IEEE Trans. on Comp., Oct
2005. - Warp-tuned FPGA
- A Configurable Logic Architecture for Dynamic
Hardware/Software Partitioning, R. Lysecky and F.
Vahid, Design Automation and Test in Europe Conf.
(DATE), Feb 2004. - On-chip CAD, including Just-in-Time FPGA
compilation - A Study of the Scalability of On-Chip Routing for
Just-in-Time FPGA Compilation. R. Lysecky, F.
Vahid and S. Tan. IEEE Symp. on
Field-Programmable Custom Computing Machines
(FCCM), 2005. - A Study of the Speedups and Competitiveness of
FPGA Soft Processor Cores using Dynamic
Hardware/Software Partitioning. R. Lysecky and F.
Vahid. Design Automation and Test in Europe
(DATE), March 2005. - Dynamic FPGA Routing for Just-in-Time FPGA
Compilation. R. Lysecky, F. Vahid, and S. Tan.
Design Automation Conf. (DAC), June 2004. - A Codesigned On-Chip Logic Minimizer, R. Lysecky
and F. Vahid, ISSS/CODES conf., Oct 2003. - Dynamic Hardware/Software Partitioning A First
Approach. G. Stitt, R. Lysecky and F. Vahid,
Design Automation Conf. (DAC), 2003. - On-Chip Logic Minimization, R. Lysecky and F.
Vahid, Design Automation Conf. (DAC), 2003. - The Energy Advantages of Microprocessor Platforms
with On-Chip Configurable Logic, G. Stitt and F.
Vahid, IEEE Design and Test of Computers,
Nov./Dec. 2002. - Hardware/Software Partitioning of Software
Binaries, G. Stitt and F. Vahid, IEEE/ACM
International Conference on Computer Aided Design
(ICCAD), Nov. 2002. - Related
- A Self-Tuning Cache Architecture for Embedded
Systems. C. Zhang, F. Vahid and R. Lysecky. ACM
Transactions on Embedded Computing Systems
(TECS), Vol. 3., Issue 2, May 2004. - Fast Configurable-Cache Tuning with a Unified
Second-Level Cache. A. Gordon-Ross, F. Vahid, N.
Dutt. Int. Symp. on Low-Power Electronics and
Design (ISLPED), 2005.
16Task Description
- Warp processing background
- Two seed SRC CSR grants (2002-2005) showed
feasibility - Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more - Task Mature warp technology
- Year 1 (in progress)
- Automatic high-level construct recovery from
binaries - In-depth case studies (with Freescale)
- Also discovered unanticipated problem, developed
solution - Warp-tailored FPGA prototype (with Intel)
- Years 2/3
- Reduce memory bottleneck by using smart buffer
- Investigate domain-specific-FPGA concepts (with
Freescale) - Consider desktop/server domains (with IBM)
17Automatic High-Level Construct Recovery from
Binaries
- Challenge Binary lacks high-level constructs
(loops, arrays, ...) - Decompilation can help recover
- Extensive previous work (e.g., Cifuentes 93, 94,
99)
Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18New Method Loop Rerolling
- Problem Compiler unrolling of loops (to expose
parallelism) causes synthesis problems - Huge input (slow), cant unroll to desired
amount, cant use advanced loop methods (loop
pipelining, fusion, splitting, ...) - Solution New decompilation method Loop
Rerolling - Identify unrolled iterations, compact into one
iteration
Loop Unrolling
Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2,
100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add
reg1, reg1, reg2
for (int i0 i lt 3 i) accum ai
19Loop Rerolling Identify Unrolled Iterations
- Find consecutively repeating instruction sequences
Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
20Loop Rerolling Compacting Iterations
Unrolled Loop Identificiation
Original C Code
Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0),
r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4,
r3
x x 1 for (i0 i lt 2 i)
aibi1 yx
21Method Strength Promotion
- Problem Compilers strength reduction (replacing
multiplies by shifts and adds) prevents synthesis
from using hard-core multipliers, sometimes
hurting circuit performance
FIR Filter
Strength-reduced multiplication
Strength-Reduced FIR Filter
22Strength Promotion
- Solution Promote strength-reduced code to muls
23New Decompilation Methods Benefits
Speedups from Loop Rerolling
- Rerolling
- Speedups from better use of smart buffers
- Other potential benefits faster synthesis, less
area - Strength promotion
- Speedups from fewer cycles
- Speedups from faster clock
- New methods to be developed
- e.g., pointer DS to arrays
24Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
Publication New Decompilation Techniques for
Binary-level Co-processor Generation. G. Stitt,
F. Vahid. IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), Nov. 2005.
25Task Description
- Warp processing background
- Two seed SRC CSR grants (2002-2005) showed
feasibility - Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more - Task Mature warp technology
- Year 1 (in progress)
- Automatic high-level construct recovery from
binaries - In-depth case studies (with Freescale)
- Also discovered unanticipated problem, developed
solution - Warp-tailored FPGA prototype (with Intel)
- Years 2/3
- Reduce memory bottleneck by using smart buffer
- Investigate domain-specific-FPGA concepts (with
Freescale) - Consider desktop/server domains (with IBM)
26Research Problem Make Synthesis from Binaries
Competitive with Synthesis from High-Level
Languages
- Performed in-depth study with Freescale
- H.264 video decoder
- Highly-optimized proprietary code, not reference
code - Huge difference
- A benefit of SRC collaboration
- Research question Is synthesis from binaries
competitive on highly-optimized code? - Several-month study
MPEG 2
H.264 Better quality, or smaller files, using
more computation
27Optimized H.264
- Larger than most benchmarks
- H.264 16,000 lines
- Previous work 100 to several thousand lines
- Highly-optimized
- H.264 Many man-hours of manual optimization
- 10x faster than reference code used in previous
works - Different profiling results
- Previous examples
- 90 time in several loops
- H.264
- 90 time in 45 functions
- Harder to speedup
28C vs. Binary Synthesis on Opt. H.264
- Binary partitioning competitive with source
partitioning - Speedups compared to ARM9 software
- Binary 2.48, C 2.53
- Decompilation recovered nearly all high-level
information needed for partitioning and synthesis - Discovered another research problem Why arent
speedups (from binary or C) closer to ideal
(0-time per fct)
29Coding Guidelines
- Are there C-coding guidelines to improve
partitioning speedups? - Orthogonal to C vs. binary question
- Guidelines may help both
- Examined H.264 code further
- Several phone conferences with Freescale liasons,
also several email exchanges and reports
Competitive, but both could be better
Coding guidelines get closer to ideal
30Synthesis-Oriented Coding Guidelines
- Pass by value-return
- Declare a local array and copy in all data needed
by a function (makes lack of aliases explicit) - Function specialization
- Create function version having frequent
parameter-values as constants
Rewritten
Original
void f(int width, int height ) . . . .
for (i0 i lt width, i) for (j0 j lt
height j) . . . . . .
void f_4_4() . . . . for (i0 i lt 4,
i) for (j0 j lt 4 j) . . .
. . .
Bounds are explicit so loops are now unrollable
31Synthesis-Oriented Coding Guidelines
- Algorithmic specialization
- Use parallelizable hardware algorithms when
possible - Hoisting and sinking of error checking
- Keep error checking out of loops to enable
unrolling - Lookup table avoidance
- Use expressions rather than lookup tables
Original
Rewritten
Comparisons can now be parallelized
int clip512 . . . void f() . . .
for (i0 i lt 10 i) vali
clipvali . . .
void f() . . . for (i0 i lt 10 i)
if (vali gt 255) vali 255 else if
(vali lt 0) vali 0 . . .
. . .
32Synthesis-Oriented Coding Guidelines
- Use explicit control flow
- Replace function pointers with if statements and
static function calls
Original
Rewritten
void (funcArray) (char data) func1,
func2, . . . void f(char data) . . .
funcPointer funcArrayi (funcPointer)
(data) . . .
void f(char data) . . . if (i 0)
func1(data) else if (i1)
func2(data) . . .
33Coding Guideline Results on H.264
- Simple coding guidelines made large improvement
- Rewritten software only 3 slower than original
- And, binary partitioning still competitive with C
partitioning - Speedups Binary 6.55, C 6.56
- Small difference caused by switch statements that
used indirect jumps
34Studied More Benchmarks, Developed More Guidelines
- Studied guidelines further on standard benchmarks
- Further synthesis speedups (again, independent of
C vs. binary issue) - Publications
- Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode. G. Stitt,
F. Vahid, G. McGregor, B. Einloth. Int. Conf. on
Hardware/Software Codesign and System Synthesis
(CODES/ISSS), 2005 (joint publication with
Freescale) - Submitted A Code Refinement Methodology for
Performance-Improved Synthesis from C. G. Stitt,
F. Vahid, W. Najjar, 2006. - More guidelines to be developed
35Task Description
- Warp processing background
- Two seed SRC CSR grants (2002-2005) showed
feasibility - Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more - Task Mature warp technology
- Year 1 (in progress)
- Automatic high-level construct recovery from
binaries - In-depth case studies (with Freescale)
- Also discovered unanticipated problem, developed
solution - Warp-tailored FPGA prototype (with Intel)
- Years 2/3
- Reduce memory bottleneck by using smart buffer
- Investigate domain-specific-FPGA concepts (with
Freescale) - Consider desktop/server domains (with IBM)
36Warp-Tailored FPGA Prototype
- Developed FPGA fabric tailored to
fast/small-memory on-chip CAD - Building chip prototype with Intel
- Created synthesizable VHDL models, running
through Intel shuttle tool flow - Plan to incorporate with ARM processor and other
IP on shuttle seat - Bi-weekly phone meetings with Intel engineers
since summer 2005, ongoing, scheduled tapeout
2006 Q3
37Industrial Interactions
- Freescale
- Numerous phone conferences, emails, and reports,
on technical subjects - Co-authored paper (CODES/ISSS05), another
pending - Summer internship Scott Sirowy (new UCR
graduate student), summer 2005, Austin - Intel
- Three visits by PI, one by graduate student Roman
Lysecky, to Intel Research in Santa Clara - PI presented at Intel System Design Symposium,
Nov. 2005 - PI served on Intel Research Silicon Prototyping
Workshop panel, May 2005 - Participating in Intels Research Shuttle (chip
prototype), bi-weekly phone conferences since
summer 2005 involving PI, Intel engineers, and
Roman Lysecky (now Prof. at UA) - IBM
- Embarking on studies of warp processing results
on server applications - UCR group to receive Cell-based prototyping
platform (w/ Prof. Walid Najjar) - Several interactions with Xilinx also
38Task Description Coming Up
- Warp processing background
- Two seed SRC CSR grants (2002-2005) showed
feasibility - Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more - Task Mature warp technology
- Years 1/2 (in progress)
- Automatic high-level construct recovery from
binaries - In-depth case studies (with Freescale)
- Also discovered unanticipated problem, developed
solution - Warp-tailored FPGA prototype (with Intel)
- Years 2/3 All three sub-tasks just now underway
- Reduce memory bottleneck by using smart buffer
- Investigate domain-specific-FPGA concepts (with
Freescale) - Consider desktop/server domains (with IBM)
39Recent Publications
- New Decompilation Techniques for Binary-level
Co-processor Generation. G. Stitt, F. Vahid.
IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), 2005. - Fast Configurable-Cache Tuning with a Unified
Second-Level Cache. A. Gordon-Ross, F. Vahid, N.
Dutt. Int. Symp. on Low-Power Electronics and
Design (ISLPED), 2005. - Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode. G. Stitt,
F. Vahid, G. McGregor, B. Einloth. International
Conference on Hardware/Software Codesign and
System Synthesis (CODES/ISSS), 2005. (Co-authored
paper with Freescale) - Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware. A. Gordon-Ross
and F. Vahid. IEEE Trans. on Computers, Special
Issue- Best of Embedded Systems,
Microarchitecture, and Compilation Techniques in
Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. - A Study of the Scalability of On-Chip Routing for
Just-in-Time FPGA Compilation. R. Lysecky, F.
Vahid and S. Tan. IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM), 2005. - A First Look at the Interplay of Code Reordering
and Configurable Caches. A. Gordon-Ross, F.
Vahid, N. Dutt. Great Lakes Symposium on VLSI
(GLSVLSI), April 2005. - A Study of the Speedups and Competitiveness of
FPGA Soft Processor Cores using Dynamic
Hardware/Software Partitioning. R. Lysecky and F.
Vahid. Design Automation and Test in Europe
(DATE), March 2005. - A Decompilation Approach to Partitioning Software
for Microprocessor/FPGA Platforms. G. Stitt and
F. Vahid. Design Automation and Test in Europe
(DATE), March 2005.