Warp Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Warp Processors

Description:

... -04 0.04 1000000.00 3.73e-04 841.60 0.04 1000000.00 423.53 -0.88 120702.00 40234.00 3.00 31300.00 3.00 93900.00 1700.00 26802.00 28502.00 4.50 4.24 100.00 100 ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 40
Provided by: FrankV156
Learn more at: http://www.cs.ucr.edu
Category:
Tags: processors | warp

less

Transcript and Presenter's Notes

Title: Warp Processors


1
Warp Processors
  • Frank Vahid (Task Leader)
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Associate Director, Center for Embedded Computer
    Systems, UC Irvine
  • Task ID 1331.001 July 2005 June 2008
  • Ph.D. students
  • Greg Stitt Ph.D. expected June 2006
  • Ann Gordon-Ross Ph.D. expected June 2006
  • David Sheldon Ph.D. expected 2009
  • Ryan Mannion Ph.D. expected 2009
  • Scott Sirowy Ph.D. expected 2010
  • Industrial Liaisons
  • Brian W. Einloth, Motorola
  • Serge Rutman, Dave Clark, Intel
  • Jeff Welser, IBM

2
Task Description
  • Warp processing background
  • Two seed SRC CSR grants (2002-2005) showed
    feasibility
  • Idea Transparently move critical binary regions
    from microprocessor to FPGA ? 10x perf./energy
    gains or more
  • Task Mature warp technology
  • Years 1/2 (in progress)
  • Automatic high-level construct recovery from
    binaries
  • In-depth case studies (with Freescale)
  • Also discovered unanticipated problem, developed
    solution
  • Warp-tailored FPGA prototype (with Intel)
  • Years 2/3
  • Reduce memory bottleneck by using smart buffer
  • Investigate domain-specific-FPGA concepts (with
    Freescale)
  • Consider desktop/server domains (with IBM)

3
Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
4
Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
5
Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
6
Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
7
Warp Processing Background Basic Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
8
Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
9
Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


10
Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


11
Warp Processing Background Trend Towards
Processor/FPGA Programmable Platforms
  • FPGAs with hard core processors
  • FPGAs with soft core processors
  • Computer boards with FPGAs

Xilinx Virtex II Pro. Source Xilinx
Altera Excalibur. Source Altera
Xilinx Spartan. Source Xilinx
Cray XD1. Source FPGA journal, Apr05
12
Warp Processing Background Trend Towards
Processor/FPGA Programmable Platforms
  • Programming a key challenge
  • Soln 1 Compile high-level language to custom
    binaries
  • Soln 2 Use standard binaries, dynamically re-map
    (warp)
  • Cons
  • Less high-level information, less optimization
  • Pros
  • Available to all software developers, not just
    specialists
  • Data dependent optimization
  • Most importantly, standard binaries enable
    ecosystem among tools, architecture, and
    applications

Xilinx Virtex II Pro. Source Xilinx
Most significant concept presently absent in
FPGAs and other new programmable platforms
13
Warp Processing Background Basic Technology
  • Warp processing
  • On-chip profiler
  • Warp-tuned FPGA
  • On-chip CAD, including Just-in-Time FPGA
    compilation

JIT FPGA compilation
14
Warp Processing Background Initial Results
Tech. Map
Decomp.
Partitioning
Log. Syn.
RT Syn.
Place
Route
9.1 s
Xilinx ISE
15
Warp Processing Background Publications 2002-2005
  • On-chip profiler
  • Frequent Loop Detection Using Efficient
    Non-Intrusive On-Chip Hardware, A. Gordon-Ross
    and F. Vahid, ACM/IEEE Conf. on Compilers,
    Architecture and Synthesis for Embedded Systems
    (CASES), 2003
  • Extended version of above in special issue Best
    of CASES/MICRO of IEEE Trans. on Comp., Oct
    2005.
  • Warp-tuned FPGA
  • A Configurable Logic Architecture for Dynamic
    Hardware/Software Partitioning, R. Lysecky and F.
    Vahid, Design Automation and Test in Europe Conf.
    (DATE), Feb 2004.
  • On-chip CAD, including Just-in-Time FPGA
    compilation
  • A Study of the Scalability of On-Chip Routing for
    Just-in-Time FPGA Compilation. R. Lysecky, F.
    Vahid and S. Tan. IEEE Symp. on
    Field-Programmable Custom Computing Machines
    (FCCM), 2005.
  • A Study of the Speedups and Competitiveness of
    FPGA Soft Processor Cores using Dynamic
    Hardware/Software Partitioning. R. Lysecky and F.
    Vahid. Design Automation and Test in Europe
    (DATE), March 2005.
  • Dynamic FPGA Routing for Just-in-Time FPGA
    Compilation. R. Lysecky, F. Vahid, and S. Tan.
    Design Automation Conf. (DAC), June 2004.
  • A Codesigned On-Chip Logic Minimizer, R. Lysecky
    and F. Vahid, ISSS/CODES conf., Oct 2003.
  • Dynamic Hardware/Software Partitioning A First
    Approach. G. Stitt, R. Lysecky and F. Vahid,
    Design Automation Conf. (DAC), 2003.
  • On-Chip Logic Minimization, R. Lysecky and F.
    Vahid, Design Automation Conf. (DAC), 2003.
  • The Energy Advantages of Microprocessor Platforms
    with On-Chip Configurable Logic, G. Stitt and F.
    Vahid, IEEE Design and Test of Computers,
    Nov./Dec. 2002.
  • Hardware/Software Partitioning of Software
    Binaries, G. Stitt and F. Vahid, IEEE/ACM
    International Conference on Computer Aided Design
    (ICCAD), Nov. 2002.
  • Related
  • A Self-Tuning Cache Architecture for Embedded
    Systems. C. Zhang, F. Vahid and R. Lysecky. ACM
    Transactions on Embedded Computing Systems
    (TECS), Vol. 3., Issue 2, May 2004.
  • Fast Configurable-Cache Tuning with a Unified
    Second-Level Cache. A. Gordon-Ross, F. Vahid, N.
    Dutt. Int. Symp. on Low-Power Electronics and
    Design (ISLPED), 2005.

16
Task Description
  • Warp processing background
  • Two seed SRC CSR grants (2002-2005) showed
    feasibility
  • Idea Transparently move critical binary regions
    from microprocessor to FPGA ? 10x perf./energy
    gains or more
  • Task Mature warp technology
  • Year 1 (in progress)
  • Automatic high-level construct recovery from
    binaries
  • In-depth case studies (with Freescale)
  • Also discovered unanticipated problem, developed
    solution
  • Warp-tailored FPGA prototype (with Intel)
  • Years 2/3
  • Reduce memory bottleneck by using smart buffer
  • Investigate domain-specific-FPGA concepts (with
    Freescale)
  • Consider desktop/server domains (with IBM)

17
Automatic High-Level Construct Recovery from
Binaries
  • Challenge Binary lacks high-level constructs
    (loops, arrays, ...)
  • Decompilation can help recover
  • Extensive previous work (e.g., Cifuentes 93, 94,
    99)

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18
New Method Loop Rerolling
  • Problem Compiler unrolling of loops (to expose
    parallelism) causes synthesis problems
  • Huge input (slow), cant unroll to desired
    amount, cant use advanced loop methods (loop
    pipelining, fusion, splitting, ...)
  • Solution New decompilation method Loop
    Rerolling
  • Identify unrolled iterations, compact into one
    iteration

Loop Unrolling
Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2,
100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add
reg1, reg1, reg2
for (int i0 i lt 3 i) accum ai
19
Loop Rerolling Identify Unrolled Iterations
  • Find consecutively repeating instruction sequences

Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
20
Loop Rerolling Compacting Iterations
Unrolled Loop Identificiation
Original C Code
Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0),
r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4,
r3
x x 1 for (i0 i lt 2 i)
aibi1 yx
21
Method Strength Promotion
  • Problem Compilers strength reduction (replacing
    multiplies by shifts and adds) prevents synthesis
    from using hard-core multipliers, sometimes
    hurting circuit performance

FIR Filter
Strength-reduced multiplication
Strength-Reduced FIR Filter
22
Strength Promotion
  • Solution Promote strength-reduced code to muls

23
New Decompilation Methods Benefits
Speedups from Loop Rerolling
  • Rerolling
  • Speedups from better use of smart buffers
  • Other potential benefits faster synthesis, less
    area
  • Strength promotion
  • Speedups from fewer cycles
  • Speedups from faster clock
  • New methods to be developed
  • e.g., pointer DS to arrays

24
Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
Publication New Decompilation Techniques for
Binary-level Co-processor Generation. G. Stitt,
F. Vahid. IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), Nov. 2005.
25
Task Description
  • Warp processing background
  • Two seed SRC CSR grants (2002-2005) showed
    feasibility
  • Idea Transparently move critical binary regions
    from microprocessor to FPGA ? 10x perf./energy
    gains or more
  • Task Mature warp technology
  • Year 1 (in progress)
  • Automatic high-level construct recovery from
    binaries
  • In-depth case studies (with Freescale)
  • Also discovered unanticipated problem, developed
    solution
  • Warp-tailored FPGA prototype (with Intel)
  • Years 2/3
  • Reduce memory bottleneck by using smart buffer
  • Investigate domain-specific-FPGA concepts (with
    Freescale)
  • Consider desktop/server domains (with IBM)

26
Research Problem Make Synthesis from Binaries
Competitive with Synthesis from High-Level
Languages
  • Performed in-depth study with Freescale
  • H.264 video decoder
  • Highly-optimized proprietary code, not reference
    code
  • Huge difference
  • A benefit of SRC collaboration
  • Research question Is synthesis from binaries
    competitive on highly-optimized code?
  • Several-month study

MPEG 2
H.264 Better quality, or smaller files, using
more computation
27
Optimized H.264
  • Larger than most benchmarks
  • H.264 16,000 lines
  • Previous work 100 to several thousand lines
  • Highly-optimized
  • H.264 Many man-hours of manual optimization
  • 10x faster than reference code used in previous
    works
  • Different profiling results
  • Previous examples
  • 90 time in several loops
  • H.264
  • 90 time in 45 functions
  • Harder to speedup

28
C vs. Binary Synthesis on Opt. H.264
  • Binary partitioning competitive with source
    partitioning
  • Speedups compared to ARM9 software
  • Binary 2.48, C 2.53
  • Decompilation recovered nearly all high-level
    information needed for partitioning and synthesis
  • Discovered another research problem Why arent
    speedups (from binary or C) closer to ideal
    (0-time per fct)

29
Coding Guidelines
  • Are there C-coding guidelines to improve
    partitioning speedups?
  • Orthogonal to C vs. binary question
  • Guidelines may help both
  • Examined H.264 code further
  • Several phone conferences with Freescale liasons,
    also several email exchanges and reports

Competitive, but both could be better
Coding guidelines get closer to ideal
30
Synthesis-Oriented Coding Guidelines
  • Pass by value-return
  • Declare a local array and copy in all data needed
    by a function (makes lack of aliases explicit)
  • Function specialization
  • Create function version having frequent
    parameter-values as constants

Rewritten
Original
void f(int width, int height ) . . . .
for (i0 i lt width, i) for (j0 j lt
height j) . . . . . .
void f_4_4() . . . . for (i0 i lt 4,
i) for (j0 j lt 4 j) . . .
. . .
Bounds are explicit so loops are now unrollable
31
Synthesis-Oriented Coding Guidelines
  • Algorithmic specialization
  • Use parallelizable hardware algorithms when
    possible
  • Hoisting and sinking of error checking
  • Keep error checking out of loops to enable
    unrolling
  • Lookup table avoidance
  • Use expressions rather than lookup tables

Original
Rewritten
Comparisons can now be parallelized
int clip512 . . . void f() . . .
for (i0 i lt 10 i) vali
clipvali . . .
void f() . . . for (i0 i lt 10 i)
if (vali gt 255) vali 255 else if
(vali lt 0) vali 0 . . .
. . .
32
Synthesis-Oriented Coding Guidelines
  • Use explicit control flow
  • Replace function pointers with if statements and
    static function calls

Original
Rewritten
void (funcArray) (char data) func1,
func2, . . . void f(char data) . . .
funcPointer funcArrayi (funcPointer)
(data) . . .
void f(char data) . . . if (i 0)
func1(data) else if (i1)
func2(data) . . .
33
Coding Guideline Results on H.264
  • Simple coding guidelines made large improvement
  • Rewritten software only 3 slower than original
  • And, binary partitioning still competitive with C
    partitioning
  • Speedups Binary 6.55, C 6.56
  • Small difference caused by switch statements that
    used indirect jumps

34
Studied More Benchmarks, Developed More Guidelines
  • Studied guidelines further on standard benchmarks
  • Further synthesis speedups (again, independent of
    C vs. binary issue)
  • Publications
  • Hardware/Software Partitioning of Software
    Binaries A Case Study of H.264 Decode. G. Stitt,
    F. Vahid, G. McGregor, B. Einloth. Int. Conf. on
    Hardware/Software Codesign and System Synthesis
    (CODES/ISSS), 2005 (joint publication with
    Freescale)
  • Submitted A Code Refinement Methodology for
    Performance-Improved Synthesis from C. G. Stitt,
    F. Vahid, W. Najjar, 2006.
  • More guidelines to be developed

35
Task Description
  • Warp processing background
  • Two seed SRC CSR grants (2002-2005) showed
    feasibility
  • Idea Transparently move critical binary regions
    from microprocessor to FPGA ? 10x perf./energy
    gains or more
  • Task Mature warp technology
  • Year 1 (in progress)
  • Automatic high-level construct recovery from
    binaries
  • In-depth case studies (with Freescale)
  • Also discovered unanticipated problem, developed
    solution
  • Warp-tailored FPGA prototype (with Intel)
  • Years 2/3
  • Reduce memory bottleneck by using smart buffer
  • Investigate domain-specific-FPGA concepts (with
    Freescale)
  • Consider desktop/server domains (with IBM)

36
Warp-Tailored FPGA Prototype
  • Developed FPGA fabric tailored to
    fast/small-memory on-chip CAD
  • Building chip prototype with Intel
  • Created synthesizable VHDL models, running
    through Intel shuttle tool flow
  • Plan to incorporate with ARM processor and other
    IP on shuttle seat
  • Bi-weekly phone meetings with Intel engineers
    since summer 2005, ongoing, scheduled tapeout
    2006 Q3

37
Industrial Interactions
  • Freescale
  • Numerous phone conferences, emails, and reports,
    on technical subjects
  • Co-authored paper (CODES/ISSS05), another
    pending
  • Summer internship Scott Sirowy (new UCR
    graduate student), summer 2005, Austin
  • Intel
  • Three visits by PI, one by graduate student Roman
    Lysecky, to Intel Research in Santa Clara
  • PI presented at Intel System Design Symposium,
    Nov. 2005
  • PI served on Intel Research Silicon Prototyping
    Workshop panel, May 2005
  • Participating in Intels Research Shuttle (chip
    prototype), bi-weekly phone conferences since
    summer 2005 involving PI, Intel engineers, and
    Roman Lysecky (now Prof. at UA)
  • IBM
  • Embarking on studies of warp processing results
    on server applications
  • UCR group to receive Cell-based prototyping
    platform (w/ Prof. Walid Najjar)
  • Several interactions with Xilinx also

38
Task Description Coming Up
  • Warp processing background
  • Two seed SRC CSR grants (2002-2005) showed
    feasibility
  • Idea Transparently move critical binary regions
    from microprocessor to FPGA ? 10x perf./energy
    gains or more
  • Task Mature warp technology
  • Years 1/2 (in progress)
  • Automatic high-level construct recovery from
    binaries
  • In-depth case studies (with Freescale)
  • Also discovered unanticipated problem, developed
    solution
  • Warp-tailored FPGA prototype (with Intel)
  • Years 2/3 All three sub-tasks just now underway
  • Reduce memory bottleneck by using smart buffer
  • Investigate domain-specific-FPGA concepts (with
    Freescale)
  • Consider desktop/server domains (with IBM)

39
Recent Publications
  • New Decompilation Techniques for Binary-level
    Co-processor Generation. G. Stitt, F. Vahid.
    IEEE/ACM International Conference on
    Computer-Aided Design (ICCAD), 2005.
  • Fast Configurable-Cache Tuning with a Unified
    Second-Level Cache. A. Gordon-Ross, F. Vahid, N.
    Dutt. Int. Symp. on Low-Power Electronics and
    Design (ISLPED), 2005.
  • Hardware/Software Partitioning of Software
    Binaries A Case Study of H.264 Decode. G. Stitt,
    F. Vahid, G. McGregor, B. Einloth. International
    Conference on Hardware/Software Codesign and
    System Synthesis (CODES/ISSS), 2005. (Co-authored
    paper with Freescale)
  • Frequent Loop Detection Using Efficient
    Non-Intrusive On-Chip Hardware. A. Gordon-Ross
    and F. Vahid. IEEE Trans. on Computers, Special
    Issue- Best of Embedded Systems,
    Microarchitecture, and Compilation Techniques in
    Memory of B. Ramakrishna (Bob) Rau, Oct. 2005.
  • A Study of the Scalability of On-Chip Routing for
    Just-in-Time FPGA Compilation. R. Lysecky, F.
    Vahid and S. Tan. IEEE Symposium on
    Field-Programmable Custom Computing Machines
    (FCCM), 2005.
  • A First Look at the Interplay of Code Reordering
    and Configurable Caches. A. Gordon-Ross, F.
    Vahid, N. Dutt. Great Lakes Symposium on VLSI
    (GLSVLSI), April 2005.
  • A Study of the Speedups and Competitiveness of
    FPGA Soft Processor Cores using Dynamic
    Hardware/Software Partitioning. R. Lysecky and F.
    Vahid. Design Automation and Test in Europe
    (DATE), March 2005.
  • A Decompilation Approach to Partitioning Software
    for Microprocessor/FPGA Platforms. G. Stitt and
    F. Vahid. Design Automation and Test in Europe
    (DATE), March 2005.
Write a Comment
User Comments (0)
About PowerShow.com