VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors - PowerPoint PPT Presentation

About This Presentation

Title:

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors

Description:

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter Yiannacouras Univ. of Toronto J. Gregory Steffan Univ. of Toronto – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 27

Provided by: loo71

Category:

more less

Transcript and Presenter's Notes

Title: VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors

1
VESPA Portable, Scalable, and Flexible
FPGA-Based Vector Processors

Peter Yiannacouras Univ. of Toronto
J. Gregory Steffan Univ. of Toronto
Jonathan Rose Univ. of Toronto

2
Soft Processors in FPGA Systems
Soft Processor
Custom Logic
HDL CAD
C Compiler
? Easier
? Faster ? Smaller ? Less Power
? Configurable how can we make use of this?
3
Vector Processing Primer
vadd
// C code for(i0ilt16 i) biai //
Vectorized code set vl,16 vload vr0,b vload
vr1,a vadd vr0,vr0,vr1 vstore vr0,b
b15a15
b14a14
b13a13
b12a12
b11a11
b10a10
b9a9
b8a8
b7a7
b6a6
b5a5
b4a4
Each vector instruction holds many units of
independent operations
b3a3
b2a2
b1a1
b0a0
1 Vector Lane
4
Vector Processing Primer
vadd
// C code for(i0ilt16 i) biai //
Vectorized code set vl,16 vload vr0,b vload
vr1,a vadd vr0,vr0,vr1 vstore vr0,b
16 Vector Lanes
b15a15
b14a14
b13a13
b12a12
1) Portable 2) Flexible 3) Scalable
b11a11
b10a10
b9a9
b8a8
b7a7
b6a6
b5a5
b4a4
Each vector instruction holds many units of
independent operations
b3a3
b2a2
b1a1
b0a0
5
Soft Vector Processor Benefits

Portable
SW Agnostic to HW implementation
Eg. Number of lanes
HW Can be implemented on any FPGA architecture
Flexible
Many parameters to tune (by end-user, not vendor)
Eg. Number of lanes, width of lanes, etc.
Scalable
SW Applies to any code with data-level
parallelism
HW Number of lanes can grow with capacity of
device
Parallelism can scale with Moores law

6
Conventional FPGA Design Flow

Three options
Manual hardware design
Acquire RTL IP-core
High level synthesis
Eg. Altera C2H
Push button
Code dependent

Software Routine
Software Routine
Software Routine
yes, find hot code
Is the soft processor the bottleneck?
Custom Accelerator
Custom Accelerator
Soft Proc
Custom Accelerator
Memory Interface
Peripherals
7
Proposed Soft Vector Processor System Design Flow
www.fpgavendor.com
User Code

Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Portable, Easy-to-use
Is the soft processor the bottleneck?
Custom Accelerator
Portable, Flexible, Scalable
Soft Proc
Vector Lane 1
Vector Lane 2
Vector Lane 3
Vector Lane 4
Memory Interface
Peripherals
yes, increase lanes
8
Our Goals

Evaluate soft vector processing for real
Using a complete hardware design (in Verilog)
On real FPGA hardware (Stratix 1S80C6)
Running full benchmarks (EEMBC)
From off-chip memory (DDR-133MHz)
Quantify performance/area tradeoffs
Across different vector processor configurations
Explore application-specific customizations
Reduce generality of soft vector processors

9
Current Infrastructure
SOFTWARE
HARDWARE
Verilog
SPREE
EEMBC C Benchmarks
GCC
ld
scalar µP
Vector support
ELF Binary

Vectorized assembly subroutines
GNU as
vpu
Manually designed coprocessor
Vector support
Vector Extended Soft Processor Architecture
Instruction Set Simulation
RTL Simulation
CAD Software
area, frequency
cycles
verification
verification
10
VESPA Architecture Design
Icache
Dcache
M U X
WB
Decode
RF
Scalar Pipeline 3-stage
A L U
Shared Dcache
VC RF
VC WB
Supports integer and fixed-point operations, and
predication
Vector Control Pipeline 3-stage
Logic
Decode
VS RF
VS WB
Mem Unit
Decode
Repli- cate
Hazard check
VR RF
Vector Pipeline 6-stage
VR RF
VR WB
M U X
VR WB
M U X
A L U
A L U
Satu- rate
Satu- rate
32-bit datapaths
x satur.
Rshift
x satur.
Rshift
10
11
Experiment 1 Vector Lane Exploration

Vary the number of vector lanes implemented
Using parameterized vector core
Measure speedup on 6 EEMBC benchmarks
Directly on Stratix I 1S80C6 clocked at 50 MHz
Was designed for Stratix III, runs at 135 MHz
Using 32KB direct-mapped level 1 cache
DDR 133MHz gt 10 cycle miss penalty
Measure area cost
Equate silicon area of all resources used
Report in units of Equivalent LEs

12
Performance Scaling Across Vector Lanes
Cycle Speedup Normalized to 1 Lane
6.3x
13
Design Characteristics on Stratix III
Lanes 1 2 4 8 16 32 64
Clock Frequency (MHz) 135 137 140 137 137 136 122
Logic Used (ALMs) 2763 3606 4741 7031 11495 20758 38983
Mulipliers Used (18-bit DSPs) 8 12 20 36 76 172 388
Block RAMs Used (M9Ks) 45 46 46 46 46 78 144
Device 3S200C2
14
Application-Specific Vector Processing

Customize to the application if
It is the only application that will run, OR
The FPGA can be reconfigured between runs
Observations Not all applications
Operate on 32-bit data types
Use the entire vector instruction set
Eliminate unused hardware (reduce area)
Reduce cost (buy smaller FPGA)
Re-invest area savings into more lanes
Speed up clock (nets span shorter distances)

15
Opportunity for Customization
? 0 reduction
Benchmark Largest Data Type Size Percentage of Vector ISA used
autcor 4 bytes 9.6
conven 1 byte 5.9
fbital 2 bytes 14.1
viterb 2 bytes 13.3
rgbcmyk 1 byte 5.9
rgbyiq 2 bytes 8.1
? up to 75 reduction
? lt15 utilization
16
Customizing the Vector Processor

Parameterized core can very easily change
L - Number of Vector Lanes
W - Bit-width of the vector lanes
M Size of memory crossbar
MVL Maximum Vector Length
Instruction set automatically subsetted
Each vector instruction individually
enabled/disabled
Control logic datapath hardware automatically
removed

17
Experiment 2 Reducing Area by Reducing Vector
Width
Normalized Vector Coprocessor Area
54
54
38
38
38
largest data type size (in bytes)
18
Experiment 3 Reducing Area by Subsetting
Instruction Set
Normalized Vector Coprocessor Area
55
46
19
Experiment 4 Combined Width Reduction and
Instruction Set Subsetting
61
70
20
Re-Invest Area Savings into Lanes(Improved VESPA)
11.5x
9.3x
21
Summary

Evaluated soft vector processors
Real hardware, memory, and benchmarks
Observed significant performance scaling
Average of 6.3x with 16 lanes
Further scaling possible on newer devices
Explored measures to reduce area cost
Reducing vector width
Reducing supported instruction set
Combining width and instruction set reduction
61 area reduction on average, up to 70

Soft vector processors provide a portable,
flexible, and scalable framework for exploiting
data level parallelism that is easier to use than
designing custom FPGA hardware
22
Backup Slides
23
Future Work

Improve scalability bottlenecks
Memory system
Evaluate scaling past 16 lanes
Port to platform with newer FPGA
Compare against hardware
What do we pay for simpler design?

24
Performance Impact of Cache Size

Measure impact of cache size on 16 lane VPU

Streaming
Streaming
25
Combined Width Reduction and Instruction Set
Subsetting
26
Performance vs Scalar (C) Code
1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes
autcor 1.3 2.6 4.7 8.1 11.4
conven 6.9 12.5 21.4 32.9 43.6
fbital 1.2 2.2 3.7 5.5 6.9
viterb 1.0 1.8 2.8 3.6 3.9
rgbcmyk 1.0 1.8 2.4 3.2 3.8
rgbyiq 2.4 4.2 6.7 9.6 12.0
GEOMEAN 1.8 3.1 5.1 7.4 9.2

Write a Comment

User Comments (0)