VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors - PowerPoint PPT Presentation

About This Presentation
Title:

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors

Description:

VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter Yiannacouras Univ. of Toronto J. Gregory Steffan Univ. of Toronto – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 27
Provided by: loo71
Category:

less

Transcript and Presenter's Notes

Title: VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors


1
VESPA Portable, Scalable, and Flexible
FPGA-Based Vector Processors
  • Peter Yiannacouras Univ. of Toronto
  • J. Gregory Steffan Univ. of Toronto
  • Jonathan Rose Univ. of Toronto

2
Soft Processors in FPGA Systems
Soft Processor
Custom Logic
HDL CAD
C Compiler
? Easier
? Faster ? Smaller ? Less Power
? Configurable how can we make use of this?
3
Vector Processing Primer
vadd
// C code for(i0ilt16 i) biai //
Vectorized code set vl,16 vload vr0,b vload
vr1,a vadd vr0,vr0,vr1 vstore vr0,b
b15a15
b14a14
b13a13
b12a12
b11a11
b10a10
b9a9
b8a8
b7a7
b6a6
b5a5
b4a4
Each vector instruction holds many units of
independent operations
b3a3
b2a2
b1a1
b0a0
1 Vector Lane
4
Vector Processing Primer
vadd
// C code for(i0ilt16 i) biai //
Vectorized code set vl,16 vload vr0,b vload
vr1,a vadd vr0,vr0,vr1 vstore vr0,b
16 Vector Lanes
b15a15
b14a14
b13a13
b12a12
1) Portable 2) Flexible 3) Scalable
b11a11
b10a10
b9a9
b8a8
b7a7
b6a6
b5a5
b4a4
Each vector instruction holds many units of
independent operations
b3a3
b2a2
b1a1
b0a0
5
Soft Vector Processor Benefits
  • Portable
  • SW Agnostic to HW implementation
  • Eg. Number of lanes
  • HW Can be implemented on any FPGA architecture
  • Flexible
  • Many parameters to tune (by end-user, not vendor)
  • Eg. Number of lanes, width of lanes, etc.
  • Scalable
  • SW Applies to any code with data-level
    parallelism
  • HW Number of lanes can grow with capacity of
    device
  • Parallelism can scale with Moores law

6
Conventional FPGA Design Flow
  • Three options
  • Manual hardware design
  • Acquire RTL IP-core
  • High level synthesis
  • Eg. Altera C2H
  • Push button
  • Code dependent

Software Routine
Software Routine
Software Routine
yes, find hot code
Is the soft processor the bottleneck?
Custom Accelerator
Custom Accelerator
Soft Proc
Custom Accelerator
Memory Interface
Peripherals
7
Proposed Soft Vector Processor System Design Flow
www.fpgavendor.com
User Code

Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Portable, Easy-to-use
Is the soft processor the bottleneck?
Custom Accelerator
Portable, Flexible, Scalable
Soft Proc
Vector Lane 1
Vector Lane 2
Vector Lane 3
Vector Lane 4
Memory Interface
Peripherals
yes, increase lanes
8
Our Goals
  • Evaluate soft vector processing for real
  • Using a complete hardware design (in Verilog)
  • On real FPGA hardware (Stratix 1S80C6)
  • Running full benchmarks (EEMBC)
  • From off-chip memory (DDR-133MHz)
  • Quantify performance/area tradeoffs
  • Across different vector processor configurations
  • Explore application-specific customizations
  • Reduce generality of soft vector processors

9
Current Infrastructure
SOFTWARE
HARDWARE
Verilog
SPREE
EEMBC C Benchmarks
GCC
ld
scalar µP
Vector support
ELF Binary

Vectorized assembly subroutines
GNU as
vpu
Manually designed coprocessor
Vector support
Vector Extended Soft Processor Architecture
Instruction Set Simulation
RTL Simulation
CAD Software
area, frequency
cycles
verification
verification
10
VESPA Architecture Design
Icache
Dcache
M U X
WB
Decode
RF
Scalar Pipeline 3-stage
A L U
Shared Dcache
VC RF
VC WB
Supports integer and fixed-point operations, and
predication
Vector Control Pipeline 3-stage
Logic
Decode
VS RF
VS WB
Mem Unit
Decode
Repli- cate
Hazard check
VR RF
Vector Pipeline 6-stage
VR RF
VR WB
M U X
VR WB
M U X
A L U
A L U
Satu- rate
Satu- rate
32-bit datapaths
x satur.
Rshift
x satur.
Rshift
10
11
Experiment 1 Vector Lane Exploration
  • Vary the number of vector lanes implemented
  • Using parameterized vector core
  • Measure speedup on 6 EEMBC benchmarks
  • Directly on Stratix I 1S80C6 clocked at 50 MHz
  • Was designed for Stratix III, runs at 135 MHz
  • Using 32KB direct-mapped level 1 cache
  • DDR 133MHz gt 10 cycle miss penalty
  • Measure area cost
  • Equate silicon area of all resources used
  • Report in units of Equivalent LEs

12
Performance Scaling Across Vector Lanes
Cycle Speedup Normalized to 1 Lane
6.3x
13
Design Characteristics on Stratix III
Lanes  1 2 4 8 16 32 64
Clock Frequency (MHz) 135 137 140 137 137 136 122
Logic Used (ALMs) 2763 3606 4741 7031 11495 20758 38983
Mulipliers Used (18-bit DSPs) 8 12 20 36 76 172 388
Block RAMs Used (M9Ks) 45 46 46 46 46 78 144
Device 3S200C2
14
Application-Specific Vector Processing
  • Customize to the application if
  • It is the only application that will run, OR
  • The FPGA can be reconfigured between runs
  • Observations Not all applications
  • Operate on 32-bit data types
  • Use the entire vector instruction set
  • Eliminate unused hardware (reduce area)
  • Reduce cost (buy smaller FPGA)
  • Re-invest area savings into more lanes
  • Speed up clock (nets span shorter distances)

15
Opportunity for Customization
? 0 reduction
Benchmark Largest Data Type Size Percentage of Vector ISA used
autcor 4 bytes 9.6
conven 1 byte 5.9
fbital 2 bytes 14.1
viterb 2 bytes 13.3
rgbcmyk 1 byte 5.9
rgbyiq 2 bytes 8.1
? up to 75 reduction
? lt15 utilization
16
Customizing the Vector Processor
  • Parameterized core can very easily change
  • L - Number of Vector Lanes
  • W - Bit-width of the vector lanes
  • M Size of memory crossbar
  • MVL Maximum Vector Length
  • Instruction set automatically subsetted
  • Each vector instruction individually
    enabled/disabled
  • Control logic datapath hardware automatically
    removed

17
Experiment 2 Reducing Area by Reducing Vector
Width
Normalized Vector Coprocessor Area
54
54
38
38
38
largest data type size (in bytes)
18
Experiment 3 Reducing Area by Subsetting
Instruction Set
Normalized Vector Coprocessor Area
55
46
19
Experiment 4 Combined Width Reduction and
Instruction Set Subsetting
61
70
20
Re-Invest Area Savings into Lanes(Improved VESPA)
11.5x
9.3x
21
Summary
  • Evaluated soft vector processors
  • Real hardware, memory, and benchmarks
  • Observed significant performance scaling
  • Average of 6.3x with 16 lanes
  • Further scaling possible on newer devices
  • Explored measures to reduce area cost
  • Reducing vector width
  • Reducing supported instruction set
  • Combining width and instruction set reduction
  • 61 area reduction on average, up to 70

Soft vector processors provide a portable,
flexible, and scalable framework for exploiting
data level parallelism that is easier to use than
designing custom FPGA hardware
22
Backup Slides
23
Future Work
  • Improve scalability bottlenecks
  • Memory system
  • Evaluate scaling past 16 lanes
  • Port to platform with newer FPGA
  • Compare against hardware
  • What do we pay for simpler design?

24
Performance Impact of Cache Size
  • Measure impact of cache size on 16 lane VPU

Streaming
Streaming
25
Combined Width Reduction and Instruction Set
Subsetting
26
Performance vs Scalar (C) Code
1 Lane 2 Lanes 4 Lanes 8 Lanes 16 Lanes
autcor 1.3 2.6 4.7 8.1 11.4
conven 6.9 12.5 21.4 32.9 43.6
fbital 1.2 2.2 3.7 5.5 6.9
viterb 1.0 1.8 2.8 3.6 3.9
rgbcmyk 1.0 1.8 2.4 3.2 3.8
rgbyiq 2.4 4.2 6.7 9.6 12.0
GEOMEAN 1.8 3.1 5.1 7.4 9.2
Write a Comment
User Comments (0)
About PowerShow.com