The Microarchitecture of FPGABased Soft Processors

About This Presentation

Title:

The Microarchitecture of FPGABased Soft Processors

Description:

Tune each one to meet design constraints. Option 3: On-chip 'soft' processor. Custom Logic. Processor. Tuning Processors. Application, Design constraints $3. 4 MHz ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 55

Provided by: looie6

Category:

more less

Transcript and Presenter's Notes

Title: The Microarchitecture of FPGABased Soft Processors

1
The Microarchitecture of FPGA-Based Soft
Processors

Peter Yiannacouras
CARG - June 14, 2005

2
FPGA vs ASIC Flows

Reduced cost for low-volume
Reduced time-to-market
Programmability affords customization
Designers use FPGAs!

ASIC Flow
FPGA Flow
Circuit Design
Circuit Design
3
Processors and FPGAs
Custom Logic
Processor
4
Tuning Processors
Application, Design constraints

3
4 MHz
800 mW
2-stage pipeline

300
3.8 GHz
80 W
31-stage pipeline

5
Understanding Soft Processors
Architecture Description

Tuning requires understanding of soft processor
design space
We implement many processors and study the design
space

Synthesized Processor

Area
Performance
Power

6
Dont we already understand architecture?

Not completely
We can evaluate area, power, performance
Not accurately (rules of thumb)
FPGA CAD tools are very accurate
Not in the FPGA domain
LUTs vs transistors
relative speed of RAM Multipliers

7
Goals

Develop measurement methodology
Populate the design space
Compare against industrial soft processor(s)

8
Measurement Methodology

Require a set of metrics

Performance
Power
Area

Resource Usage
Clock Frequency
Power estimate

9
Area
Logic Elements (LEs LUT flip flop)
Big RAM
Little RAM
Multipliers
Medium RAM
10
Performance

Wall Clock Time Cycles Clock Period

11
Power

CAD tool can estimate power from assumed toggle
ratio (derived experimentally)

Clock Frequency (MHz)
12
Metrics summary

Require the following information
Resource Usage (area CAD Tool)
Clock Frequency (wall clock time CAD Tool)
Power Estimate (energy/cycle CAD Tool)
Cycle Count (wall clock time RTL Simulator)

13
RTL-based Design Space Exploration
Circuit Design (RTL)
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power

Correctness
Cycle Count

14
Goals

Develop measurement methodology
Populate the design space
Compare against industrial soft processor(s)

15
Microarchitectural Design Space Exploration
Circuit Design (RTL)
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power

Correctness
Cycle Count

16
SPREE (Soft Processor Rapid Exploration
Environment)
SPREE RTL Generator
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power

Correctness
Cycle Count

17
Goals

Develop measurement methodology
Populate the design space
Rapidly
With interesting designs
Accurately (minimize overhead)
Compare against industrial soft processor(s)

18
Related Work

Parametrized Cores
Narrow design space, laborious changes to control
Architecture Description Languages (ADLs)
Too robust, inaccurate (simulator based, or
behavioural RTL)
PEAS-III/ASIPMeister Itoh2000
non-fpga specific, ISA design focus

19
SPREE RTL Generator Overview
ISA Description
Datapath Description
SPREE RTL Generator
Component Library
Efficiently Synthesizable RTL
20
Some current limitations

No caches (use fast on-chip RAM)
Simple in-order issue pipelines
No dynamic branch prediction
No OS or exceptions support
No ISA changes!
Need compiler generation to support
Use subset of MIPS-I

21
Architecture Input
Mul
Write Back
ALU
Component Library
22
Architecture Input
Component Library
Datapath Description
23
Architecture Input
ISA Description
Datapath Description
Ifetch
Reg File
Ifetch
Reg File
SPREE RTL Generator
Mul
Data Mem
Mul

ALU
Write Back
ALU
Write Back
Component Library
24
Architecture InputISA Description

Generic Operations (GENOPs)
MIPS instructions made of GENOPs

GENOPs
MIPS ADD add rd, rs, rt
FETCH
FETCH
RFREAD
RFREAD
RFREAD
ADD
ADD
RFWRITE
RFWRITE
25
Complete Experimental Framework Using SPREE
FIXED
ISA Description
Datapath Description
SPREE RTL Generator
Component Library
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power

Correctness
Cycle Count

26
Goals

Develop measurement methodology
Populate the design space
Compare against industrial soft processor(s)

Performance
SPREE
Power
Area
27
Alteras NiosII

Second generation soft processor
Has three variations
NiosIIe unpipelined, no hardware multiply
NiosIIs 5-stages, no branch prediction
NiosIIf 6-stages, dynamic branch prediction
Caveats
Supports exceptions, OS, and caches
Very similar but tweaked ISA

28
Design Space vs NiosII Variations
29
Summary

We span the design space
Remain competitive
Achieved 9 faster and 11 smaller than NiosIIs
gt dont suffer from prohibitive overhead

30
Architectural Axes

Hardware vs Software Multiplication
Shifter implementation
Pipeline
Depth
Organization
Forwarding

31
Hardware vs Software Multiplication

Hardware multiplication
Increases area power consumption
Speeds up execution
BUT
Not all applications care about speed
Not all applications use multiplication
(significantly)

32
Cycle Count Speedup of Hardware Multiplication
33
Cost of Hardware Multiply

250 LEs (20)
35 more Energy/cycle

34
Shifter Implementations

Shifters (multiplexers) are big in FPGAs
Consider 3 implementations
Serial shifter
LUT-based barrel shifter
Multiplier-based barrel shifter

35
Impact of Shifter Implementation

Consistent across different pipe depths

36
Shifter Implementation Tradeoffs

Averaged over all pipeline depths
Smallest Serial
Fastest LUT-based barrel
Energy efficient Serial

37
Pipelines - Depth

Study different pipeline depths
Over 3 shifters
Arrows possible forwarding lines (not used)
All use predict not-taken branches

38
Pipelining clock frequency
39
Impact of Pipelining

Adds area, can increase speed (2 to 3 stage?)

40
FPGA Nuance Synchronous RAMs 2-stage Pipeline
Mul

Regfile
Write Back

ALU
41
3-stage Pipeline

Less stalls, increased frequency
gt Big speedup (1.7x)

Mul

Regfile
Write Back

ALU
42
3, 4 and 5 stage pipelines

Increased area, small change in performance
gt Deeper pipelines have potential for better
speedups

43
The 7-stage Pipeline Where Branch Delay Slots
break down

The ideal case

BEQ
OR
JR
ADD
44
Problem Separation of Branch and Branch Delay
Slot
Stalls on RAW hazard

BEQ
ADD
JR
45
Problem Separation of Branch and Branch Delay
Slot
X

BEQ
ADD
JR
NOP