Title: The Microarchitecture of FPGABased Soft Processors
1The Microarchitecture of FPGA-Based Soft
Processors
- Peter Yiannacouras
- CARG - June 14, 2005
2FPGA vs ASIC Flows
- Reduced cost for low-volume
- Reduced time-to-market
- Programmability affords customization
- Designers use FPGAs!
ASIC Flow
FPGA Flow
Circuit Design
Circuit Design
3Processors and FPGAs
Custom Logic
Processor
4Tuning Processors
Application, Design constraints
- 3
- 4 MHz
- 800 mW
- 2-stage pipeline
- 300
- 3.8 GHz
- 80 W
- 31-stage pipeline
5Understanding Soft Processors
Architecture Description
- Tuning requires understanding of soft processor
design space - We implement many processors and study the design
space
Synthesized Processor
6Dont we already understand architecture?
- Not completely
- We can evaluate area, power, performance
- Not accurately (rules of thumb)
- FPGA CAD tools are very accurate
- Not in the FPGA domain
- LUTs vs transistors
- relative speed of RAM Multipliers
7Goals
- Develop measurement methodology
- Populate the design space
- Compare against industrial soft processor(s)
8Measurement Methodology
Performance
Power
Area
- Resource Usage
- Clock Frequency
- Power estimate
9Area
Logic Elements (LEs LUT flip flop)
Big RAM
Little RAM
Multipliers
Medium RAM
10Performance
- Wall Clock Time Cycles Clock Period
11Power
- CAD tool can estimate power from assumed toggle
ratio (derived experimentally)
Clock Frequency (MHz)
12Metrics summary
- Require the following information
- Resource Usage (area CAD Tool)
- Clock Frequency (wall clock time CAD Tool)
- Power Estimate (energy/cycle CAD Tool)
- Cycle Count (wall clock time RTL Simulator)
13RTL-based Design Space Exploration
Circuit Design (RTL)
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
14Goals
- Develop measurement methodology
- Populate the design space
- Compare against industrial soft processor(s)
15Microarchitectural Design Space Exploration
Circuit Design (RTL)
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
16SPREE (Soft Processor Rapid Exploration
Environment)
SPREE RTL Generator
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
17Goals
- Develop measurement methodology
- Populate the design space
- Rapidly
- With interesting designs
- Accurately (minimize overhead)
- Compare against industrial soft processor(s)
18Related Work
- Parametrized Cores
- Narrow design space, laborious changes to control
- Architecture Description Languages (ADLs)
- Too robust, inaccurate (simulator based, or
behavioural RTL) - PEAS-III/ASIPMeister Itoh2000
- non-fpga specific, ISA design focus
19SPREE RTL Generator Overview
ISA Description
Datapath Description
SPREE RTL Generator
Component Library
Efficiently Synthesizable RTL
20Some current limitations
- No caches (use fast on-chip RAM)
- Simple in-order issue pipelines
- No dynamic branch prediction
- No OS or exceptions support
- No ISA changes!
- Need compiler generation to support
- Use subset of MIPS-I
21Architecture Input
Mul
Write Back
ALU
Component Library
22Architecture Input
Component Library
Datapath Description
23Architecture Input
ISA Description
Datapath Description
Ifetch
Reg File
Ifetch
Reg File
SPREE RTL Generator
Mul
Data Mem
Mul
ALU
Write Back
ALU
Write Back
Component Library
24Architecture InputISA Description
- Generic Operations (GENOPs)
- MIPS instructions made of GENOPs
GENOPs
MIPS ADD add rd, rs, rt
FETCH
FETCH
RFREAD
RFREAD
RFREAD
ADD
ADD
RFWRITE
RFWRITE
25Complete Experimental Framework Using SPREE
FIXED
ISA Description
Datapath Description
SPREE RTL Generator
Component Library
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
26Goals
- Develop measurement methodology
- Populate the design space
- Compare against industrial soft processor(s)
Performance
SPREE
Power
Area
27Alteras NiosII
- Second generation soft processor
- Has three variations
- NiosIIe unpipelined, no hardware multiply
- NiosIIs 5-stages, no branch prediction
- NiosIIf 6-stages, dynamic branch prediction
- Caveats
- Supports exceptions, OS, and caches
- Very similar but tweaked ISA
28Design Space vs NiosII Variations
29Summary
- We span the design space
- Remain competitive
- Achieved 9 faster and 11 smaller than NiosIIs
- gt dont suffer from prohibitive overhead
30Architectural Axes
- Hardware vs Software Multiplication
- Shifter implementation
- Pipeline
- Depth
- Organization
- Forwarding
31Hardware vs Software Multiplication
- Hardware multiplication
- Increases area power consumption
- Speeds up execution
- BUT
- Not all applications care about speed
- Not all applications use multiplication
(significantly)
32Cycle Count Speedup of Hardware Multiplication
33Cost of Hardware Multiply
- 250 LEs (20)
- 35 more Energy/cycle
34Shifter Implementations
- Shifters (multiplexers) are big in FPGAs
- Consider 3 implementations
- Serial shifter
- LUT-based barrel shifter
- Multiplier-based barrel shifter
35Impact of Shifter Implementation
- Consistent across different pipe depths
36Shifter Implementation Tradeoffs
- Averaged over all pipeline depths
- Smallest Serial
- Fastest LUT-based barrel
- Energy efficient Serial
37Pipelines - Depth
- Study different pipeline depths
- Over 3 shifters
- Arrows possible forwarding lines (not used)
- All use predict not-taken branches
38Pipelining clock frequency
39Impact of Pipelining
- Adds area, can increase speed (2 to 3 stage?)
40FPGA Nuance Synchronous RAMs 2-stage Pipeline
Mul
Regfile
Write Back
ALU
413-stage Pipeline
- Less stalls, increased frequency
- gt Big speedup (1.7x)
Mul
Regfile
Write Back
ALU
423, 4 and 5 stage pipelines
- Increased area, small change in performance
- gt Deeper pipelines have potential for better
speedups
43The 7-stage Pipeline Where Branch Delay Slots
break down
BEQ
OR
JR
ADD
44Problem Separation of Branch and Branch Delay
Slot
Stalls on RAW hazard
BEQ
ADD
JR
45Problem Separation of Branch and Branch Delay
Slot
X
BEQ
ADD
JR
NOP
- Must track and protect delay slots
46Multiple Delay Slots
- Cant guard all delay slots
BEQ
OR
JR
ADD
- Must detect separation of branch from delay slot
- OR prevent multiple delay slots
- Stall branch if a delay slot exists in the pipe
- We did this one (30LEs, -15 clock frequency)
47Pipeline organization
- Where stages are placed is important
- Pipe stage placement can
- Result in all around win/loss
- Present a tradeoff
48Forwarding
- SPREE supports stage to stage forwarding
49Effect of Forwarding
50An Aside ISA Subsetting
- Applications dont generally use all instructions
51Processor reduction
- Can strip away unused components/control
- Generator supports instruction disabling
- Automatically strips away unused components
- Create an Application Specific processor
- Do this for each benchmark
- FPGAs are a good platform for this!
52Area of a Subsetted Processor
53Speed of a Subsetted Processor
54Conclusion
- Understanding architectural trade-offs
- gt Maximize efficiency
- Developed SPREE measurement methodology
- Performed preliminary architectural study
- Quantified cost of hardware multiplication
- Explored shift unit implementations
- Explored pipelines depth, organization,
forwarding