The Microarchitecture of FPGABased Soft Processors - PowerPoint PPT Presentation

About This Presentation
Title:

The Microarchitecture of FPGABased Soft Processors

Description:

Tune each one to meet design constraints. Option 3: On-chip 'soft' processor. Custom Logic. Processor. Tuning Processors. Application, Design constraints $3. 4 MHz ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 55
Provided by: looie6
Category:

less

Transcript and Presenter's Notes

Title: The Microarchitecture of FPGABased Soft Processors


1
The Microarchitecture of FPGA-Based Soft
Processors
  • Peter Yiannacouras
  • CARG - June 14, 2005

2
FPGA vs ASIC Flows
  • Reduced cost for low-volume
  • Reduced time-to-market
  • Programmability affords customization
  • Designers use FPGAs!

ASIC Flow
FPGA Flow
Circuit Design
Circuit Design
3
Processors and FPGAs
Custom Logic
Processor
4
Tuning Processors
Application, Design constraints
  • 3
  • 4 MHz
  • 800 mW
  • 2-stage pipeline
  • 300
  • 3.8 GHz
  • 80 W
  • 31-stage pipeline

5
Understanding Soft Processors
Architecture Description
  • Tuning requires understanding of soft processor
    design space
  • We implement many processors and study the design
    space

Synthesized Processor
  • Area
  • Performance
  • Power

6
Dont we already understand architecture?
  • Not completely
  • We can evaluate area, power, performance
  • Not accurately (rules of thumb)
  • FPGA CAD tools are very accurate
  • Not in the FPGA domain
  • LUTs vs transistors
  • relative speed of RAM Multipliers

7
Goals
  • Develop measurement methodology
  • Populate the design space
  • Compare against industrial soft processor(s)

8
Measurement Methodology
  • Require a set of metrics

Performance
Power
Area
  • Resource Usage
  • Clock Frequency
  • Power estimate

9
Area
Logic Elements (LEs LUT flip flop)
Big RAM
Little RAM
Multipliers
Medium RAM
10
Performance
  • Wall Clock Time Cycles Clock Period

11
Power
  • CAD tool can estimate power from assumed toggle
    ratio (derived experimentally)

Clock Frequency (MHz)
12
Metrics summary
  • Require the following information
  • Resource Usage (area CAD Tool)
  • Clock Frequency (wall clock time CAD Tool)
  • Power Estimate (energy/cycle CAD Tool)
  • Cycle Count (wall clock time RTL Simulator)

13
RTL-based Design Space Exploration
Circuit Design (RTL)
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
  • Correctness
  • Cycle Count

14
Goals
  • Develop measurement methodology
  • Populate the design space
  • Compare against industrial soft processor(s)

15
Microarchitectural Design Space Exploration
Circuit Design (RTL)
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
  • Correctness
  • Cycle Count

16
SPREE (Soft Processor Rapid Exploration
Environment)
SPREE RTL Generator
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
  • Correctness
  • Cycle Count

17
Goals
  • Develop measurement methodology
  • Populate the design space
  • Rapidly
  • With interesting designs
  • Accurately (minimize overhead)
  • Compare against industrial soft processor(s)

18
Related Work
  • Parametrized Cores
  • Narrow design space, laborious changes to control
  • Architecture Description Languages (ADLs)
  • Too robust, inaccurate (simulator based, or
    behavioural RTL)
  • PEAS-III/ASIPMeister Itoh2000
  • non-fpga specific, ISA design focus

19
SPREE RTL Generator Overview
ISA Description
Datapath Description
SPREE RTL Generator
Component Library
Efficiently Synthesizable RTL
20
Some current limitations
  • No caches (use fast on-chip RAM)
  • Simple in-order issue pipelines
  • No dynamic branch prediction
  • No OS or exceptions support
  • No ISA changes!
  • Need compiler generation to support
  • Use subset of MIPS-I

21
Architecture Input
Mul
Write Back
ALU
Component Library
22
Architecture Input
Component Library
Datapath Description
23
Architecture Input
ISA Description
Datapath Description
Ifetch
Reg File
Ifetch
Reg File
SPREE RTL Generator
Mul
Data Mem
Mul

ALU
Write Back
ALU
Write Back
Component Library
24
Architecture InputISA Description
  • Generic Operations (GENOPs)
  • MIPS instructions made of GENOPs

GENOPs
MIPS ADD add rd, rs, rt
FETCH
FETCH
RFREAD
RFREAD
RFREAD
ADD
ADD
RFWRITE
RFWRITE
25
Complete Experimental Framework Using SPREE
FIXED
ISA Description
Datapath Description
SPREE RTL Generator
Component Library
Benchmarks
CAD Tool
RTL Simulator
3. Area 4. Clock Frequency 5. Power
  • Correctness
  • Cycle Count

26
Goals
  • Develop measurement methodology
  • Populate the design space
  • Compare against industrial soft processor(s)

Performance
SPREE
Power
Area
27
Alteras NiosII
  • Second generation soft processor
  • Has three variations
  • NiosIIe unpipelined, no hardware multiply
  • NiosIIs 5-stages, no branch prediction
  • NiosIIf 6-stages, dynamic branch prediction
  • Caveats
  • Supports exceptions, OS, and caches
  • Very similar but tweaked ISA

28
Design Space vs NiosII Variations
29
Summary
  • We span the design space
  • Remain competitive
  • Achieved 9 faster and 11 smaller than NiosIIs
  • gt dont suffer from prohibitive overhead

30
Architectural Axes
  • Hardware vs Software Multiplication
  • Shifter implementation
  • Pipeline
  • Depth
  • Organization
  • Forwarding

31
Hardware vs Software Multiplication
  • Hardware multiplication
  • Increases area power consumption
  • Speeds up execution
  • BUT
  • Not all applications care about speed
  • Not all applications use multiplication
    (significantly)

32
Cycle Count Speedup of Hardware Multiplication
33
Cost of Hardware Multiply
  • 250 LEs (20)
  • 35 more Energy/cycle

34
Shifter Implementations
  • Shifters (multiplexers) are big in FPGAs
  • Consider 3 implementations
  • Serial shifter
  • LUT-based barrel shifter
  • Multiplier-based barrel shifter

35
Impact of Shifter Implementation
  • Consistent across different pipe depths

36
Shifter Implementation Tradeoffs
  • Averaged over all pipeline depths
  • Smallest Serial
  • Fastest LUT-based barrel
  • Energy efficient Serial

37
Pipelines - Depth
  • Study different pipeline depths
  • Over 3 shifters
  • Arrows possible forwarding lines (not used)
  • All use predict not-taken branches

38
Pipelining clock frequency
39
Impact of Pipelining
  • Adds area, can increase speed (2 to 3 stage?)

40
FPGA Nuance Synchronous RAMs 2-stage Pipeline
Mul

Regfile
Write Back

ALU
41
3-stage Pipeline
  • Less stalls, increased frequency
  • gt Big speedup (1.7x)

Mul

Regfile
Write Back

ALU
42
3, 4 and 5 stage pipelines
  • Increased area, small change in performance
  • gt Deeper pipelines have potential for better
    speedups

43
The 7-stage Pipeline Where Branch Delay Slots
break down
  • The ideal case


BEQ
OR
JR
ADD
44
Problem Separation of Branch and Branch Delay
Slot
Stalls on RAW hazard

BEQ
ADD
JR
45
Problem Separation of Branch and Branch Delay
Slot
X

BEQ
ADD
JR
NOP
  • Must track and protect delay slots

46
Multiple Delay Slots
  • Cant guard all delay slots


BEQ
OR
JR
ADD
  • Must detect separation of branch from delay slot
  • OR prevent multiple delay slots
  • Stall branch if a delay slot exists in the pipe
  • We did this one (30LEs, -15 clock frequency)

47
Pipeline organization
  • Where stages are placed is important
  • Pipe stage placement can
  • Result in all around win/loss
  • Present a tradeoff

48
Forwarding
  • SPREE supports stage to stage forwarding

49
Effect of Forwarding
50
An Aside ISA Subsetting
  • Applications dont generally use all instructions

51
Processor reduction
  • Can strip away unused components/control
  • Generator supports instruction disabling
  • Automatically strips away unused components
  • Create an Application Specific processor
  • Do this for each benchmark
  • FPGAs are a good platform for this!

52
Area of a Subsetted Processor
53
Speed of a Subsetted Processor
54
Conclusion
  • Understanding architectural trade-offs
  • gt Maximize efficiency
  • Developed SPREE measurement methodology
  • Performed preliminary architectural study
  • Quantified cost of hardware multiplication
  • Explored shift unit implementations
  • Explored pipelines depth, organization,
    forwarding
Write a Comment
User Comments (0)
About PowerShow.com