Title: Survey of Digital Signal Processors
1Survey of Digital Signal Processors
- Michael Warner
- ECD VLSI Communication Systems
2Agenda
- Industry Trends
- DSP Architecture
- DSP Micro-Architecture
- DSP Systems
3Agenda
- Industry Trends
- DSP Architecture
- DSP Micro-Architecture
- DSP Systems
4Moores Law Drives Processor Development
But what if energy-delay had to be reduced every
generation by an order of magnitude?
Doubling the number of transistors every 18-24 at
same price point drives significant product
opportunities especially if you have little
regard for power
5Genes Law DrivesDSP Development
Genes Law will have its challenges to hold the
line!
6Whats Driving Genes Law?
7DSP Design Constraints
DEVICE CAPABILITIES
8Agenda
- Industry Trends
- DSP Architecture
- DSP Micro-Architecture
- DSP Systems
9What Makes a DSP a DSP?
- Single-Cycle MAC
- Multiple Execution Units
- High Bandwidth (Flat) Memory Sub-Systems
- Efficient Zero-Overhead Looping
- Short Pipeline
- High Bandwidth I/O
- Specialized Instruction Sets
- Sophisticated DMA
- Little to No Speculation
10Single Cycle MAC
- MACs Typically Determine DSP Performance and
Pipeline Length (EX) - Most DSPs Have 2-8 MAC Units
- MACs Typically Operate in Both a Scalar and
Vector Mode
11Multiple Instruction Units
- VLIW Architectures Driving ILP
- Typically Instruction Units
- M-Unit - MAC
- S-Unit - Shift
- L-Unit - ALU
- D-Unit Load/Store
- Industry Has Converged on a ILP of 8
Registers B0 - B15
Registers A0 - A15
2X
1X
D2
M1
D1
L 1
S1
M2
L2
S2
D
S1
S2
D
S1
S2
D
S1
S2
S1
S2
DL
SL
SL
D
DL
S2
S1
D
S2
D
DL
SL
SL
D
DL
S2
S1
S1
S2
D
S1
DDATA_I2 (load data)
DDATA_I1 (load data)
12High Bandwidth Memory Sub-Systems
- Multiple Load-Store Units Required to Feed Data
Path - Tightly Coupled Memory is Typically Dual Ported
- Harvard Architecture is Heavily Banked
PC
CNTL
ARs
P
MUXES
D
MUX
INTERNAL MEMORY
EXTERNAL MEMORY
C
E
CentralArithmeticLogic Unit
MAC
ALU
SHIFTER
B
A
13Specialized Instruction Sets
- Base RISC ISA Plus CISC ISA Driven by End
Application - MAC
- SAD
- LMS
- FIRS
- Viterbi
- Support For Both Scalar and Vector Instructions
- Support For 8, 16 and 32-Bit Instructions
- Instructions are Highly Orthogonal
14Scalar (55x) vs VLIW (64x)
- Scalar DSPs Tend to be More CISC Like
- Hurts Compiler Performance
- Improves Energy-Delay
- Improves Code Density
- Limits Top End Performance
- VLIW DSPs Tend to be More RISC Like
- RISC GP Regs Orthogonality Makes For a Good C
Compiler - Assembler Code Is Challenging
- RISC ISA Allows for Higher Frequencies
- Load-Store Hurts Energy-Delay
15TMS320C54x
16TMS320C54x Protected Pipeline
CYCLES
P1
X6
Prefetch Calculate address of instruction
Fetch Collect instruction Decode Interpret
instruction Access Collect address of
operand Read Collect operand Execute Perform
operation
Fully loaded pipeline
Note Protected Pipeline Limits
Micro-Architectural Flexibility and Performance
17TMS320C6xx
C6xx CPU Core
Program Fetch
Control Registers
Instruction Dispatch
Instruction Decode
Control Logic
Data Path 1
Data Path 2
A Register File
B Register File
Test
Emulation
D1
M1
S1
L1
L2
S2
M2
D2
Interrupts
ArithmeticLogicUnit
Auxiliary LogicUnit
MultiplierUnit
18TMS320C6xx Exposed Pipeline
Fetch
Decode
Execute
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
- Fetch
- PG Program Address Generate
- PS Program Address Send
- PW Program Access Ready Wait
- PR Program Fetch Packet Receive
- Decode
- DP Instruction Dispatch
- DC Instruction Decode
- Execute
- E1 - E5 Execute 1 through Execute 5
Execute Packet 1
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 2
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 3
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 4
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 5
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 6
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 7
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Note Exposed Pipeline Adds Risk to Programming
Model
19Agenda
- Industry Trends
- DSP Architecture
- DSP Micro-Architecture
- DSP Systems
20Micro-Architectural Challenges
- Accessing (Flat) On Chip Memory At Speed Within
2-3 cycles - Feeding Multiple Functional Units From a Single
Register File - Running 600Mhz with a 7-9 Stage Pipeline
- Linking Multiple Functional Units with Result
Forwarding - Implementing CISC Data-path to Meet Area and
Performance Goals - Achieving ARM Like Code Density
21What Does and Doesnt Work?
- Do
- Banked Memory
- Dual Access Memory
- Full Custom Register Files
- Split/Multiple Register Files
- Custom/Semi-Custom Data-paths
- Variable Length Instructions
- CISC ISA
- Co-Processors
- Multi-Core
- Dont
- Multi-Level Caches
- Super-Scalar
- VLIW Packet Descriptors
- Speculative Branching
- Full Synthesis
- Dynamic Logic
- Consider
- Multi-Threading
22Agenda
- Industry Trends
- DSP Architecture
- DSP Micro-Architecture
- DSP Systems
23DSP Systems
24VIOP Platform
- TNETV3010 Features
- 6 C55x DSP _at_ 300 MHz
- Shared Instruction Memory
- Broadcast DMA
- 24M Bits of On Chip SRAM
25DaVinci Platform
26OMAP Platform
- OMAP2420 Features
- ARM 1136 _at_ 330 MHz, VFP (Vector Floating Point),
32K/32K I/Dcache - DSP _at_ 220 MHz
- 2D/3D graphics accelerator
- IVA supports still images to gt4 Mpixels, 30 fps
VGA video decode - Output to TV for gaming and video playback
- Encryption hardware for DRM and security
Imaging VideoAccelerator(IVA)
2D/3DGraphics Accelerator
ARM11 VFP
TMS320C55x DSP
L3 Interconnect
LCD I/FVideoOut
Camera I/F
MemoryController
Internal SRAM
Peripherals
L4 Interconnect
Security
OMAP2420