Title: ECE 734
1ECE 734
Programmable DSP Architectures and Algorithm
Considerations TMS320 DSPs
2TMS320 DSP slides courtesy of Texas Instruments.
3Processor Architecture 101
- Memory Architectures
- Single Issue Processors
- Pipelining
- Multiple Issue Processors
- Considerations
- Resources
- Scheduling
- Completion
- Superscalar
- Very-Long Instruction Word (VLIW)
4Algorithm Considerations
- DSP Implications
- Determinism and Predictability
- Resource Scheduling
- Buses
- Memory
- ALUs/MACs
- Completion
5TMS320 Programmable DSPs
- The TMS320 Family includes 3 major divisions
- TMS320C2000 (fixed-point)
- Optimized for embedded motor control
- Low cost, very high integration (like
microcontroller) - TMS320C5000 (fixed point)
- Optimized for low-power operation (0.25mW/MIP)
- Commonly used in cell phones, MP3 players,
cameras - TMS320C6000 (fixed-/floating-point)
- Optimized for high performance
- Commonly used in cellular base stations, digital
radios, image processing, printers
6C2x DSP Applications
7C2x DSP Architecture
8C2x DSP Integration
9C54x Architecture (10km view)
PC
XPC
Addr Gen
Data Write A/D Bus (E)
MAC AR2, AR3, A
ADD _at_x2, B ...
Separate Program and Data spaces Harvard
architecture
(C) 2002-2004 Michael G. Morrow - 9
10C54x Memory, Buses and Pipeline
Internal Memory
External Memory
- External 1 access / cycle
- up to 8M words program
- Internal Up to 4 accesses / cycle
Pipeline Phases
P
F
D
A
R
X
P - generate program address
F - get opcode
D - decode instruction
A - generate read address
R - read operands
X - execute
(C) 2002-2004 Michael G. Morrow - 10
11Pipeline Implications
Program ROM
Data ROM
SARAM
DARAM
Extl Mem I/F
There are no conflicts as long as you follow
these rules
- ROM/SARAM - 1 access per block per cycle
- DARAM - 2 accesses per block per cycle
(C) 2002-2004 Michael G. Morrow - 11
12Parallel Instructions
Example Z X Y and F D E
LD MACR LD MASR ST MPYST MACR
ST MASR ST ADDST SUBST LD
LD AR5,16,A ADD AR5,16,A STH
A,AR5 LD AR4,16,B ADD AR4,16,B
STH B,AR4
ST A,AR5 LD AR4,B
- Parallel load/store instructions use D Bus and E
Bus in same cycle.
- Parallel ops focus on high accumulator.
- Store in parallel ops are offset by ASM value.
(C) 2002-2004 Michael G. Morrow - 12
13Pipeline and Delayed Branches
P1
2 words/ 4 cycles
BDnew
P1
2 words/ 2 cycles
(C) 2002-2004 Michael G. Morrow - 13
14Using Delayed Instructions
LD _at_x,A ADD _at_y,A MPY _at_z,B STL A,_at_r B next
LD _at_x,A ADD _at_y,A BD next MPY _at_z,B STL A,_at_r
6w/8c
6w/6c
- Delay slot is two words deep - cycles or lines of
codeare not relevant
- Delay operation may not be a branch of any kind
(B, CALL, RET, RPT, etc.) - Conditions set in delay slot of BCD/CCD/RCDwill
have NO effect on the instruction - Do not load BRC in delay slot of RPTBD
- No PUSH/POP in CALLD or RETD delay slots
(C) 2002-2004 Michael G. Morrow - 14
15Handling Accumulative Overflow
- F F could be gt 1, so how is this handled?
1. Use Guard Bits (allow at least 128 signed
summations)
Guard bits increase dynamic range from /-1 to
/-128
2. In a non-gain system temporary overflow
is permitted. The output is
guaranteed to remain bounded by the input.
3. In a system with gain, the output is not
guaranteed to remain bounded (i.e.
result is larger than 32-bits).
How do you handle a result larger than 32-bits?
(C) 2002-2004 Michael G. Morrow - 15
16Saturation
Two saturation methods exist for A/B
- Manual use the SAT instruction (saturates A or
B)
- Auto saturate on store (saturates stored value
only)
SAT A MANUALSTH A,AR1
-OR-
LD 0,DP AUTOORM 1,_at_PMST SST1 STH A,AR0
PMST Processor Mode Status Reg
- SAT will set the overflow bit (OVA or OVB) if
saturation occurs
- SST does not affect OVx or accumulator contents
(C) 2002-2004 Michael G. Morrow - 16
17C6x Architecture
- High-performance VLIW architecture
- 32-bit RISC core, 32 GP registers
- 8 functional units
- Static scheduling
- Byte addressable memory
- Caches
- Split L1
- Unified L2 / internal SRAM
- Determinism?
- Fixed- and floating-point versions
- Floating-point is superset of fixed-point
18'C6200 Instruction Set (by unit)
19'C6700 Superset of Fixed-Point (by unit)
20The C64x adds ...
C62x Dual 32-Bit Load/Store C64x Dual 64-Bit
Load/Store C67x Dual 64-Bit Load/32-Bit Store
21Conditional Instructions
Execution based on !zero/non-zero condition
Where condition is A0, A1, A2, B0, B1, B2
Note Only C64x allows A0 to be used as a
condition
22'C6000 System Block Diagram
23C6000 Internal Buses
C62x Dual 32-Bit Load/Store C64x Dual 64-Bit
Load/Store C67x Dual 64-Bit Load / 32-Bit Store
24'C6000 System Block Diagram
25'C6000 Peripherals
26'C6000 Peripherals (EMIF)
EMIF ? Glueless access to async/sync
memory? 8/16/32-bit data or 32-bit program access
27'C6000 Peripherals (HPI/XB/PCI)
HPI / XB (Expansion Bus) / PCI ? 16 / 32-bit
host-?P dedicated bus? HPI/XB are great for PCI
interfacing or boot-loading ? PCI is even better
for PCI interfacing -)
28'C6000 Peripherals (McBSP)
McBSP ? 2 (or 3) full-duplex, synchronous
serial-ports? Supports multi-channel operation
(T1, E1, MVIP, )
29'C6000 Peripherals (DMA/EDMA)
DMA / EDMA ? Transfers any set of memory
locations to another? 4 / 17 channels (transfer
setups) ? Includes boot-strap capability
30'C6000 Peripherals (Timer/Counter)
Timer / Counter ? Two 32-bit timer/counters? Can
generate interrupts? Input and output pins
31'C6000 Peripherals (PLL)
Input ? CLKIN Output ? CLKOUT1 - x1 or x4
CLKIN - Instruction (MIP) rate ? CLKOUT2 - 1/2
rate of CLKOUT1
PLL ? x1 or x4 clock multiplier? Reduces EMI and
cost? Pin selectable
32'C6000 System Block Diagram (Final)
ProgramRAM
Data Ram
Addr
Internal Buses
DMA
D (32)
EMIF
Serial Port
Extl Memory
Host Port
Boot Load
- Sync - Async
Timers
Pwr Down
33Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
34Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
35Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
36VelociTI (FP ? EP)
Code Example B .S1 MVK .S2 ADD .L1
ADD .L2 MPY .M1 MPY .M1 LDH .D1
LDW .D2
Definitions Fetch Packet 8
32-bit instr (256 bits) VLIW Very Long Instr
Word (256 bits) EP Execute Packet (group of
instr) Instruction 32-bit opcode VelociTI TIs
VLIW Architecture w/EP's
37VelociTI vs. Standard VLIW
Standard VLIW
- VelociTI reduces code size up to 81
- Fewer program fetches
- Less power consumption
- Lower memory costs
.vs
VelociTI
38C62x/67x VelociTI EP/FP Alignment
Code Example B .S1 SUB .L1
MVK .S2 ADD .L2 ADD .L1
MPY .M1 MPY .M1 MPY .M2
LDH .D1 LDB .D2
Execute packets cannot cross fetch packet
boundaries
- To align EP's within FP's, the tools add
parallel NOPs
39C64x Alignment
Code Example B .S1 SUB .L1
MVK .S2 ADD .L2 ADD .L1
MPY .M1 MPY .M1 MPY .M2
LDH .D1 LDB .D2
Execute packets can cross fetch packet boundaries
FP1
ADD
MPY
?
?
B
SUB
MVK
ADD
EP2
EP1
EP3
FP2
LDH
LDB
. .
. .
. .
. .
. .
. .
EP3
Etc.
40C6x Instruction scheduling
- To write optimal code for the C6x platform, we
need to understand the instruction execution
pipeline. - Implement a simple FIR filter on the C6x
architecture. - Scheduling must be done to maximize utilization
of buses and functional units, and to minimize
the number of wasted delay slots.
41C6x Pipeline Phases
Program Fetch
Decode
Execute
PG PS PW PR
DP DC
E1 E2 E3 E4 E5 E6
(1) (2) (3) (4)
(5) (6)
(7) (8) (9) (10) (11) (12)
E2-E6 are place holdersfor delayed results
42C6x Instruction Latency
- Most instructions are single cycle
- Others require delay slots to be filled until the
result becomes available
43C6x Scheduling Example
- FIR Assembly code and constraints
44C6x Scheduling Example
- Accounting for delay slots
- Multiply - 1
- Load - 4
- Branch - 5
45C6x Scheduling Example
46C6x Scheduling Example
- Using 32-bit loads to fully utilize functional
units
47C6x Scheduling Example
- Assigning functional units
48C6x Scheduling Example
- Parallelizing and using delay slots
49C6x Instruction Scheduling
50C6x Scheduling
- Software pipelined code
- Prolog
- Loop iteration
- Extraneous instructions
51Effects on C6x DSP Software Development
- DSP Hardware is often used in multiprocessor
configurations with communication links. - DSP software is increasingly large and complex.
- Assembly code is difficult.
- More reliance on high-level languages.
- Increasing use of graphical environments (i.e.
Matlab Simulink, Hyperception RIDE) to develop
algorithms. - Increased use of RTOSs.
- Packaging and sale of algorithms as IP blocks.
- Standard interface (i.e. TI ExpressDSP Algorithm
standard)