ECE 734 - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

ECE 734

Description:

The TMS320 Family includes 3 major divisions. TMS320C2000 (fixed-point) ... Commonly used in cell phones, MP3 players, cameras. TMS320C6000 (fixed-/floating-point) ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 44

Provided by: michael172

Category:

more less

Transcript and Presenter's Notes

Title: ECE 734

1
ECE 734
Programmable DSP Architectures and Algorithm
Considerations TMS320 DSPs

Michael G. Morrow, P.E.

2
TMS320 DSP slides courtesy of Texas Instruments.
3
Processor Architecture 101

Memory Architectures
Single Issue Processors
Pipelining
Multiple Issue Processors
Considerations
Resources
Scheduling
Completion
Superscalar
Very-Long Instruction Word (VLIW)

4
Algorithm Considerations

DSP Implications
Determinism and Predictability
Resource Scheduling
Buses
Memory
ALUs/MACs
Completion

5
TMS320 Programmable DSPs

The TMS320 Family includes 3 major divisions
TMS320C2000 (fixed-point)
Optimized for embedded motor control
Low cost, very high integration (like
microcontroller)
TMS320C5000 (fixed point)
Optimized for low-power operation (0.25mW/MIP)
Commonly used in cell phones, MP3 players,
cameras
TMS320C6000 (fixed-/floating-point)
Optimized for high performance
Commonly used in cellular base stations, digital
radios, image processing, printers

6
C2x DSP Applications
7
C2x DSP Architecture
8
C2x DSP Integration
9
C54x Architecture (10km view)
PC
XPC
Addr Gen
Data Write A/D Bus (E)
MAC AR2, AR3, A
ADD _at_x2, B ...
Separate Program and Data spaces Harvard
architecture
(C) 2002-2004 Michael G. Morrow - 9
10
C54x Memory, Buses and Pipeline
Internal Memory
External Memory

External 1 access / cycle
up to 8M words program

Internal Up to 4 accesses / cycle

Pipeline Phases
P
F
D
A
R
X

P - generate program address
F - get opcode
D - decode instruction
A - generate read address
R - read operands
X - execute
(C) 2002-2004 Michael G. Morrow - 10
11
Pipeline Implications
Program ROM
Data ROM
SARAM
DARAM
Extl Mem I/F
There are no conflicts as long as you follow
these rules

ROM/SARAM - 1 access per block per cycle
DARAM - 2 accesses per block per cycle

(C) 2002-2004 Michael G. Morrow - 11
12
Parallel Instructions
Example Z X Y and F D E
LD MACR LD MASR ST MPYST MACR
ST MASR ST ADDST SUBST LD
LD AR5,16,A ADD AR5,16,A STH
A,AR5 LD AR4,16,B ADD AR4,16,B
STH B,AR4
ST A,AR5 LD AR4,B

Parallel load/store instructions use D Bus and E
Bus in same cycle.

Parallel ops focus on high accumulator.
Store in parallel ops are offset by ASM value.

(C) 2002-2004 Michael G. Morrow - 12
13
Pipeline and Delayed Branches
P1
2 words/ 4 cycles
BDnew
P1
2 words/ 2 cycles
(C) 2002-2004 Michael G. Morrow - 13
14
Using Delayed Instructions
LD _at_x,A ADD _at_y,A MPY _at_z,B STL A,_at_r B next
LD _at_x,A ADD _at_y,A BD next MPY _at_z,B STL A,_at_r
6w/8c
6w/6c

Delay slot is two words deep - cycles or lines of
codeare not relevant

Delay operation may not be a branch of any kind
(B, CALL, RET, RPT, etc.)
Conditions set in delay slot of BCD/CCD/RCDwill
have NO effect on the instruction
Do not load BRC in delay slot of RPTBD
No PUSH/POP in CALLD or RETD delay slots

F F could be gt 1, so how is this handled?

1. Use Guard Bits (allow at least 128 signed
summations)
Guard bits increase dynamic range from /-1 to
/-128
2. In a non-gain system temporary overflow
is permitted. The output is
guaranteed to remain bounded by the input.
3. In a system with gain, the output is not
guaranteed to remain bounded (i.e.
result is larger than 32-bits).
How do you handle a result larger than 32-bits?
(C) 2002-2004 Michael G. Morrow - 15
16
Saturation
Two saturation methods exist for A/B

Manual use the SAT instruction (saturates A or
B)

Auto saturate on store (saturates stored value
only)

SAT A MANUALSTH A,AR1
-OR-
LD 0,DP AUTOORM 1,_at_PMST SST1 STH A,AR0
PMST Processor Mode Status Reg

SAT will set the overflow bit (OVA or OVB) if
saturation occurs

SST does not affect OVx or accumulator contents

High-performance VLIW architecture
32-bit RISC core, 32 GP registers
8 functional units
Static scheduling
Byte addressable memory
Caches
Split L1
Unified L2 / internal SRAM
Determinism?
Fixed- and floating-point versions
Floating-point is superset of fixed-point

18
'C6200 Instruction Set (by unit)
19
'C6700 Superset of Fixed-Point (by unit)
20
The C64x adds ...
C62x Dual 32-Bit Load/Store C64x Dual 64-Bit
Load/Store C67x Dual 64-Bit Load/32-Bit Store
21
Conditional Instructions
Execution based on !zero/non-zero condition
Where condition is A0, A1, A2, B0, B1, B2
Note Only C64x allows A0 to be used as a
condition
22
'C6000 System Block Diagram
23
C6000 Internal Buses
C62x Dual 32-Bit Load/Store C64x Dual 64-Bit
Load/Store C67x Dual 64-Bit Load / 32-Bit Store
24
'C6000 System Block Diagram
25
'C6000 Peripherals
26
'C6000 Peripherals (EMIF)
EMIF ? Glueless access to async/sync
memory? 8/16/32-bit data or 32-bit program access
27
'C6000 Peripherals (HPI/XB/PCI)
HPI / XB (Expansion Bus) / PCI ? 16 / 32-bit
host-?P dedicated bus? HPI/XB are great for PCI
interfacing or boot-loading ? PCI is even better
for PCI interfacing -)
28
'C6000 Peripherals (McBSP)
McBSP ? 2 (or 3) full-duplex, synchronous
serial-ports? Supports multi-channel operation
(T1, E1, MVIP, )
29
'C6000 Peripherals (DMA/EDMA)
DMA / EDMA ? Transfers any set of memory
locations to another? 4 / 17 channels (transfer
setups) ? Includes boot-strap capability
30
'C6000 Peripherals (Timer/Counter)
Timer / Counter ? Two 32-bit timer/counters? Can
generate interrupts? Input and output pins
31
'C6000 Peripherals (PLL)
Input ? CLKIN Output ? CLKOUT1 - x1 or x4
CLKIN - Instruction (MIP) rate ? CLKOUT2 - 1/2
rate of CLKOUT1
PLL ? x1 or x4 clock multiplier? Reduces EMI and
cost? Pin selectable
32
'C6000 System Block Diagram (Final)
ProgramRAM
Data Ram
Addr
Internal Buses
DMA
D (32)
EMIF
Serial Port
Extl Memory
Host Port
Boot Load
- Sync - Async
Timers
Pwr Down
33
Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
34
Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
35
Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
36
VelociTI (FP ? EP)
Code Example B .S1 MVK .S2 ADD .L1
ADD .L2 MPY .M1 MPY .M1 LDH .D1
LDW .D2
Definitions Fetch Packet 8
32-bit instr (256 bits) VLIW Very Long Instr
Word (256 bits) EP Execute Packet (group of
instr) Instruction 32-bit opcode VelociTI TIs
VLIW Architecture w/EP's
37
VelociTI vs. Standard VLIW
Standard VLIW

VelociTI reduces code size up to 81
Fewer program fetches
Less power consumption
Lower memory costs

.vs
VelociTI
38
C62x/67x VelociTI EP/FP Alignment
Code Example B .S1 SUB .L1
MVK .S2 ADD .L2 ADD .L1
MPY .M1 MPY .M1 MPY .M2
LDH .D1 LDB .D2
Execute packets cannot cross fetch packet
boundaries

To align EP's within FP's, the tools add
parallel NOPs

39
C64x Alignment
Code Example B .S1 SUB .L1
MVK .S2 ADD .L2 ADD .L1
MPY .M1 MPY .M1 MPY .M2
LDH .D1 LDB .D2
Execute packets can cross fetch packet boundaries
FP1
ADD
MPY
?
?
B
SUB
MVK
ADD
EP2
EP1
EP3
FP2
LDH
LDB
. .
. .
. .
. .
. .
. .
EP3
Etc.
40
C6x Instruction scheduling

To write optimal code for the C6x platform, we
need to understand the instruction execution
pipeline.
Implement a simple FIR filter on the C6x
architecture.
Scheduling must be done to maximize utilization
of buses and functional units, and to minimize
the number of wasted delay slots.

41
C6x Pipeline Phases
Program Fetch
Decode
Execute
PG PS PW PR
DP DC
E1 E2 E3 E4 E5 E6
(1) (2) (3) (4)
(5) (6)
(7) (8) (9) (10) (11) (12)
E2-E6 are place holdersfor delayed results
42
C6x Instruction Latency

Most instructions are single cycle
Others require delay slots to be filled until the
result becomes available

43
C6x Scheduling Example

FIR Assembly code and constraints

44
C6x Scheduling Example

Accounting for delay slots
Multiply - 1
Load - 4
Branch - 5

45
C6x Scheduling Example

Parallelizing

46
C6x Scheduling Example

Using 32-bit loads to fully utilize functional
units

47
C6x Scheduling Example

Assigning functional units

48
C6x Scheduling Example

Parallelizing and using delay slots

49
C6x Instruction Scheduling

Software pipelining

50
C6x Scheduling

Software pipelined code
Prolog
Loop iteration
Extraneous instructions

51
Effects on C6x DSP Software Development

DSP Hardware is often used in multiprocessor
configurations with communication links.
DSP software is increasingly large and complex.
Assembly code is difficult.
More reliance on high-level languages.
Increasing use of graphical environments (i.e.
Matlab Simulink, Hyperception RIDE) to develop
algorithms.
Increased use of RTOSs.
Packaging and sale of algorithms as IP blocks.
Standard interface (i.e. TI ExpressDSP Algorithm
standard)