ECE 734 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

ECE 734

Description:

The TMS320 Family includes 3 major divisions. TMS320C2000 (fixed-point) ... Commonly used in cell phones, MP3 players, cameras. TMS320C6000 (fixed-/floating-point) ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 44
Provided by: michael172
Category:
Tags: ece | cell | phone

less

Transcript and Presenter's Notes

Title: ECE 734


1
ECE 734
Programmable DSP Architectures and Algorithm
Considerations TMS320 DSPs
  • Michael G. Morrow, P.E.

2
TMS320 DSP slides courtesy of Texas Instruments.
3
Processor Architecture 101
  • Memory Architectures
  • Single Issue Processors
  • Pipelining
  • Multiple Issue Processors
  • Considerations
  • Resources
  • Scheduling
  • Completion
  • Superscalar
  • Very-Long Instruction Word (VLIW)

4
Algorithm Considerations
  • DSP Implications
  • Determinism and Predictability
  • Resource Scheduling
  • Buses
  • Memory
  • ALUs/MACs
  • Completion

5
TMS320 Programmable DSPs
  • The TMS320 Family includes 3 major divisions
  • TMS320C2000 (fixed-point)
  • Optimized for embedded motor control
  • Low cost, very high integration (like
    microcontroller)
  • TMS320C5000 (fixed point)
  • Optimized for low-power operation (0.25mW/MIP)
  • Commonly used in cell phones, MP3 players,
    cameras
  • TMS320C6000 (fixed-/floating-point)
  • Optimized for high performance
  • Commonly used in cellular base stations, digital
    radios, image processing, printers

6
C2x DSP Applications
7
C2x DSP Architecture
8
C2x DSP Integration
9
C54x Architecture (10km view)
PC
XPC
Addr Gen
Data Write A/D Bus (E)
MAC AR2, AR3, A
ADD _at_x2, B ...
Separate Program and Data spaces Harvard
architecture
(C) 2002-2004 Michael G. Morrow - 9
10
C54x Memory, Buses and Pipeline
Internal Memory
External Memory
  • External 1 access / cycle
  • up to 8M words program
  • Internal Up to 4 accesses / cycle

Pipeline Phases
P
F
D
A
R
X

P - generate program address
F - get opcode
D - decode instruction
A - generate read address
R - read operands
X - execute
(C) 2002-2004 Michael G. Morrow - 10
11
Pipeline Implications
Program ROM
Data ROM
SARAM
DARAM
Extl Mem I/F
There are no conflicts as long as you follow
these rules
  • ROM/SARAM - 1 access per block per cycle
  • DARAM - 2 accesses per block per cycle

(C) 2002-2004 Michael G. Morrow - 11
12
Parallel Instructions
Example Z X Y and F D E
LD MACR LD MASR ST MPYST MACR
ST MASR ST ADDST SUBST LD
LD AR5,16,A ADD AR5,16,A STH
A,AR5 LD AR4,16,B ADD AR4,16,B
STH B,AR4
ST A,AR5 LD AR4,B
  • Parallel load/store instructions use D Bus and E
    Bus in same cycle.
  • Parallel ops focus on high accumulator.
  • Store in parallel ops are offset by ASM value.

(C) 2002-2004 Michael G. Morrow - 12
13
Pipeline and Delayed Branches
P1
2 words/ 4 cycles
BDnew
P1
2 words/ 2 cycles
(C) 2002-2004 Michael G. Morrow - 13
14
Using Delayed Instructions
LD _at_x,A ADD _at_y,A MPY _at_z,B STL A,_at_r B next
LD _at_x,A ADD _at_y,A BD next MPY _at_z,B STL A,_at_r
6w/8c
6w/6c
  • Delay slot is two words deep - cycles or lines of
    codeare not relevant
  • Delay operation may not be a branch of any kind
    (B, CALL, RET, RPT, etc.)
  • Conditions set in delay slot of BCD/CCD/RCDwill
    have NO effect on the instruction
  • Do not load BRC in delay slot of RPTBD
  • No PUSH/POP in CALLD or RETD delay slots

(C) 2002-2004 Michael G. Morrow - 14
15
Handling Accumulative Overflow
  • F F could be gt 1, so how is this handled?

1. Use Guard Bits (allow at least 128 signed
summations)
Guard bits increase dynamic range from /-1 to
/-128
2. In a non-gain system temporary overflow
is permitted. The output is
guaranteed to remain bounded by the input.
3. In a system with gain, the output is not
guaranteed to remain bounded (i.e.
result is larger than 32-bits).
How do you handle a result larger than 32-bits?
(C) 2002-2004 Michael G. Morrow - 15
16
Saturation
Two saturation methods exist for A/B
  • Manual use the SAT instruction (saturates A or
    B)
  • Auto saturate on store (saturates stored value
    only)

SAT A MANUALSTH A,AR1
-OR-
LD 0,DP AUTOORM 1,_at_PMST SST1 STH A,AR0
PMST Processor Mode Status Reg
  • SAT will set the overflow bit (OVA or OVB) if
    saturation occurs
  • SST does not affect OVx or accumulator contents

(C) 2002-2004 Michael G. Morrow - 16
17
C6x Architecture
  • High-performance VLIW architecture
  • 32-bit RISC core, 32 GP registers
  • 8 functional units
  • Static scheduling
  • Byte addressable memory
  • Caches
  • Split L1
  • Unified L2 / internal SRAM
  • Determinism?
  • Fixed- and floating-point versions
  • Floating-point is superset of fixed-point

18
'C6200 Instruction Set (by unit)
19
'C6700 Superset of Fixed-Point (by unit)
20
The C64x adds ...
C62x Dual 32-Bit Load/Store C64x Dual 64-Bit
Load/Store C67x Dual 64-Bit Load/32-Bit Store
21
Conditional Instructions
Execution based on !zero/non-zero condition
Where condition is A0, A1, A2, B0, B1, B2
Note Only C64x allows A0 to be used as a
condition
22
'C6000 System Block Diagram
23
C6000 Internal Buses
C62x Dual 32-Bit Load/Store C64x Dual 64-Bit
Load/Store C67x Dual 64-Bit Load / 32-Bit Store
24
'C6000 System Block Diagram
25
'C6000 Peripherals
26
'C6000 Peripherals (EMIF)
EMIF ? Glueless access to async/sync
memory? 8/16/32-bit data or 32-bit program access
27
'C6000 Peripherals (HPI/XB/PCI)
HPI / XB (Expansion Bus) / PCI ? 16 / 32-bit
host-?P dedicated bus? HPI/XB are great for PCI
interfacing or boot-loading ? PCI is even better
for PCI interfacing -)
28
'C6000 Peripherals (McBSP)
McBSP ? 2 (or 3) full-duplex, synchronous
serial-ports? Supports multi-channel operation
(T1, E1, MVIP, )
29
'C6000 Peripherals (DMA/EDMA)
DMA / EDMA ? Transfers any set of memory
locations to another? 4 / 17 channels (transfer
setups) ? Includes boot-strap capability
30
'C6000 Peripherals (Timer/Counter)
Timer / Counter ? Two 32-bit timer/counters? Can
generate interrupts? Input and output pins
31
'C6000 Peripherals (PLL)
Input ? CLKIN Output ? CLKOUT1 - x1 or x4
CLKIN - Instruction (MIP) rate ? CLKOUT2 - 1/2
rate of CLKOUT1
PLL ? x1 or x4 clock multiplier? Reduces EMI and
cost? Pin selectable
32
'C6000 System Block Diagram (Final)
ProgramRAM
Data Ram
Addr
Internal Buses
DMA
D (32)
EMIF
Serial Port
Extl Memory
Host Port
Boot Load
- Sync - Async
Timers
Pwr Down
33
Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
34
Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
35
Standard VLIW (FP EP)
FP Fetch PacketEP Execute Packet
36
VelociTI (FP ? EP)
Code Example B .S1 MVK .S2 ADD .L1
ADD .L2 MPY .M1 MPY .M1 LDH .D1
LDW .D2
Definitions Fetch Packet 8
32-bit instr (256 bits) VLIW Very Long Instr
Word (256 bits) EP Execute Packet (group of
instr) Instruction 32-bit opcode VelociTI TIs
VLIW Architecture w/EP's
37
VelociTI vs. Standard VLIW
Standard VLIW
  • VelociTI reduces code size up to 81
  • Fewer program fetches
  • Less power consumption
  • Lower memory costs

.vs
VelociTI
38
C62x/67x VelociTI EP/FP Alignment
Code Example B .S1 SUB .L1
MVK .S2 ADD .L2 ADD .L1
MPY .M1 MPY .M1 MPY .M2
LDH .D1 LDB .D2
Execute packets cannot cross fetch packet
boundaries
  • To align EP's within FP's, the tools add
    parallel NOPs

39
C64x Alignment
Code Example B .S1 SUB .L1
MVK .S2 ADD .L2 ADD .L1
MPY .M1 MPY .M1 MPY .M2
LDH .D1 LDB .D2
Execute packets can cross fetch packet boundaries
FP1
ADD
MPY
?
?
B
SUB
MVK
ADD
EP2
EP1
EP3
FP2
LDH
LDB
. .
. .
. .
. .
. .
. .
EP3
Etc.
40
C6x Instruction scheduling
  • To write optimal code for the C6x platform, we
    need to understand the instruction execution
    pipeline.
  • Implement a simple FIR filter on the C6x
    architecture.
  • Scheduling must be done to maximize utilization
    of buses and functional units, and to minimize
    the number of wasted delay slots.

41
C6x Pipeline Phases
Program Fetch
Decode
Execute
PG PS PW PR
DP DC
E1 E2 E3 E4 E5 E6
(1) (2) (3) (4)
(5) (6)
(7) (8) (9) (10) (11) (12)
E2-E6 are place holdersfor delayed results
42
C6x Instruction Latency
  • Most instructions are single cycle
  • Others require delay slots to be filled until the
    result becomes available

43
C6x Scheduling Example
  • FIR Assembly code and constraints

44
C6x Scheduling Example
  • Accounting for delay slots
  • Multiply - 1
  • Load - 4
  • Branch - 5

45
C6x Scheduling Example
  • Parallelizing

46
C6x Scheduling Example
  • Using 32-bit loads to fully utilize functional
    units

47
C6x Scheduling Example
  • Assigning functional units

48
C6x Scheduling Example
  • Parallelizing and using delay slots

49
C6x Instruction Scheduling
  • Software pipelining

50
C6x Scheduling
  • Software pipelined code
  • Prolog
  • Loop iteration
  • Extraneous instructions

51
Effects on C6x DSP Software Development
  • DSP Hardware is often used in multiprocessor
    configurations with communication links.
  • DSP software is increasingly large and complex.
  • Assembly code is difficult.
  • More reliance on high-level languages.
  • Increasing use of graphical environments (i.e.
    Matlab Simulink, Hyperception RIDE) to develop
    algorithms.
  • Increased use of RTOSs.
  • Packaging and sale of algorithms as IP blocks.
  • Standard interface (i.e. TI ExpressDSP Algorithm
    standard)
Write a Comment
User Comments (0)
About PowerShow.com