Title: Programmable Digital Signal Processor II
1Programmable Digital Signal Processor (II)
Based on presentations by S. Kittitornkun
2Outline
- Programmable DSP What and Why ?
- TMS320C8x C80 C82
- TMS320C6x C62x C67x
- Target Applications
- Application H.324 on TMS320C82
- Current Multimedia Processors
- References
3What is PDSP?
- A special purpose, programmable micro-processor
designed for DSP applications. - Features
- Specialized instruction sets
- Complex instructions ? Smaller program
- Instruction level parallelism (ILP)
- Specialized hardware support
- Fast/parallel input/output support for media
processing - Special ALUs, function units for bit operations,
etc. - Special memory, bus transfer architecture
- On-board co-processors, etc.
4Why PDSP?
- Higher performance compared to general purpose
micro-processors for specialized (embedded)
applications - Software implementation offers more flexibility
for product upgrade and migration than ASIC - Low cost
- Lower per-unit cost than general purpose
micro-processors - Lower overall cost than ASIC for lower volume
products
5TMS320C8x Overview
- RISC Master processor _at_ 50 and 60 MHz
- Parallel processors x2 (4 for c80)
- Transfer controller DMA and memory controller
- Video controller (C80 only)
6TMS320C8x Master Processor
- 32-bit RISC instruction/64-bit data
- Score-boarded 31 general purpose registers and a
zero register - IEEE 754 floating point unit
- Supports vector floating point (FP) operations
- Performs single precision floating point MAC in 1
cycle 100 MFLOPS (_at_50 MHz) - Suitable for control protocols and FP intensive
algorithms
7TMS320C8x Processor communication
x86 Host
PP0
processor
Ports
PP1
Tasks
Tasks
Tasks
function
Kernel
Tasks
Tasks
Tasks
Tasks
Tasks
Signals
Semiphore
Master processor
- Shared memory multiprocessor
- MP sends commands through command buffers located
in shared memory
8TMS320C8x Parallel Processor
- Data unit 32-bit datapath, ALU, multiplier,
etc. - 2 Independent Address units global and local
- 1 cycle on-chip memory access (no conflict)
- 1 cycle load/store of byte, halfword, and word
- Internal adder can offload data unit computation
- Program flow control unit
9TMS320C8x Parallel Processor Program Flow
Control Unit
- 3-stage pipelining
- Instruction fetch
- Address generation, and
- Operation execution
- conditional operation of data unit operations,
moves, load from memory and branches - PC is mapped into register file
- To minimize overhead Loop controller supports 3
levels of nested loops
10TMS320C8x Parallel Processor Data Unit
- Split 32-bit 3-input ALU Boolean and arithmetic
operations - Split and rounded multiplier dual 8x816,
16x1632 - Flexible datapath barrel rotator, mask
generator - Supports signed, unsigned and saturate arithmetic
11TMS320C8x Parallel Processor Data Unit
3-input ALU
- Supports totally 512 operations Boolean 256
Arithmetic. 256 - Boolean F0 (ABC) F1 (ABC) F2
(ABC) F3 (ABC) F4 (ABC)
F5 (ABC) F6 (ABC) F7 (ABC) - Arithmetic A f1(B,C) f2(B,C) 1
- Example
- AB1
- (AC)(BC) Mask A and B by C and then add
- A((BC) (-BC)) Multiple-byte AB
- A-((BC) (-BC)) Multiple-byte A-B
12TMS320C8x Parallel Processor Instruction Set
- 64-bit OPcode contains multiple sub-instructions
for - Data unit
- Global address unit and
- Local address unit
- Ex d4d5d6gtgtd0 a8d7 d0(a0x1)
13TMS320C8x Transfer Controller
- Prioritizes, schedules, and transfers data cache
between on- and off-chip memories - Handles data cache (on chip RAM) miss and
instruction cache - Supports multidimensional data transfers
- simple contiguous linear sequence up to 3D region
- Memory interface supports a wide range of memory
system - DRAM, SDRAM, Video RAM and SRAM
14TMS320C8x Video Controller (c80 only)
- Provides simultaneous control over two
independent capture or display systems and frame
grabber or frame buffer image storage - Dual-frame timers
- Programmable timing and control registers
- Programmable line interrupt to MP
15TMS320C8x Development Tools
- C-like compilers and assemblers for both master
and parallel processor - Register allocator
- identifies live and free registers
- allows using variable names in assembly code and
- assigns specific register to variable
- Code compactor converts straight-line assembly
codes into parallel codes - Optimization can be done by hand for
time-critical parallel code
16TMS320C8x Execution Time for 256-Point FFT
-C" indicates performance with the cache
pre-loaded - Benchmark results for the TMS320C80
are for one of the on-chip DSP processors
17TMS320C6x VelociTI Overview
- VLIW DSPs
- TMS320C62x Fixed-point DSPs
- TMS320C67x Floating-point DSPs
18TMS320C6x VelociTI Key features
- Issues and executes up to 8 instructions every
cycle - Load/store architecture
- 32-bit RISC instruction /32-bit data
- Conditional instructions
- reduces costly branching
- increases parallelism for higher sustained
performance - Instruction packing
- Reduces code size, program fetches, and power
consumption.
19TMS320C6x VelociTI Datapath
20TMS320C6x VelociTI Datapath
- Two register files
- 16x32 bits
- Each supports simultaneous 10 reads and 6 writes
- Two sets of identical functional units 8 units
- L logic functions, bit counting, and add/sub
- S shifting, bit manipulation, branch/control
and add/sub - D addressing and add/sub
- M multiplication
- Grouping of functional units reduces the reg.
ports
21TMS320C6x VelociTI Instruction set
- 32-bit RISC like OPcode format
- creg conditional registers
- z zero or nonzero
- dst destination
- src1/2 source 1 and 2
- cst constant
- x use cross path for src2
- s side A or B for destination
- op operation
- Instruction can be conditioned on value of A1,
A2, B0, B1, B2 - Each instruction takes 1 cycle to execute except
double- precision operations in C67x
22TMS320C6x VelociTI Instruction packing
- Fetch packet 8 32-bit instructions are fetched
simultaneously
23TMS320C6x VelociTI Instruction packing
Execute packet indicated by p-bit or
parallel-bit 1 in parallel 0 not in
parallel Example
24TMS320C6x VelociTI Pipeline
- 3 stages of 16 phases of deep pipeline
- Fetch - 4 phases PG, PS, PW, PR
- Decode - 2 phases DP, DC
- Execute - 10 phases max E1 to E10
- No stall except cache miss or external access
- Performs load after store to the same memory
location - Each branch takes 5 cycle to be taken or not-taken
25TMS320C6x VelociTI Memory Hierarchy
- Internal Program Memory is configurable
- Mapped memory or direct mapped cache
- 16 K of 32-bit instructions or 2 K of 256-bit
fetch packets - Internal Data Memory
- 2 blocks of 4 8-Kbyte interleaved banks
- DMA Controller 800 Mbytes/s peak
- Transfers between on-chip memories, peripherals
and external memory - EMIF (External Memory Interface)
- 800 Mbytes/s peak
- Supports SBSRAM, SDRAM, etc.
26TMS320C6x VelociTI Peripherals
- McBSP (Multichannel Buffered Serial Port)
- Two independent 100 Mbits/s full duplex serial
port - Supports standards ST-BUS, AC97 audio codec,
etc. - Timers
- Two programmable 32-bit timers
- Host Port Interface
- 100 Mbytes/s 16-bit bi-directional port to
standard processors - Power-Down Modes 1,2,3
- Reduce power consumption
27TMS320C6x VelociTI Programming
- Includes C compiler, Assembler, , Optimizer, and
Debuggers in software simulator - 72-82 efficiency compared to handwritten
assembly codes - Optimization techniques
- Intrinsic functions in C compiler
- Software pipelining
- If..Else and Case conversion to conditional
instruction - Data types (by compiler)
- long 40 bits
- int 32 bits
- short 16 bits
- char 8 bits
28 Target Applications
- Video - DVD, MPEG 1 2 decoding
- Audio - Dolby AC-3, 3D Audio, MPEG Decode,
Wavetable Synthesis - Graphics - 2D 3D acceleration
- Communication
- Vocoder
- ADSL, Fax/MODEM V.34, 56k
- Echo chancellor
- Desktop Videoconferencing
- H.320 ISDN
- H.324 on POTS (Plain Old Telephone System)
29H.324 on TMS320C82 Overview
- ITU-T H.324 Low-bit-rate multimedia
teleconferencing on circuit-switched network
includes - G.723 Audio coding at 5.3-6.4 kbps requires 18-20
fixed-point MIPS - H.263 Video coding based on H.261 includes some
enhancements - H.223 MUX/DEMUX control
- H.245 Control protocol
- V.34 Modem up to 33.6 kbps
- Other related standards H.320 (ISDN), H.323
(LAN), and H.310 (ATM/B-ISDN)
30H.324 on TMS320C82 Overview
31H.324 on TMS320C82 Task Partitioning
- Video Processing (H.263)
- Encoding
- Pre-processing MP
- Motion estimation PP0
- DCT PP0
- Decoding
- Huffman or arithmetic decode, IDCT, etc. PP0
- Post processing
PP0 - Audio Processing and AEC (Acoustic Echo
Cancellation) - PP1 - G.723
- Encoding 22 MIPS
- Decoding 3 MIPS
- AEC LMS algorithm up to 64-ms echo 10MIPS
- MODEM V.34 20 MIPS - PP1
32H.324 on TMS320C82 Task Partitioning
33Current Multimedia Processors
- Digital Signal Processor gt Multimedia Processor
- Employ RISC instruction set and pipelining to
gain higher clock frequency - Perform operations on single and multiple bytes
of data - Try to exploit more parallelisms on static
instruction level parallelism (ILP) rather than
dynamic ILP - Concern more and more on data movement and I/O
interface - Pay more attention on low power design
- PC/consumer market is one of their primary targets
34Current Multimedia Processors
35References
- TMS320C8x
- J. Golston, Single-chip H.324 video
conferencing, IEEE Micro, August 1996, pp. 42-50 - Texas Instrument, TMSC320C80 Data Sheet, 1997
at http//www.ti.com/../sprs023b.pdf - P. Lapseley and G. Blalock, How to estimate DPS
processor performance, IEEE Spectrum, July 1996,
pp. 74-78 - HTML file http//www.bdti.com/../wpeval.html
- TMS320C6x
- N. Seshan, High VelociTI Processing, IEEE Signal
Processing Mag, March 1998, pp. 86-101 - TMS320C6x data sheet
36References
- Trimedia
- G. A. Slavenburg, The Trimedia TM-1 PCI VLIW
Mediaprocessor, IEEE Hot Chips 8 Symposium on
High-Performance Chips, Aug. 1996 - http//infopad.eecs.berkeley.edu/HotChips8/
- MSP
- L. T.Nguyen, M. Mohamed, H. Park, Y. Pal, R.
Wong, A. Qureshi, P. Psong, F. Valesco, H. D.
Truong, C. Reader, Multi-media Signal Processor
(MSP) Summary , IEEE Hot Chips 8 Symposium on
High-Performance Chips, Aug. 1996 - http//infopad.eecs.berkeley.edu/HotChips8/
- H.324
- D. Lindbergh, The H.324 multimedia communication
standard, IEEE Communication Magazine, December
1996, pp. 46-51 - K. Rijkse, H.263 Video coding for low-bit-rate
communication, IEEE Communication Magazine,
December 1996, pp. 42-45
37Useful links
- CPU Information Center
- http//infopad.eecs.berkeley.edu/CIC/
- Microprocessor Report
- http//www.chipanalyst.com/q/
- Berkeley Design Technology Inc.
- http//www.bdti.com/
- Peter Pirschs research group
- http//www.mst.uni-hannover.de/