Title: Digital Signal Processor: Architectures and Applications
1Digital Signal Processor Architectures and
Applications
- ECE734 VLSI Array Structure for Digital Signal
Processing - Spring 1998
- By
- Surin Kittitornkun
- Apr. 28, 1998
2Contents
- Programmable DSP Why ?
- TMS320C8x C80 C82
- MPACT-R3600 -2
- TMS320C6x C62x C67x
- Target Applications
- Application H.324 on TMS320C82
- Current Multimedia Processors
- References
3Programmable DSP Why ?
- More flexibility changes can be made in software
- Less complexity shorter time to market
- More cost efficient than ASIC design
- Requires software development
- May consume more power
- PC and consumer product
4TMS320C8x Overview
- RISC Master processor _at_ 50 and 60 MHz
- Parallel processors x2 (4 for c80)
- Transfer controller DMA and memory controller
- Video controller (C80 only)
5TMS320C8x Master Processor
- 32-bit RISC instruction/64-bit data
- Scoreboarded 31 GP registers and a zero register
- IEEE 754 floating point
- Supports vector FP operations
- Performs single precision FP MAC in 1 cycle 100
MFLOPS (_at_50 MHz) - Suitable for control protocols and FP intensive
algorithms
6TMS320C8x Processor communication
- Shared memory multiprocessor
- MP sends commands through command buffers located
in shared memory
7TMS320C8x Parallel Processor
- Data unit 32-bit datapath, ALU, multiplier,
etc. - 2 Independent Address units global and local
- 1 cycle on-chip memory access (no conflict)
- 1 cycle load/store of byte, halfword, and word
- Internal adder can offload data unit computation
- Program flow control unit
8TMS320C8x Parallel Procesor Program Flow
Control Unit
- 3-stage pipelining
- Instruction fetch
- Address generation, and
- Operation execution
- conditional operation of data unit operations,
moves, load from memory and branches - PC is mapped into register file
- To minimize overhead Loop controller supports 3
levels of nested loops
9TMS320C8x Parallel Procesor Data Unit
- Split 32-bit 3-input ALU Boolean and arithmetic
operations - Split and rounded multiplier dual 8x816,
16x1632 - Flexible datapath barrel rotator, mask
generator - Supports signed, unsigned and saturate arithmetic
10TMS320C8x Parallel Procesor Data Unit
3-input ALU
- Supports totally 512 operations Boolean 256
Arith. 256 - Boolean F0 (ABC) F1 (ABC) F2
(ABC) F3 (ABC) F4 (ABC)
F5 (ABC) F6 (ABC) F7 (ABC) - Arithmetic A f1(B,C) f2(B,C) 1
- Example
- AB1
- (AC)(BC) Mask A and B by C and then add
- A((BC) (-BC)) Multiple-byte AB
- A-((BC) (-BC)) Multiple-byte A-B
11TMS320C8x Parallel Processor Instruction Set
- 64-bit opcode contains multiple subinstructions
for - Data unit
- Global address unit and
- Local address unit
- Ex d4d5d6gtgtd0 a8d7 d0(a0x1)
12TMS320C8x Transfer Controller
- Prioritizes, schedules, and transfers data cache
between on- and off-chip memories - Handles data cache (on chip RAM) miss and
instruction cache - Supports multidimensional data transfers
- simple contiguous linear sequence up to 3D region
- Memory interface supports a wide range of memory
system - DRAM, SDRAM, Video RAM and SRAM
13TMS320C8x Video Controller (c80 only)
- Provides simultaneous control over two
independent capture or display systems and frame
grabber or frame buffer image storage - Dual-frame timers
- Programmable timing and control registers
- Programmable line interrupt to MP
14TMS320C8x Development Tools
- C-like compilers and assemblers for both master
and parallel processor - Register allocator
- identifies live and free registers
- allows using variable names in assembly code and
- assigns specific register to variable
- Code compactor converts straight-line assembly
codes into parallel codes - Optimization can be done by hand for
time-critical parallel code
15TMS320C8x Execution Time for 256-Point FFT
-C" indicates performance with the cache
pre-loaded - Benchmark results for the TMS320C80
are for one of the on-chip DSP processors
16MPACT-R3600 -2 Overview
- VLIW CPU
- Multimedia ISA
- Hardware/Software relationship
- Variety of high speed I/O interfaces
17MPACT-R3600 -2 CPU Datapath
- Data size multiple of 9 bits
- 512 72-bit register file with 4 read and 4 write
ports - ALU1 - shift and align
- ALU2 - add and logic
- ALU3 - arithmetic and logic
- ALU4 - stage 1 of multiplication
- ALU5 - motion estimation
- Full crossbar between ALU outputs, inputs,
register read and write ports
18MPACT-R3600 Datapath
ALU group 3 Muliply and add
4 write ports SRAM (512 entries) 4 read ports
ALU group 4 Stage 1 of Multipl.
ALU group 2 Add and logic
ALU group 1 Shift and align
ALU group 5 Motion estimation
19MPACT-R3600 -2 Multimedia ISA
- Issues two instruction pack of 72 bits every
cycle - Data forwarding from one ALU to one another
- Vector instruction (length upto 255)
- Multimedia data byte of 9, 18, 27, and 36 bits
- Supports signed , unsigned and saturating
arithmetic - MPACT 2 includes single-precision FP for 3D
graphics - Flow control branch, jump and calls
- Special purpose instruction
- Motion Estimation
- IDCT
- Butterfly FFT, etc.
20MPACT-R3600 -2 Hardware/Software Relationship
- Requires a host x86 CPU
- Mediaware- uses standard APIs
- RM Resource Manager running under Windows
- MRK MPACT real-time kernel
- Nearest deadline scheduling algorithm
- Interrupt-driven kernel with 4-us context switch
time in the worst case
21MPACT-R3600 -2 Hardware/Software Relationship
22MPACT-R3600 -2 High speed I/O interface
- PCI bus or AGP (Accelerated Graphics Port)
- x86 Host CPU bus
- 66 MHz gt 264 Mbytes/s
- Rambus Memory Interface
- 300 MHz bus (9-bit wide) on both edge600Mbytes/s
- Requires 2-4 Mbytes
- Display Controller
- 24-bit RAMDAC
- High resolution up to 1280x1024 24-bit or
1600x1200 16-bit - Video Interface
- Accepts NTSC and PAL format video or
- DVD input through PCI or AGP
- Programmable Peripheral I/O Interface
- Supports connection to several devices
23MPACT-R3600 -2 Architecture trade-offs
- High speed I/O to move data inout
- No Data cache but large register file
- multimedia data has poor locality
- Based on standard APIs (Application Program
Interface) of Microsoft Windows no proprietary
API - Pin counts vs. high memory bandwidth/low latency
- RDRAM is chosen
- PC and Consumer market
24TMS320C6x VelociTI Overview
- VLIW DSPs
- TMS320C62x Fixed-point DSPs
- TMS320C67x Floating-point DSPs
25TMS320C6x VelociTI Key features
- Issues and executes up to 8 instructions every
cycle - Load/store architecture
- 32-bit RISC instruction /32-bit data
- Conditional instructions
- reduces costly branching
- increases parallelism for higher sustained
performance - Instruction packing
- Reduces code size, program fetches, and power
consumption.
26TMS320C6x VelociTI Datapath
27TMS320C6x VelociTI Datapath
- Two register files
- 16x32 bits
- Each supports simultaneous 10 reads and 6 writes
- Two sets of identical functional units 8 units
- L logic functions, bit counting, and add/sub
- S shifting, bit manipulation, branch/control
and add/sub - D adddressing and add/sub
- M multiplication
- Grouping of functional units reduces the reg.
ports
28TMS320C6x VelociTI Instruction set
- 32-bit RISC like opcode format
- creg conditional registers
- z zero or nonzero
- dst destination
- src1/2 source 1 and 2
- cst constant
- x use cross path for src2
- s side A or B for destination
- op operation
- Instruction can be conditioned on value of A1,
A2, B0, B1, B2 - Each instruction takes 1 cycle to execute except
double- precision operations in C67x
29TMS320C6x VelociTI Instruction packing
- Fetch packet 8 32-bit instructions are fetched
simultaneously
30TMS320C6x VelociTI Instruction packing
Execute packet indicated by p-bit or
parallel-bit 1 in parallell 0 not in
parallell Example
31TMS320C6x VelociTI Pipeline
- 3 stages of 16 phases of deep pipeline
- Fetch - 4 phases PG, PS, PW, PR
- Decode - 2 phases DP, DC
- Execute - 10 phases max E1 to E10
- No stall except cache miss or external access
- Performs load after store to the same memory
location - Each branch takes 5 cycle to be taken or not-taken
32TMS320C6x VelociTI Memory Hierachy
- Internal Program Memory is configurable
- Mapped memory or direct mapped cache
- 16 K of 32-bit instructions or 2 K of 256-bit
fetch packets - Internal Data Memory
- 2 blocks of 4 8-Kbyte interleaved banks
- DMA Controller 800 Mbytes/s peak
- Transfers between on-chip memories, peripherals
and external memory - EMIF (External Memory Interface) 800 Mbytes/s
peak - Supports SBSRAM, SDRAM, etc.
33TMS320C6x VelociTI Peripherals
- McBSP (Multichannel Buffered Serial Port)
- Two independent 100 Mbits/s full duplex serial
port - Supports standards ST-BUS, AC97 audio codec,
etc. - Timers
- Two programmable 32-bit timers
- Host Port Interface
- 100 Mbytes/s 16-bit bidirectional port to
standard processors - Power-Down Modes 1,2,3
- Reduce power consumption
34TMS320C6x VelociTI Programming
- Includes C compiler, Assembler, , Optimizer, and
Debuggers in software simulator - 72-82 efficiency compared to handwritten
assembly codes - Optimization techniques
- Intrinsic functions in C compiler
- Software pipelining
- If..Else and Case conversion to conditional
instruction - Data types (by compiler)
- long 40 bits
- int 32 bits
- short 16 bits
- char 8 bits
35 Target Applications
- Video - DVD, MPEG 1 2 decoding
- Audio - Dolby AC-3, 3D Audio, MPEG Decode,
Wavetable Synthesis - Graphics - 2D 3D acceleration
- Communication
- Vocoder
- ADSL, Fax/MODEM V.34, 56k
- Echo cancellor
- Desktop Videoconferencing
- H.320 ISDN
- H.324 on POTS (Plain Old Telephone System)
36H.324 on TMS320C82 Overview
- ITU-T H.324 Low-bit-rate multimedia
teleconferencing on circuit-switched network
includes - G.723 Audio coding at 5.3-6.4 kbps requires 18-20
fixed-point MIPS - H.263 Video coding based on H.261 includes some
enhancements - H.223 MUX/DEMUX control
- H.245 Control protocol
- V.34 Modem up to 33.6 kbps
- Other related standards H.320 (ISDN), H.323
(LAN), and H.310 (ATM/B-ISDN)
37H.324 on TMS320C82 Overview
38H.324 on TMS320C82 Task Partitioning
- Video Processing (H.263)
- Encoding
- Pre-processing MP
- Motion estimation PP0
- DCT PP0
- Decoding
- Huffman or arithmetic decode, IDCT, etc. PP0
- Post processing PP0
- Audio Processing and AEC (Acoustic Echo
Cancellation) - PP1 - G.723
- Encoding 22 MIPS
- Decoding 3 MIPS
- AEC LMS algorithm up to 64-ms echo 10MIPS
- MODEM V.34 20 MIPS - PP1
39H.324 on TMS320C82 Task Partitioning
40Current Multimedia Processors
- Digital Signal Processor gt Multimedia Processor
- Employ RISC instruction set and pipelining to
gain higher clock frequency - Perform operations on single and multiple bytes
of data - Try to exploit more parallelisms on static
instruction level parallelism (ILP) rather than
dynamic ILP - Concern more and more on data movement and I/O
interface - Pay more attention on low power design
- PC/consumer market is one of their primary targets
41Current Multimedia Processors
42References
- TMS320C8x
- J. Golston, Single-chip H.324 video
conferencing, IEEE Micro, August 1996, pp. 42-50 - Texas Instrument, TMSC320C80 Data Sheet, 1997
at http//www.ti.com/../sprs023b.pdf - P. Lapseley and G. Blalock, How to estimate DPS
processor performance, IEEE Spectrum, July 1996,
pp. 74-78 - HTML file http//www.bdti.com/../wpeval.html
- MPACT
- P. Kalapathy, Hardware-software interfacing on
Mpact, IEEE Micro, March 1997, pp. 20-26 - Presentation file http//infopad.eecs.berkeley.ed
u/HotChips8/ - Chromatic Research, MPACT2 Preliminary Data
Sheet, Feb. 1998 - http//www.mpact.com/../mpact2.pdf
- Toshiba, TOSHIBA ANNOUNCES ITS NEXT-GENERATION
MPACT MEDIA PROCESSOR, September 22, 1997 - http//www.toshiba.com/taec/../to-628.htm
43References
- TMS320C6x
- N. Seshan, High VelociTI Processing, IEEE Signal
Processing Mag, March 1998, pp. 86-101 - TMS320C6x data sheet
- Trimedia
- G. A. Slavenburg, The Trimedia TM-1 PCI VLIW
Mediaprocessor, IEEE Hot Chips 8 Symposium on
High-Performance Chips, Aug. 1996 - http//infopad.eecs.berkeley.edu/HotChips8/
- MSP
- L. T.Nguyen, M. Mohamed, H. Park, Y. Pal, R.
Wong, A. Qureshi, P. Psong, F. Valesco, H. D.
Truong, C. Reader, Multi-media Signal Processor
(MSP) Summary , IEEE Hot Chips 8 Symposium on
High-Performance Chips, Aug. 1996 - http//infopad.eecs.berkeley.edu/HotChips8/
- H.324
- D. Lindbergh, The H.324 multimedia communication
standard, IEEE Communication Magazine, December
1996, pp. 46-51 - K. Rijkse, H.263 Video coding for low-bit-rate
communication, IEEE Communication Magazine,
December 1996, pp. 42-45
44Useful links
- CPU Information Center
- http//infopad.eecs.berkeley.edu/CIC/
- Microprocessor Report
- http//www.chipanalyst.com/q/
- Berkeley Design Technology Inc.
- http//www.bdti.com/
- Peter Pirschs research group
- http//www.mst.uni-hannover.de/