Programmable Digital Signal Processor II - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Programmable Digital Signal Processor II

Description:

identifies live and free registers. allows using variable names in assembly code and ... Video - DVD, MPEG 1 & 2 decoding. Audio - Dolby AC-3, 3D Audio, MPEG ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 38
Provided by: sur73
Category:

less

Transcript and Presenter's Notes

Title: Programmable Digital Signal Processor II


1
Programmable Digital Signal Processor (II)
Based on presentations by S. Kittitornkun
2
Outline
  • Programmable DSP What and Why ?
  • TMS320C8x C80 C82
  • TMS320C6x C62x C67x
  • Target Applications
  • Application H.324 on TMS320C82
  • Current Multimedia Processors
  • References

3
What is PDSP?
  • A special purpose, programmable micro-processor
    designed for DSP applications.
  • Features
  • Specialized instruction sets
  • Complex instructions ? Smaller program
  • Instruction level parallelism (ILP)
  • Specialized hardware support
  • Fast/parallel input/output support for media
    processing
  • Special ALUs, function units for bit operations,
    etc.
  • Special memory, bus transfer architecture
  • On-board co-processors, etc.

4
Why PDSP?
  • Higher performance compared to general purpose
    micro-processors for specialized (embedded)
    applications
  • Software implementation offers more flexibility
    for product upgrade and migration than ASIC
  • Low cost
  • Lower per-unit cost than general purpose
    micro-processors
  • Lower overall cost than ASIC for lower volume
    products

5
TMS320C8x Overview
  • RISC Master processor _at_ 50 and 60 MHz
  • Parallel processors x2 (4 for c80)
  • Transfer controller DMA and memory controller
  • Video controller (C80 only)

6
TMS320C8x Master Processor
  • 32-bit RISC instruction/64-bit data
  • Score-boarded 31 general purpose registers and a
    zero register
  • IEEE 754 floating point unit
  • Supports vector floating point (FP) operations
  • Performs single precision floating point MAC in 1
    cycle 100 MFLOPS (_at_50 MHz)
  • Suitable for control protocols and FP intensive
    algorithms

7
TMS320C8x Processor communication
x86 Host
PP0
processor
Ports
PP1
Tasks
Tasks
Tasks
function
Kernel
Tasks
Tasks
Tasks
Tasks
Tasks
Signals
Semiphore
Master processor
  • Shared memory multiprocessor
  • MP sends commands through command buffers located
    in shared memory

8
TMS320C8x Parallel Processor
  • Data unit 32-bit datapath, ALU, multiplier,
    etc.
  • 2 Independent Address units global and local
  • 1 cycle on-chip memory access (no conflict)
  • 1 cycle load/store of byte, halfword, and word
  • Internal adder can offload data unit computation
  • Program flow control unit

9
TMS320C8x Parallel Processor Program Flow
Control Unit
  • 3-stage pipelining
  • Instruction fetch
  • Address generation, and
  • Operation execution
  • conditional operation of data unit operations,
    moves, load from memory and branches
  • PC is mapped into register file
  • To minimize overhead Loop controller supports 3
    levels of nested loops

10
TMS320C8x Parallel Processor Data Unit
  • Split 32-bit 3-input ALU Boolean and arithmetic
    operations
  • Split and rounded multiplier dual 8x816,
    16x1632
  • Flexible datapath barrel rotator, mask
    generator
  • Supports signed, unsigned and saturate arithmetic

11
TMS320C8x Parallel Processor Data Unit
3-input ALU
  • Supports totally 512 operations Boolean 256
    Arithmetic. 256
  • Boolean F0 (ABC) F1 (ABC) F2
    (ABC) F3 (ABC) F4 (ABC)
    F5 (ABC) F6 (ABC) F7 (ABC)
  • Arithmetic A f1(B,C) f2(B,C) 1
  • Example
  • AB1
  • (AC)(BC) Mask A and B by C and then add
  • A((BC) (-BC)) Multiple-byte AB
  • A-((BC) (-BC)) Multiple-byte A-B

12
TMS320C8x Parallel Processor Instruction Set
  • 64-bit OPcode contains multiple sub-instructions
    for
  • Data unit
  • Global address unit and
  • Local address unit
  • Ex d4d5d6gtgtd0 a8d7 d0(a0x1)

13
TMS320C8x Transfer Controller
  • Prioritizes, schedules, and transfers data cache
    between on- and off-chip memories
  • Handles data cache (on chip RAM) miss and
    instruction cache
  • Supports multidimensional data transfers
  • simple contiguous linear sequence up to 3D region
  • Memory interface supports a wide range of memory
    system
  • DRAM, SDRAM, Video RAM and SRAM

14
TMS320C8x Video Controller (c80 only)
  • Provides simultaneous control over two
    independent capture or display systems and frame
    grabber or frame buffer image storage
  • Dual-frame timers
  • Programmable timing and control registers
  • Programmable line interrupt to MP

15
TMS320C8x Development Tools
  • C-like compilers and assemblers for both master
    and parallel processor
  • Register allocator
  • identifies live and free registers
  • allows using variable names in assembly code and
  • assigns specific register to variable
  • Code compactor converts straight-line assembly
    codes into parallel codes
  • Optimization can be done by hand for
    time-critical parallel code

16
TMS320C8x Execution Time for 256-Point FFT
-C" indicates performance with the cache
pre-loaded - Benchmark results for the TMS320C80
are for one of the on-chip DSP processors
17
TMS320C6x VelociTI Overview
  • VLIW DSPs
  • TMS320C62x Fixed-point DSPs
  • TMS320C67x Floating-point DSPs

18
TMS320C6x VelociTI Key features
  • Issues and executes up to 8 instructions every
    cycle
  • Load/store architecture
  • 32-bit RISC instruction /32-bit data
  • Conditional instructions
  • reduces costly branching
  • increases parallelism for higher sustained
    performance
  • Instruction packing
  • Reduces code size, program fetches, and power
    consumption.

19
TMS320C6x VelociTI Datapath
20
TMS320C6x VelociTI Datapath
  • Two register files
  • 16x32 bits
  • Each supports simultaneous 10 reads and 6 writes
  • Two sets of identical functional units 8 units
  • L logic functions, bit counting, and add/sub
  • S shifting, bit manipulation, branch/control
    and add/sub
  • D addressing and add/sub
  • M multiplication
  • Grouping of functional units reduces the reg.
    ports

21
TMS320C6x VelociTI Instruction set
  • 32-bit RISC like OPcode format
  • creg conditional registers
  • z zero or nonzero
  • dst destination
  • src1/2 source 1 and 2
  • cst constant
  • x use cross path for src2
  • s side A or B for destination
  • op operation
  • Instruction can be conditioned on value of A1,
    A2, B0, B1, B2
  • Each instruction takes 1 cycle to execute except
    double- precision operations in C67x

22
TMS320C6x VelociTI Instruction packing
  • Fetch packet 8 32-bit instructions are fetched
    simultaneously

23
TMS320C6x VelociTI Instruction packing
Execute packet indicated by p-bit or
parallel-bit 1 in parallel 0 not in
parallel Example
24
TMS320C6x VelociTI Pipeline
  • 3 stages of 16 phases of deep pipeline
  • Fetch - 4 phases PG, PS, PW, PR
  • Decode - 2 phases DP, DC
  • Execute - 10 phases max E1 to E10
  • No stall except cache miss or external access
  • Performs load after store to the same memory
    location
  • Each branch takes 5 cycle to be taken or not-taken

25
TMS320C6x VelociTI Memory Hierarchy
  • Internal Program Memory is configurable
  • Mapped memory or direct mapped cache
  • 16 K of 32-bit instructions or 2 K of 256-bit
    fetch packets
  • Internal Data Memory
  • 2 blocks of 4 8-Kbyte interleaved banks
  • DMA Controller 800 Mbytes/s peak
  • Transfers between on-chip memories, peripherals
    and external memory
  • EMIF (External Memory Interface)
  • 800 Mbytes/s peak
  • Supports SBSRAM, SDRAM, etc.

26
TMS320C6x VelociTI Peripherals
  • McBSP (Multichannel Buffered Serial Port)
  • Two independent 100 Mbits/s full duplex serial
    port
  • Supports standards ST-BUS, AC97 audio codec,
    etc.
  • Timers
  • Two programmable 32-bit timers
  • Host Port Interface
  • 100 Mbytes/s 16-bit bi-directional port to
    standard processors
  • Power-Down Modes 1,2,3
  • Reduce power consumption

27
TMS320C6x VelociTI Programming
  • Includes C compiler, Assembler, , Optimizer, and
    Debuggers in software simulator
  • 72-82 efficiency compared to handwritten
    assembly codes
  • Optimization techniques
  • Intrinsic functions in C compiler
  • Software pipelining
  • If..Else and Case conversion to conditional
    instruction
  • Data types (by compiler)
  • long 40 bits
  • int 32 bits
  • short 16 bits
  • char 8 bits

28
Target Applications
  • Video - DVD, MPEG 1 2 decoding
  • Audio - Dolby AC-3, 3D Audio, MPEG Decode,
    Wavetable Synthesis
  • Graphics - 2D 3D acceleration
  • Communication
  • Vocoder
  • ADSL, Fax/MODEM V.34, 56k
  • Echo chancellor
  • Desktop Videoconferencing
  • H.320 ISDN
  • H.324 on POTS (Plain Old Telephone System)

29
H.324 on TMS320C82 Overview
  • ITU-T H.324 Low-bit-rate multimedia
    teleconferencing on circuit-switched network
    includes
  • G.723 Audio coding at 5.3-6.4 kbps requires 18-20
    fixed-point MIPS
  • H.263 Video coding based on H.261 includes some
    enhancements
  • H.223 MUX/DEMUX control
  • H.245 Control protocol
  • V.34 Modem up to 33.6 kbps
  • Other related standards H.320 (ISDN), H.323
    (LAN), and H.310 (ATM/B-ISDN)

30
H.324 on TMS320C82 Overview
31
H.324 on TMS320C82 Task Partitioning
  • Video Processing (H.263)
  • Encoding
  • Pre-processing MP
  • Motion estimation PP0
  • DCT PP0
  • Decoding
  • Huffman or arithmetic decode, IDCT, etc. PP0
  • Post processing
    PP0
  • Audio Processing and AEC (Acoustic Echo
    Cancellation) - PP1
  • G.723
  • Encoding 22 MIPS
  • Decoding 3 MIPS
  • AEC LMS algorithm up to 64-ms echo 10MIPS
  • MODEM V.34 20 MIPS - PP1

32
H.324 on TMS320C82 Task Partitioning
33
Current Multimedia Processors
  • Digital Signal Processor gt Multimedia Processor
  • Employ RISC instruction set and pipelining to
    gain higher clock frequency
  • Perform operations on single and multiple bytes
    of data
  • Try to exploit more parallelisms on static
    instruction level parallelism (ILP) rather than
    dynamic ILP
  • Concern more and more on data movement and I/O
    interface
  • Pay more attention on low power design
  • PC/consumer market is one of their primary targets

34
Current Multimedia Processors
35
References
  • TMS320C8x
  • J. Golston, Single-chip H.324 video
    conferencing, IEEE Micro, August 1996, pp. 42-50
  • Texas Instrument, TMSC320C80 Data Sheet, 1997
    at http//www.ti.com/../sprs023b.pdf
  • P. Lapseley and G. Blalock, How to estimate DPS
    processor performance, IEEE Spectrum, July 1996,
    pp. 74-78
  • HTML file http//www.bdti.com/../wpeval.html
  • TMS320C6x
  • N. Seshan, High VelociTI Processing, IEEE Signal
    Processing Mag, March 1998, pp. 86-101
  • TMS320C6x data sheet

36
References
  • Trimedia
  • G. A. Slavenburg, The Trimedia TM-1 PCI VLIW
    Mediaprocessor, IEEE Hot Chips 8 Symposium on
    High-Performance Chips, Aug. 1996
  • http//infopad.eecs.berkeley.edu/HotChips8/
  • MSP
  • L. T.Nguyen, M. Mohamed, H. Park, Y. Pal, R.
    Wong, A. Qureshi, P. Psong, F. Valesco, H. D.
    Truong, C. Reader, Multi-media Signal Processor
    (MSP) Summary , IEEE Hot Chips 8 Symposium on
    High-Performance Chips, Aug. 1996
  • http//infopad.eecs.berkeley.edu/HotChips8/
  • H.324
  • D. Lindbergh, The H.324 multimedia communication
    standard, IEEE Communication Magazine, December
    1996, pp. 46-51
  • K. Rijkse, H.263 Video coding for low-bit-rate
    communication, IEEE Communication Magazine,
    December 1996, pp. 42-45

37
Useful links
  • CPU Information Center
  • http//infopad.eecs.berkeley.edu/CIC/
  • Microprocessor Report
  • http//www.chipanalyst.com/q/
  • Berkeley Design Technology Inc.
  • http//www.bdti.com/
  • Peter Pirschs research group
  • http//www.mst.uni-hannover.de/
Write a Comment
User Comments (0)
About PowerShow.com