The TM3270 TriMedia MediaProcessor - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

The TM3270 TriMedia MediaProcessor

Description:

Multi-purpose programmable solution. Standard video/audio (en/de)-coders ... Proprietary video enhancement processing to improve picture quality. Eg. ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 34
Provided by: NLV66
Category:

less

Transcript and Presenter's Notes

Title: The TM3270 TriMedia MediaProcessor


1
The TM3270 TriMediaMedia-Processor
  • Jan-Willem van de Waerdt
  • Chief Processor Architect
  • TmCoE Philips Semiconductors

2
Outline
  • Introduction
  • Overview
  • Architecture highlights
  • Implementation highlights
  • Realization highlights
  • Conclusions

3
Introduction
TRIMEDIA TM3270 DESIGN OBJECTIVES
  • Multi-purpose programmable solution
  • Standard video/audio (en/de)-coders
  • H.264 standard definition encode or decode
  • MPEG2, WMV high definition decode
  • Proprietary video enhancement processing to
    improve picture quality
  • Eg. Temporal upconversion, de-interlacing,
    peaking, etc.
  • Many applications evolve after HW designs
    complete
  • Programmable solution gt Fast time-to-market
  • Multi-market (connected and battery operated)
  • Share development costs (CPU, software and tools
    development)
  • Low power (battery operated products)
  • The bottom line
  • Enough programmable performance at acceptable
    power consumption with smallest silicon area

4
Overview
TRIMEDIA TM3270 CHARACTERISTICS
  • Fully synthesizable design (450/350 MHz)
  • VLIW machine with 5 issue slots
  • 32-bit address range, 32-bit datapath
  • Operations are guarded
  • Unified 128x32-bit register-file
  • 35 execution units
  • SIMD multimedia and IEEE754 FP operation support
  • 64 Kbyte instruction cache (8-way set
    associative)
  • 128 Kbyte data cache (4-way set associative)
  • Variable length instruction encoding
  • Pipeline depth 7-12 stages

if r34 fadd r12 r14 -gt r56, if
r24 fmul r3 r89 -gt r112, add
r34 r76 -gt r121, std32d(4) r42 r48, if
r21 ld32d(-8) r93 -gt r45
5
Overview
TRIMEDIA TM3270 PIPELINE
Sequential instruction cache design
Unified registerfile
Two-slot execution unit Double the registerfile
bandwidth
Load/store unit connects to two issue slots
6
Architecture
  • Highlights
  • New operations
  • Two-slot operations
  • Fractional load operations
  • More information ACM SAC2005
  • H.264 CABAC decoding operations
  • Data prefetching

7
Architecture
TWO-SLOT OPERATIONS
SUPER_QUADUMEDIAN src1 src2 src3 -gt dst
src1
src2
src3
Sources
dst
Destinations
8
Architecture
TWO-SLOT OPERATIONS
SUPER_LD32R src1 src2 -gt dst1 dst2
31 0
31 0
src1
src2
Sources
A1
A2
A5
A
A3
A4
A6
A7
Memory
dst1
dst2
Destinations
9
Architecture
TWO-SLOT OPERATIONS
SUPER_LD32R src1 src2 -gt dst1 dst2
31 0
31 0
src1
src2
Sources
A1
A2
A5
A
A3
A4
A6
A7
Memory
  • 2 independent destinations
  • 4 cycle latency
  • Non-aligned memory access

dst2
Destinations
10
Architecture
FRACTIONAL LOAD OPERATIONS
LD_FRAC8 src1 src2 -gt dst
3..0
31 0
src1
src2
address A
Sources
11
Architecture
CABAC DECODING OPERATIONS
  • Context-Adaptive Binary Arithmetic Coding (CABAC)
  • H.264/AVC compression feature
  • Lossless compression of syntax elements in the
    video stream based on the probabilities of syntax
    elements in a given context
  • Achieves high compression ratio
  • High computational complexity

12
Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
data_aligned data ltlt bit_position range_lps
LpsRangeTablestate(range gtgt 6)
3) temp_range range - range_lps most_probabl
e value lt temp_range bit if most_probable
then mps else !mps value if most_probable
then value else value - temp_range range if
most_probable then temp_range else
range_lps mps if most_probable then mps else
mps (state ! 0) state if most_probable
then MpsStateTablestate else LpsStateTablestate
while (range lt 256) value
(value ltlt 1) ((data_aligned gtgt 31) 1)
range ltlt 1 data_aligned ltlt 1
bit_position 1
13
Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
14
Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
bit_position
data
value
range
state
mps
Sources
src1
src2
src3
src4
SUPER_CABAC_CTX (two-slot operation 4 inputs
and 2 outputs)
dst1
dst2
bit_position
bit
value
range
state
mps
Destin.
15
Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
bit_position
data
value
range
state
mps
Sources
src1
src2
src3
SUPER_CABAC_STR (two-slot operation 3 inputs
and 2 outputs)
dst2
dst1
bit_position
bit
value
range
state
mps
Destin.
16
Architecture
CABAC DECODING OPERATIONS
  • H.264 CABAC 2.5 Mbits/s encoded bitstream
  • 720576 resolution _at_ 25 frames/s
  • 80 of 16x16 blocks decomposed into sixteen 4x4
    blocks
  • Without TriMedia operations used, except for
    CABAC decoding operations

17
Architecture
CABAC DECODING OPERATIONS
Deleted slide
18
Architecture
DATA PREFETCHING
  • Pre-fetching to hide SDRAM latency in SoC
    environment
  • Pre-fetching based on memory regions
  • Memory regions are under software control
  • The programmer knows best, rely on programmers
    algorithm knowledge
  • Simple program model / low overhead
  • Four memory regions supported

19
Architecture
DATA PREFETCHING
  • Memory Region n defined by
  • Start address rgn_start_address
  • End address rgn_end_address
  • Stride rgn_stride
  • Pre-fetching is triggered by loads
  • Functionality
  • Load for address A
  • if ( (A gt rgn_start_address)
  • (A lt rgn_end_address)
  • (prefetch bit A is 1)
  • ( miss for (A rgn_stride))) then
  • Pre-fetch for address (A rgn_stride)
  • Set prefetch bit (A rgn_stride) to 1
  • Set prefetch bit A to 0

rg0_start_address

. A
requested
Region 0
. A rg0_stride
pre-fetched
rg0_end_address
20
Architecture
DATA PREFETCHING
  • Copy (char src_ptr, char dst_ptr, int size)
  • for (int i 0 i lt size i)
  • dst_ptr src_ptr
  • Copy (char src_ptr, char dst_ptr, int size)
  • int local_src_ptr (int ) src_ptr
  • int local_dst_ptr (int ) dst_ptr
  • rg0_start_address src_ptr // Perform
    pre-fetching
  • rg0_end_address src_ptr (size-1) // on
    the source region.
  • rg0_stride 128 // Set pre-fetch
    stride
  • // to data line size.
  • for (int i 0 i lt size/4 i) //
    Non-aligned LD/ST support.
  • local_dst_ptr local_src_ptr //
    write misses caused by
  • // local_dst_ptr references

21
Implementation
  • Highlights
  • Load/store unit
  • 128 Kbyte, 4 way set associativity
  • Allocate on write miss policy
  • The right balance between area and performance
  • More information IEEE ICCD2005

22
Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Two copies of tags cache line crossing Unaligned
access
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Multiple Data SRAM arbiters
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Two additional cycles for fractional loads
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
23
Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
24
Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
LD_FRAC8 operation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
25
Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
26
Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
LD operation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
27
Realization
OVERVIEW
  • Fully synthesizable design, low power process
    technology
  • Relatively high VT
  • Frequency
  • 450 MHz (1.2 V, 85 C, worst case process corner)
  • 350 MHz (1.1 V, 125 C, worst case process corner)
  • Area 8.08 mm2
  • 46.9 for SRAMs
  • 13.9 64 Kbyte instruction cache
  • 33.0 128 Kbyte data cache
  • Standard cell utilization 65
  • Power 0.7 1.0 mW/MHz (1.2 V)
  • Extensive usage of clock gating (gt70 domains in
    total)

28
Realization
TM3270 FLOORPLAN
INSTRUCTION
ILRU
INSTR. TAG
BUS INTERFACE UNIT
DECODE
INSTRUCTION FETCH UNIT
INSTR. TAG
MMIO
INSTRUCTION
LOAD STORE UNIT
DATA
BYTE VALID
BYTE VALID
BYTE VALID
BYTE VALID
DATA
29
Realization
MP3 POWER DISTRIBUTION
  • Power
  • Static (negligible)
  • Low power process technology High VT standard
    cells
  • Dynamic CV2f
  • C process technology and switching activity
    (clock gating)
  • V process voltage, range 0.8 - 1.4 V
  • MP3 decoder (384 Kbits/s stereo decoding at 44.1
    kHz) gt 8 MHz execution time
  • 0.94 mW/MHz (1.2 V) gt 7.48 mW
  • Voltage scaling
  • 6.29 mW (1.1 V), 5.19 mW (1.0 V),
    4.21 mW (0.9 V), 3.32 mW (0.8 V)

30
Realization
PERFORMANCE WRT. TM3260
  • TM3260 TM3270 predecessor
  • 240 MHz
  • 64 Kbyte I , 16 Kbyte D
  • Performance impact of
  • Processor type (TM3260 vs TM3270)
  • Frequency (240 MHz vs 350 MHz)
  • Data cache size (16 Kbyte vs 128 Kbyte)

31
Realization
PERFORMANCE WRT. TM3260
  • TM3260 source code
  • recompiled for TM3270, no source modifications
  • Lower bound of TM3270 performance potential

32
Conclusions
  • ISA enhancements to improve video processing
    performance
  • Two-slot operations
  • Collapsed load operations
  • CABAC decoding operations
  • Balanced design for performance, power and area
  • Enough performance to enable
  • Standard definition encode or decode (incl.
    H.264)
  • More information IEEE ASAP2005
  • Low power
  • Enables application in battery operated products

33
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com