Title: The TM3270 TriMedia MediaProcessor
1The TM3270 TriMediaMedia-Processor
- Jan-Willem van de Waerdt
- Chief Processor Architect
- TmCoE Philips Semiconductors
2Outline
- Introduction
- Overview
- Architecture highlights
- Implementation highlights
- Realization highlights
- Conclusions
3Introduction
TRIMEDIA TM3270 DESIGN OBJECTIVES
- Multi-purpose programmable solution
- Standard video/audio (en/de)-coders
- H.264 standard definition encode or decode
- MPEG2, WMV high definition decode
- Proprietary video enhancement processing to
improve picture quality - Eg. Temporal upconversion, de-interlacing,
peaking, etc. - Many applications evolve after HW designs
complete - Programmable solution gt Fast time-to-market
- Multi-market (connected and battery operated)
- Share development costs (CPU, software and tools
development) - Low power (battery operated products)
- The bottom line
- Enough programmable performance at acceptable
power consumption with smallest silicon area
4Overview
TRIMEDIA TM3270 CHARACTERISTICS
- Fully synthesizable design (450/350 MHz)
- VLIW machine with 5 issue slots
- 32-bit address range, 32-bit datapath
- Operations are guarded
- Unified 128x32-bit register-file
- 35 execution units
- SIMD multimedia and IEEE754 FP operation support
- 64 Kbyte instruction cache (8-way set
associative) - 128 Kbyte data cache (4-way set associative)
- Variable length instruction encoding
- Pipeline depth 7-12 stages
if r34 fadd r12 r14 -gt r56, if
r24 fmul r3 r89 -gt r112, add
r34 r76 -gt r121, std32d(4) r42 r48, if
r21 ld32d(-8) r93 -gt r45
5Overview
TRIMEDIA TM3270 PIPELINE
Sequential instruction cache design
Unified registerfile
Two-slot execution unit Double the registerfile
bandwidth
Load/store unit connects to two issue slots
6Architecture
- Highlights
- New operations
- Two-slot operations
- Fractional load operations
- More information ACM SAC2005
- H.264 CABAC decoding operations
- Data prefetching
7Architecture
TWO-SLOT OPERATIONS
SUPER_QUADUMEDIAN src1 src2 src3 -gt dst
src1
src2
src3
Sources
dst
Destinations
8Architecture
TWO-SLOT OPERATIONS
SUPER_LD32R src1 src2 -gt dst1 dst2
31 0
31 0
src1
src2
Sources
A1
A2
A5
A
A3
A4
A6
A7
Memory
dst1
dst2
Destinations
9Architecture
TWO-SLOT OPERATIONS
SUPER_LD32R src1 src2 -gt dst1 dst2
31 0
31 0
src1
src2
Sources
A1
A2
A5
A
A3
A4
A6
A7
Memory
- 2 independent destinations
- 4 cycle latency
- Non-aligned memory access
dst2
Destinations
10Architecture
FRACTIONAL LOAD OPERATIONS
LD_FRAC8 src1 src2 -gt dst
3..0
31 0
src1
src2
address A
Sources
11Architecture
CABAC DECODING OPERATIONS
- Context-Adaptive Binary Arithmetic Coding (CABAC)
- H.264/AVC compression feature
- Lossless compression of syntax elements in the
video stream based on the probabilities of syntax
elements in a given context - Achieves high compression ratio
- High computational complexity
12Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
data_aligned data ltlt bit_position range_lps
LpsRangeTablestate(range gtgt 6)
3) temp_range range - range_lps most_probabl
e value lt temp_range bit if most_probable
then mps else !mps value if most_probable
then value else value - temp_range range if
most_probable then temp_range else
range_lps mps if most_probable then mps else
mps (state ! 0) state if most_probable
then MpsStateTablestate else LpsStateTablestate
while (range lt 256) value
(value ltlt 1) ((data_aligned gtgt 31) 1)
range ltlt 1 data_aligned ltlt 1
bit_position 1
13Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
14Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
bit_position
data
value
range
state
mps
Sources
src1
src2
src3
src4
SUPER_CABAC_CTX (two-slot operation 4 inputs
and 2 outputs)
dst1
dst2
bit_position
bit
value
range
state
mps
Destin.
15Architecture
CABAC DECODING OPERATIONS
Function biari_decode_symbol Semantics
CABAC decoding of a single bin
Inputs value, range state,
mps bit_position data
Outputs value, range state,
mps bit_position bit
bit_position
data
value
range
state
mps
Sources
src1
src2
src3
SUPER_CABAC_STR (two-slot operation 3 inputs
and 2 outputs)
dst2
dst1
bit_position
bit
value
range
state
mps
Destin.
16Architecture
CABAC DECODING OPERATIONS
- H.264 CABAC 2.5 Mbits/s encoded bitstream
- 720576 resolution _at_ 25 frames/s
- 80 of 16x16 blocks decomposed into sixteen 4x4
blocks
- Without TriMedia operations used, except for
CABAC decoding operations
17Architecture
CABAC DECODING OPERATIONS
Deleted slide
18Architecture
DATA PREFETCHING
- Pre-fetching to hide SDRAM latency in SoC
environment - Pre-fetching based on memory regions
- Memory regions are under software control
- The programmer knows best, rely on programmers
algorithm knowledge - Simple program model / low overhead
- Four memory regions supported
19Architecture
DATA PREFETCHING
- Memory Region n defined by
- Start address rgn_start_address
- End address rgn_end_address
- Stride rgn_stride
- Pre-fetching is triggered by loads
- Functionality
- Load for address A
- if ( (A gt rgn_start_address)
- (A lt rgn_end_address)
- (prefetch bit A is 1)
- ( miss for (A rgn_stride))) then
- Pre-fetch for address (A rgn_stride)
- Set prefetch bit (A rgn_stride) to 1
- Set prefetch bit A to 0
rg0_start_address
. A
requested
Region 0
. A rg0_stride
pre-fetched
rg0_end_address
20Architecture
DATA PREFETCHING
- Copy (char src_ptr, char dst_ptr, int size)
- for (int i 0 i lt size i)
- dst_ptr src_ptr
-
-
- Copy (char src_ptr, char dst_ptr, int size)
- int local_src_ptr (int ) src_ptr
- int local_dst_ptr (int ) dst_ptr
- rg0_start_address src_ptr // Perform
pre-fetching - rg0_end_address src_ptr (size-1) // on
the source region. - rg0_stride 128 // Set pre-fetch
stride - // to data line size.
- for (int i 0 i lt size/4 i) //
Non-aligned LD/ST support. - local_dst_ptr local_src_ptr //
write misses caused by - // local_dst_ptr references
21Implementation
- Highlights
- Load/store unit
- 128 Kbyte, 4 way set associativity
- Allocate on write miss policy
- The right balance between area and performance
- More information IEEE ICCD2005
22Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Two copies of tags cache line crossing Unaligned
access
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Multiple Data SRAM arbiters
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Two additional cycles for fractional loads
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
23Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
24Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
LD_FRAC8 operation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
25Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
26Implementation
LOAD/STORE UNIT
src1 Slot 4 src2
src1 Slot 5 src2
Address calculation
Address calculation
LD operation
Refill unit Prefetch unit Copy back unit
Slot 4 Tag arbiter
Slot 5 Tag arbiter
Data arbiter
Data arbiter
Data arbiter
Data arbiter
Slot4 tags
Cache data
Slot5 tags
Cache write buffer
Tag SRAM
Tag SRAM
Tag SRAM
Tag SRAM
Data SRAM
Data SRAM
Data SRAM
Data SRAM
Pending stores
Tag comparison
Tag comparison
Slot 4 store data Slot 5 store data
Slot 4 Control state machine
Slot 5 Control state machine
Cache way selection Load aligner Sign extension
Slot 4 dst Slot 5 dst
Fractional load filter bank
src230
Slot 5 dst
27Realization
OVERVIEW
- Fully synthesizable design, low power process
technology - Relatively high VT
- Frequency
- 450 MHz (1.2 V, 85 C, worst case process corner)
- 350 MHz (1.1 V, 125 C, worst case process corner)
- Area 8.08 mm2
- 46.9 for SRAMs
- 13.9 64 Kbyte instruction cache
- 33.0 128 Kbyte data cache
- Standard cell utilization 65
- Power 0.7 1.0 mW/MHz (1.2 V)
- Extensive usage of clock gating (gt70 domains in
total)
28Realization
TM3270 FLOORPLAN
INSTRUCTION
ILRU
INSTR. TAG
BUS INTERFACE UNIT
DECODE
INSTRUCTION FETCH UNIT
INSTR. TAG
MMIO
INSTRUCTION
LOAD STORE UNIT
DATA
BYTE VALID
BYTE VALID
BYTE VALID
BYTE VALID
DATA
29Realization
MP3 POWER DISTRIBUTION
- Power
- Static (negligible)
- Low power process technology High VT standard
cells - Dynamic CV2f
- C process technology and switching activity
(clock gating) - V process voltage, range 0.8 - 1.4 V
- MP3 decoder (384 Kbits/s stereo decoding at 44.1
kHz) gt 8 MHz execution time - 0.94 mW/MHz (1.2 V) gt 7.48 mW
- Voltage scaling
- 6.29 mW (1.1 V), 5.19 mW (1.0 V),
4.21 mW (0.9 V), 3.32 mW (0.8 V)
30Realization
PERFORMANCE WRT. TM3260
- TM3260 TM3270 predecessor
- 240 MHz
- 64 Kbyte I , 16 Kbyte D
- Performance impact of
- Processor type (TM3260 vs TM3270)
- Frequency (240 MHz vs 350 MHz)
- Data cache size (16 Kbyte vs 128 Kbyte)
31Realization
PERFORMANCE WRT. TM3260
- TM3260 source code
- recompiled for TM3270, no source modifications
- Lower bound of TM3270 performance potential
32Conclusions
- ISA enhancements to improve video processing
performance - Two-slot operations
- Collapsed load operations
- CABAC decoding operations
- Balanced design for performance, power and area
- Enough performance to enable
- Standard definition encode or decode (incl.
H.264) - More information IEEE ASAP2005
- Low power
- Enables application in battery operated products
33(No Transcript)