Title: Stream Architecture: Rethinking Media Processor Design
1Stream ArchitectureRethinking Media Processor
Design
Scott Rixner April 9, 2001
- Rice University
- Computer Systems Laboratory
2Media Processing
- Video/image compression decompression
- MPEG, JPEG, ...
- Signal Processing
- DSL modems, cellular base stations, ...
- Image synthesis
- Polygon rendering, image-based rendering, ...
- Image understanding
- Face recognition, depth extraction, ...
3Stereo Depth Extraction
Left Camera Image
Right Camera Image
- 640x480 _at_ 30 fps
- Requirements
- 11 GOPS
- Imagine stream processor
- 12.1 GOPS, 4.6 GOPS/W
Depth Map
4Outline
- Stream Processing
- VLSI Constraints
- Register Organization
- Imagine
- Conclusions
5Media Processing Characteristics
- Low-precision data
- 24 8-bit integer operations
- 29 16-bit integer operations
- Abundant data-parallelism
- Little global data reuse
- Average of 1.5 references per global data word
- Numerous computations per global reference
- 50-500 operations per global data reference
6Stream Processing
- Little data reuse (pixels never revisited)
- Highly data parallel (output pixels not dependent
on other output pixels) - Compute intensive (gt60 operations per memory
reference)
7Locality and Concurrency
Operations within a kernel operate on local data
Kernels can be partitioned across chips to
exploit control parallelism
Image 0
convolve
convolve
Depth Map
SAD
Image 1
convolve
convolve
Streams expose data parallelism
8Sony PlayStation2
Emotion Engine
FPU
MIPS Core
VPU0
Graphics Synthesizer
VPU1
Display
IPU
RDRAM, I/O, DMAC, etc.
9Special vs. General Purpose
- Special Purpose
- Fixed function
- High performance
- General Purpose
- Programmable
- Insufficient performance
10Register Files Dwarf ALUs
11Register File Area
- Each cell requires
- 1 word line per port
- 1 bit line per port
- Each cell grows as p2
- R registers in the file
- Area p2R µ N3
Register Bit Cell
12Register File Access Delay
- Signal must traverse
- Word line to access cell
- Bit line to transfer data
- Wire capacitance dominates
- Delay pR1/2 µ N3/2
Register File
13Register File Power Dissipation
- 100 utilization requires
- driving all pR1/2 bit lines
- Wire capacitance dominates
- Power p2R µ N3
Register File
14Centralized Register Organization
- Area, Power µ N3, Delay µ N3/2
15Partitioned Organizations
- SIMD
- Data-parallel axis
- Distributed Register Files (DRF)
- Instruction-level parallel axis
- Hierarchical
- Memory hierarchy axis
- Stream
- Optimizing for streams
16SIMD Register Organization
- Area, Power µ N3/C2, Delay µ (N/C)3/2
17Distributed Register Organization
- Area, Power µ N2, Delay µ N
18Combining SIMD and DRF
Scalar
SIMD
Central
DRF
19Hierarchical Register Organization
Hierarchical T40
- Area, Power µ N3, Delay µ N3/2
20Hierarchical Organizations
Scalar
SIMD
Central
DRF
21Stream Register Organization
- Area, Power µ N2/C, Delay µ N/C
22Stream Organizations
Scalar
SIMD
Central
DRF
23Comparison of Organizations
- 48 ALUs (32-bit), 500 MHz
- Stream organization improves central organization
by - Area 195x, Delay 20x, Power 430x
24Performance
16 Performance Drop (8 with latency constraints)
180x Improvement
25Stream Architecture
- Stream Processing
- Matched to media processing
- Exposes locality and concurrency
- Stream Register Organization
- Efficiency of special-purpose hardware
- Optimized for streaming applications
- Data bandwidth
- Bandwidth hierarchy
- Memory access scheduling
- Conditional streams
26The Imagine Stream Processor
27Arithmetic Clusters
Communication Unit
Scratch-pad Register File
Intercluster Network
Local Register File
/
CU
To SRF
Cross Point
From SRF
28Bandwidth Hierarchy
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
2GB/s
32GB/s
544GB/s
- 41.2 32-bit operations per word of memory
bandwidth
29Stream Recirculation
30Bandwidth Demands of FIR Filter
31Bandwidth Utilization of FIR Filter
32Performance
floating-point application
16-bit kernels
16-bit applications
floating-point kernel
33Power
GOPS/W 4.6 6.9 4.1 10.2
9.6 2.4 6.3
34Relative Performance and Power Efficiency
FFT Performance
Power Efficiency
35Imagine Floorplan
- Tapeout Q2 01
- 21 million Ts
- 6M SRF SRAM
- 6M UC SRAM
- 6M Clusters
- 3M Other
- Target 32 FO4
- 300 MHz at SSSS
- 500 MHz at TTSS
- TI GS30KA
- 0.15 mm Ldrawn
- 457 Signal Pins
36Imagine Team
- William J. Dally
- Ujval Kapasi
- Brucek Khailany
- Peter Mattson
- Jinyung Namkoong
- John Owens
- Ben Serebrin
- Brian Towles
- Scott Rixner
- Don Alpert (Intel)
- Ghazi Ben Amor
- Chris Buehler (MIT)
- JP Grossman (MIT)
- Brad Johanson
- Abelardo Lopez-Lagunas
- Ben Mowery
- Manman Ren
37Conclusions
- Media Processing
- Little data reuse
- Highly data parallel
- Compute intensive
- VLSI
- Stream register organization
- Bandwidth hierarchy
- Imagine
- Stream architecture
- 10 GOPS sustained application performance
- 5 GOPS/W application power efficiency