Stream Architecture: Rethinking Media Processor Design - PowerPoint PPT Presentation

About This Presentation

Title:

Stream Architecture: Rethinking Media Processor Design

Description:

SIMD Register Organization. Area, Power N3/C2, Delay (N/C)3/2. Scott Rixner ... Stream Register Organization. Efficiency of special-purpose hardware ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 38

Provided by: andrew638

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Stream Architecture: Rethinking Media Processor Design

1
Stream ArchitectureRethinking Media Processor
Design
Scott Rixner April 9, 2001

Rice University
Computer Systems Laboratory

2
Media Processing

Video/image compression decompression
MPEG, JPEG, ...
Signal Processing
DSL modems, cellular base stations, ...
Image synthesis
Polygon rendering, image-based rendering, ...
Image understanding
Face recognition, depth extraction, ...

3
Stereo Depth Extraction
Left Camera Image
Right Camera Image

640x480 _at_ 30 fps
Requirements
11 GOPS
Imagine stream processor
12.1 GOPS, 4.6 GOPS/W

Depth Map
4
Outline

Stream Processing
VLSI Constraints
Register Organization
Imagine
Conclusions

5
Media Processing Characteristics

Low-precision data
24 8-bit integer operations
29 16-bit integer operations
Abundant data-parallelism
Little global data reuse
Average of 1.5 references per global data word
Numerous computations per global reference
50-500 operations per global data reference

6
Stream Processing

Little data reuse (pixels never revisited)
Highly data parallel (output pixels not dependent
on other output pixels)
Compute intensive (gt60 operations per memory
reference)

7
Locality and Concurrency
Operations within a kernel operate on local data
Kernels can be partitioned across chips to
exploit control parallelism
Image 0
convolve
convolve
Depth Map
SAD
Image 1
convolve
convolve
Streams expose data parallelism
8
Sony PlayStation2
Emotion Engine
FPU
MIPS Core
VPU0
Graphics Synthesizer
VPU1
Display
IPU
RDRAM, I/O, DMAC, etc.
9
Special vs. General Purpose

Special Purpose
Fixed function
High performance
General Purpose
Programmable
Insufficient performance

10
Register Files Dwarf ALUs
11
Register File Area

Each cell requires
1 word line per port
1 bit line per port
Each cell grows as p2
R registers in the file
Area p2R µ N3

Signal must traverse
Word line to access cell
Bit line to transfer data
Wire capacitance dominates
Delay pR1/2 µ N3/2

100 utilization requires
driving all pR1/2 bit lines
Wire capacitance dominates
Power p2R µ N3

Area, Power µ N3, Delay µ N3/2

15
Partitioned Organizations

SIMD
Data-parallel axis
Distributed Register Files (DRF)
Instruction-level parallel axis
Hierarchical
Memory hierarchy axis
Stream
Optimizing for streams

16
SIMD Register Organization

Area, Power µ N3/C2, Delay µ (N/C)3/2

17
Distributed Register Organization

Area, Power µ N2, Delay µ N

18
Combining SIMD and DRF
Scalar
SIMD
Central
DRF
19
Hierarchical Register Organization
Hierarchical T40

Area, Power µ N3, Delay µ N3/2

20
Hierarchical Organizations
Scalar
SIMD
Central
DRF
21
Stream Register Organization

Area, Power µ N2/C, Delay µ N/C

22
Stream Organizations
Scalar
SIMD
Central
DRF
23
Comparison of Organizations

48 ALUs (32-bit), 500 MHz
Stream organization improves central organization
by
Area 195x, Delay 20x, Power 430x

24
Performance
16 Performance Drop (8 with latency constraints)
180x Improvement
25
Stream Architecture

Stream Processing
Matched to media processing
Exposes locality and concurrency
Stream Register Organization
Efficiency of special-purpose hardware
Optimized for streaming applications
Data bandwidth
Bandwidth hierarchy
Memory access scheduling
Conditional streams

26
The Imagine Stream Processor
27
Arithmetic Clusters
Communication Unit
Scratch-pad Register File
Intercluster Network
Local Register File

/
CU
To SRF
Cross Point
From SRF
28
Bandwidth Hierarchy
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
2GB/s
32GB/s
544GB/s

41.2 32-bit operations per word of memory
bandwidth

29
Stream Recirculation
30
Bandwidth Demands of FIR Filter
31
Bandwidth Utilization of FIR Filter
32
Performance
floating-point application
16-bit kernels
16-bit applications
floating-point kernel
33
Power
GOPS/W 4.6 6.9 4.1 10.2
9.6 2.4 6.3
34
Relative Performance and Power Efficiency
FFT Performance
Power Efficiency
35
Imagine Floorplan