Architectures for Video Signal Processing - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Architectures for Video Signal Processing

Description:

Architectures for Video Signal Processing Wayne Wolf Dept. of EE Princeton University – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 50

Provided by: WayneW155

Learn more at: https://ptolemy.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Architectures for Video Signal Processing

1
Architectures for Video Signal Processing

Wayne Wolf
Dept. of EE
Princeton University

2
Outline

Multimedia requirements
Architectural styles
Jason Fritts PhD work Programmable VSPs

3
Multimedia requirements

Today, compression is the dominant application.
Tomorrow, analysis will be as important
object recognition
summarization
analysis of situations.

4
Storyboard made of keyframes
For political ads, see www.ee.princeton.edu/caeti
5
Key frame analysis algorithm

Compute optical flow.
Compute sum of magnitudes of optical flow vectors
per frame.
Select key frames at local minima min/max ratio
is user parameter.

keyframe 2
keyframe 1
motion
time
6
The multimedia processing funnel
pixel processing
data volume
data abstraction
principal component analysis, hidden Markov models
7
Styles of video processing

Single-instruction multiple-data (SIMD).
Heterogeneous multiprocessors.
Instruction set architecture (ISA) extensions.
Very long instruction word (VLIW) processors.

8
SIMD processing

Broadcast operation to an array of processing
elements, each of which has its own data.
Well-suited to regular, data-oriented operations.

9
A block correlation architecture
D
D
D
D
D
D
D
D
D
318

10
Hetereogeneous multiprocessor design

Will need accelerators for quite some time to
come
power
performance.
Candidates for acceleration
complex coding and error correction
motion estimation.

11
Expensive operations

Expensive operations can be speeded up by
special-purpose units
specialized memory accesses
specialized datapath operations.
Special-purpose units may be useful for only
certain parameters
block size
search region size.

12
Communication bandwidth

Performance is often limited by communication
bandwidth
internal
external.
Specialized communication topologies can make
more efficient use of available bandwidth.

13
ISA extensions

Augment instruction set of traditional
microprocessor to provide media processing
instructions
smaller word sizes
operations particular to multimedia (saturation
arithmetic)

14
Why ISA extensions

Easy provide significant parallelism with small
changes to architecture.
Cheap can be implemented with
Effective provide 2x-4x speedups.

15
Basic principles of ISA extensions

Split data word into subwords to provide single
instruction multiple data (SIMD) parallelism.
Assemble CPU word from pixels

16 bits
16 bits
16 bits
16 bits
pixel 1
pixel 2
pixel 3
pixel 4
64 bits
16
Packed compare instruction

Used for chromakey

logo
wa
xa

xb
wb

to packed logical op
xc
wc

logo
wd
xd

17
VLIW architectures?

Parallel function units, shared register file,
static scheduling of operations

register file
function unit
function unit
function unit
function unit
...
instruction decode and memory
18
VLIWs popularity

Invented 20 years ago, popular today
Good compiler technology.
Low control overhead.
Systems-on-silicon eliminates pinout problems.
Advantages for video
Embarrassing parallelism with static scheduling
opportunities.
Less problem with code compatibility.

19
Trimedia TM-1
memory interface
video in
video out
audio in
audio out
I2C
serial
timers
VLD co-p
image co-p
VLIW CPU
PCI
20
TM-1 VLIW CPU
register file
read/write crossbar
FU1
FU27
...
slot 1
slot 2
slot 3
slot 4
slot 5
21
Workload characteristics experiments

Goal compare media workload characteristics to
general-purpose load.
Used MediaBench benchmarks.
Compiled on Impact compiler, measured with with
Impact simulator.

22
Basic characteristics

Comparison of operation frequencies with SPEC
(ALU, mem, branch, shift, FP, mult) gt (4, 2, 1,
1, 1, 1)
Lower frequency of memory and floating-point
operations
More arithmetic operations
Larger variation in memory usage
Basic block statistics
Average of 5.5 operations per basic block
Need global scheduling techniques to extract ILP

23
Basic characteristics, contd

Static branch prediction
Average of 89.5 static branch prediction on
training input
Average of 85.9 static branch prediction on
evaluation input
Data types and sizes
Nearly 70 of all instructions require only 8 or
16 bit data types

24
Breakdown of data types by media type
25
Memory experiment setup

Spatial locality experiment
cache regression line sizes 8 to 1024 bytes
assumed cache size of 64 KB
measure read and write miss ratios

26
Data spatial locality
27
Multimedia looping characteristics

Highly loop centric
95 of CPU time in two innermost loop levels
Significant processing regularity
About 10 iterations per loop on average
Complex loop control
average of instructions executed per loop
invocation/total of loop instructions
Average path ratio of 78--high complexity

28
Average iterations per loopand path ratio
- average number of loop iterations
- average path ratio
29
Instruction level parallelism

Instruction level parallelism
base model single issue using classical
optimizations only
parallel model 8-issue
Explores only parallel scheduling performance
assumes an ideal processor model
no performance penalties from branches, cache
misses, etc.

30
ILP results
31
Workload evaluation conclusions

Operation characteristics
More arithmetic, less memory and floating-point
Large variation in memory usage
(ALU, mem, branch, shift, FP, mult) gt (4, 2, 1,
1, 1, 1)
Good static branch prediction
Multimedia 10-15 avg. miss ratio
General-purpose 20-30 avg. miss ratio
Similar basic block sizes (5 instrs per basic
block)

32
Workload evaluation conclusions, contd

Primarily small data types (8 or 16 bits)
Nearly 70 of instructions require 16-bit or
smaller data types
Significant opportunity for subword parallelism
or narrower datapaths
Memory
Typically small data and instruction working set
sizes
High data and instruction spatial locality

33
Workload evaluation conclusions, contd

Loop-centric
Majority of execution time spent in two innermost
loops
Average of 10 iterations per loop invocation
Path ratio indicates greater control complexity
than expected

34
VSP architecture evaluation

Determine fundamental architecture style
Statically Scheduled gt Very Long Instruction
Word (VLIW)
Dynamically Scheduled gt Superscalar
Examine variety of architecture parameters
Fundamental Architecture Style
Instruction Fetch Architecture
High Frequency Effects
Cache Memory Hierarchy

35
Fundamental architectureevaluation

Major issues
Static vs. dynamic scheduling
Issue width
Focused on non-memory limited applications.

36
Architectural model

8-issue processor
Operation latencies targeted for 500 MHz to 1 GHz
64 integer and floating-point registers
Pipeline 1 fetch, 2 decode, 1 write back,
variable execute stages

37
Architectural model, contd

32 KB direct-mapped L1 data cache with 64 byte
lines
16 KB direct-mapped L1 instruction cache with 256
byte lines
256 KB 4-way set associate on-chip L2 cache
41 Processor to external bus frequency ratio

38
Static versus Dynamic Scheduling
39
Increasing issue width
40
Dynamic branch prediction comparison
41
Impact of higher processor frequencies

Increased wire delay at higher frequencies may
cause
Longer operation latencies
Delayed bypassing

42
Processor frequency models

Three processor models with different operation
latencies
250 MHz 500 MHz stores 1, loads 2, FP 3,
mult 3, div 10
500 MHz 1 GHz stores 2, loads 3, FP 4,
mult 5, div 20
1 GH 2 GHz stores 3, loads 4, FP 5, mult
7, div 30

43
Processor frequency results

10 performance difference between processor
models
35 performance degradation for delayed bypassing
Out-of-order scheduling and superscalar
compilation least susceptible to high frequency
effects
20-30 less performance degradation

44
Cache evaluation
45
Evaluation ofcache memory hierarchy

Conclusions
L2 cache has little impact on performance
useful for storing state during context switches
External memory miss latency is primary memory
problem
Streaming data structures will help alleviate
this
External memory bandwidth is second-most problem

46
Architecture evaluation conclusions

Fundamental Architecture Style
VLIW and In-order superscalar are comparable
Out-of-order superscalar has 70 better
performance
Hyperblock is most effective compilation
technique
Issue widths of 3-4 are sufficient

47
Architecture conclusions, contd.

Instruction Fetch Architecture
Small dynamic branch predictor provides good
performance
Aggressive fetch provides little benefit
2 performance degradation for additional
pre-execute pipeline stages
Instruction fetch is not critical in media
processors

48
Architecture conclusions, contd.

High Frequency Effects
10 performance difference between processors
with varying operation latencies
35 performance degradation from delayed
bypassing
Out-of-order superscalar and superscalar
compilation least affected
Cache Memory Hierarchy
L1 cache size has little effect on media
processing
External memory latency and bandwidth are primary
bottlenecks

49
Summary

Multimedia applications are already more complex
and will become more so.
Programmable video architectures enable
sophisticated applications.
Video architectures must be sophisticated enough
to handle modern video applications.

Write a Comment

User Comments (0)