Title: Architectures for Video Signal Processing
1Architectures for Video Signal Processing
- Wayne Wolf
- Dept. of EE
- Princeton University
2Outline
- Multimedia requirements
- Architectural styles
- Jason Fritts PhD work Programmable VSPs
3Multimedia requirements
- Today, compression is the dominant application.
- Tomorrow, analysis will be as important
- object recognition
- summarization
- analysis of situations.
4Storyboard made of keyframes
For political ads, see www.ee.princeton.edu/caeti
5Key frame analysis algorithm
- Compute optical flow.
- Compute sum of magnitudes of optical flow vectors
per frame. - Select key frames at local minima min/max ratio
is user parameter.
keyframe 2
keyframe 1
motion
time
6The multimedia processing funnel
pixel processing
data volume
data abstraction
principal component analysis, hidden Markov models
7Styles of video processing
- Single-instruction multiple-data (SIMD).
- Heterogeneous multiprocessors.
- Instruction set architecture (ISA) extensions.
- Very long instruction word (VLIW) processors.
8SIMD processing
- Broadcast operation to an array of processing
elements, each of which has its own data. - Well-suited to regular, data-oriented operations.
9A block correlation architecture
D
D
D
D
D
D
D
D
D
318
10Hetereogeneous multiprocessor design
- Will need accelerators for quite some time to
come - power
- performance.
- Candidates for acceleration
- complex coding and error correction
- motion estimation.
11Expensive operations
- Expensive operations can be speeded up by
special-purpose units - specialized memory accesses
- specialized datapath operations.
- Special-purpose units may be useful for only
certain parameters - block size
- search region size.
12Communication bandwidth
- Performance is often limited by communication
bandwidth - internal
- external.
- Specialized communication topologies can make
more efficient use of available bandwidth.
13ISA extensions
- Augment instruction set of traditional
microprocessor to provide media processing
instructions - smaller word sizes
- operations particular to multimedia (saturation
arithmetic)
14Why ISA extensions
- Easy provide significant parallelism with small
changes to architecture. - Cheap can be implemented with
- Effective provide 2x-4x speedups.
15Basic principles of ISA extensions
- Split data word into subwords to provide single
instruction multiple data (SIMD) parallelism. - Assemble CPU word from pixels
16 bits
16 bits
16 bits
16 bits
pixel 1
pixel 2
pixel 3
pixel 4
64 bits
16Packed compare instruction
logo
wa
xa
xb
wb
to packed logical op
xc
wc
logo
wd
xd
17VLIW architectures?
- Parallel function units, shared register file,
static scheduling of operations
register file
function unit
function unit
function unit
function unit
...
instruction decode and memory
18VLIWs popularity
- Invented 20 years ago, popular today
- Good compiler technology.
- Low control overhead.
- Systems-on-silicon eliminates pinout problems.
- Advantages for video
- Embarrassing parallelism with static scheduling
opportunities. - Less problem with code compatibility.
19Trimedia TM-1
memory interface
video in
video out
audio in
audio out
I2C
serial
timers
VLD co-p
image co-p
VLIW CPU
PCI
20TM-1 VLIW CPU
register file
read/write crossbar
FU1
FU27
...
slot 1
slot 2
slot 3
slot 4
slot 5
21Workload characteristics experiments
- Goal compare media workload characteristics to
general-purpose load. - Used MediaBench benchmarks.
- Compiled on Impact compiler, measured with with
Impact simulator.
22Basic characteristics
- Comparison of operation frequencies with SPEC
- (ALU, mem, branch, shift, FP, mult) gt (4, 2, 1,
1, 1, 1) - Lower frequency of memory and floating-point
operations - More arithmetic operations
- Larger variation in memory usage
- Basic block statistics
- Average of 5.5 operations per basic block
- Need global scheduling techniques to extract ILP
23Basic characteristics, contd
- Static branch prediction
- Average of 89.5 static branch prediction on
training input - Average of 85.9 static branch prediction on
evaluation input - Data types and sizes
- Nearly 70 of all instructions require only 8 or
16 bit data types
24Breakdown of data types by media type
25Memory experiment setup
- Spatial locality experiment
- cache regression line sizes 8 to 1024 bytes
- assumed cache size of 64 KB
- measure read and write miss ratios
26Data spatial locality
27Multimedia looping characteristics
- Highly loop centric
- 95 of CPU time in two innermost loop levels
- Significant processing regularity
- About 10 iterations per loop on average
- Complex loop control
- average of instructions executed per loop
invocation/total of loop instructions - Average path ratio of 78--high complexity
28Average iterations per loopand path ratio
- average number of loop iterations
- average path ratio
29Instruction level parallelism
- Instruction level parallelism
- base model single issue using classical
optimizations only - parallel model 8-issue
- Explores only parallel scheduling performance
- assumes an ideal processor model
- no performance penalties from branches, cache
misses, etc.
30ILP results
31Workload evaluation conclusions
- Operation characteristics
- More arithmetic, less memory and floating-point
- Large variation in memory usage
- (ALU, mem, branch, shift, FP, mult) gt (4, 2, 1,
1, 1, 1) - Good static branch prediction
- Multimedia 10-15 avg. miss ratio
- General-purpose 20-30 avg. miss ratio
- Similar basic block sizes (5 instrs per basic
block)
32Workload evaluation conclusions, contd
- Primarily small data types (8 or 16 bits)
- Nearly 70 of instructions require 16-bit or
smaller data types - Significant opportunity for subword parallelism
or narrower datapaths - Memory
- Typically small data and instruction working set
sizes - High data and instruction spatial locality
33Workload evaluation conclusions, contd
- Loop-centric
- Majority of execution time spent in two innermost
loops - Average of 10 iterations per loop invocation
- Path ratio indicates greater control complexity
than expected
34VSP architecture evaluation
- Determine fundamental architecture style
- Statically Scheduled gt Very Long Instruction
Word (VLIW) - Dynamically Scheduled gt Superscalar
- Examine variety of architecture parameters
- Fundamental Architecture Style
- Instruction Fetch Architecture
- High Frequency Effects
- Cache Memory Hierarchy
35Fundamental architectureevaluation
- Major issues
- Static vs. dynamic scheduling
- Issue width
- Focused on non-memory limited applications.
36Architectural model
- 8-issue processor
- Operation latencies targeted for 500 MHz to 1 GHz
- 64 integer and floating-point registers
- Pipeline 1 fetch, 2 decode, 1 write back,
variable execute stages
37Architectural model, contd
- 32 KB direct-mapped L1 data cache with 64 byte
lines - 16 KB direct-mapped L1 instruction cache with 256
byte lines - 256 KB 4-way set associate on-chip L2 cache
- 41 Processor to external bus frequency ratio
38Static versus Dynamic Scheduling
39Increasing issue width
40Dynamic branch prediction comparison
41Impact of higher processor frequencies
- Increased wire delay at higher frequencies may
cause - Longer operation latencies
- Delayed bypassing
42Processor frequency models
- Three processor models with different operation
latencies - 250 MHz 500 MHz stores 1, loads 2, FP 3,
mult 3, div 10 - 500 MHz 1 GHz stores 2, loads 3, FP 4,
mult 5, div 20 - 1 GH 2 GHz stores 3, loads 4, FP 5, mult
7, div 30
43Processor frequency results
- 10 performance difference between processor
models - 35 performance degradation for delayed bypassing
- Out-of-order scheduling and superscalar
compilation least susceptible to high frequency
effects - 20-30 less performance degradation
44Cache evaluation
45Evaluation ofcache memory hierarchy
- Conclusions
- L2 cache has little impact on performance
- useful for storing state during context switches
- External memory miss latency is primary memory
problem - Streaming data structures will help alleviate
this - External memory bandwidth is second-most problem
46Architecture evaluation conclusions
- Fundamental Architecture Style
- VLIW and In-order superscalar are comparable
- Out-of-order superscalar has 70 better
performance - Hyperblock is most effective compilation
technique - Issue widths of 3-4 are sufficient
47Architecture conclusions, contd.
- Instruction Fetch Architecture
- Small dynamic branch predictor provides good
performance - Aggressive fetch provides little benefit
- 2 performance degradation for additional
pre-execute pipeline stages - Instruction fetch is not critical in media
processors
48Architecture conclusions, contd.
- High Frequency Effects
- 10 performance difference between processors
with varying operation latencies - 35 performance degradation from delayed
bypassing - Out-of-order superscalar and superscalar
compilation least affected - Cache Memory Hierarchy
- L1 cache size has little effect on media
processing - External memory latency and bandwidth are primary
bottlenecks
49Summary
- Multimedia applications are already more complex
and will become more so. - Programmable video architectures enable
sophisticated applications. - Video architectures must be sophisticated enough
to handle modern video applications.