Architectures for Video Signal Processing - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Architectures for Video Signal Processing

Description:

Jason Fritts PhD work: Programmable VSPs. Multimedia requirements ... Expensive operations can be speeded up by special-purpose units: specialized memory accesses; ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 50
Provided by: wayn120
Category:

less

Transcript and Presenter's Notes

Title: Architectures for Video Signal Processing


1
Architectures for Video Signal Processing
  • Wayne Wolf
  • Dept. of EE
  • Princeton University

2
Outline
  • Multimedia requirements
  • Architectural styles
  • Jason Fritts PhD work Programmable VSPs

3
Multimedia requirements
  • Today, compression is the dominant application.
  • Tomorrow, analysis will be as important
  • object recognition
  • summarization
  • analysis of situations.

4
Storyboard made of keyframes
For political ads, see www.ee.princeton.edu/caeti
5
Key frame analysis algorithm
  • Compute optical flow.
  • Compute sum of magnitudes of optical flow vectors
    per frame.
  • Select key frames at local minima min/max ratio
    is user parameter.

keyframe 2
keyframe 1
motion
time
6
The multimedia processing funnel
pixel processing
data volume
data abstraction
principal component analysis, hidden Markov models
7
Styles of video processing
  • Single-instruction multiple-data (SIMD).
  • Heterogeneous multiprocessors.
  • Instruction set architecture (ISA) extensions.
  • Very long instruction word (VLIW) processors.

8
SIMD processing
  • Broadcast operation to an array of processing
    elements, each of which has its own data.
  • Well-suited to regular, data-oriented operations.

9
A block correlation architecture
D
D
D
D
D
D
D
D
D
318



10
Hetereogeneous multiprocessor design
  • Will need accelerators for quite some time to
    come
  • power
  • performance.
  • Candidates for acceleration
  • complex coding and error correction
  • motion estimation.

11
Expensive operations
  • Expensive operations can be speeded up by
    special-purpose units
  • specialized memory accesses
  • specialized datapath operations.
  • Special-purpose units may be useful for only
    certain parameters
  • block size
  • search region size.

12
Communication bandwidth
  • Performance is often limited by communication
    bandwidth
  • internal
  • external.
  • Specialized communication topologies can make
    more efficient use of available bandwidth.

13
ISA extensions
  • Augment instruction set of traditional
    microprocessor to provide media processing
    instructions
  • smaller word sizes
  • operations particular to multimedia (saturation
    arithmetic)

14
Why ISA extensions
  • Easy provide significant parallelism with small
    changes to architecture.
  • Cheap can be implemented with
  • Effective provide 2x-4x speedups.

15
Basic principles of ISA extensions
  • Split data word into subwords to provide single
    instruction multiple data (SIMD) parallelism.
  • Assemble CPU word from pixels

16 bits
16 bits
16 bits
16 bits
pixel 1
pixel 2
pixel 3
pixel 4
64 bits
16
Packed compare instruction
  • Used for chromakey

logo
wa
xa


xb
wb

to packed logical op
xc
wc


logo
wd
xd

17
VLIW architectures?
  • Parallel function units, shared register file,
    static scheduling of operations

register file
function unit
function unit
function unit
function unit
...
instruction decode and memory
18
VLIWs popularity
  • Invented 20 years ago, popular today
  • Good compiler technology.
  • Low control overhead.
  • Systems-on-silicon eliminates pinout problems.
  • Advantages for video
  • Embarrassing parallelism with static scheduling
    opportunities.
  • Less problem with code compatibility.

19
Trimedia TM-1
memory interface
video in
video out
audio in
audio out
I2C
serial
timers
VLD co-p
image co-p
VLIW CPU
PCI
20
TM-1 VLIW CPU
register file
read/write crossbar
FU1
FU27
...
slot 1
slot 2
slot 3
slot 4
slot 5
21
Workload characteristics experiments
  • Goal compare media workload characteristics to
    general-purpose load.
  • Used MediaBench benchmarks.
  • Compiled on Impact compiler, measured with with
    Impact simulator.

22
Basic characteristics
  • Comparison of operation frequencies with SPEC
  • (ALU, mem, branch, shift, FP, mult) gt (4, 2, 1,
    1, 1, 1)
  • Lower frequency of memory and floating-point
    operations
  • More arithmetic operations
  • Larger variation in memory usage
  • Basic block statistics
  • Average of 5.5 operations per basic block
  • Need global scheduling techniques to extract ILP

23
Basic characteristics, contd
  • Static branch prediction
  • Average of 89.5 static branch prediction on
    training input
  • Average of 85.9 static branch prediction on
    evaluation input
  • Data types and sizes
  • Nearly 70 of all instructions require only 8 or
    16 bit data types

24
Breakdown of data types by media type
25
Memory experiment setup
  • Spatial locality experiment
  • cache regression line sizes 8 to 1024 bytes
  • assumed cache size of 64 KB
  • measure read and write miss ratios

26
Data spatial locality
27
Multimedia looping characteristics
  • Highly loop centric
  • 95 of CPU time in two innermost loop levels
  • Significant processing regularity
  • About 10 iterations per loop on average
  • Complex loop control
  • average of instructions executed per loop
    invocation/total of loop instructions
  • Average path ratio of 78--high complexity

28
Average iterations per loopand path ratio
- average number of loop iterations
- average path ratio
29
Instruction level parallelism
  • Instruction level parallelism
  • base model single issue using classical
    optimizations only
  • parallel model 8-issue
  • Explores only parallel scheduling performance
  • assumes an ideal processor model
  • no performance penalties from branches, cache
    misses, etc.

30
ILP results
31
Workload evaluation conclusions
  • Operation characteristics
  • More arithmetic, less memory and floating-point
  • Large variation in memory usage
  • (ALU, mem, branch, shift, FP, mult) gt (4, 2, 1,
    1, 1, 1)
  • Good static branch prediction
  • Multimedia 10-15 avg. miss ratio
  • General-purpose 20-30 avg. miss ratio
  • Similar basic block sizes (5 instrs per basic
    block)

32
Workload evaluation conclusions, contd
  • Primarily small data types (8 or 16 bits)
  • Nearly 70 of instructions require 16-bit or
    smaller data types
  • Significant opportunity for subword parallelism
    or narrower datapaths
  • Memory
  • Typically small data and instruction working set
    sizes
  • High data and instruction spatial locality

33
Workload evaluation conclusions, contd
  • Loop-centric
  • Majority of execution time spent in two innermost
    loops
  • Average of 10 iterations per loop invocation
  • Path ratio indicates greater control complexity
    than expected

34
VSP architecture evaluation
  • Determine fundamental architecture style
  • Statically Scheduled gt Very Long Instruction
    Word (VLIW)
  • Dynamically Scheduled gt Superscalar
  • Examine variety of architecture parameters
  • Fundamental Architecture Style
  • Instruction Fetch Architecture
  • High Frequency Effects
  • Cache Memory Hierarchy

35
Fundamental architectureevaluation
  • Major issues
  • Static vs. dynamic scheduling
  • Issue width
  • Focused on non-memory limited applications.

36
Architectural model
  • 8-issue processor
  • Operation latencies targeted for 500 MHz to 1 GHz
  • 64 integer and floating-point registers
  • Pipeline 1 fetch, 2 decode, 1 write back,
    variable execute stages

37
Architectural model, contd
  • 32 KB direct-mapped L1 data cache with 64 byte
    lines
  • 16 KB direct-mapped L1 instruction cache with 256
    byte lines
  • 256 KB 4-way set associate on-chip L2 cache
  • 41 Processor to external bus frequency ratio

38
Static versus Dynamic Scheduling
39
Increasing issue width
40
Dynamic branch prediction comparison
41
Impact of higher processor frequencies
  • Increased wire delay at higher frequencies may
    cause
  • Longer operation latencies
  • Delayed bypassing

42
Processor frequency models
  • Three processor models with different operation
    latencies
  • 250 MHz 500 MHz stores 1, loads 2, FP 3,
    mult 3, div 10
  • 500 MHz 1 GHz stores 2, loads 3, FP 4,
    mult 5, div 20
  • 1 GH 2 GHz stores 3, loads 4, FP 5, mult
    7, div 30

43
Processor frequency results
  • 10 performance difference between processor
    models
  • 35 performance degradation for delayed bypassing
  • Out-of-order scheduling and superscalar
    compilation least susceptible to high frequency
    effects
  • 20-30 less performance degradation

44
Cache evaluation
45
Evaluation ofcache memory hierarchy
  • Conclusions
  • L2 cache has little impact on performance
  • useful for storing state during context switches
  • External memory miss latency is primary memory
    problem
  • Streaming data structures will help alleviate
    this
  • External memory bandwidth is second-most problem

46
Architecture evaluation conclusions
  • Fundamental Architecture Style
  • VLIW and In-order superscalar are comparable
  • Out-of-order superscalar has 70 better
    performance
  • Hyperblock is most effective compilation
    technique
  • Issue widths of 3-4 are sufficient

47
Architecture conclusions, contd.
  • Instruction Fetch Architecture
  • Small dynamic branch predictor provides good
    performance
  • Aggressive fetch provides little benefit
  • 2 performance degradation for additional
    pre-execute pipeline stages
  • Instruction fetch is not critical in media
    processors

48
Architecture conclusions, contd.
  • High Frequency Effects
  • 10 performance difference between processors
    with varying operation latencies
  • 35 performance degradation from delayed
    bypassing
  • Out-of-order superscalar and superscalar
    compilation least affected
  • Cache Memory Hierarchy
  • L1 cache size has little effect on media
    processing
  • External memory latency and bandwidth are primary
    bottlenecks

49
Summary
  • Multimedia applications are already more complex
    and will become more so.
  • Programmable video architectures enable
    sophisticated applications.
  • Video architectures must be sophisticated enough
    to handle modern video applications.
Write a Comment
User Comments (0)
About PowerShow.com