CHAPTER 6 VLSI Architectures for Motion Estimation - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

CHAPTER 6 VLSI Architectures for Motion Estimation

Description:

These sub-operations are performed in a pipeline fashion and thus reduce the cycle time. ... Illustration of One-Dimensional Full Search Algorithm. 43 ... – PowerPoint PPT presentation

Number of Views:685
Avg rating:5.0/5.0
Slides: 94
Provided by: Alo8
Category:

less

Transcript and Presenter's Notes

Title: CHAPTER 6 VLSI Architectures for Motion Estimation


1
CHAPTER 6VLSI Architectures for Motion
Estimation
2
1-D Systolic Array
  • A Family of VLSI Designs for the Motion
    Compensation Block-Matching Algorithm
  • K. M. Yang, M. T. Sun, and L. Wu
  • IEEE Transactions on Circuits and Systems, vol.
    36, no. 10, pp. 1317-1325, Oct. 1989

3
Main Features
  • They allow full search capability which is the
    optimal solution in block-matching.
  • They allow sequential inputs to save pin counts
    but perform parallel processing.
  • They use common busses for data transfers and
    save silicon area.
  • They are very flexible and modular designs,
    capable of processing different block sizes, e.g.
    8 X 8, 1 6 x 1 6 or 32x32
  • They are cascadable, i.e., cascaded chips allow a
    larger tracking area.
  • They contain testing circuitry for increasing the
    testability.
  • The first chip design for block matching motion
    estimation in the world.

4
Architecture Design
  • In order to utilize fully the processing power of
    the PEs, a special data flow has to be derived
    to keep the PEs as busy as possible.
  • The data are repeatedly used at different
    searching positions.
  • In the following, two data-flow techniques which
    allow the designs to achieve 100 percent
    efficiency are described. One broadcasts previous
    frame data and the other broadcasts current block
    data.

5
Notations
6
Broadcasting the Previous Frame Data
  • While b(Ib, Jb15) is being inputted it can be
    broadcasted to all processors that need it.
  • This relieves the burden of repeated access of
    the same data from the previous frame.

7
Broadcasting Reference Frame
  • The 16 PE columns represent the calculation of
    the error measurement for 16 search positions.
  • Except for a very short initial delay, all the
    PEs are busy all the time, so that the
    utilization is 100.
  • The address generator generates the address by
    summing up a base address and a running index.
  • The base address, (Ia, Ja) or (Ib, Jb) which is
    defined as the upper left corner of a block,
    remains the same for the entire processing of
    that blocks and the running indexes (i, j) and
    (k, l) are identical sequence for all blocks.

8
Basic Data Flow
9
Architecture of PE
  • These sub-operations are performed in a pipeline
    fashion and thus reduce the cycle time.
  • The accumulator in the last stage of the PE has
    16-bit precision to accommodate the largest
    possible error measurement.

10
Broadcasting the Current Frame Data
Parallel-in-parallel- output shift registers
Parallel-in-parallel- output shift registers with
multiplexers
11
Basic Dataflow for Broadcasting Current Block Data
12
Flexible Block Size
  • Different motion-compensation schemes may use
    different block sizes and require large tracking
    ranges. It is very desirable to have a chip
    flexible enough for use in different systems.
  • Consider a block-size of 8 ? 8, the required
    computations for each block is ¼ of the
    computation required for a block-size of 16 ? 16.
  • However, in each frame, the number of blocks is 4
    times the number of the block-size of 16 ? 16.

13
Flexible Block Size (Cont.)
  • The computational load for each frame is the same
    for different block-sizes except that the
    internal dynamic-range is slightly different
    (tracking range is fixed).
  • Both architectures discussed are flexible enough
    to process 8 ? 8, 16 ? 16 or 32 ? 32 blocks as
    long as the tracking range is fixed to 16
    searches in one coordinate.
  • The same hardware containing 16 PEs can be
    reconfigured to process different block sizes by
    a very simple control signal (address generator).
  • The above discussion can be generalized to other
    block sizes of power of 2.

14
Larges Tracking Ranges
  • The tracking range is basically limited by the
    computation power of the PE's. If the tracking
    range of -16 to 15 is needed, the computation
    load is increased by 4 times.
  • Assuming each PE is already operating at the
    limit of its capability, 4 times the number of
    PE's will be needed.
  • In this connection, essentially two chips are
    cascaded to provide 32-stage input registers and
    32 PEs for the doubled horizontal tracking
    range.

15
Block Diagram for Cascading Four Chips to Achieve
Tracking Range of -16 to 15
Motion Vector
CMP
CHIP A
CHIP C
C1 p1 p1
C2 p2 p2
CHIP D
CHIP B
16
Overlapped Search Area
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
Sub-tracking area I
Sub-tracking area III
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
Sub-tracking area III
Sub-tracking area IV
17
Overlapped Search Area (Cont.)
  • The cascaded chip design can also be easily done
    by assigning each chip to process one portion of
    the tracking area.
  • While these data from the overlapped area are
    inputted, they can be broadcasted to two chips to
    save the bandwidth. This avoids proportional
    increase of the memory requirement in a cascaded
    chips system.

18
Motion Estimation with Fractional Precision
  • Quarter-pel precision

19
Fractional Motion Estimation Chip-Pair Design
Video in
Current Frame Storage Memory
Motion Compensation Chip I Integer Precision
Reconstructed Video in
Previous Frame Storage Memory
(mi, mj)
Tracking Area Storage Memory
Motion Compensation Chip F Fractional Precision
Current Block Storage Memory
20
Block Diagram of a Fractional Motion Estimation
CHip
21
Interpolation
  • The combination of IP1 and IP2 eases the input
    rate and keeps the PEs performing operations
    every cycle.
  • The interpolated values at the output of the IP1
    and the IP2 can be expressed as

22
Basic Data Flow for Fractional Motion Vector
Estimator
23
Basic Data Flow for Fractional Motion Vector
Estimator (cont.)
24
Schematic Diagram of IP1
25
Schematic Diagram of IP2
26
Chip Layout
27
Testability
  • The motion vector calculated by the chip is a
    function of the current block data and the data
    in the previous frame within the tracking range.
    Since the number of possible combinations of
    these input data are extremely large, exhaustive
    testing of the chip is impossible.
  • In order to be able to test the chip, it is
    highly desirable to have a testing circuit inside
    the chip without using excessive chip area, or
    degrading performance.
  • The chip proposed operates in two modes, the
    normal mode and the test mode, which are selected
    by an external signal named test.

28
Testability (Cont.)
  • By using tri-state buses and a decoder, the
    testing vectors for the whole chip are reduced to
    much smaller sets of functionally divided
    modules.
  • In the test mode, a test pattern is inputted
    from some data pins, which are normally used for
    inputting one of the previous frame data, and
    then is decoded by the Test Pattern Decoder.
  • Only one of the modules will be tested at a time
    and only its results are routed to an output bus
    and observed from the output pins.

29
Array Architectures for Block Matching
AlgorithmsT. Komarek and P. PirschIEEE
Transactions on Circuits and Systems, vol. 36, 
no. 10,  Oct. 1989, pp. 1301-1308
30
Block Matching Algorithm
(motion vector)
  • The BMA is defined over a four-dimensional index
    space due to its four indexes i, k, m, and n.
  • As an example, the BMA is decomposed into two
    parts which are defined over two-dimensional
    index spaces.
  • The first one is spawn by the indexes i and k and
    consists of the addition of the sum s(m, n).
  • In the rest, which is defined over m and n, the
    minimum search and the selection of the
    displacement vector components is performed.

31
Derivation of Systolic Arrays for Full Search BMA
  • The addition of s(m, n ) starts with the index k,
    and is continued over the index i for fixed m and
    n.
  • The second part of the decomposed BMA is given by

m and n fixed
32
DG Spawn in the i, k Plane
Subtraction magnitude operation, addition
DG displayed for a block size of N 3 and a
maximum displacementof p 2 in the i, k-plane
of the decomposed full search BMA.
i
0
0
0
1
2
3
k
AD
AD
AD
Time schedule
4
AD
AD
AD
5
x(i, k)
y(im,kn)
AD
AD
AD
6
0
A
A
A
s(m,n)
7
m,n s(m-1,n) minimum,search displacement vector
M
addition
33
Systolic Architecture AB1 for N 3, p 2
Search area data
Reference data
0
AD
  • 31 21 31 21 11

11 21 31 11 21 31
AD
42 32 22 32 22 12
12 22 32 12 22 32
AD
13 23 33 13 23 33
43 33 23 33 23 13
Number of time instance necessary to determine a
displacement vector N ? (2p1)?(2p1N-1) N ?
(2p1)?(2pN)
M
AD
Displacement Vector
34
Three-Dimensional Index Space Spawn by the Index
i, k, and m
35
Systolic Array AS2
  • Systolic architecture AS2 with processing
    elements AD, A , and M derived from the previous
    DGwith the indexes of input data x ( i , k ) and
    y(i m , k n). The indexes enclosed by the
    dashed lines belong to data of one search area
    line and one reference block.

projection onto the i, m plane
36
Systolic Architecture AB2
  • Systolic architecture AB2 with the indexes of
    search area data y(i m, k n). The reference
    block data x ( i , k) remain fixed in the PE's
    AD. The indexes of one search area line data are
    enclosed by the dashed line.

Projection along the i, k-plane
37
Processing Element
38
Bit-Level Cell Array
4x4 PE array
39
Bit-Level PE Array (Cont.)
40
Systolic Array AS1
  • Systolic architecture AS1 for N 3 and p 2
    with the indexes of search area data y ( i m, k
    n) and reference block data x ( i , k).

41
Efficient Hybrid Tree/Linear Array Aarchitectures
forBlock-Matching Motion Estimation Algorithms
  • M.-J.Chen, L.-G. Chen, K.-N.Cheng, M.C.Chen
  • IEE Proc.-Vis. Image Signal Process., vol. 143,
    no. 4, pp. 217-222, Aug. 1996

42
Illustration of One-Dimensional Full Search
Algorithm
43
Tree-Type Array Architecture with N 4
44
Hybrid Tree/Linear Architecture
45
Tree-Cut Technique Direct Form
46
Image pel Distribution for Memory Interleaving
47
Chip Layout and Characteristics
48
Analysis and Architecture Design of
VariableBlock Size Motion Estimation for
H.264/AVC
  • Ching-Yeh Chen, Shao-Yi Chien, Yu-Wen Huang,
    Tung-Chien Chen, Tu-ChihWang, and Liang-Gee Chen
  • IEEE Trans. Circuits Syst. Video Technology

49
Abstract
  • Variable block size motion estimation (VBSME) has
    become an important video coding technique, but
    it increases the difficulty of hardware design.
  • We use inter/intra-level classification and
    various data flows to analyze the impact of
    supporting VBSME in different hardware
    architectures.
  • We propose two hardware architectures, which can
    support traditional fixed block size motion
    estimation as well as VBSME with the less chip
    area overhead compared to previous approaches.

50
Abstract (Cont.)
  • By broadcasting reference pixel rows and
    propagating partial SADs, the first design has
    the fewer reference pixel registers and a shorter
    critical path.
  • The second design utilizes a 2-D distortion array
    and one adder tree with the reference buffer
    which can maximize the data reuse between
    successive searching candidates.
  • We demonstrate a 720p, 30fps solution at 108 MHz
    with 330.2K gate count and 208K bits on-chip
    memory.

51
Introduction (Cont.)
  • The row (column) SAD is the summation of N
    distortions in a row (column).
  • Although FSBMA provides the best quality among
    various ME algorithms, it consumes the largest
    computation power. In general, the computation
    complexity of ME is from 50 to 90 of a typical
    video coding system. Hence a hardware accelerator
    of ME is required.

52
VBSME
  • Variable block size motion estimation (VBSME) is
    a new coding technique and provides more accurate
    predictions compared to traditional fixed block
    size motion estimation (FBSME).
  • With FBSME, if a MB consists of two objects with
    different motion directions, the coding
    performance of this MB is worse.
  • On the other hand, for the same condition, the MB
    can be divided into smaller blocks in order to
    fit the different motion directions with VBSME.
  • VBSME has been adopted in the latest video coding
    standards, including H.263, MPEG-4, WMV9.0, and
    H.264/AVC.

53
VBSME (Cont.)
  • In H.264/AVC, a MB with variable block size can
    be divided into seven kinds of blocks including 4
    4, 4 8, 8 4, 8 8, 8 16, 16 8, and 16
    16.
  • Although VBSME can achieve higher compression
    ratio, it not only requires huge computation
    complexity but also increases the difficulty of
    hardware implementation for ME.
  • Traditional ME hardware architectures are
    designed for FBSME, and they can be classified
    into two categories.
  • One is an inter-level architecture, where each
    processing element (PE) is responsible for one
    SAD of a specific searching candidate.
  • The other is an intra-level architecture, where
    each PE is responsible for the distortion of a
    specific current pixel in the current MB for all
    searching candidates.

54
Yang, Sun, and Wus Architetures
  • An 1-D inter-level hardware architecture
    (1DInterYSW).
  • The number of PEs is equal to the number of
    searching candidates in the horizontal direction,
    2Ph.
  • The most important concept is data broadcasting.
    With broadcasting technique, the memory bandwidth
    which is defined as the number of bits for the
    required reference data in one cycle is reduced
    significantly, although some global routings are
    required.

55
Yeo and Hus Architectures
56
Lai and Chens Architeture
  • Reference pixels are propagated with propagation
    registers, and current pixels are broadcasted
    into PEs.
  • The partial SADs are still stored and accumulated
    in PEs.
  • Besides, 2DInterLC has to load reference pixels
    into propagation registers before computing SADs.
    The latency of loading reference pixels can be
    reduced by partitioning the search range in
    2DInterLC.

57
Vos and Stegherrs Architecture
58
Vos and Stegherrs Architecture (Cont.)
  • A 2-D intra-level architecture.
  • The number of PEs is equal to the block size.
    Each PE is corresponding to a current pixel. And
    current pixels are stored in PEs, respectively.
  • The important concept of 2DIntraVS is the
    scanning order in searching candidates, snake
    scan.
  • The computation flow is as follows.
  • First, the distortion is computed in each PE, and
    N partial row SADs are propagated and accumulated
    in the horizontal direction.
  • Second, an adder tree is used to accumulate the N
    row SADs to be SAD. The accumulations of row SADs
    and SAD are done in one cycle. Hence no partial
    SAD is required to be stored.

59
Komarek and Pirschs Architecture
Hsieh and Lins
Komarek and Pirschs Architecture
60
Komarek and Pirschs Architecture (Cont.)
  • Komarek and Pirsch contributed a detailed
    systolic mapping procedure by the dependence
    graph (DG). AB2 (2DIntraKP) is a 2-D intra-level
    architecture.
  • Current pixels are stored in corresponding PEs.
    Reference pixels are propagated PE by PE in the
    horizontal direction.
  • The N partial column SADs are propagated and
    accumulated in the vertical direction, first.
  • After the vertical propagation, these N column
    SADs are propagated in the horizontal direction.

61
Hsieh and Lins Architecture
  • 2DIntraHL consists of N PE arrays in the vertical
    direction, and each PE array is composed of N PEs
    in a row.
  • In 2DIntraHL, reference pixels are propagated
    with propagation registers one by one, which can
    provide the advantages of serial data input and
    increasing the data reuse.
  • Current pixels are still stored in PEs. The N
    partial column SADs are propagated in the
    vertical direction from bottom to up.
  • In each computing cycle, each PE array generates
    N distortions of a searching candidate and
    accumulates these distortions with N partial
    column SADs in the vertical propagation.
  • After the accumulation in the vertical direction,
    N column SADs are accumulated in the top adder
    tree in one cycle. The longer latency for loading
    reference pixels and large propagation registers
    are the penalties for the reduction of memory
    bandwidth and memory bandwidth.

62
Proposed Propagate Partial SAD
63
Proposed Propagate Partial SAD (Cont.)
  • The architecture is composed of N PE arrays with
    1-D adder tree in the vertical direction.
  • Current pixels are stored in each PE, and two
    sets of N continuous reference pixels in a row
    are broadcasted to N PE arrays at the same time.

64
Data Flow of Propagate Partial SAD
65
Proposed SAD Tree
66
Scan Order and Memory Access
67
Variable Block Size Motion Estimation
68
The Impact of Variable Block Size Motion
Estimation in Hardware Architectures
  • There are many methods to support VBSME in
    hardware architectures.
  • For example, we can increase the number of PEs or
    the operating frequency to do ME for different
    block sizes, respectively. One of them is to
    reuse the SADs of the smallest blocks, which are
    the blocks partitioned with the smallest block
    size, to derive the SADs of larger blocks.
  • By this method, the overhead of supporting VBSME
    is only the slight increase of gate count, and
    the other factors, such as frequency, hardware
    utilization, memory usage, and so on, are the
    same as those of FBSME.

69
Data Flow IStoring in PEs (Inter-Level
Architecture)
  • The number of bits for the data buffer in each PE
    is increased from log2N28 to n2(log2(N/n)28),
    where N2 and (N/n)2 are the number of pixels in
    one block, and 8 is the wordlength of one pixel.

FBSME, N 16 VBSME, N 16,
n 4
70
Data Flow IIPropagating with Propagation
Registers (Intra-Level Architecture)
  • In intra-level architectures, partial SADs can be
    accumulated and propagated with propagation
    registers.
  • Each PE computes the distortion of one
    corresponding current pixel in current MB.
  • By propagation adders and registers, the partial
    SAD is accumulated with these distortions.
  • When supporting VBSME, more propagation registers
    are required to store partial SADs of the
    smallest blocks. In each propagating direction,
    the number of propagation registers are n times
    of that in the original for the n smallest blocks
    in the other direction.

71
The Proposed Propagate Partial SAD Architecture
with Data Flow II
72
Data Flow IIINo Partial SADs
The proposed SAD Tree architecture with Data Flow
III, where N 16 and n 4.
73
Data Flow IIINo Partial SADs (Cont.)
  • In intra-level architectures, it is possible that
    no partial SADs are required to be stored, such
    as SAD Tree.
  • Each PE computes the distortion of one current
    pixel for a searching
  • candidate, and the total SAD is accumulated
    by an adder tree in
  • one cycle, as shown in Fig. 5(a).
  • Because there is no partial SAD in this
    architecture, there is no registers overhead to
    store partial SADs when supporting VBSME.
  • The adder tree is the one to be reorganized to
    support VBSME
  • That is, we partition the 2-D adder tree in order
    to get the SADs of the smallest blocks first, and
    then based on these SADs, to derive the SADs of
    large blocks. Although there is no additional
    register overhead, the adder tree additions
    required to support VBSME do require additional
    area,

74
THE PARALLELISM, CYCLES, LATENCY, AND DATA FLOW
OF EIGHT HARDWARE ARCHITECTURES
75
THE DATA BUFFER AND MEMORY BITWIDTH OF EIGHT
HARDWARE ARCHITECTURES
76
An Example
  • The specifications of ME are as follows. The MB
    size is 1616, and the search range is Ph 64
    and Pv 32.
  • The frame size is D1 size, 720 480.
  • When VBSME is supported, a MB can be partitioned
    at most to 16 44 blocks.
  • We use Verilog-HDL and SYNOPSYS Design Compiler
    with ARTISAN UMC 0.18um cell library to implement
    each hardware architecture.
  • Because the timing of the critical path in some
    architectures is too long, which means the
    maximum operating frequency is limited without
    modifying the architecture, the frame rate is set
    as only 10 frames per second (fps).

77
Area and Required Frequency
  • Among these eight hardware architectures, all
    inter-level architectures with Data Flow I
    increase gate count dramatically. The chip area
    is five times of that in FBSME at least.

78
Latency
  • The latency is defined as the number of start-up
    cycles that a hardware takes to generate the
    first SAD.
  • If a module has a long latency and it cannot be
    shortened by parallel architectures, the effect
    of parallel computation is reduced. That is, a
    shorter latency is better for video coding
    systems.
  • There are two factors to affect the latency.
  • Hardware architecture
  • Memory bandwidth
  • Compared to these hardware architectures, the
    other intra-level architectures, such as proposed
    Propagate Partial SAD and SAD Tree, have shorter
    latencies.

79
Utilization
  • In general, inter-level architectures can
    continuously compute MB by MB, so the initial
    cycles can be neglected and the utilization will
    be 100.
  • Therefore, we defined the utilization as
    Computing cycles / Operating cycles for a MB.
  • The operating cycles include three parts,
    latency, computing cycles, and bubble cycles.
    Computation cycles are the number of cycles when
    we can get one SAD at least. That is, if the
    utilization is 100, we can get one SAD in each
    cycle at least. Fewer operating cycles will less
    the penalty of the latency be apparent.
  • The more bubble cycles are, the lower the
    utilization is.

80
Memory Usage
  • Memory usage consists of two parts, memory
    bitwidth and memory bandwidth.
  • Memory bitwidth is defined as the number of bits
    which a hardware has to access from memory in
    each cycle, and memory bandwidth is re-defined as
    the number of bits which a hardware has to access
    from memory for a MB.
  • Memory bandwidth affects the loading of system
    bus without on-chip memory or the power of
    on-chip memory, and memory bitwidth is the key to
    the data arrangement of on-chip memories.
  • Memory bitwidth and bandwidth are affected by the
    data reuse scheme and operating cycles.

81
Hexagonal Plot
  • The closer the point is to the center, the worse
    the performance is.
  • Note that, in various video coding systems or
    hardware system platforms, the weighting of each
    axis will be very different.
  • We can use these hexagonal plots to select the
    optimal architecture based on different
    constraints for the system integration.

82
Hexagonal Plots
83
Hexagonal Plots
84
Hexagonal Plots
85
Hexagonal Plots
86
Hardware Architecture of H.264 Integer Motion
Estimation
  • Based on the above analysis, we propose a ME
    hardware for H.264/AVC integer-pixel motion
    estimation (IME) as an example.
  • Our specification is that two frame sizes are
    supported in our specification.
  • One is D1 Format with four reference frames, 30
    fps. In the previous frame, the search range is
    -64,64) and -32,32) in the horizontal and
    vertical directions. In the rest frames, the
    search range is -32,32) and -16,16) in the
    horizontal and vertical directions.
  • The other is 720p with one reference frame, 30
    fps. The search range is the same as that of the
    previous frame in D1 Format.

87
Hardware Architecture of H.264 Integer Motion
Estimation (Cont.)
  • In our specification, the computation complexity
    of H.264 is 2.4 tera instructions per second and
    3.8 tera bytes per second in D1 Format and
    dominated by IME, which is estimated by
    instruction profiling of reference software,
    JM7.3.
  • The ultra large computation complexity can be
    solved by the parallel computation, but the huge
    external memory bandwidth can not. Therefore, the
    huge memory bandwidth is a difficult challenge
    for hardware design.
  • There are still two problems.
  • First, because of VBSME and Lagrangian mode
    decision, the data dependency of motion vector
    predictor prohibits from the parallel computation
    between the smaller blocks in a MB.
  • Secondly, when the high processing ability is
    necessary, the hardware cost of ME hardware
    architectures with high degrees of parallelism is
    also required to be discussed.

88
Modified Algorithm
  • First, we divide the computation of ME into two
    parts, integer-pixel ME and fractional-pixel ME
    (FME), and propose two individual hardware
    accelerators for IME
  • and FME, respectively. The utilization of
    hardware accelerators can be significantly
    improved by this way.
  • Second, in the original Lagrangian mode decision,
    the MV predictor of a block is the medium MV
    among the MVs of top, top-right, left neighboring
    44 blocks but in the parallel computation of
    hardware architectures, the coding modes of the
    neighboring 44 blocks can not be decided in
    parallel, especially when the block size is 44.

89
The motion vector predictor for (a) the 48
block, (b) the 1616 block, and (c) the modified
motion vector predictor for all blocks.
90
Hardware Architecture with M-parallelism
  • In our specification, we require eight sets of
    Propagate Partial SAD or SAD Tree to achieve the
    realtime computation.
  • Eight sets of Propagate Partial SAD and SAD Tree,
    which can process eight successive candidates in
    a row at the same time, are combined as
    Eight-Parallel Propagate Partial SAD and
    Eight-Parallel SAD Tree, respectively.

91
Hardware Architecture of H.264 Integer Motion
Estimation.
92
Comparison of RD Curves Between JM7.3 and Our
Proposed Encoder
93
Memory Reduction of H.264 IME
Write a Comment
User Comments (0)
About PowerShow.com