DSP Algorithms on FPGA Part II Digital image Processing - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

DSP Algorithms on FPGA Part II Digital image Processing

Description:

DSP Algorithms on FPGA Part II Digital image Processing – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 26
Provided by: kittitor
Category:

less

Transcript and Presenter's Notes

Title: DSP Algorithms on FPGA Part II Digital image Processing


1
DSP Algorithms on FPGAPart II Digital
image Processing
2
Content
  • Overview image processing and FPGA
  • Algorithm to FPGA Mapping Flow
  • Nested Loop Algorithms and MODG
  • Example Motion Estimation
  • Conclusion and Future Trends

3
Video signal in different formats
  • PAL 720576(pixels) 25 (f/s) 10.4 (Mp/s)
  • NTSC 720480 29.97 10.4
  • HDTV 19201080 30.0 62.2
  • Common delivery form
  • Analog (cable)
  • USB
  • Firewire

4
Image Processing Character
  • Need available maximize logic by supporting N-D
    multiple configurable devices
  • For Example
  • Image

1 2 1
2 4 2
1 2 1
5
Challenges
  • How to???
  • Appropriate partitioning of algorithms between
    hardware and software
  • Exploiting spatial and temporal parallelism
  • Integration the configurable computer into the
    software framework
  • Selecting a suitable configuration strategy
  • How shall we deal with these challenges?

6
Why SRAM-Based FPGAs? (Pros)
  • Higher logic/storage capacity
  • Fast carry chain for adders /subtractors
  • Built-in XOR gates/LUT
  • Array of bit-parallel multipliers
  • Fast and local storage array of SRAM
    blocks
  • Interconnect supports three-state
    buffers/LUT
  • Equivalent to fine-grained reconfigurable
    hardware
  • Finer-gained pipeling can help preserve the
  • performance at low power supply voltage
  • More mature CMOS manufacturing technology

7
Algorithm to FPGA Mapping Flow
8
The Matrix Multiplication MODG
A number of different execution orders can be
carried out to achieve the same algorithm.
9
Nested Do Loop Algorithms and Inter-Iteration
Dependence Graph
  • Do i1 to M
  • Do j1 to N
  • ci,j0
  • Do k1 to K
  • ci,j ci,jai,kbk,j
  • EndDo k
  • EndDo j
  • EndDo I
  • Dependence vectors
  • da (i,j,k)t (0,1,0)t
  • db (i,j,k)t (1,0,0)t
  • dc (i,j,k)t (0,0,1)t
  • Index Space J3 (i,j,k)t 1 i,j,k
    3(MNK3)
  • Inter-Iteration Data Dependence graph (DG)

10
Systolic Mapping (space-time) of Matrix
Multiplication
11
Systolic Mapping of Matrix Multiplication, cont.
0
0
0
12
Why Space-Time Mapping is suitable for FPGAs?
  • It can bridge the nested Do loop signal/image
    processing algorithms to the processor array
    implementation.
  • The space-time array matches the modular and
    regular FPGA structure.
  • The localized/pipelined interprocessor links can
    overcome the long programmable interconnect
    delay.
  • The size of configuration storage can be
    significantly reduced because of the almost
    identical processing elements and interconnect
    structure.

13
Problems with Existing Design Methodologies/Tools
  • The dependence graphs of many other algorithms
    are not uniform and must be predetermined by
    human designers.
  • Existing methodologies
  • cannot handle these complex algorithms use
    unrealistic cost functions (metrics)
  • No built-in features of FPGAs have been
    incorporated.
  • Longer interconnect delay in deep submicron CMOS
    technology
  • Much lower hardware utilization due to
    programmable interconnect delay in FPGAs



  • There is
    another problem--speed

14
What is Intra-PE Pipelining?
  • Interconnect delay of FPGAs results in even
    longer clock period.
  • To enhance the overall throughput,
    Intra-Iteration parallelism must be exploited.
  • A simple vector dot product array
  • It can be observed that the utilization of each
    operator is increased.
  • Of course, the control mechanism is more complex.


  • Tech done example

15
Examples of Nested Do Loop Algorithms
  • Motion estimation
  • One of the most time consuming operations (tasks)
    in digital video compression
  • Stereo matching
  • used to build disparity map for 3D robot/computer
    navigation
  • Matrix/Vector Multiplication
  • FFT, DCT, 2D/3D graphic etc.
  • 2D Linear Transform/Operations
  • 2D FFT, 2D DCT, etc.

16
Tennis frame 0
17
Tennis frame 1
18
Motion Vectors of 8x8-Pixel Blocks
19
Reconstructed Frame 1 from Frame 0 and Motion
Vectors
20
Illustration of Full Search Block Matching Motion
Estimation (6 level Nested do loop)
Motion vector(m,n)
21
Exp A Simpler PE Microarchitecture
  • MAD(m,n) MAD(m,n)x(hNi,vNj)-y(hNim-p,vNjn
    -p)
  • Xilinx Core Generator System
  • Critical path delay 25 ns. based on Xilinx
    Virtex data
  • 1,500-2,000 equivalent gate count
  • Critical path (blue line) can be shortened
    further by the Intra-PE pipelining

22
Significance of the Contributions
  • The MODG representation for nested Do loop
    algorithms
  • The actual execution is not constrained to any
    predetermined order.
  • keeps track of every variable instance so that
    there is no redundant memory access to save I/O,
    bandwidth and power consumption.
  • can be automated using memory .
  • Without the MODG,
  • the motion estimation and many other nested DO
    loop algorithms can be written in many of
    different DGs,
  • human must be involved to formulate a DG,
  • the built-in ROM/RAM of FPGA may not be
    exploited, and

23
Significance of the Contributions, cont.
  • Space-Time mapping for the MODG can be applied to
  • any SRAM-based FPGA Architecture Constraints and
    Practical Cost functions
  • any coarse-grained architecture
  • Intra-PE pipelining
  • enhances/preserves the throughput rate at low
    power mode.

24
Conclusion
  • Users demand more communication/multimedia
    processing capabilities on the resource-limited
    Internet appliances.
  • Reconfigurable SOC is the ultimate solution to
    design the challenging low-power/high performance
    platform.
  • Its success lies on the embedded high-density
    FPGA core as a reconfigurable (programmable)
    accelerating hardware.
  • As technology (supply voltage) scales down, logic
    (transistor) is virtually free while the
    interconnect becomes the bottleneck and power
    consuming.
  • Parallel execution of nested Do loop algorithms
    by an array of localized processing elements at
    moderate clock frequency is a viable solution.
  • It can compromise the three main issues design
    time, power consumption, and performance.

25
Future Trends
  • Memory (storage) organization should be
    investigated due to multiple reads per-clock
    cycle in order to sustain such high throughput.
  • The control mechanism of the entire array is one
    of the aspects that will determine its success.
  • A given MODG may need to be partitioned of so
    that the resulting array fits the on-chip
    reconfigurable FPGA core.
Write a Comment
User Comments (0)
About PowerShow.com