DSP Algorithms on FPGA Part II Digital image Processing - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

DSP Algorithms on FPGA Part II Digital image Processing

Description:

DSP Algorithms on FPGA Part II Digital image Processing – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 26

Provided by: kittitor

Category:

more less

Transcript and Presenter's Notes

Title: DSP Algorithms on FPGA Part II Digital image Processing

1
DSP Algorithms on FPGAPart II Digital
image Processing
2
Content

Overview image processing and FPGA
Algorithm to FPGA Mapping Flow
Nested Loop Algorithms and MODG
Example Motion Estimation
Conclusion and Future Trends

3
Video signal in different formats

PAL 720576(pixels) 25 (f/s) 10.4 (Mp/s)
NTSC 720480 29.97 10.4
HDTV 19201080 30.0 62.2
Common delivery form
Analog (cable)
USB
Firewire

4
Image Processing Character

Need available maximize logic by supporting N-D
multiple configurable devices
For Example
Image

1 2 1
2 4 2
1 2 1
5
Challenges

How to???
Appropriate partitioning of algorithms between
hardware and software
Exploiting spatial and temporal parallelism
Integration the configurable computer into the
software framework
Selecting a suitable configuration strategy
How shall we deal with these challenges?

6
Why SRAM-Based FPGAs? (Pros)

Higher logic/storage capacity
Fast carry chain for adders /subtractors
Built-in XOR gates/LUT
Array of bit-parallel multipliers
Fast and local storage array of SRAM
blocks
Interconnect supports three-state
buffers/LUT
Equivalent to fine-grained reconfigurable
hardware
Finer-gained pipeling can help preserve the
performance at low power supply voltage
More mature CMOS manufacturing technology

7
Algorithm to FPGA Mapping Flow
8
The Matrix Multiplication MODG
A number of different execution orders can be
carried out to achieve the same algorithm.
9
Nested Do Loop Algorithms and Inter-Iteration
Dependence Graph

Do i1 to M
Do j1 to N
ci,j0
Do k1 to K
ci,j ci,jai,kbk,j
EndDo k
EndDo j
EndDo I
Dependence vectors
da (i,j,k)t (0,1,0)t
db (i,j,k)t (1,0,0)t
dc (i,j,k)t (0,0,1)t
Index Space J3 (i,j,k)t 1 i,j,k
3(MNK3)
Inter-Iteration Data Dependence graph (DG)

10
Systolic Mapping (space-time) of Matrix
Multiplication
11
Systolic Mapping of Matrix Multiplication, cont.
0
0
0
12
Why Space-Time Mapping is suitable for FPGAs?

It can bridge the nested Do loop signal/image
processing algorithms to the processor array
implementation.
The space-time array matches the modular and
regular FPGA structure.
The localized/pipelined interprocessor links can
overcome the long programmable interconnect
delay.
The size of configuration storage can be
significantly reduced because of the almost
identical processing elements and interconnect
structure.

13
Problems with Existing Design Methodologies/Tools

The dependence graphs of many other algorithms
are not uniform and must be predetermined by
human designers.
Existing methodologies
cannot handle these complex algorithms use
unrealistic cost functions (metrics)
No built-in features of FPGAs have been
incorporated.
Longer interconnect delay in deep submicron CMOS
technology
Much lower hardware utilization due to
programmable interconnect delay in FPGAs
There is
another problem--speed

14
What is Intra-PE Pipelining?

Interconnect delay of FPGAs results in even
longer clock period.
To enhance the overall throughput,
Intra-Iteration parallelism must be exploited.
A simple vector dot product array
It can be observed that the utilization of each
operator is increased.
Of course, the control mechanism is more complex.
Tech done example

15
Examples of Nested Do Loop Algorithms

Motion estimation
One of the most time consuming operations (tasks)
in digital video compression
Stereo matching
used to build disparity map for 3D robot/computer
navigation
Matrix/Vector Multiplication
FFT, DCT, 2D/3D graphic etc.
2D Linear Transform/Operations
2D FFT, 2D DCT, etc.

16
Tennis frame 0
17
Tennis frame 1
18
Motion Vectors of 8x8-Pixel Blocks
19
Reconstructed Frame 1 from Frame 0 and Motion
Vectors
20
Illustration of Full Search Block Matching Motion
Estimation (6 level Nested do loop)
Motion vector(m,n)
21
Exp A Simpler PE Microarchitecture

MAD(m,n) MAD(m,n)x(hNi,vNj)-y(hNim-p,vNjn
-p)
Xilinx Core Generator System
Critical path delay 25 ns. based on Xilinx
Virtex data
1,500-2,000 equivalent gate count
Critical path (blue line) can be shortened
further by the Intra-PE pipelining

22
Significance of the Contributions

The MODG representation for nested Do loop
algorithms
The actual execution is not constrained to any
predetermined order.
keeps track of every variable instance so that
there is no redundant memory access to save I/O,
bandwidth and power consumption.
can be automated using memory .
Without the MODG,
the motion estimation and many other nested DO
loop algorithms can be written in many of
different DGs,
human must be involved to formulate a DG,
the built-in ROM/RAM of FPGA may not be
exploited, and

23
Significance of the Contributions, cont.

Space-Time mapping for the MODG can be applied to
any SRAM-based FPGA Architecture Constraints and
Practical Cost functions
any coarse-grained architecture
Intra-PE pipelining
enhances/preserves the throughput rate at low
power mode.

24
Conclusion

Users demand more communication/multimedia
processing capabilities on the resource-limited
Internet appliances.
Reconfigurable SOC is the ultimate solution to
design the challenging low-power/high performance
platform.
Its success lies on the embedded high-density
FPGA core as a reconfigurable (programmable)
accelerating hardware.
As technology (supply voltage) scales down, logic
(transistor) is virtually free while the
interconnect becomes the bottleneck and power
consuming.
Parallel execution of nested Do loop algorithms
by an array of localized processing elements at
moderate clock frequency is a viable solution.
It can compromise the three main issues design
time, power consumption, and performance.

25
Future Trends

Memory (storage) organization should be
investigated due to multiple reads per-clock
cycle in order to sustain such high throughput.
The control mechanism of the entire array is one
of the aspects that will determine its success.
A given MODG may need to be partitioned of so
that the resulting array fits the on-chip
reconfigurable FPGA core.

Write a Comment

User Comments (0)