Title: DSP Algorithms on FPGA Part II Digital image Processing
1DSP Algorithms on FPGAPart II Digital
image Processing
2Content
- Overview image processing and FPGA
- Algorithm to FPGA Mapping Flow
- Nested Loop Algorithms and MODG
- Example Motion Estimation
- Conclusion and Future Trends
3Video signal in different formats
- PAL 720576(pixels) 25 (f/s) 10.4 (Mp/s)
- NTSC 720480 29.97 10.4
- HDTV 19201080 30.0 62.2
- Common delivery form
- Analog (cable)
- USB
- Firewire
4Image Processing Character
- Need available maximize logic by supporting N-D
multiple configurable devices - For Example
- Image
1 2 1
2 4 2
1 2 1
5Challenges
- How to???
- Appropriate partitioning of algorithms between
hardware and software - Exploiting spatial and temporal parallelism
- Integration the configurable computer into the
software framework - Selecting a suitable configuration strategy
- How shall we deal with these challenges?
6Why SRAM-Based FPGAs? (Pros)
- Higher logic/storage capacity
- Fast carry chain for adders /subtractors
- Built-in XOR gates/LUT
- Array of bit-parallel multipliers
- Fast and local storage array of SRAM
blocks - Interconnect supports three-state
buffers/LUT - Equivalent to fine-grained reconfigurable
hardware - Finer-gained pipeling can help preserve the
- performance at low power supply voltage
- More mature CMOS manufacturing technology
7Algorithm to FPGA Mapping Flow
8The Matrix Multiplication MODG
A number of different execution orders can be
carried out to achieve the same algorithm.
9Nested Do Loop Algorithms and Inter-Iteration
Dependence Graph
- Do i1 to M
- Do j1 to N
- ci,j0
- Do k1 to K
- ci,j ci,jai,kbk,j
- EndDo k
- EndDo j
- EndDo I
- Dependence vectors
- da (i,j,k)t (0,1,0)t
- db (i,j,k)t (1,0,0)t
- dc (i,j,k)t (0,0,1)t
- Index Space J3 (i,j,k)t 1 i,j,k
3(MNK3) - Inter-Iteration Data Dependence graph (DG)
10Systolic Mapping (space-time) of Matrix
Multiplication
11Systolic Mapping of Matrix Multiplication, cont.
0
0
0
12Why Space-Time Mapping is suitable for FPGAs?
- It can bridge the nested Do loop signal/image
processing algorithms to the processor array
implementation. - The space-time array matches the modular and
regular FPGA structure. - The localized/pipelined interprocessor links can
overcome the long programmable interconnect
delay. - The size of configuration storage can be
significantly reduced because of the almost
identical processing elements and interconnect
structure.
13Problems with Existing Design Methodologies/Tools
- The dependence graphs of many other algorithms
are not uniform and must be predetermined by
human designers. - Existing methodologies
- cannot handle these complex algorithms use
unrealistic cost functions (metrics) - No built-in features of FPGAs have been
incorporated. - Longer interconnect delay in deep submicron CMOS
technology - Much lower hardware utilization due to
programmable interconnect delay in FPGAs -
-
There is
another problem--speed
14What is Intra-PE Pipelining?
- Interconnect delay of FPGAs results in even
longer clock period. - To enhance the overall throughput,
Intra-Iteration parallelism must be exploited. - A simple vector dot product array
- It can be observed that the utilization of each
operator is increased. - Of course, the control mechanism is more complex.
-
Tech done example
15Examples of Nested Do Loop Algorithms
- Motion estimation
- One of the most time consuming operations (tasks)
in digital video compression - Stereo matching
- used to build disparity map for 3D robot/computer
navigation - Matrix/Vector Multiplication
- FFT, DCT, 2D/3D graphic etc.
- 2D Linear Transform/Operations
- 2D FFT, 2D DCT, etc.
16Tennis frame 0
17Tennis frame 1
18Motion Vectors of 8x8-Pixel Blocks
19Reconstructed Frame 1 from Frame 0 and Motion
Vectors
20Illustration of Full Search Block Matching Motion
Estimation (6 level Nested do loop)
Motion vector(m,n)
21Exp A Simpler PE Microarchitecture
- MAD(m,n) MAD(m,n)x(hNi,vNj)-y(hNim-p,vNjn
-p) - Xilinx Core Generator System
- Critical path delay 25 ns. based on Xilinx
Virtex data - 1,500-2,000 equivalent gate count
- Critical path (blue line) can be shortened
further by the Intra-PE pipelining
22Significance of the Contributions
- The MODG representation for nested Do loop
algorithms - The actual execution is not constrained to any
predetermined order. - keeps track of every variable instance so that
there is no redundant memory access to save I/O,
bandwidth and power consumption. - can be automated using memory .
- Without the MODG,
- the motion estimation and many other nested DO
loop algorithms can be written in many of
different DGs, - human must be involved to formulate a DG,
- the built-in ROM/RAM of FPGA may not be
exploited, and
23Significance of the Contributions, cont.
- Space-Time mapping for the MODG can be applied to
- any SRAM-based FPGA Architecture Constraints and
Practical Cost functions - any coarse-grained architecture
- Intra-PE pipelining
- enhances/preserves the throughput rate at low
power mode.
24Conclusion
- Users demand more communication/multimedia
processing capabilities on the resource-limited
Internet appliances. - Reconfigurable SOC is the ultimate solution to
design the challenging low-power/high performance
platform. - Its success lies on the embedded high-density
FPGA core as a reconfigurable (programmable)
accelerating hardware. - As technology (supply voltage) scales down, logic
(transistor) is virtually free while the
interconnect becomes the bottleneck and power
consuming. - Parallel execution of nested Do loop algorithms
by an array of localized processing elements at
moderate clock frequency is a viable solution. - It can compromise the three main issues design
time, power consumption, and performance.
25Future Trends
- Memory (storage) organization should be
investigated due to multiple reads per-clock
cycle in order to sustain such high throughput. - The control mechanism of the entire array is one
of the aspects that will determine its success. - A given MODG may need to be partitioned of so
that the resulting array fits the on-chip
reconfigurable FPGA core.