Title: Image Processing With FPGAs
1Image Processing With FPGAs
- Zach Fuchs
- Sarit Patel
- EEL6935
- 14 April 2008
2FPGA-Based Configurable Systolic Architecture for
Window-Based Image Processing
- Authors
- César Torres-Huitzil
- Miguel Arias-Estrada
3Introduction
- Image processing is a fundamental step in modern
machine vision systems. - Many complex algorithms use lower level results
to pursue higher level goals. - e.g. edge detection to determine object
- Real time performance in video applications is
usually required.
4Difficulty Building Systems
- Most computer vision applications are
computationally intensive - Sequential nature of conventional processors slow
down performance - Different computations in processing limits
parallelization - Real time performance is required
5Sample Applications
- Robotics
- Multimedia
- Virtual reality
- Industrial inspection
- Medical engineering
- Autonomous navigation
6Goals of Paper
- Design 2D systolic architecture for window-based
image processing - Consider design issues
- Flexibility
- Silicon area
- Power consumption
- Performance
- Area
7Window-Based Image Processing
- Large number of repetitive neighbor operations
over image data - Area of w x w pixels extracted from image
- Transformed according to window mask and
mathematical functions - Produce single, new output according to transform
8Windows-Based Image Processing
2
1
3
9Window-Based Operators
- Same scalar function applied on a pixel by pixel
basis - Scalar functions
- e.g. relational, arithmetic, logical, look up
tables - Reduction functions
- Reduce window of results from scalar function to
one output - e.g. accumulation, maximum, absolute value
10Computational Requirements
- Window-based operations are computationally
expensive tasks - Focusing on convolution
- Convolution - the amount of overlap between f and
a reversed and translated version of g - In general, complexity O(w2 x M x N)
- w x w window mask
- M x N image
11Data Transfer Rate
- Must transfer data between image acquisition
module, memory, and processor - Input Data Transfer Rate
- Output Data Transfer Rate
- b of bits per pixel
- fF processing rate of images per second
- Requires efficient use of communication bandwidth
and parallel processing
12Implementation Technology FPGA
- Provides massive parallel structures and high
density for logic arithmetic - Tasks implemented by spatially rather than
temporally - Possible to control at bit level to build
specialized data paths - Offer more raw computational power compared to
conventional processors - Shorter design cycles than ASICs
- Well suited for implementing parallel
architectures.
13Memory Accesses
- Gap between processor speed and memory access
speed - Memory access overhead critical issue
- Window-based operations are memory intensive
require new pixel in each step - High potential for parallelism since independent
operations are applied to large regions of image
arrays
14Memory Accesses
- Pixels might not be stored as neighboring
elements - Parallelism is hidden
- Windows usually overlap with neighboring windows
- Must create vectors of data elements and process
them using parallel vectorization techniques.
15Overlapping Windows
- Three windows shown shaded box indicates
overlapping data.
16Overlapping Windows
- Some pixels can be used in computation of all
three windows - Reduce memory accesses for those pixels by a
factor of 3 - Large number of windows means less overlap
- Must compromise between data overlap and window
count
17Data Parallelism
- Can be combined with loop unrolling to diminish
memory accesses for sequential accesses - Process one window, then slide to the right and
process next - Unroll this loop so more windows are computed in
parallel - Authors use vertical unrolling
- Can apply to horizontal unrolling equally
18Data Parallelism
- Number of pixels read per column is directly
dependent on number of rows processed in parallel - Number of pixels read w NR 1
- w windows mask length/width
- NR rows processed
- Number of Memory Accesses (MxN Image)
19Data Parallelism
20Systolic Architecture
- Configurable Window Processor (CWP)
- Processing element in systolic arch.
- Architecture reads data from input memory
- P image pixel
- W window mask coefficients
- Transmitted to array of processing elements for
computation
21Array of CWPs
- LDC Local data collector
- Collects results of CWPs
- CWP
- Compute a window operator on same column of input
image - D Delay line / shift register
- Used for synchronization purposes
22Architecture Flow
- Pixel is broadcast to all CWPs
- At each clock cycle
- Each CWP receives a different window coefficient
- New image pixel for all processing elements
- Each CWP multiplies and accumulates values until
all pixels in a window are processed - After short latency, the LDC will collect the
data and send it to output memory
23CWP
- AP Arithmetic Processor (ALU)
- Multiplies
- LRM Local Reduction Module
- Accumulator
- Pc Result of window operation
- Wd delayed window coefficient
24Systolic Architecture
25Processing Time
- Latency
- Time required to start pipeline operation
- Measured between activation of first CWP to last
CWP - Parallel processing time
- Time when all CWPs are working in parallel
- Addition of all times to process set of rows
- Performance compromised with number of rows
processed - Directly reflects silicon resources allocated to
architecture
26Throughput
- Number of elemental operations system can perform
per second - Only scalar function and local reduction function
are considered
27Implementation
- Fully parameterizable VHDL description
- Use generics to make design flexible
- Structural description used only elementary logic
operations - Design is platform, version, technology, and
tool independent - Used XCV2000E-6 VirtexE FPGA w/ 2 Million Gates
28FPGA Technical Data
29Performance Results
- I/O time not considered in results
- 512x512 Image w/ 7x7 Window Mask
30Performance Results
- Image processing time for 7x7 window mask is 8.35
ms - Leaves enough time for image acquisition
- 30ms required for real-time constraints
- Post-processing also possible
31Performance Results
- Throughput increases with number of processing
elements - Utilization and activity efficiency of processing
elements decrease
32Improving Performance
- Optimize design mapped on the FPGA
- Apply timing restrictions for increased speed
- Use better FPGA
- Note that performance requirement for real-time
operation is still met with lower FPGA
33Comparisons to Other Architectures
34Area/Performance Tradeoffs
- Low resource utilization allows implementation in
compact mobile apps - High computational density due to small area
usage - Can reduce hardware or clock frequency
- Reduces power
- Still meets timing requirements
35Reconfigurability
- Flexible enough to support different window-based
image operators - Allows different image-based applications on a
SoC
36Conclusion
- Easy to exploit SIMD for parallelism in image
processing - FPGAs allow reconfigurability and flexibility
- Real-time constraints can be met with high
performance and low area usage - All Images and Graphs from
- Torres-Huitzil, Cesar, and Miguel Arias-Estrada.
"FPGA-Based Configurable Systolic Architecture
for Window-Based Image Processing." EURASIP
Journal on Applied Signal Processing 7(2005)
1024-1034.
37Hardware, Design and Implementation Issues on a
FPGA-Based Smart Camera
- Fabio Dias, Francois Berry, Jocelyn Serot,
Francois Marmoiton
38Summary of Paper
- Describe the hardware architecture of a
FPGA-based Smart Camera research platform and
some of the hardware design issues. - Propose a architectural design methodology based
on pre-programmed processing elements. - Provide a low level image processing example.
- Present an embedded tracking application to show
the cameras utilization.
39What is a Smart Camera?
- Smart cameras utilize embedded processing to
relieve some of the low level computational
burden of the interfacing system. - Reduce communication flow and overhead.
- Processing resources consist of FPGA devices,
medi/streaming processors, DSPs, etc.
40Why FPGA devices?
- Reconfigurability
- Allows the camera to adapt to a wide range of
applications. - Parallelism
- Take advantage of independence of many
computational tasks in order to meet time
restraints. - Hardware Flexibility
- Capable of interfacing with a wide range of
external devices such as memory or ASICs.
41Smart Camera Hardware Architecture
- ALTERA Stratix EP1S60F1020C7
- 4Mpixels LUPA-400 image sensor
- (2) 2d accelerometers
- (3) gyroscopes
- 10Mb SRAM
- 64Mb SDRAM
42Smart Camera Hardware Architecture
43Design Methodology
- Centralized around reconfiguration of the FPGA.
- Set of Pre-designed configurable data processing
elements (PEs). - Programmable Control Module
- System supervisor, communicating with the PEs
through registers and hand-shake signals - Configures and synchronizes different PEs
44Design Methodology
Schematic of a SoPC architecture illustrating the
proposed methodological approach.
45Generic Window-Based Processing Element
- Applied over a small defined over a small defined
portion of the input image. - Deal with large amounts of data because they are
often applied over the entire image. - Examples
- Convolution
- Correlation estimation
- Morphological transformations
46Generic Window-BasedProcessing Element
47Smart Camera Application
- Template Tracking System
- VGA images sent to host computer to be displayed.
- The user selects frame of interest for tracking.
- A search window is acquired and stored into
memory. - A sliding window SAD algorithm is applied.
- The portion with the best correlation score is
considered the as being the new template
location. - A null acceleration model is employed in order to
predict displacement in the next frame.
48Smart Camera Application
Embedded tracking implemented architecture
49Experimental Results
50Conclusion
- Generic window-based processing element
successfully implemented in an FPGA. - An image tracking algorithm utilizing the
described design methodology successfully
implemented with adequate performance. - A flexible FPGA base smart camera research
platform created for future research. - All Images and Graphs from
- Dias, Fabio, Francois Berry, Jocelyn Serot, and
Francois Marmoiton, "HARDWARE, DESIGN AND
IMPLEMENTATION ISSUES ON A FPGA-BASED SMART
CAMERA." IEEE 1-4244-1354-0/07(2007) 20-26.