Title: Implementation of High-Rate JPEG2000 Coding on a Virtex-2 Pro Reconfigurable Computing Board
1Implementation of High-Rate JPEG2000 Coding on a
Virtex-2 Pro Reconfigurable Computing Board
- Presented by Damon Van Buren
- SEAKR Engineering
- MAPLD 2004
- Submission 133
2The Sensor Bandwidth Problem
- Commercial satellite imaging systems are
experiencing growth in imaging capability... - Higher resolution lt 1 m
- Larger images gt10k image width and height
- More spectral components
- Panchromatic
- Red/Green/Blue
- Multi-spectral
- Improved capabilities are leading to high sensor
data rates - Data output rates gt 2 Gbps for some systems
- Providing storage and downlink bandwidth for the
data is becoming a significant challenge for
system designers - The largest data recorders can store less than 20
minutes of data at 2 Gbps - Downlinks must be several hundred Mbps to
downlink 15 minutes of data in under an hour - Data storage and high-bandwidth downlinks require
lots of power - By reducing the amount of image data, compression
provides a solution to the bandwidth problem!
3Desired Compressor Features
- Real Time
- Compression must be performed in real time, prior
to storage. - High throughput (gt 2 Gbps)
- Excellent Performance in Lossy and Lossless Modes
- Purchasers of satellite imagery are sensitive to
reductions in image quality caused by lossy
compression. - Scientific users prefer undistorted data (bit
true). - Space-Qualified
- Must survive hazards of launch and space
operation, including radiation. - Low Risk
- Satellite imaging companies seek high reliability
solutions.. - Low Cost
- Commercial customers require cost effective
solutions. - Flexible
- The ability to support varying compression ratios
and contents would allow more effective use of
available storage and bandwidth.
4JPEG2000 Algorithm
- JPEG2000 is an excellent choice for satellite
image compression. - Latest still image compression standard from the
JPEG committee - Meets two key requirements for satellite image
compression - Excellent performance in both lossy and lossless
modes. - 1.7 to 1 lossless compression for typical
satellite imagery - 70 improvement! - Visually lossless compression gt 2 to 1 - 100
improvement in storage and downlink performance. - Very flexible
- Many options for compressed images.
- Other advantages
- International Standard
- Wavelet based
- High quality lossy images with comp. ratios gt
1001 - Packet oriented
- Allows random access to the compressed code
stream. - Makes compressed data more robust in the presence
of bit errors. - Allows selection of image quality, spatial
region, resolution, and color component after
compression.
5JPEG2000 Implementation Challenges
- JPEG2000 is a very complex algorithm.
- More Features More Complexity.
- Operation intensive
- Several hundred operations per pixel, because
each bit must be processed many times, for the
wavelet transform, entropy coding, MQ coding,
packet generation, etc. - Complex
- Many different stages to produce compressed
output. - Wavelet transform.
- Quantization.
- Context generation.
- Arithmetic coding.
- Packet generation.
- Many parameters must be tracked individually for
each code block (64x64). - Memory intensive
- Each pixel must be accessed many times, so many
small buffers are needed to get good throughput. - Few processors are capable of implementing
JPEG2000 at high rates!
6High-Performance Processing Using Xilinx FPGAs
- Xilinx FPGAs have many advantages for fast
parallel processing - Millions of gates.
- System clocks of several hundred MHz.
- High speed I/O
- 622 Mbps LVDS
- Multi-Gigabit serial I/O
- Hundreds of internal block RAMS.
- Hundreds of internal 18 bit multipliers.
- Xilinx FPGAs are available in a space qualified
versions - Radiation testing is complete on the Virtex and
Virtex-II devices. - 200 kRad total dose, latchup immune.
- Radiation testing to begin on the Virtex-II Pro
devices soon. - Xilinx FPGAs are very flexible, reducing risk
- May be re-programmed an infinite number of times.
- Configurations may be uploaded at any time during
the mission to fix errors or add new capability. - Xilinx FPGAs are the best solution for fast
compression in space!
7Challenges for Xilinx Use in Space
- The effects of radiation in spacecraft
electronics are well known. - Caused primarily by charged particles.
- May cause permanent damage over time by ionizing
SiO2 (total dose). - May also cause errors in digital logic by
upsetting registers (single event effects). - Mitigation techniques are used to reduce or
eliminate the effect of radiation upsets. - Triple Modular Redundancy (TMR) uses voting to
select the correct output from 3 separate
instances of the design. - Mitigation of radiation effects in SRAM-based
FPGAs presents an additional challenge - As with other digital electronics, the functional
logic of the device is susceptible to upset,
however... - Another layer of logic (configuration logic)
controls the routing of the part, giving the
device its capability to be reprogrammed to
perform different functions. - Configuration logic is also susceptible to
radiation upsets. - Xilinx FPGAs require system level mitigation
strategies in addition to the device level
mitigation techniques (such as TMR) that are
commonly used for space electronics. - Configuration data must be continuously
re-written, or scrubbed using a read-and-correct
approach.
8SEAKRs RCC Board Processing Solutions
- SEAKR has developed a line of Reconfigurable
Computing (RCC) products based on the Xilinx
FPGAs. - RCC 1 4x Virtex 1000s
- RCC 2 4x Virtex II 6000s
- RCC 3 (NTRCC) 4x Virtex II Pro 70/100s
- Boards include system-level upset mitigation
(scrub) for the Xilinx devices. - Configuration data is continuously read and
checked for errors. - Errors are corrected by overwriting the corrupted
frames, without interrupting the operation of the
device. - Other devices on board employ radiation
mitigation strategies as well - Radiation hardened
- EDAC
- Boards also have dedicated resources to support
high-performance processing - High speed I/O.
- External memories.
- Industry standard form-factor 6U Compact PCI.
9Network RCC (NTRCC)
- Four Xilinx XC2VP70-6FF1704 FPGA CO-Processors
- Design compatible with XC2VP100-6FF1706 and V2P-X
- (4) banks of 1Mx36 Quad Data Rate (QDR) SRAMs for
each COP - 512MB of DDRII Shared SDRAM memory for prototype
- 1GB of 128M x 64 EDAC (R-S) Protected DDRII SDRAM
shared memory (19.2Gbps _at_150MHz) using 1Gbit
memory - Network IF
- (2) parallel 16bit RapidIO ports to front panel
(8 Gbps) - (1) 4x3.125 Gbps serial port to front panel
(gt10Gbps) - 4x3.125 Gbps ports from NIC to each COP (gt10Gbps)
- 4x3.125 Gbps ports from each COP to each neighbor
COP (gt10Gbps) - Shared Data Buses
- Cop Interconnect Bus (4.224 Gbps)
- cPCI 32bit 33Mhz
- Read and write COP configurations via cPCI
- Extended 6U form factor
- Configuration RAM SEU detection and correction
- DDRII SDRAM on configuration controller for
shadow config program storage - Non-Volatile memory for 16 different
configurations (1 Gbit Flash)
10Network RCC Block Diagram
11NTRCC Layout
- 24 Layer board
- MicroVias, blind vias, via-in-pad
- High speed 3.125 Gbps Serial links
- 82 pages of schematic capture
- 10 weeks of PCB layout time
12Implementation of the JPEG2000 Algorithm
- The JPEG2000 core has been in development for
over a year. - Eventual target data rate 600 Mbps/device.
- Written in VHDL.
- Simulations performed in Modelsim.
- Synthesis in Synplify_Pro.
- Targeted to the NTRCC-R summer 04.
- Targeted to a reduced version of the NTRCC with a
single coprocessor. - Take advantage of improved external memory
throughput. - Ultimately use the high-speed serial I/O to move
image information on the board. - Designed for high throughput.
- Cycle efficient coding style.
- Highly parallel design.
- Pipelined architecture.
- Rolling wavelet transform.
- Designed for flexible output file format.
- Output is divided into quality layers for easy
selection of compression ratio.
13JPEG2000 Block Diagram
14JPEG2000 Coding Steps
- Image is broken into tiles
- Tiles are wavelet transformed
- 5/3 reversible or 9/7 irreversible, also user
defined. - Selectable number of transform levels.
- Each subband from the transform is further broken
up into code blocks (typically 32x32 or 64x64)
for entropy coding. - Each code block is entropy coded, starting from
the top bit plane and working down. - The current bit of each pixel is passed to an
arithmetic coder, along with context information. - The MQ encoder takes advantage of any skewing of
the probability for each context, and adapts
contexts as the coding progresses. - Packets are formed by combining the entropy coder
outputs from a single resolution. - Tile parts are formed from all the packet in a
given bit plane.
15JPEG2000 Architecture Drivers
- To achieve high data rates, the processing must
be paralleled as much as possible. - The tall pole in the tent is the arithmetic
coding, because the coding of a single data bit
with its context can take several clock cycles. - Significance propagation coding is also a
challenge, because each coefficient must be
accessed many times, as each bit plane is
processed. - Other operations, such as wavelet transform, code
block loading, and packet generation are much
more efficient, and require fewer parallel paths. - A pipelined architecture with many entropy coders
in parallel was used to achieve the required
throughput.
16Architecture Description
- Processes 256x256 tiles.
- Pipelined architecture, using separate external
memories for image, tile, and compressed data
storage. - 19 Entropy coders working in parallel to improve
throughput, one for each code block. - 64x64 code blocks.
- FIFO buffering between the stages improves data
flow efficiency. - A rolling wavelet transform is used to reduce
memory accesses and improve efficiency. - Entropy coder outputs are formed into layers,
giving each tile a progressive output format. - Tile parts are interleaved as the image tiles are
processed. - Performs lossy or lossless compression.
17NTRCC-R Implementation Results
- The JPEG2000 encoder was targeted to the V2Pro 70
FPGA on the NTRCC-R. - Lossless or Lossy compression.
- Data precision up to 13 bits.
- Simulation and Routing Results
- Slices 30043 out of 33088, 90
- Block RAMS 148 out of 328, 45
- Max system clock 43 MHz without optimization.
- Hardware Throughput
- 140 Mbps w/ 33 MHz clock (depending on image.)
- 180 Mbps w/ 43 Mhz clock.
18JPEG2000 Floorplan
- The Pro 70 Device is quite full!
19Planned Improvements
- Optimize design to hit 66 MHz.
- Un-optimized design will operate at up to 43 MHz.
- Use of asynchronous fifos will allow optimal
clocking of various parts of the design. - Improve pipelining of code block loader and
wavelet transform. - Allow autonomous operation of each stage, so
that operations take place as soon as input data
and output buffers are ready. - Make use of additional QDR SRAMs available to
each coprocessor by creating separate buffers for
wavelet transform and packetizer output. - NTRCC has 4 QDR memories for each coprocessor.
- Arithmetic coder bypass.
- Arithmetic coder requires gt 2 cycles per bit
coded, on average. - 9/7 wavelet transform with quantization.
- Use of the 9/7 wavelet results in better SNR and
max error performance for lossy compression. - Add RapidIO serial interface to Network Interface
Chip (NIC).
20Conclusions
- The JPEG2000 core is expected to provide a
valuable option for satellite imagery systems. - Compression will result in a dramatic improvement
in system performance. - Lossless compression will allow 70 more image
data to be stored and downlinked by a system. - Lossy compression will allow even greater
improvements. - NTRCC hardware is an excellent platform for the
compressor. - High bandwidth interconnect and I/O (several
Gbps). - High bandwidth external memories.
- Excellent processing capability with the
Virtex-II Pro devices. - The skys the limit!
- Target rate of 600 Mbps per device appears to be
a realistic goal. - Some improvements are left to be made to the
clock rate and pipelining of the design.