Title: scientific data compression through wavelet transformation
1scientific data compression through wavelet
transformation
2Problem Statement
- Scientific data is hard to compress with
traditional run-length encoders, like gzip or
bzip because common patterns do not exist - If the data can be transformed into a form that
retains the information but can be thresholded,
it can be compressed - Thresholding removes excessively small features
and replaces with a zero (much easier to compress
than a 64b double) - Wavelet transforms are well suited for this
purpose and have been used in image compression
(jpeg 2000)
3Why Wavelets?
- Wavelet transforms encode more information than
other techniques like Fourier transforms. - Time and frequency information is saved
- In practical terms, the transformation is applied
to many scales and sizes within the signal - This results in vectors that encode approximation
and detail information. - By separating the signals, it is easier to
threshold and remove information. Thus the data
can be compressed
4Wavelets in Compression
- The jpeg2000 standard gave up the discrete cosine
transform in favor of a wavelet transform - The FBI uses wavelets to compress fingerprint
scans by 15 20 times the original size
5Choosing the Right WaveletThe Transform
- Continuous wavelet transform the sum over the
time of the signal convolved by the scaled and
shifted versions of the wavelet - Unfortunately, its slow and generates way too
much data. Its also hard to implement
From mathworks.com
6Choosing the Right WaveletThe Transform
- The discrete transform - if the scales and
positions are chosen based on powers of two, then
the transform will be much more efficient and
just as accurate. - Then the signal is sent through only two
subband coders (which get the approximation and
the detail data from the signal).
Signal decomposed by low pass and high
pass filters to get approx and detail info.
From mathworks.com
7Choosing the Right WaveletThe Decomposition
- The signal can be recursively decomposed to get
finer detail and more general approximation. - This is called multi-level decomposition.
- A signal can be decomposed as many times as it
can be divided in half. - Thus, we only have one approximation signal at
the end of the process
From mathworks.com
8Choosing the Right WaveletThe Wavelet
Approx/ Low pass/ Scaling
- The low and high pass filters (subband coders)
are in reality the wavelet that is used. There
have been a wide variety of wavelets created over
time - The low pass is called the scaling function
- The high pass is the wavelet function
- Different wavelets give better results depending
on the type of data
Detail/ High pass/ Wavelet
From mathworks.com
9Choosing the Right WaveletThe Wavelet
- The wavelets that turned out to give the best
results were the Biorthogonal wavelets - These were discovered by Daubechies and make use
of the fact that exact reconstruction is
impossible if you use the same wavelet. - Instead a reconstruction wavelet and a
decomposition wavelet are used that are slightly
different
These are the coefficients of the filters used
for convolution
Actual wavelet and scaling functions
From mathworks.com
10Testing Methodology
- In order to find what was the best combination of
wavelet, decomposition, and thresholding, an
exhaustive search was done with Matlab - A 1000x1000 grid of vorticity data from the
navier stokes simulator was first compressed with
gzip. This was the baseline file to compare
against - Then each available wavelet in Matlab was tested
with 1, 3 and 5 level decomposition, in
combination with thresholding by removing values
smaller than 1x10-4 to 1x10-7. - The resulting data was saved and compressed with
gzip and compared against the baseline. - Then the data was reconstructed and the max and
average error was taken.
11Testing Methodology
Sample Results
For the application, I chose three of the
methodologies that represented High
compression/high error, medium/medium error and
low compression/low error.
12Matlab functions
- Four matlab functions were made for compression
and decompression - wavecompress (1D) and wavecompress2 (2D)
- wavedecompress (1D) and wavedecompress2 (2D)
wavecompress2 - Lossy compression for 2D data
savings WAVECOMPRESS2(x,mode,outputfile)
compresses the 2-D data in x using the mode
specified and saves to outputfile. The return
value is compression ratio achieved The
valid values for mode are 1 high
compression, high error (uses bior3.1
filter and 1xE-4 limit) 2 medium
compression, medium error (uses
bior3.1 filter and 1xE-5 limit) 3 low
compression, low error (uses bior5.5
filter and 1xE-7 limit) To decompress the
data, see wavedecompress2
13Some Pictures
14C Implementation
- With the easy work out of the way, the next phase
of the project was a C implementation. There
were a few reasons for reinventing the wheel - I wanted to fully understand the process
- I could try my hand at some parallel processing
- I could have a native 3-D transformation
- And Matlab makes my computer very slow
15Demo
- ./wavecomp c 1 d 2 vorticity000.dat
- ./wavedec vorticity000.dat.wcm
16Algorithm
- The basic algorithm for 1-D multi-level
decomposition - Convolve the input with the low pass filter to
get the approx. vector. - Convolve the input with the high pass to get the
detail vector - Set the input approx, and repeat for number of
times to get desired decomposition level
17Convolution
- The convolution step is tricky though because the
filters use data from before and after a specific
point, which makes edges hard to handle - For signals that arent sized appropriately, the
data must be extended. The most common way is
periodically, symmetrically or with zero-padding.
- The convolution algorithm
- for (k 0 k lt signal size k 2)
- int sum 0
- for (j 0 j lt filter size j)
- sum filterjinputk-j
- outputk sum
18Implementation
- The convolution caused the most problems as many
available libraries didnt seem to do it
correctly or assumed the data was periodic or
symmetric. - I finally appropriated some code from the
PyWavelets project that handled zero padding
extension, determining the appropriate output
sizes, and performing the correct convolution
along the edges
192-D transformation
- The 2-D transformation proved more challenging in
terms of how to store the data and how decompose
it. - Convolve each row with the low pass filter to get
the approx. vector, then downsample rows - Convolve each row with the high pass to get the
detail vector, then downsample columns - Convolve each remaining low pass column with low
pass - Convolve each remaining low pass col with high
pass - Convolve each high pass column with low pass
- Convolve each high pass column with high pass
- Downsample each result
- Store the 3 detail vectors and set input low
pass/low pass. Repeat desired number of levels.
20Post Transformation
- After the data was transformed and stored in an
appropriate data structure, an in memory gzip
compression (which was oddly better than bzip2)
was done on the data and it was outputted in
binary format - Reconstruction is another program that does
everything in reverse except uses the
reconstruction filters. - There was trouble in storing and reading the data
back in an appropriate form based on the
decomposition structure.
DD L1
DA L1
AD L1
DD L2
DA L2
AD L2
DDL3
DAL3
ADL3
AA3
Storing data
212D Results
Results for 2-D data vorticity data sets using
high compression (uses bior3.1 wavelet. 1x10-4
threshold)
2500x2500 went from size 50,000,016B -gt
169,344B 1024x1024 went from 8,388,624B -gt
201,601B !!!
222D Results
Compression Ratio for different 2-D data sizes
232D Results
Results for 2-D data vorticity data sets using
low compression (uses bior5.5 wavelet (more
coefficients than bior3.1). 1x10-7 threshold)
64x64 actually increases in size because the
decomposition creates matrices whose sum is
larger than the original and the threshold level
is too low.
24Comparison
- Compared to the adaptive subsampling presented in
the thesis of Tallat..
252D Pictures
128x128 vorticity ORIGINAL
128x128 vorticity RECONSTRUCTED
Using High compression
262D Pictures
Difference between 128x128 vorticity original and
reconstructed
272D Pictures
Plot of 1024x1024 max difference for each row
between orig. and reconstructed
283-D Data
From http//taco.poly.edu/WaveletSoftware/standar
d3D.html
- Similar to 2-D except more
- Complicated.
- My implementation
- Take Z axis and downsample
- Get A and D
- Take Y axis and downsample
- Get AA, AD, DA, DD
- Take X axis and downsample
- Get AAA, AAD, ADA, ADD,
- DAA, DAD, DDA, DDD
- Take AAA and set as input,
- Repeat for desired level of steps
- Real trouble in trying to store the data in way
that can be reconstructed later
y
Store next decomposition here
ADD
ADA
x
AAD
AAA
z
DDA
DDD
DAD
DAA
293-D Data
- Its also problematic getting 3-D data.
- Took vorticity frames and concatenated.
- Tested with 64x64 vorticity with 64 frames (so
not really 3D data).
303-D Visual Comparison
Original 64x64x64 vorticity data
Reconstructed 64x64x64 vorticity data
31Parallel Processing
- The detail and approximation data can be
calculated independently. - XML-RPC was used to send an input vector to one
node to find the detail data and another to find
the approximation data. - The master node coordinates the sending and
receiving. - This led to an enormous slowdown in performance
as expected. - XML-RPC adds a huge overhead
- Data is sent one row at a time instead of sending
the entire level decomposition. This creates
excessive communication - In the 2-D decomposition, when convolving the
columns, four operations can be done in parallel,
but instead are only done two at a time - Performance was not the main goal here, rather it
was a proof of concept.
32Demo
- Master Node
- ./wavecomp c 1 d 2 -p -m -s 132.239.55.175
132.239.55.174 vorticity001.dat - Slave Node
- ./wavecomp p -s
33Parallel Processing Results
Results for 2-D data vorticity data sets using
high compression Parallel Processing on two
nodes
34Known Issues
- This method is not applicable for all kinds of
data. - If enough values are not thresholded (because the
limit is too low or the wavelet wasnt
appropriate), then the size can actually increase
(because decomposition usually creates detail and
approx vectors larger in sum than the original) - My implementation does excessive data copying,
which could be eliminated, speeding up the time
for processing. It comes down to the question of
whether the transformations should be done
in-place (which is tricky because sizes can
change)
35Conclusion
- Lossy compression is applicable for many kinds of
data, but the user should have a basic
understanding of the thresholding required - Wavelets are a good choice for doing such
compression, as evidenced by other applications
and these results. - The finer the resolution of the data, the better
the compression
36References
- The following helped significantly.
- Matlab Wavelet Toolbox http//www.mathworks.com/a
ccess/helpdesk/help/toolbox/wavelet/wavelet.html - Robi Polikar Wavelet Tutorial -
http//users.rowan.edu/polikar/WAVELETS/WTtutoria
l.html - PyWavelets - http//www.pybytes.com/pywavelets/
- Geoff Davis Wavelet Construction Kit -
http//www.geoffdavis.net/dartmouth/wavelet/wavele
t.html - Wickerhauser Adapted Wavelet Analysis from
Theory to Software