Title: Data Compression by Quantization
1Data Compression by Quantization
- Edward J. Wegman
- Center for Computational Statistics
- George Mason University
-
2Outline
- Acknowledgements
- Complexity
- Sampling Versus Binning
- Some Quantization Theory
- Recommendations for Quantization
3Acknowledgements
- This is joint work with Nkem-Amin (Martin)
Khumbah - This work was funded by the Army Research Office
4Complexity
- Descriptor Data Set Size in Bytes Storage
Mode - Tiny 102 Piece of Paper
- Small 104 A Few Pieces of Paper
- Medium 106 A Floppy Disk
- Large 108 Hard Disk
- Huge 1010 Multiple Hard Disks
- e.g. RAID Storage
- Massive 1012 Robotic Magnetic Tape
- Storage Silos
- Super Massive 1015 Distributed Archives
- The Huber/Wegman Taxonomy of Data Set Sizes
5Complexity
- O(r) Plot a scatterplot
- O(n) Calculate means, variances, kernel density
- estimates
- O(n log(n)) Calculate fast Fourier transforms
- O(nc) Calculate singular value decomposition of
an rc matrix solve a multiple linear
regression - O(n2) Solve most clustering algorithms.
- O(an) Detect Multivariate Outliers
- Algorithmic Complexity
6Complexity
7Motivation
- Massive data sets can make many algorithms
computationally infeasible, e.g. O(n2) and
higher - Must reduce effective number of cases
- Reduce computational complexity
- Reduce data transfer requirements
- Enhance visualization capabilities
8Data Sampling
- Database Sampling
- Exhaustive search may not be practically feasible
because of their size - The KDD systems must be able to assist in the
selection of appropriate parts if the databases
to be examined - For sampling to work, the data must satisfy
certain conditions (not ordered, no systematic
biases) - Sampling can be very expensive operation
especially when the sample is taken from data
stored in a DBMS. Sampling 5 of the database can
be more expensive that a sequential full scan of
the data.
9Data Compression
- Squishing, Squashing, Thinning, Binning
- Squishing cases reduced
- Sampling Thinning
- Quantization Binning
- Squashing dimensions (variables) reduced
- Depending on goal, one of sampling or
quantization may be preferable
10Data Quantization
- Thinning vs Binning
- Peoples first thoughts about Massive Data
usually is statistical subsampling - Quantization is engineerings success story
- Binning is statisticians quantization
11Data Quantization
- Images are quantized in 8 to 24 bits, i.e. 256 to
16 million levels. - Signals (audio on CDs) are quantized in 16 bits,
i.e. 65,536 levels - Ask a statistician how many bins to use, likely
response is a few hundred, ask a CS data miner,
likely response is 3 - For a terabyte data set, 106 bins
12Data Quantization
- Binning, but at microresolution
- Conventions
- d dimension
- k of bins
- n sample size
- Typically k ltlt n
13Data Quantization
- Choose EWQ yj mean of observations in jth
bin yj - In other words, EWQ Q
- The quantizer is self-consistent
14Data Quantization
- EW EQ
- If ? is a linear unbiased estimator, then so is
E?Q - If h is a convex function, then Eh(Q) ?
Eh(W). - In particular, EQ2 ? EW2 and var (Q) ? var
(W). - EQ(Q-W) 0
- cov (W-Q) cov (W) - cov (Q)
- EW-P2 ? EW-Q2 where P is any other quantizer.
15Data Quantization
16Distortion due to Quantization
- Distortion is the error due to quantization.
- In simple terms, EW-Q2.
- Distortion is minimized when the quantization
regions, Sj, are most like a (hyper-) sphere.
17Geometry-based Quantization
- Need space-filling tessellations
- Need congruent tiles
- Need as spherical as possible
18Geometry-based Quantization
- In one dimension
- Only polytope is a straight line segment (also
bounded by a one-dimensional sphere). - In two dimensions
- Only polytopes are equilateral triangles, squares
and hexagons
19Geometry-based Quantization
- In 3 dimensions
- Tetrahedrons (3-simplex), cube, hexagonal prism,
rhombic dodecahedron, truncated octahedron. - In 4 dimensions
- 4 simplex, hypercube, 24 cell
Truncated octahedron tessellation
20Geometry-based Quantization
Tetrahedron .1040042
Cube .0833333 Octahedron .0825482
Hexagonal Prism .0812227 Rhombic
Dodecahedron .0787451 Truncated
Octahedron .0785433 Dodecahedron .078128
5 Icosahedron .0778185
Sphere .0769670 Dimensionless Second Moment
for 3-D Polytopes
21Geometry-based Quantization
Tetrahedron
Cube
Octahedron
Icosahedron
Dodecahedron
Truncated Octahedron
22Geometry-based Quantization
Rhombic Dodecahedron
http//www.jcrystal.com/steffenweber/POLYHEDRA/p_0
7.html
23Geometry-based Quantization
24 Cell with Cuboctahedron Envelope
Hexagonal Prism
24Geometry-based Quantization
- Using 106 bins is computationally and visually
feasible. - Fast binning, for data in the range a,b, and
for k bins - j fixedk(xi-a)/(b-a)
- gives the index of the bin for xi in one
dimension. - Computational complexity is 4n1O(n).
- Memory requirements drop to 3k - location of bin
items in bin representor of bin, I.e.
storage complexity is 3k. -
25Geometry-based Quantization
- In two dimensions
- Each hexagon is indexed by 3 parameters.
- Computational complexity is 3 times 1-D
complexity, - I.e. 12n3O(n).
- Complexity for squares is 2 times 1-D complexity.
- Ratio is 3/2.
- Storage complexity is still 3k.
26Geometry-based Quantization
- In 3 dimensions
- For truncated octahedron, there are 3 pairs of
square sides and 4 pairs of hexagonal sides. - Computational complexity is 28n7 O(n).
- Computational complexity for a cube is 12n3.
- Ratio is 7/3.
- Storage complexity is still 3k.
27Quantization Strategies
- Optimally for purposes of minimizing distortion,
use roundest polytope in d-dimensions. - Complexity is always O(n).
- Storage complexity is 3k.
- tiles grows exponentially with dimension,
so-called curse of dimensionality. - Higher dimensional geometry is poorly known.
- Computational complexity grows faster than
hypercube.
28Quantization Strategies
- For purposes of simplicity, always use hypercube
or d-dimensional simplices - Computational complexity is always O(n).
- Methods for data adaptive tiling are available
- Storage complexity is 3k.
- tiles grows exponentially with dimension.
- Both polytopes depart spherical shape rapidly as
d increases. - Hypercube approach is known as datacube in
computer science literature and is closely
related to multivariate histograms in statistical
literature.
29Quantization Strategies
- Conclusions on Geometric Quantization
- Geometric approach good to 4 or 5 dimensions.
- Adaptive tilings may improve rate at which
tiles grows, but probably destroy spherical
structure. - Good for large n, but weaker for large d.
30Quantization Strategies
- Alternate Strategy
- Form bins via clustering
- Known in the electrical engineering literature as
vector quantization. - Distance based clustering is O(n2) which implies
poor performance for large n. - Not terribly dependent on dimension, d.
- Clusters may be very out of round, not even
convex. - Conclusion
- Cluster approach may work for large d, but fails
for large n. - Not particularly applicable to massive data
mining.
31Quantization Strategies
- Third strategy
- Density-based clustering
- Density estimation with kernel estimators is
O(n). - Uses modes m? to form clusters
- Put xi in cluster ? if it is closest to mode m?.
- This procedure is distance based, but with
complexity O(kn) not O(n2). - Normal mixture densities may be an alternative
approach. - Roundness may be a problem.
- But quantization based on density-based
clustering offers promise for both large d and
large n.
32Data Quantization
- Binning does not lose fine structure in tails as
sampling might. - Roundoff analysis applies.
- With scale of binning, discretization not likely
to be much less accurate than accuracy of
recorded data. - Discretization - finite number of bins implies
discrete variables more compatible with
categorical data.
33Data Quantization
- Analysis on a finite subset of the integers has
theoretical advantages - Analysis is less delicate
- different forms of convergence are equivalent
- Analysis is often more natural since data is
already quantized or categorical - Graphical analysis of numerical data is not much
changed since 106 pixels is at limit of HVS