Data Compression by Quantization - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Data Compression by Quantization

Description:

Need space-filling tessellations. Need congruent tiles. Need as ... Truncated octahedron tessellation. Geometry-based Quantization. Tetrahedron* .1040042... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 34
Provided by: edwardj8
Category:

less

Transcript and Presenter's Notes

Title: Data Compression by Quantization


1
Data Compression by Quantization
  • Edward J. Wegman
  • Center for Computational Statistics
  • George Mason University

2
Outline
  • Acknowledgements
  • Complexity
  • Sampling Versus Binning
  • Some Quantization Theory
  • Recommendations for Quantization

3
Acknowledgements
  • This is joint work with Nkem-Amin (Martin)
    Khumbah
  • This work was funded by the Army Research Office

4
Complexity
  • Descriptor Data Set Size in Bytes Storage
    Mode
  • Tiny 102 Piece of Paper
  • Small 104 A Few Pieces of Paper
  • Medium 106 A Floppy Disk
  • Large 108 Hard Disk
  • Huge 1010 Multiple Hard Disks
  • e.g. RAID Storage
  • Massive 1012 Robotic Magnetic Tape
  • Storage Silos
  • Super Massive 1015 Distributed Archives
  • The Huber/Wegman Taxonomy of Data Set Sizes

5
Complexity
  • O(r) Plot a scatterplot
  • O(n) Calculate means, variances, kernel density
  • estimates
  • O(n log(n)) Calculate fast Fourier transforms
  • O(nc) Calculate singular value decomposition of
    an rc matrix solve a multiple linear
    regression
  • O(n2) Solve most clustering algorithms.
  • O(an) Detect Multivariate Outliers
  • Algorithmic Complexity

6
Complexity
7
Motivation
  • Massive data sets can make many algorithms
    computationally infeasible, e.g. O(n2) and
    higher
  • Must reduce effective number of cases
  • Reduce computational complexity
  • Reduce data transfer requirements
  • Enhance visualization capabilities

8
Data Sampling
  • Database Sampling
  • Exhaustive search may not be practically feasible
    because of their size
  • The KDD systems must be able to assist in the
    selection of appropriate parts if the databases
    to be examined
  • For sampling to work, the data must satisfy
    certain conditions (not ordered, no systematic
    biases)
  • Sampling can be very expensive operation
    especially when the sample is taken from data
    stored in a DBMS. Sampling 5 of the database can
    be more expensive that a sequential full scan of
    the data.

9
Data Compression
  • Squishing, Squashing, Thinning, Binning
  • Squishing cases reduced
  • Sampling Thinning
  • Quantization Binning
  • Squashing dimensions (variables) reduced
  • Depending on goal, one of sampling or
    quantization may be preferable

10
Data Quantization
  • Thinning vs Binning
  • Peoples first thoughts about Massive Data
    usually is statistical subsampling
  • Quantization is engineerings success story
  • Binning is statisticians quantization

11
Data Quantization
  • Images are quantized in 8 to 24 bits, i.e. 256 to
    16 million levels.
  • Signals (audio on CDs) are quantized in 16 bits,
    i.e. 65,536 levels
  • Ask a statistician how many bins to use, likely
    response is a few hundred, ask a CS data miner,
    likely response is 3
  • For a terabyte data set, 106 bins

12
Data Quantization
  • Binning, but at microresolution
  • Conventions
  • d dimension
  • k of bins
  • n sample size
  • Typically k ltlt n

13
Data Quantization
  • Choose EWQ yj mean of observations in jth
    bin yj
  • In other words, EWQ Q
  • The quantizer is self-consistent

14
Data Quantization
  • EW EQ
  • If ? is a linear unbiased estimator, then so is
    E?Q
  • If h is a convex function, then Eh(Q) ?
    Eh(W).
  • In particular, EQ2 ? EW2 and var (Q) ? var
    (W).
  • EQ(Q-W) 0
  • cov (W-Q) cov (W) - cov (Q)
  • EW-P2 ? EW-Q2 where P is any other quantizer.

15
Data Quantization
16
Distortion due to Quantization
  • Distortion is the error due to quantization.
  • In simple terms, EW-Q2.
  • Distortion is minimized when the quantization
    regions, Sj, are most like a (hyper-) sphere.

17
Geometry-based Quantization
  • Need space-filling tessellations
  • Need congruent tiles
  • Need as spherical as possible

18
Geometry-based Quantization
  • In one dimension
  • Only polytope is a straight line segment (also
    bounded by a one-dimensional sphere).
  • In two dimensions
  • Only polytopes are equilateral triangles, squares
    and hexagons

19
Geometry-based Quantization
  • In 3 dimensions
  • Tetrahedrons (3-simplex), cube, hexagonal prism,
    rhombic dodecahedron, truncated octahedron.
  • In 4 dimensions
  • 4 simplex, hypercube, 24 cell

Truncated octahedron tessellation
20
Geometry-based Quantization
Tetrahedron .1040042
Cube .0833333 Octahedron .0825482
Hexagonal Prism .0812227 Rhombic
Dodecahedron .0787451 Truncated
Octahedron .0785433 Dodecahedron .078128
5 Icosahedron .0778185
Sphere .0769670 Dimensionless Second Moment
for 3-D Polytopes
21
Geometry-based Quantization
Tetrahedron
Cube
Octahedron
Icosahedron
Dodecahedron
Truncated Octahedron
22
Geometry-based Quantization
Rhombic Dodecahedron
http//www.jcrystal.com/steffenweber/POLYHEDRA/p_0
7.html
23
Geometry-based Quantization
24 Cell with Cuboctahedron Envelope
Hexagonal Prism
24
Geometry-based Quantization
  • Using 106 bins is computationally and visually
    feasible.
  • Fast binning, for data in the range a,b, and
    for k bins
  • j fixedk(xi-a)/(b-a)
  • gives the index of the bin for xi in one
    dimension.
  • Computational complexity is 4n1O(n).
  • Memory requirements drop to 3k - location of bin
    items in bin representor of bin, I.e.
    storage complexity is 3k.

25
Geometry-based Quantization
  • In two dimensions
  • Each hexagon is indexed by 3 parameters.
  • Computational complexity is 3 times 1-D
    complexity,
  • I.e. 12n3O(n).
  • Complexity for squares is 2 times 1-D complexity.
  • Ratio is 3/2.
  • Storage complexity is still 3k.

26
Geometry-based Quantization
  • In 3 dimensions
  • For truncated octahedron, there are 3 pairs of
    square sides and 4 pairs of hexagonal sides.
  • Computational complexity is 28n7 O(n).
  • Computational complexity for a cube is 12n3.
  • Ratio is 7/3.
  • Storage complexity is still 3k.

27
Quantization Strategies
  • Optimally for purposes of minimizing distortion,
    use roundest polytope in d-dimensions.
  • Complexity is always O(n).
  • Storage complexity is 3k.
  • tiles grows exponentially with dimension,
    so-called curse of dimensionality.
  • Higher dimensional geometry is poorly known.
  • Computational complexity grows faster than
    hypercube.

28
Quantization Strategies
  • For purposes of simplicity, always use hypercube
    or d-dimensional simplices
  • Computational complexity is always O(n).
  • Methods for data adaptive tiling are available
  • Storage complexity is 3k.
  • tiles grows exponentially with dimension.
  • Both polytopes depart spherical shape rapidly as
    d increases.
  • Hypercube approach is known as datacube in
    computer science literature and is closely
    related to multivariate histograms in statistical
    literature.

29
Quantization Strategies
  • Conclusions on Geometric Quantization
  • Geometric approach good to 4 or 5 dimensions.
  • Adaptive tilings may improve rate at which
    tiles grows, but probably destroy spherical
    structure.
  • Good for large n, but weaker for large d.

30
Quantization Strategies
  • Alternate Strategy
  • Form bins via clustering
  • Known in the electrical engineering literature as
    vector quantization.
  • Distance based clustering is O(n2) which implies
    poor performance for large n.
  • Not terribly dependent on dimension, d.
  • Clusters may be very out of round, not even
    convex.
  • Conclusion
  • Cluster approach may work for large d, but fails
    for large n.
  • Not particularly applicable to massive data
    mining.

31
Quantization Strategies
  • Third strategy
  • Density-based clustering
  • Density estimation with kernel estimators is
    O(n).
  • Uses modes m? to form clusters
  • Put xi in cluster ? if it is closest to mode m?.
  • This procedure is distance based, but with
    complexity O(kn) not O(n2).
  • Normal mixture densities may be an alternative
    approach.
  • Roundness may be a problem.
  • But quantization based on density-based
    clustering offers promise for both large d and
    large n.

32
Data Quantization
  • Binning does not lose fine structure in tails as
    sampling might.
  • Roundoff analysis applies.
  • With scale of binning, discretization not likely
    to be much less accurate than accuracy of
    recorded data.
  • Discretization - finite number of bins implies
    discrete variables more compatible with
    categorical data.

33
Data Quantization
  • Analysis on a finite subset of the integers has
    theoretical advantages
  • Analysis is less delicate
  • different forms of convergence are equivalent
  • Analysis is often more natural since data is
    already quantized or categorical
  • Graphical analysis of numerical data is not much
    changed since 106 pixels is at limit of HVS
Write a Comment
User Comments (0)
About PowerShow.com