Title: Fast Parallel Algorithms for Universal Lossless Source Coding
1Fast Parallel Algorithms for Universal Lossless
Source Coding
- Dror Baron
- CSL ECE Department, UIUC
- dbaron_at_uiuc.edu
- Ph.D. Defense February 18, 2003
2Overview
- Motivation, applications, and goals
- Background
- Source models
- Lossless source coding universality
- Semi-predictive methods
- An O(N) semi-predictive universal encoder
- Two-part codes
- Rigorous analysis of their compression quality
- Application to parallel compression of Bernoulli
sequences - Parallel semi-predictive (PSP) coding
- Achieving a work-efficient algorithm
- Theoretical results
- Summary
3Motivation
- Lossless compression text files, facsimiles,
software executables, medical, financial, etc. - What do we want in a compression algorithm?
- Universality adaptive to a large class of
sources - Good compression quality
- Speed low computational complexity
- Simple implementation
- Low memory use
- Sequential vs. offline
4Why Parallel Compression ?
- Some applications require high data rates
- Compressed pages in virtual memory
- Remote archiving fast communication links
- Real-time compression in storage systems
- Power reduction for interconnects on a circuit
board - Serial compression is limited by the clock rate
5Room for Improvement and Goals
- Previous Art
- Serial universal source coding methods have
reached the bounds on compression quality
Willems1998,Rissanen1999 - Parallel source coding algorithms have high
complexity and/or poor compression quality - Naïve parallelization compresses poorly
- Parallel dictionary compression Franszek et.
al.1996 - Parallel context tree weighting
StassenTjalkens2001,Willems2000 - Research Goals good parallel compression
algorithm - Work-efficient O(N/B) time with B computational
units - Compression quality as good as best serial
methods (almost!)
6Main Contributions
- BWT-MDL (O(N) universal encoder)
- An O(N) algorithm that achieves Rissanens
redundancy bounds on best achievable compression - Combines efficient prefix tree construction with
semi-predictive approach to universal coding - Fast Suffix Sorting (not in this talk)
- Core algorithm is very simple (can be implemented
in VLSI) - Worst-case complexity O(N log0.5(N))
- Competitive with other suffix sorting methods in
practice - Two-Part Codes
- Rigorous analysis of their compression quality
- Application to distributed/parallel compression
- Optimal two-part codes
- Parallel Compression Algorithm (not in this
talk) - Work-efficient O(N/B) algorithm
- Compression loss is roughly B log(N/B) bits
7Source Models
- Binary alphabet X0,1, sequence x ? XN
- Bernoulli Model
- i.i.d. model
- p(xi1)?
- Order-K Markov Model
- Previous K symbols called context
- Context-dependent conditional probability for
next symbol - More flexible than Bernoulli
- Exponentially many states
8Context Tree Sources
- More flexible than Bernoulli
- More compact than Markov
- Particularly good for text
- Works for M-ary alphabet
- State context conditional probabilities
- Example N11, x01011111111
root
9Review of Lossless Source Coding
- Stationary ergodic sources
- Entropy rate HlimN?? H(x)/N
- Asymptotically, H is the lowest attainable
per-symbol rate - Arithmetic coding
- Probability assignment p(x)
- Coding length l(x)-log(p(x))O(1)
- Can achieve entropy O(1) bits
10Universal Source Coding
- Source statistics are unknown
- Need probability assignment p(x)
- Need to estimate source model
- Need to describe estimated source (explicitly or
implicitly) - Redundancy excess coding length above entropy
- ?(x)l(x)-NH
11Redundancy Bounds
- Rissanens bound (K unknown parameters)
- E?(x) gt (K/2) log(N)O(1)
- Worst-case redundancy for Bernoulli sequences
(K1) ?(x)maxx?XN ?(x) ? 0.5 log(?N/2) - Asymptotically, ?(x)/N? 0
- In practice, e.g., text, the number of parameters
scales almost linearly with N - Low redundancy is still essential
12Semi-Predictive Approach
- Semi-predictive methods describe x in two phases
- Phase I find a good tree source structure S
and describe it using codelength lS - Phase II encode x using S with probability
assignment pS(x) - Phase I estimate minimum description length
(MDL) tree source model Sarg min lS
log(pS(x))
13Semi-Predictive Approach - Phase II
- Sequential encoding of x given S
- Determine which state s of S generated symbol xi
- Assign xi a conditional probability p(xis)
- Arithmetic encoding
- p(xis) can be based on previously processed
portion of x, quantized probability estimates,
etc.
14Context Trees
- We will provide an O(N) semi-predictive algorithm
by estimating S using context trees - Context trees arrange x in a tree
- Each node corresponds to
- sequence of appended arc
- labels on path to root
- Internal nodes correspond
- to repeating contexts in x
- Leaves correspond to unique contexts
- Sentinel symbol x0 makes sure
- symbols have different contexts
15Context Tree Pruning(To prune or not to prune)
- The MDL structure for state s yields the shortest
description for symbols generated by s - When processing state s
- Estimate MDL structures for states 0s and 1s
- Decide whether to keep 0s and 1s or prune them
into state s - Base decision on coding lengths
16Phase I with Atomic Context Trees
- Atomic context tree
- Arc labels are atomic (single symbol)
- Internal nodes are not necessarily branching
- Has up to O(N2) nodes
- The coding length minimization of Phase I
processes each node of the context tree Nohre94 - With atomic context trees, the worst-case
complexity is at least O(N2) ?
17Compact Context Trees
- Compact context tree
- Arc labels not necessarily atomic
- Internal node are branching
- O(N) nodes
- Compact representation of the same tree
- Depth-first traversal of compact context tree
provides O(N) complexity ? - Theorem Phase I of BWT-MDL requires O(N)
operations performed with O(log(N)) bits of
precision
18Phase II of BWT-MDL
- We determine the generator state using a novel
algorithm that is based on properties of the
Burrows Wheeler transform (BWT) - Theorem The BWT-MDL encoder requires O(N)
operations performed with O(log(N)) bits of
precision - Theorem Willems et. al. 2000 redundancy
w.r.t. any tree source S is at most S0.5
log(N)O(1) bits
19Distributed/Parallel Compression of Bernoulli
Sequences
- Splitter partitions x into B blocks x(1),,x(B)
- Encoder j?1,,B compresses x(j) it assigns
probabilities p(xi(j)1)? and p(xi(j)0)1-? - The total probability assigned to x is identical
to that in a serial compression system - This structure assumes that ? is known our goal
is to provide a universal parallel compression
algorithm for Bernoulli sequences
20Two-Part Codes
- Two-part codes use a semi-predictive approach to
describe Bernoulli sequences - First part of code
- Determine the maximum likelihood (ML) parameter
estimate ?ML(x)n1/(n0n1) - Quantize ?ML(x) to rk, one of K representation
levels - Describe the bin index k with log(K) bits
- Second part of code encodes x using rk
- In distributed systems
- Sequential compressors require O(N) internal
communications - Two-part codes need only communicate
n0(j),n1(j)j?1,,B - Requires O(B log(K)) internal communications
21Jeffreys Two-Part Code
- Quantize ?ML(x)
- Bin edges bksin2(?k/2K)
- Representation levels rksin2(?(2k-1)/4K)
- Use K? ?1.772N0.5? bins
- Source description
- log(K) bits for describing the bin index k
- Need n1 log(?ML(x))-n0log(1-?ML(x)) for encoding
x
22Redundancy of Jeffreys Code for Bernoulli
Sequences
- Redundancy
- log(K) bits for describing k
- N D(?ML(x)rk) bits for encoding x using
imprecise model - D(ab) is Kullback Leibler divergence
- In bin k, l(x)-?ML(x)log(rk )-1-?ML(x)
log(1-rk ) - l(? ML (x)) is poly-line
- Redundancy log(K) l(?ML(x)) N H(?ML(x)) ?
log(K) L? - Use quantizers that have small L? distance
between the entropy function and the induced
poly-line fit
23Redundancy Properties
- For x s.t. ?ML(x) is quantized to rk, the
worst-case redundancy is - log(K)N maxD(bkrk),D(bk-1rk)
- D(bkrk) and D(bk-1rk)
- Largest in initial or end bins
- Similar in the middle bins
- Difference reduced over wider range of k for
larger N (larger K) - Can construct a near-optimal quantizer by
modifying the initial and end bins of the
Jeffreys quantizer
24Redundancy Results
- Theorem The worst-case redundancy of the
Jeffreys code is 1.221O(1/N) bits above
Rissanens bound - Theorem The worst-case redundancy of the optimal
two-part code is 1.047O(1) bits above Rissanens
bound
25Parallel Universal Compression for Bernoulli
Sequences
- Phase I
- Parallel units (PUs) compute symbol counts for
the B blocks - Coordinating unit (CU) computes and quantizes the
MDL parameter estimate ?ML(x) and describes k - Phase II B PUs encode the B blocks based on rk
26Why do we need Parallel Semi-Predictive Coding?
- Naïve parallelization
- Partition x into B blocks
- Compress blocks independently
- The redundancy for a length-N/B block is
O(log(N/B)) - Total redundancy is O(B log(N/B))
- Rissanens bound is O(log(N))
- The redundancy with naïve parallelization is
excessive!
27Parallel Semi-Predictive (PSP) Concept
- Phase I
- B parallel units (PUs) accumulate statistics
(symbol counts) on the B blocks - Coordinating unit (CU) computes the MDL tree
source estimate S - Phase II-- B PUs compress the B blocks based on
S
28Source Description in PSP
- Phase I the CU describes the structure of S and
the quantized ML parameter estimates kss?S - Phase II each of B PUs compresses block x(b)
just like Phase II of the (serial)
semi-predictive approach
29Complexity of Phase I
- Phase I processes each node of the context tree
Nohre94 - The CU processes the states of a full atomic
context tree of depth-Dmax, where Dmax? log(N/B) - Processing a node
- Internal node requires
- O(1) time
- Leaf CU adds up block
- symbol counts to compute
- each symbol count, i.e., ns??b ns?(b), where ??
0,1 - The CU processes a leaf node in O(B) time
- With O(N/B) leaves, the aggregate complexity is
O(N), which is excessive
30Phase I in O(N/B) Time
- We want to compute ns??b ns?(b) faster
- An adder tree incurs O(log(B)) delay for adding
up B block symbol counts - Pipelining enables us to generate a result every
O(1) time - O(N/B) nodes, each requiring O(1) time
31Phase II in O(N/B) Time
- The challenging part in Phase II is determining
s - Define the context index for a length-Dmax
context s preceding xi(b) as the binary number
that represents s - The length-2Dmax generator table g satisfies
gjs?S if s is a suffix of the context whose
context index is j - We can construct g in O(N/B) time (far from
trivial!) - Compute context indices for all symbols of x(b)
and determine the generating states via the
generator table g
32Decoder
- An input bus is demultiplexed to multiple units
- The MDL source and quantized ML parameters are
reconstructed - The B compressed blocks y(B) are decompressed on
B decoding units
33Theoretical Results
- Theorem With computations performed with 2
log(N) bits of precision defined as O(1) time - Phase I of PSP approximates the MDL coding length
within O(1) of the true optimum - The PSP algorithm requires O(N/B) time
- Theorem The PSP algorithm uses a total of O(N)
words of memory a total of O(N log(N)) bits - Theorem The pointwise redundancy of PSP w.r.t.
S is ?(x) lt Blog(N/B)O(1)Slog(N)/2O(1)
parallelization overhead
34Main Contributions
- BWT-MDL (O(N) universal encoder)
- An O(N) algorithm that achieves Rissanens
redundancy bounds on best achievable compression - Combines efficient prefix tree construction with
semi-predictive approach to universal coding - Fast Suffix Sorting (not in this talk)
- Core algorithm is very simple (can be implemented
in VLSI) - Worst-case complexity O(N log0.5(N))
- Competitive with other suffix sorting methods in
practice - Two-Part Codes
- Rigorous analysis of their compression quality
- Application to distributed/parallel compression
- Optimal two-part codes
- Parallel Compression Algorithm (not in this
talk) - Work-efficient O(N/B) algorithm
- Compression loss is roughly B log(N/B) bits
35More
- Results have been extended to X-ary alphabet
- Future research can concentrate on
- Processing broader classes of tree sources
- Problems in statistical inference
- Universal classification
- Channel decoding
- Prediction
- Characterize the design space for parallel
compression algorithms
36Generic Phase I
if (s is a leaf) Count symbol appearances ns0
and ns1 MDLs? length(ns0, ns1) else / s is
an internal node / Recursively compute MDL
length and counts for 0s and 1s ns0 ? n0s0n1s0,
ns1 ? n0s1n1s1 MDLs? length(ns0, ns1) if (MDLs
gtMDL0s MDL1s ) Keep 0s and 1s else Prune 0s
and 1s, keep s