Fast Parallel Algorithms for Universal Lossless Source Coding - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Fast Parallel Algorithms for Universal Lossless Source Coding

Description:

Fast Parallel Algorithms for Universal Lossless Source Coding Dror Baron CSL & ECE Department, UIUC dbaron_at_uiuc.edu Ph.D. Defense February 18, 2003 – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 37

Provided by: dba32

Category:

more less

Transcript and Presenter's Notes

Title: Fast Parallel Algorithms for Universal Lossless Source Coding

1
Fast Parallel Algorithms for Universal Lossless
Source Coding

Dror Baron
CSL ECE Department, UIUC
dbaron_at_uiuc.edu
Ph.D. Defense February 18, 2003

2
Overview

Motivation, applications, and goals
Background
Source models
Lossless source coding universality
Semi-predictive methods
An O(N) semi-predictive universal encoder
Two-part codes
Rigorous analysis of their compression quality
Application to parallel compression of Bernoulli
sequences
Parallel semi-predictive (PSP) coding
Achieving a work-efficient algorithm
Theoretical results
Summary

3
Motivation

Lossless compression text files, facsimiles,
software executables, medical, financial, etc.
What do we want in a compression algorithm?
Universality adaptive to a large class of
sources
Good compression quality
Speed low computational complexity
Simple implementation
Low memory use
Sequential vs. offline

4
Why Parallel Compression ?

Some applications require high data rates
Compressed pages in virtual memory
Remote archiving fast communication links
Real-time compression in storage systems
Power reduction for interconnects on a circuit
board
Serial compression is limited by the clock rate

5
Room for Improvement and Goals

Previous Art
Serial universal source coding methods have
reached the bounds on compression quality
Willems1998,Rissanen1999
Parallel source coding algorithms have high
complexity and/or poor compression quality
Naïve parallelization compresses poorly
Parallel dictionary compression Franszek et.
al.1996
Parallel context tree weighting
StassenTjalkens2001,Willems2000
Research Goals good parallel compression
algorithm
Work-efficient O(N/B) time with B computational
units
Compression quality as good as best serial
methods (almost!)

6
Main Contributions

BWT-MDL (O(N) universal encoder)
An O(N) algorithm that achieves Rissanens
redundancy bounds on best achievable compression
Combines efficient prefix tree construction with
semi-predictive approach to universal coding
Fast Suffix Sorting (not in this talk)
Core algorithm is very simple (can be implemented
in VLSI)
Worst-case complexity O(N log0.5(N))
Competitive with other suffix sorting methods in
practice
Two-Part Codes
Rigorous analysis of their compression quality
Application to distributed/parallel compression
Optimal two-part codes
Parallel Compression Algorithm (not in this
talk)
Work-efficient O(N/B) algorithm
Compression loss is roughly B log(N/B) bits

7
Source Models

Binary alphabet X0,1, sequence x ? XN
Bernoulli Model
i.i.d. model
p(xi1)?
Order-K Markov Model
Previous K symbols called context
Context-dependent conditional probability for
next symbol
More flexible than Bernoulli
Exponentially many states

8
Context Tree Sources

More flexible than Bernoulli
More compact than Markov
Particularly good for text
Works for M-ary alphabet
State context conditional probabilities
Example N11, x01011111111

root
9
Review of Lossless Source Coding

Stationary ergodic sources
Entropy rate HlimN?? H(x)/N
Asymptotically, H is the lowest attainable
per-symbol rate
Arithmetic coding
Probability assignment p(x)
Coding length l(x)-log(p(x))O(1)
Can achieve entropy O(1) bits

10
Universal Source Coding

Source statistics are unknown
Need probability assignment p(x)
Need to estimate source model
Need to describe estimated source (explicitly or
implicitly)
Redundancy excess coding length above entropy
?(x)l(x)-NH

11
Redundancy Bounds

Rissanens bound (K unknown parameters)
E?(x) gt (K/2) log(N)O(1)
Worst-case redundancy for Bernoulli sequences
(K1) ?(x)maxx?XN ?(x) ? 0.5 log(?N/2)
Asymptotically, ?(x)/N? 0
In practice, e.g., text, the number of parameters
scales almost linearly with N
Low redundancy is still essential

12
Semi-Predictive Approach

Semi-predictive methods describe x in two phases
Phase I find a good tree source structure S
and describe it using codelength lS
Phase II encode x using S with probability
assignment pS(x)
Phase I estimate minimum description length
(MDL) tree source model Sarg min lS
log(pS(x))

13
Semi-Predictive Approach - Phase II

Sequential encoding of x given S
Determine which state s of S generated symbol xi
Assign xi a conditional probability p(xis)
Arithmetic encoding
p(xis) can be based on previously processed
portion of x, quantized probability estimates,
etc.

14
Context Trees

We will provide an O(N) semi-predictive algorithm
by estimating S using context trees
Context trees arrange x in a tree
Each node corresponds to
sequence of appended arc
labels on path to root
Internal nodes correspond
to repeating contexts in x
Leaves correspond to unique contexts
Sentinel symbol x0 makes sure
symbols have different contexts

15
Context Tree Pruning(To prune or not to prune)

The MDL structure for state s yields the shortest
description for symbols generated by s
When processing state s
Estimate MDL structures for states 0s and 1s
Decide whether to keep 0s and 1s or prune them
into state s
Base decision on coding lengths

16
Phase I with Atomic Context Trees

Atomic context tree
Arc labels are atomic (single symbol)
Internal nodes are not necessarily branching
Has up to O(N2) nodes
The coding length minimization of Phase I
processes each node of the context tree Nohre94
With atomic context trees, the worst-case
complexity is at least O(N2) ?

17
Compact Context Trees

Compact context tree
Arc labels not necessarily atomic
Internal node are branching
O(N) nodes
Compact representation of the same tree
Depth-first traversal of compact context tree
provides O(N) complexity ?
Theorem Phase I of BWT-MDL requires O(N)
operations performed with O(log(N)) bits of
precision

18
Phase II of BWT-MDL

We determine the generator state using a novel
algorithm that is based on properties of the
Burrows Wheeler transform (BWT)
Theorem The BWT-MDL encoder requires O(N)
operations performed with O(log(N)) bits of
precision
Theorem Willems et. al. 2000 redundancy
w.r.t. any tree source S is at most S0.5
log(N)O(1) bits

19
Distributed/Parallel Compression of Bernoulli
Sequences

Splitter partitions x into B blocks x(1),,x(B)
Encoder j?1,,B compresses x(j) it assigns
probabilities p(xi(j)1)? and p(xi(j)0)1-?
The total probability assigned to x is identical
to that in a serial compression system
This structure assumes that ? is known our goal
is to provide a universal parallel compression
algorithm for Bernoulli sequences

20
Two-Part Codes

Two-part codes use a semi-predictive approach to
describe Bernoulli sequences
First part of code
Determine the maximum likelihood (ML) parameter
estimate ?ML(x)n1/(n0n1)
Quantize ?ML(x) to rk, one of K representation
levels
Describe the bin index k with log(K) bits
Second part of code encodes x using rk
In distributed systems
Sequential compressors require O(N) internal
communications
Two-part codes need only communicate
n0(j),n1(j)j?1,,B
Requires O(B log(K)) internal communications

21
Jeffreys Two-Part Code

Quantize ?ML(x)
Bin edges bksin2(?k/2K)
Representation levels rksin2(?(2k-1)/4K)
Use K? ?1.772N0.5? bins
Source description
log(K) bits for describing the bin index k
Need n1 log(?ML(x))-n0log(1-?ML(x)) for encoding
x

22
Redundancy of Jeffreys Code for Bernoulli
Sequences

Redundancy
log(K) bits for describing k
N D(?ML(x)rk) bits for encoding x using
imprecise model
D(ab) is Kullback Leibler divergence

In bin k, l(x)-?ML(x)log(rk )-1-?ML(x)
log(1-rk )
l(? ML (x)) is poly-line
Redundancy log(K) l(?ML(x)) N H(?ML(x)) ?
log(K) L?
Use quantizers that have small L? distance
between the entropy function and the induced
poly-line fit

23
Redundancy Properties

For x s.t. ?ML(x) is quantized to rk, the
worst-case redundancy is
log(K)N maxD(bkrk),D(bk-1rk)
D(bkrk) and D(bk-1rk)
Largest in initial or end bins
Similar in the middle bins
Difference reduced over wider range of k for
larger N (larger K)
Can construct a near-optimal quantizer by
modifying the initial and end bins of the
Jeffreys quantizer

24
Redundancy Results

Theorem The worst-case redundancy of the
Jeffreys code is 1.221O(1/N) bits above
Rissanens bound
Theorem The worst-case redundancy of the optimal
two-part code is 1.047O(1) bits above Rissanens
bound

25
Parallel Universal Compression for Bernoulli
Sequences

Phase I
Parallel units (PUs) compute symbol counts for
the B blocks
Coordinating unit (CU) computes and quantizes the
MDL parameter estimate ?ML(x) and describes k
Phase II B PUs encode the B blocks based on rk

26
Why do we need Parallel Semi-Predictive Coding?

Naïve parallelization
Partition x into B blocks
Compress blocks independently
The redundancy for a length-N/B block is
O(log(N/B))
Total redundancy is O(B log(N/B))
Rissanens bound is O(log(N))
The redundancy with naïve parallelization is
excessive!

27
Parallel Semi-Predictive (PSP) Concept

Phase I
B parallel units (PUs) accumulate statistics
(symbol counts) on the B blocks
Coordinating unit (CU) computes the MDL tree
source estimate S
Phase II-- B PUs compress the B blocks based on
S

28
Source Description in PSP

Phase I the CU describes the structure of S and
the quantized ML parameter estimates kss?S
Phase II each of B PUs compresses block x(b)
just like Phase II of the (serial)
semi-predictive approach

29
Complexity of Phase I

Phase I processes each node of the context tree
Nohre94
The CU processes the states of a full atomic
context tree of depth-Dmax, where Dmax? log(N/B)
Processing a node
Internal node requires
O(1) time
Leaf CU adds up block
symbol counts to compute
each symbol count, i.e., ns??b ns?(b), where ??
0,1
The CU processes a leaf node in O(B) time
With O(N/B) leaves, the aggregate complexity is
O(N), which is excessive

30
Phase I in O(N/B) Time

We want to compute ns??b ns?(b) faster
An adder tree incurs O(log(B)) delay for adding
up B block symbol counts
Pipelining enables us to generate a result every
O(1) time
O(N/B) nodes, each requiring O(1) time

31
Phase II in O(N/B) Time

The challenging part in Phase II is determining
s
Define the context index for a length-Dmax
context s preceding xi(b) as the binary number
that represents s
The length-2Dmax generator table g satisfies
gjs?S if s is a suffix of the context whose
context index is j
We can construct g in O(N/B) time (far from
trivial!)
Compute context indices for all symbols of x(b)
and determine the generating states via the
generator table g

32
Decoder

An input bus is demultiplexed to multiple units
The MDL source and quantized ML parameters are
reconstructed
The B compressed blocks y(B) are decompressed on
B decoding units

33
Theoretical Results

Theorem With computations performed with 2
log(N) bits of precision defined as O(1) time
Phase I of PSP approximates the MDL coding length
within O(1) of the true optimum
The PSP algorithm requires O(N/B) time
Theorem The PSP algorithm uses a total of O(N)
words of memory a total of O(N log(N)) bits
Theorem The pointwise redundancy of PSP w.r.t.
S is ?(x) lt Blog(N/B)O(1)Slog(N)/2O(1)

parallelization overhead
34
Main Contributions

BWT-MDL (O(N) universal encoder)
An O(N) algorithm that achieves Rissanens
redundancy bounds on best achievable compression
Combines efficient prefix tree construction with
semi-predictive approach to universal coding
Fast Suffix Sorting (not in this talk)
Core algorithm is very simple (can be implemented
in VLSI)
Worst-case complexity O(N log0.5(N))
Competitive with other suffix sorting methods in
practice
Two-Part Codes
Rigorous analysis of their compression quality
Application to distributed/parallel compression
Optimal two-part codes
Parallel Compression Algorithm (not in this
talk)
Work-efficient O(N/B) algorithm
Compression loss is roughly B log(N/B) bits

35
More

Results have been extended to X-ary alphabet
Future research can concentrate on
Processing broader classes of tree sources
Problems in statistical inference
Universal classification
Channel decoding
Prediction
Characterize the design space for parallel
compression algorithms

36
Generic Phase I
if (s is a leaf) Count symbol appearances ns0
and ns1 MDLs? length(ns0, ns1) else / s is
an internal node / Recursively compute MDL
length and counts for 0s and 1s ns0 ? n0s0n1s0,
ns1 ? n0s1n1s1 MDLs? length(ns0, ns1) if (MDLs
gtMDL0s MDL1s ) Keep 0s and 1s else Prune 0s
and 1s, keep s

Write a Comment

User Comments (0)