Title: Data Compression with finite windows Fiala and Greene
1Data Compression with finitewindowsFiala and
Greene
2Overview-----------------------
- Our main purpose
- See how Suffix Tree supports a compression
algorithm.
3Overview-----------------------
- Our main purpose
- See how Suffix Tree supports a compression
algorithm. - What we would see
- A data compression method, which works by
substituting text. It uses a
modification of the basic suffix tree,
to support cyclic maintenance of the most
recent strings seen in file.
4Outlines------------------------
- 1. Compression
- - In General
- - Our Algorithm
-
- 2. Data Structure
- - Modification of the suffix tree.
- 3. Theoretical Considerations
- - Prooves.
- 4. Improvments.
5Compression-------------------------------
- What is Compression
- Compression is the coding of data to
minimize its representation.
We would focus on - lossless, adaptive, one-pass methods.
6Compression-------------------------------
- What is Compression
- Compression is the coding of data to
minimize its representation.
We would focus on - lossless, adaptive, one-pass methods.
- Main approaches-
- Statistical approach- try to predict the
next symbol. - Substitutional approach- replace blocks of
texts with references to earlier
occurrences of identical text. - We would focus on a Substitutional method
7Compression-cont.------------------------------
- What characterize a good compressor
- - Good compressing ratio.
- - Run fast in Compression.
- - Use minimum of space.
- - Run fast in Expansion.
8Compression-cont.------------------------------
- What characterize a good compressor
- - Good compressing ratio.
- - Run fast in Compression.
- - Use minimum of space.
- - Run fast in Expansion.
- There are trade-offs between all of those.
- Naturally, we want to achieve them all
- A good Algorithm a matching Data
Structure
9Substitutional Compressing-----------------------
----------------
- Consider the following basic scheme
- The compressed files would contain two types
of codewords - literal x pass the next x characters directly
to the output. - copy x, y go back y characters and copy the
next x - characters start at that
position.
10Example------------------------------------------
------
- ..it was the best of times,
- it was the worst of times..
- Would compress to-
-
11Example------------------------------------------
------
- ..it was the best of times,
- it was the worst of times..
- Would compress to-
- (literal 26) it was the best of times,
-
26
12Example------------------------------------------
------
- ..it was the best of times,
- it was the worst of times..
- Would compress to-
- (literal 26) it was the best of times,
- (copy 11-26)
-26
11
26
13Example------------------------------------------
------
- ..it was the best of times,
- it was the worst of times..
- Would compress to-
- (literal 26) it was the best of times,
- (copy 11-26) wor (copy 11-27)
-26
11
-27
11
26
14Example-cont.------------------------------------
------------
- And we get a very simple lossless method
-
- The compression achieved depends on the size of
the copy and literal codewords.
..it was the best of times,
it was the worst of times.
Compression
(literal 26) it was the best of times,
(copy 11-26) wor (copy 11-27).
..it was the best of times,
it was the worst of times.
Expansion
15A1-----------------------------------------------
-------
literal
length 1..16
- The encoding of A1
- - 8 bits for a literal codeword
-
- - 16 bit for a copy codeword
-
- (can you figure whats the logic behind?)
length 2..16
0000xxxx
0
7
displacement 1..4096
xxxxyy..yy
0
15
16A1-----------------------------------------------
-------
literal
length 1..16
- The encoding of A1
- - 8 bits for a literal codeword
-
- - 16 bit for a copy codeword
-
- And we get (a compression of 51 to 36)
- (literal 16) it was the best (literal 10)of
times, - (copy 11-26) wor (copy 11-27)
length 2..16
0000xxxx
0
7
displacement 1..4096
xxxxyy..yy
0
15
17A1s policy----------------------------
- If the compressor is idle (just finish a word)
- look for a copy gt 2
- otherwise, start a literal.
- If the compressor is in the middle of a literal
- extend it until a copy gt 3 is found.
18Where do we stand?
1. Compression - In General - Our
Algorithm 2. Data Structure - Modification
of the suffix tree. 3. Theoretical
Considerations - Prooves.
Done
(here)
19The Data Structure-------------------------------
----------
- What do we need?
- Find the current longest match (for copy).
20The Data Structure-------------------------------
----------
- What do we need?
- Find the current longest match (for copy).
- What could we use?
- Naive solution-
- Suffix tree with all strings of length lt 16 in
the previous 4096-bytes window.
21Naive solution---------------------------------
- Suffix tree with all strings of length lt 16 in
the previous 4096-bytes window
current
4096
16
16
16
22The cost ----------------------------------------
----
- If we descended d levels to insert string starts
at position j, - we will descend at least d-1 levels to insert
string starts at j1.
23The cost-cont.-----------------------------------
-------
- If we descended d levels to insert string starts
at position j, - we would descend at least d-1 levels to
insert string starts at j1. - So the cost is O(nd) for insertion.
- But we want to eliminate d.
j
j1
4096
d
d
d-1
d
24Modifications------------------------------------
- a.Suffix links
- Each node represents the string aX
- has a pointer to the node represents
- the string X.
- Immediate advantage
- We dont need to return to the root after each
insertion.
aX
X
g
b
Y
Y
d
a
k
25Suffix Links------------------------------------
- How we use and create suffix links
- .. aXYb ..
aX
X
g
b
Y
Y
d
k
26Suffix Links------------------------------------
- How we use and create suffix links
- .. aXYb ..
aX
X
x
g
b
Y
Y
d
k
27Suffix Links-cont.-------------------------------
-----
- How we use and create suffix links
- .. aXYb ..
- 1. Create a new node a, and insert b.
aX
X
x
g
b
Y
Y
d
a
b
k
28Suffix Links-cont.-------------------------------
-----
- How we use and create suffix
- links
- .. aXYb ..
- 1. Create a new node a, and insert b.
- 2. a. Use suffix link to insert XYb
- a.1 we go up to b, and cross to g,
- using the suffix link.
-
aX
X
x
g
b
Y
Y
a
d
b
k
29Suffix Links-cont.-------------------------------
-----
- How we use and create suffix
- links
- .. aXYb ..
- 1. Create a new node a, and insert b.
- 2. a. Use suffix link to insert XYb
- a.1 we go up to b, and cross to g,
- using the suffix link.
- a.2 rescan to d (not necessarily exist).
-
aX
X
x
g
b
rescan
Y
Y
a
d
b
k
If d doesnt exist, create it!
Rescan means we dont need to check string
again, but go stright to d
30Suffix Links-cont.-------------------------------
-----
- How we use and create suffix
- links
- .. aXYb ..
- 1. Create a new node a, and insert b.
- 2. a. Use suffix link to insert XYb
- a.1 we go up to b, and cross to g,
- using the suffix link.
- a.2 rescan to d.
- a.3 scan from d, to insert XYb.
aX
X
x
g
b
rescan
Y
Y
a
d
scan
b
k
31Suffix Links-cont.-------------------------------
-----
- How we use and create suffix links
- .. aXYb ..
- 1. Create a new node a, and insert b.
- 2. Use suffix link to insert XYb.
- 3. Add as suffix link (d).
- And we finish with the insertion!
aX
X
x
g
b
rescan
Y
Y
d
a
scan
b
k
Invariant kept every internal node has a suffix
link (except one just created).
32- Demends from DS
- gffghk
- We explained insertion.
- What about deletion?
match
insert
delete
4096
33 Modifications- cont.---------------------------
---------
- Deletion
- b. Leaves in a circular buffer-
- identify oldest and delete it.
- c.Son count-
- when it falls to one, delete node
- and combine arcs.
Son count3
aX
X
g
b
Y
Y
d
a
b
k
1
4096
Circular buffer
34 Is it enough?----------------------------------
--
..fkjg
- NO.
- We still have a problem.
- Higher pointers can become out-of-date.
- But, climb up and update those pointers would
take out the advantegaes of using the suffix
links!
aX
X
g
b
Y
Y
d
a
b
k
35 Modifications- Last ---------------------------
---------
True/false bit
- d. Percolating updates
- Each internal node has an update bit.
aX
X
g
Y
Y
d
a
k
36 Percolating updates ---------------------------
---------
True/false bit
- d. Percolating updates-
- When updating a node
- bit true
- 1. set bit to false.
- 2. propagate update to parent.
- bit false
- 1. set bit to true.
- 2. stop update.
aX
X
g
Y
Y
d
a
k
37Percolating updates-cont.------------------------
-------------------
- Effect
- Keep all internal pointers on position
- within the 4096-window in file.
38Percolating updates-cont.------------------------
-------------------
- Effect
- Keep all internal pointers on position
- within the 4096-window in file.
- Cost
- worst case-
- update propagates till root.
- amortized-
- summing over all new leaves, we get
constant cost.
39Summary of the inner loop------------------------
---------------------------------
- The operations
- 1. Insert
- a. insert the previous string.
- b. use suffix link to insert next
string. - 2. Percolate update from leaf
- if bit is true
- set position field of the node to
current position. - set bit to false and propagate to
parent. - if bit is false
- set it true, and stop.
-
40Summary- cont------------------------------------
---------------------
- 3. Circular buffer
- a. replace oldest leaf with the new
one. - b. if its parent has only one remaining
son- - 1. delete parent, and attach
remaining son - to grandparent.
- 2. percolate the deleted nodes
position- - (special case- comparative
percolation)
41Where do we stand?
1. Compression - In General - Our
Algorithm 2. Data Structure - Modification
of the suffix tree. 3. Theoretical
Considerations - Prooves.
Done 1
Done 2
(here)
42Theoretical Considerations-----------------------
-----------------------------
- Correctness and linearity of suffix tree
construction- - we already saw that.
- We need to be convinced about destruction
- Theorm 1
- Deleting leaves in FIFO order and deleting
internal nodes - with single sons will never leave dangling
suffix pointers..
43Proof
- Assume the contrary
- a points to g that was deleted.
- The existence of a means
- two strings agree for l differ at l1
- df..gbdf..gz..
l
a
g
b
z
44Proof-cont
- Assume the contrary
- a points to g that was deleted.
- The existence of a means
- two strings agree for l differ at l1
- df..gbdf..gz..
- two strings agree for l-1 differ at l
-
- This contradicts that g has one son, and
therefore deleted.
l
l-1
a
g
b
z
45Theoretical Considerations-----------------------
------------------------------
- Theorm 2
- Each percolated update has constant amortized
cost. - Proof
- Assume a credit on each internal node
- with update flag true.
46- A new node is added with two credits-
- One is spent to update parent.
- Second - give to parent and terminate
(parent is false). -
false
0
1
true
1
2
47- A new node is added with two credits-
- One is spent to update parent.
- Second - give to parent and terminate
(parent is false). - or - obtain two on parent and
continue (true). - Result-
- invariant is kept, and we get amortized cost of
two - updates per new leaf.
Apply recursively on parent
false
0
1
true
true
1
2
1
1
2
2
48Theoretical Considerations-----------------------
------------------------------
- Theorm 3 (effectiveness)
- Using the percolating update, every internal node
will - be updated at least once in a period (4096).
- Proof
- We would prove that every internal node will be
- updated at least twice in a period, thus
propagate - at least one update up.
49- (in contradiction) Find b- the farthest node from
the root that - doesnt propagate an update to its
parent. - 3 cases
- a. b has two (or more) remained children
- both are farther from root. Thus- updated
it. -
Child that has remained for the entire period.
50- (in contradiction) Find b- the farthest node from
the root that - doesnt propagate an update to its
parent. - 3 cases
- a. b has two (or more) remained children
- both are farther from root. Thus- updated
it. - b. b has only one remaining child
- one update from it. Second from new child
when created. - (new arc causes son to update parent)
-
Child that has remained for the entire period.
51- (in contradiction) Find b- the farthest node from
the root that - doesnt propagate an update to its
parent. - 3 cases
- a. b has two (or more) remained children
- both are farther from root. Thus- updated
it. - b. b has only one remaining child
- one update from it. Second from new child
when created. - (new arc causes son to update parent)
- c. b has two new children- similar.
- In all cases, b will receive two updates during a
period, - and thus- propagate an update. Contradiction.
Child that has remained for the entire period.
52Other Theoretical Considerations(bounds on the
compression)-------------------------------------
----------------------
- We have focused on the Data Structure.
- There are other questions, about the compression.
- ??? ?? ??,
- ???? ????!
- (?????? ???)
- ??? ????? ???? ?????
53Other Theoretical Considerations(bounds on the
compression)-------------------------------------
----------------------
- Consider the following
- 1 3 16 15 14 13
-
- A1 (literal 1)x(copy 3 y)(copy 14 y)
6 bytes - Optimal (literal 2)xx(copy 16 y)
5 bytes - How bad can it get?
Position j j1 j2 j3
j5 j6
Copy length available
A1
Optimal
Encoder is here
54Heuristic vs. Optimal----------------------------
---
- Foresight algorithms
- Must have more than one-pass we pay big
time. - And the Gain?
- (Optimal vs. A1)-
- On average- about 1 better.
- On Worst case- 20.
55Back to our business
56A1s virtues-------------------------
- - Simple one-pass adaptive lossless method.
- - Natural approach to 8-bit per character.
- Performances
- - Compression ratio - up to 1/8.
- - Expander- fast, simple, small storage
requirements. - - Compressor- much slower and larger.
- (all in comparison to other copy/literal methods)
57Improvements--------------------------------
- -Enlarge the window- gain compression ratio.
- pay space
and speed. - -Enlarge copy length- same.
- -Change encoding- gain performance, pay
simplicity. - -Change update policy-gain compression speed,
- pay in
space and expansion speed.
58Summary
- We introduce the compression problem, and propose
a simple substitutional compressing
algorithm, based on the copy/literal codewords. - Our main interest was the Data structure. We saw
how a - modification of the basic Suffix tree answers
the - algorithm demands, on what cost.
59EXIT Dont push