Data Compression with finite windows Fiala and Greene - PowerPoint PPT Presentation

About This Presentation
Title:

Data Compression with finite windows Fiala and Greene

Description:

A data compression method, which works by substituting text. ... Keep all internal pointers on position. within the 4096-window in file. Percolating updates-cont. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 60
Provided by: Gio3
Category:

less

Transcript and Presenter's Notes

Title: Data Compression with finite windows Fiala and Greene


1
Data Compression with finitewindowsFiala and
Greene
  • Speaker Giora Alexandron

2
Overview-----------------------
  • Our main purpose
  • See how Suffix Tree supports a compression
    algorithm.

3
Overview-----------------------
  • Our main purpose
  • See how Suffix Tree supports a compression
    algorithm.
  • What we would see
  • A data compression method, which works by
    substituting text. It uses a
    modification of the basic suffix tree,
    to support cyclic maintenance of the most
    recent strings seen in file.

4
Outlines------------------------
  • 1. Compression
  • - In General
  • - Our Algorithm
  • 2. Data Structure
  • - Modification of the suffix tree.
  • 3. Theoretical Considerations
  • - Prooves.
  • 4. Improvments.

5
Compression-------------------------------
  • What is Compression
  • Compression is the coding of data to
    minimize its representation.
    We would focus on
  • lossless, adaptive, one-pass methods.

6
Compression-------------------------------
  • What is Compression
  • Compression is the coding of data to
    minimize its representation.
    We would focus on
  • lossless, adaptive, one-pass methods.
  • Main approaches-
  • Statistical approach- try to predict the
    next symbol.
  • Substitutional approach- replace blocks of
    texts with references to earlier
    occurrences of identical text.
  • We would focus on a Substitutional method

7
Compression-cont.------------------------------
  • What characterize a good compressor
  • - Good compressing ratio.
  • - Run fast in Compression.
  • - Use minimum of space.
  • - Run fast in Expansion.

8
Compression-cont.------------------------------
  • What characterize a good compressor
  • - Good compressing ratio.
  • - Run fast in Compression.
  • - Use minimum of space.
  • - Run fast in Expansion.
  • There are trade-offs between all of those.
  • Naturally, we want to achieve them all
  • A good Algorithm a matching Data
    Structure

9
Substitutional Compressing-----------------------
----------------
  • Consider the following basic scheme
  • The compressed files would contain two types
    of codewords
  • literal x pass the next x characters directly
    to the output.
  • copy x, y go back y characters and copy the
    next x
  • characters start at that
    position.

10
Example------------------------------------------
------
  • ..it was the best of times,
  • it was the worst of times..
  • Would compress to-

11
Example------------------------------------------
------
  • ..it was the best of times,
  • it was the worst of times..
  • Would compress to-
  • (literal 26) it was the best of times,

26
12
Example------------------------------------------
------
  • ..it was the best of times,
  • it was the worst of times..
  • Would compress to-
  • (literal 26) it was the best of times,
  • (copy 11-26)

-26
11
26
13
Example------------------------------------------
------
  • ..it was the best of times,
  • it was the worst of times..
  • Would compress to-
  • (literal 26) it was the best of times,
  • (copy 11-26) wor (copy 11-27)

-26
11
-27
11
26
14
Example-cont.------------------------------------
------------
  • And we get a very simple lossless method
  • The compression achieved depends on the size of
    the copy and literal codewords.

..it was the best of times,
it was the worst of times.
Compression
(literal 26) it was the best of times,
(copy 11-26) wor (copy 11-27).
..it was the best of times,
it was the worst of times.
Expansion
15
A1-----------------------------------------------
-------
literal
length 1..16
  • The encoding of A1
  • - 8 bits for a literal codeword

  • - 16 bit for a copy codeword

  • (can you figure whats the logic behind?)

length 2..16
0000xxxx
0
7
displacement 1..4096
xxxxyy..yy
0
15
16
A1-----------------------------------------------
-------
literal
length 1..16
  • The encoding of A1
  • - 8 bits for a literal codeword

  • - 16 bit for a copy codeword

  • And we get (a compression of 51 to 36)
  • (literal 16) it was the best (literal 10)of
    times,
  • (copy 11-26) wor (copy 11-27)

length 2..16
0000xxxx
0
7
displacement 1..4096
xxxxyy..yy
0
15
17
A1s policy----------------------------
  • If the compressor is idle (just finish a word)
  • look for a copy gt 2
  • otherwise, start a literal.
  • If the compressor is in the middle of a literal
  • extend it until a copy gt 3 is found.

18
Where do we stand?
1. Compression - In General - Our
Algorithm 2. Data Structure - Modification
of the suffix tree. 3. Theoretical
Considerations - Prooves.
Done
(here)
19
The Data Structure-------------------------------
----------
  • What do we need?
  • Find the current longest match (for copy).

20
The Data Structure-------------------------------
----------
  • What do we need?
  • Find the current longest match (for copy).
  • What could we use?
  • Naive solution-
  • Suffix tree with all strings of length lt 16 in
    the previous 4096-bytes window.

21
Naive solution---------------------------------
  • Suffix tree with all strings of length lt 16 in
    the previous 4096-bytes window

current
4096
16
16

16
22
The cost ----------------------------------------
----
  • If we descended d levels to insert string starts
    at position j,
  • we will descend at least d-1 levels to insert
    string starts at j1.

23
The cost-cont.-----------------------------------
-------
  • If we descended d levels to insert string starts
    at position j,
  • we would descend at least d-1 levels to
    insert string starts at j1.
  • So the cost is O(nd) for insertion.
  • But we want to eliminate d.

j
j1
4096
d
d

d-1
d
24
Modifications------------------------------------
  • a.Suffix links
  • Each node represents the string aX
  • has a pointer to the node represents
  • the string X.
  • Immediate advantage
  • We dont need to return to the root after each
    insertion.

aX
X
g
b
Y
Y
d
a
k
25
Suffix Links------------------------------------
  • How we use and create suffix links
  • .. aXYb ..

aX
X
g
b
Y
Y
d
k
26
Suffix Links------------------------------------
  • How we use and create suffix links
  • .. aXYb ..

aX
X
x
g
b
Y
Y
d
k
27
Suffix Links-cont.-------------------------------
-----
  • How we use and create suffix links
  • .. aXYb ..
  • 1. Create a new node a, and insert b.

aX
X
x
g
b
Y
Y
d
a
b
k
28
Suffix Links-cont.-------------------------------
-----
  • How we use and create suffix
  • links
  • .. aXYb ..
  • 1. Create a new node a, and insert b.
  • 2. a. Use suffix link to insert XYb
  • a.1 we go up to b, and cross to g,
  • using the suffix link.

aX
X
x
g
b
Y
Y
a
d
b
k
29
Suffix Links-cont.-------------------------------
-----
  • How we use and create suffix
  • links
  • .. aXYb ..
  • 1. Create a new node a, and insert b.
  • 2. a. Use suffix link to insert XYb
  • a.1 we go up to b, and cross to g,
  • using the suffix link.
  • a.2 rescan to d (not necessarily exist).

aX
X
x
g
b
rescan
Y
Y
a
d
b
k
If d doesnt exist, create it!
Rescan means we dont need to check string
again, but go stright to d
30
Suffix Links-cont.-------------------------------
-----
  • How we use and create suffix
  • links
  • .. aXYb ..
  • 1. Create a new node a, and insert b.
  • 2. a. Use suffix link to insert XYb
  • a.1 we go up to b, and cross to g,
  • using the suffix link.
  • a.2 rescan to d.
  • a.3 scan from d, to insert XYb.

aX
X
x
g
b
rescan
Y
Y
a
d
scan
b
k
31
Suffix Links-cont.-------------------------------
-----
  • How we use and create suffix links
  • .. aXYb ..
  • 1. Create a new node a, and insert b.
  • 2. Use suffix link to insert XYb.
  • 3. Add as suffix link (d).
  • And we finish with the insertion!

aX
X
x
g
b
rescan
Y
Y
d
a
scan
b
k
Invariant kept every internal node has a suffix
link (except one just created).
32
  • Demends from DS
  • gffghk
  • We explained insertion.
  • What about deletion?

match
insert
delete
4096
33
Modifications- cont.---------------------------
---------
  • Deletion
  • b. Leaves in a circular buffer-
  • identify oldest and delete it.
  • c.Son count-
  • when it falls to one, delete node
  • and combine arcs.

Son count3
aX
X
g
b
Y
Y
d
a
b
k
1
4096
Circular buffer
34
Is it enough?----------------------------------
--
..fkjg
  • NO.
  • We still have a problem.
  • Higher pointers can become out-of-date.
  • But, climb up and update those pointers would
    take out the advantegaes of using the suffix
    links!

aX
X
g
b
Y
Y
d
a
b
k
35
Modifications- Last ---------------------------
---------
True/false bit
  • d. Percolating updates
  • Each internal node has an update bit.

aX
X
g
Y
Y
d
a
k
36
Percolating updates ---------------------------
---------
True/false bit
  • d. Percolating updates-
  • When updating a node
  • bit true
  • 1. set bit to false.
  • 2. propagate update to parent.
  • bit false
  • 1. set bit to true.
  • 2. stop update.

aX
X
g
Y
Y
d
a
k
37
Percolating updates-cont.------------------------
-------------------
  • Effect
  • Keep all internal pointers on position
  • within the 4096-window in file.

38
Percolating updates-cont.------------------------
-------------------
  • Effect
  • Keep all internal pointers on position
  • within the 4096-window in file.
  • Cost
  • worst case-
  • update propagates till root.
  • amortized-
  • summing over all new leaves, we get
    constant cost.

39
Summary of the inner loop------------------------
---------------------------------
  • The operations
  • 1. Insert
  • a. insert the previous string.
  • b. use suffix link to insert next
    string.
  • 2. Percolate update from leaf
  • if bit is true
  • set position field of the node to
    current position.
  • set bit to false and propagate to
    parent.
  • if bit is false
  • set it true, and stop.

40
Summary- cont------------------------------------
---------------------
  • 3. Circular buffer
  • a. replace oldest leaf with the new
    one.
  • b. if its parent has only one remaining
    son-
  • 1. delete parent, and attach
    remaining son
  • to grandparent.
  • 2. percolate the deleted nodes
    position-
  • (special case- comparative
    percolation)

41
Where do we stand?
1. Compression - In General - Our
Algorithm 2. Data Structure - Modification
of the suffix tree. 3. Theoretical
Considerations - Prooves.
Done 1
Done 2
(here)
42
Theoretical Considerations-----------------------
-----------------------------
  • Correctness and linearity of suffix tree
    construction-
  • we already saw that.
  • We need to be convinced about destruction
  • Theorm 1
  • Deleting leaves in FIFO order and deleting
    internal nodes
  • with single sons will never leave dangling
    suffix pointers..

43
Proof
  • Assume the contrary
  • a points to g that was deleted.
  • The existence of a means
  • two strings agree for l differ at l1
  • df..gbdf..gz..

l
a
g
b
z
44
Proof-cont
  • Assume the contrary
  • a points to g that was deleted.
  • The existence of a means
  • two strings agree for l differ at l1
  • df..gbdf..gz..
  • two strings agree for l-1 differ at l
  • This contradicts that g has one son, and
    therefore deleted.

l
l-1
a
g
b
z
45
Theoretical Considerations-----------------------
------------------------------
  • Theorm 2
  • Each percolated update has constant amortized
    cost.
  • Proof
  • Assume a credit on each internal node
  • with update flag true.

46
  • A new node is added with two credits-
  • One is spent to update parent.
  • Second - give to parent and terminate
    (parent is false).

false
0
1
true
1
2
47
  • A new node is added with two credits-
  • One is spent to update parent.
  • Second - give to parent and terminate
    (parent is false).
  • or - obtain two on parent and
    continue (true).
  • Result-
  • invariant is kept, and we get amortized cost of
    two
  • updates per new leaf.

Apply recursively on parent
false
0
1
true
true
1
2
1
1
2
2
48
Theoretical Considerations-----------------------
------------------------------
  • Theorm 3 (effectiveness)
  • Using the percolating update, every internal node
    will
  • be updated at least once in a period (4096).
  • Proof
  • We would prove that every internal node will be
  • updated at least twice in a period, thus
    propagate
  • at least one update up.

49
  • (in contradiction) Find b- the farthest node from
    the root that
  • doesnt propagate an update to its
    parent.
  • 3 cases
  • a. b has two (or more) remained children
  • both are farther from root. Thus- updated
    it.

Child that has remained for the entire period.
50
  • (in contradiction) Find b- the farthest node from
    the root that
  • doesnt propagate an update to its
    parent.
  • 3 cases
  • a. b has two (or more) remained children
  • both are farther from root. Thus- updated
    it.
  • b. b has only one remaining child
  • one update from it. Second from new child
    when created.
  • (new arc causes son to update parent)

Child that has remained for the entire period.
51
  • (in contradiction) Find b- the farthest node from
    the root that
  • doesnt propagate an update to its
    parent.
  • 3 cases
  • a. b has two (or more) remained children
  • both are farther from root. Thus- updated
    it.
  • b. b has only one remaining child
  • one update from it. Second from new child
    when created.
  • (new arc causes son to update parent)
  • c. b has two new children- similar.
  • In all cases, b will receive two updates during a
    period,
  • and thus- propagate an update. Contradiction.

Child that has remained for the entire period.
52
Other Theoretical Considerations(bounds on the
compression)-------------------------------------
----------------------
  • We have focused on the Data Structure.
  • There are other questions, about the compression.
  • ??? ?? ??,
  • ???? ????!
  • (?????? ???)
  • ??? ????? ???? ?????


53
Other Theoretical Considerations(bounds on the
compression)-------------------------------------
----------------------
  • Consider the following
  • 1 3 16 15 14 13
  • A1 (literal 1)x(copy 3 y)(copy 14 y)
    6 bytes
  • Optimal (literal 2)xx(copy 16 y)
    5 bytes
  • How bad can it get?

Position j j1 j2 j3
j5 j6
Copy length available
A1
Optimal
Encoder is here
54
Heuristic vs. Optimal----------------------------
---
  • Foresight algorithms
  • Must have more than one-pass we pay big
    time.
  • And the Gain?
  • (Optimal vs. A1)-
  • On average- about 1 better.
  • On Worst case- 20.

55
Back to our business
56
A1s virtues-------------------------
  • - Simple one-pass adaptive lossless method.
  • - Natural approach to 8-bit per character.
  • Performances
  • - Compression ratio - up to 1/8.
  • - Expander- fast, simple, small storage
    requirements.
  • - Compressor- much slower and larger.
  • (all in comparison to other copy/literal methods)

57
Improvements--------------------------------
  • -Enlarge the window- gain compression ratio.
  • pay space
    and speed.
  • -Enlarge copy length- same.
  • -Change encoding- gain performance, pay
    simplicity.
  • -Change update policy-gain compression speed,
  • pay in
    space and expansion speed.

58
Summary
  • We introduce the compression problem, and propose
    a simple substitutional compressing
    algorithm, based on the copy/literal codewords.
  • Our main interest was the Data structure. We saw
    how a
  • modification of the basic Suffix tree answers
    the
  • algorithm demands, on what cost.

59
EXIT Dont push
Write a Comment
User Comments (0)
About PowerShow.com