Chapter 9 LZ78 Compression - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Chapter 9 LZ78 Compression

Description:

The LZ77 algorithms have a deficiency that these algorithms use only a small ... The algorithm, popularly referred to as LZ78, was published in 'Compression of ... – PowerPoint PPT presentation

Number of Views:675

Avg rating:3.0/5.0

Slides: 24

Provided by: yr53

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 9 LZ78 Compression

1
Chapter 9LZ78 Compression
The Data Compression Book
2
Deficiency in LZ77
The LZ77 algorithms have a deficiency that these
algorithms use only a small window into
previously seen text, which means they
continuously throw away valuable phrases because
they slide out of the dictionary. A second
deficiency in LZ77 compression is the limited
size of a phrase that can be matched.
3
9.1 Can LZ77 Improve?
Instead of using a 4K window and a seventeen-byte
buffer, use a 64K text window and a 1K
look-ahead buffer, what happens then?
16 bits to encode an index location, 10 bits for
a phrase length, we need 27 bits to encode a
phrase. it will change the BREAK_EVEN point in
the program from just under two characters to
three characters. it will drastically increase
the amount of CPU time needed to perform
compression.
4
9.2 Enter LZ78
The algorithm, popularly referred to as LZ78, was
published in Compression of Individual Sequences
via Variable-Rate Coding in IEEE Transactions on
Information Theory (September 1978). LZ78
abandons the concept of a text window. Under LZ77
the dictionary of phrases was defined by a fixed
window of previously seen text. Under LZ78, the
dictionary is a potentially unlimited list of
previously seen phrases. LZ78 is similar to LZ77
in some ways. LZ77 outputs a series of tokens.
Each token has three components a phrase
location, the phrase length, and a character that
follows the phrase. LZ78 also outputs a series of
tokens with essentially the same meanings. Each
LZ78 token consists of a code that selects a
given phrase and a single character that follows
the phrase. Unlike LZ77, the phrase length is not
passed since the decoder knows it.
5
9.2.1 LZ78 Details
When using the LZ78 algorithm, both encoder and
the decoder start off with a nearly empty
dictionary. By definition, the dictionary has a
single encoded stringthe null string. As each
character is read in, it is added to the current
string. As long as the current string matches
some phrase in the dictionary, this process
continues. No match? outputs a token and a
character. The new phrase, consisting of the
dictionary match and the new character is added
to the dictionary.
6
9.2.1 LZ78 Details (cont.1)
A code fragment to implement the algorithm for (
) current_match 1 current_length
0 memset( test_string, '\0', MAX_STRING )
for ( ) test_string current_length
getc( input ) new_match find_match(
test_string ) if ( new_match -1 )
break current_match new_match
output_code( current_match ) output_char(
test_string current_length - 1 )
add_string_to_dictionary( test_string )
7
9.2.1 LZ78 Details (cont.2)
An example of the encoder output Input text
"DAD DADA DADDY DADO... Output Phrase Output
Character Encoded String 0 D D 0
A A 1 D 1 A DA 4
DA 4 D DAD 1 Y DY
0 6 O DADO
the dictionary 0 1 D 2 A 3 D
4 DA 5 DA 6 DAD 7 DY 8
9 DADO
8
9.2.2 LZ78 Implementation
The real difficulty with LZ78 actually comes in
managing the dictionary. These phrases are
conventionally stored in a multi-way tree. The
major difficulty with managing a tree such as
this is the potentially large number of branches
off of each node. One negative side effect of
LZ78 not found in LZ77 is that the decoder has to
maintain this tree as well. Another issue is that
of the dictionary filling up.
9
9.3 An Effective Variant
Under LZW, the compressor never outputs single
characters, only phrases. By preload the phrase
dictionary with single-symbol phrases equal to
the number of symbols in the alphabet. The LZW
compression algorithm in its simplest form
old_string 0 getc(input) old_string 1
'\0' while ( !feof( input ) )
character getc( input ) strcpy(
new_string, old_string ) strncat(
new_string, character, 1 ) if (
in_dictionary( new_string ) ) strcpy(
old_string, new_string ) else code
look_up_dictionary( old_string ) output
code( code ) add_to_dictionary( new_string
) old_string 0 character
old_string 1 '\0' code
look_up_dictionary( old string )
output_code( code )
10
9.3 An Effective Variant (cont 1)
A sample string used to demonstrate the
algorithm Input String " WED WE WEE WEB WET
Characters Input Code Output New code value
and associated string W 256 W
E W 257 WE D E 258 ED
D 259 D WE 256 260 WE
E 261 E WEE 260 262
WEE W 261 263 E W EB 257
264 WEB B 265 B WET 260
266 WET ltEOFgt T
11
9.4 Decompression
The decompression algorithm takes the stream of
codes output from the compression algorithm and
uses them to recreate the exact input stream. A
rough C implementation. old_string 0
input_bits() old_string 1 '\0'
putc( old_string 0 , output ) while ( (
new_code input_bits() ) ! EOF )
new_string dictionary_lookup( new_code )
fputs( new_string, output )
append_char_to_string( old_string, new_string 0
) add_to_dictionary( old_string )
strcpy( old_string, new_string )
12
9.4 Decompression (cont. 1)
The output of the algorithm given the input
created by the earlier compression. Input Codes
WEDlt256gtElt260gtlt261gtlt257gtBlt260gtT" Input/
NEW_CODE OLD_CODE STRING/Output CHARACTER New
table entry W W W 256
W E W E E 257 WE D E D D 258
ED 256 D W 259 D E 256 E E
260 WE 260 E WE 261
E 261 260 E E 262 WEE 257 261 WE W
263 E W B 257 B B 264
WEB 260 B WE 265 B T 260 T T
266 WET
13
9.4.1 The Catch
Each time the compressor adds a new string to the
phrase table, it does so before the entire phrase
has actually been output to the file. If for some
reason the compressor used that phrase as its
next code, the expansion code would have a
problem. It would be expected to decode a string
that was not yet in its table. Input String
IWOMBAT....IWOMBATIWOMBATIXXX Character
Input New code value and associated string Code
Output ...I WOMBATA 300 IWOMBAT 288
(IWOMBA) . . . ...I . . WOMBATI
400 IWOMBATI 300 (IWOMBAT) WOMBATIX 401
IWOMBATIX 400 (IWOMBATI)
14
9.4.1 The Catch (cont. 1)
Fortunately, this is the only time when the
decompression algorithm will encounter an
undefined code. The exception handler takes
advantage of the knowledge that this problem can
happen only in the special circumstances of
CHARACTER STRINGCHARACTERSTRINGCHARACTER.
old_string 0 input_bits() old_string 1
'\0' putc( old_string 0 , output )
while ( ( new_code input_bits() ) ! EOF )
new_string dictionary_lookup( new_code )
if ( new_string NULL ) strcpy(
new_string, old_string )
append_character_to_string( new_string,
new_string 0 ) fputs( new_string,
output ) append_character_to_string(
old_string, new_string 0 )
add_to_dictionary( old_string ) strcpy(
old_string, new_string )
15
9.4.2 LZW Implementation
The concepts in the compression algorithm are so
simple that the whole algorithm can be expressed
in a dozen lines. Implementation of this
algorithm is somewhat more complicated, mainly
due to management of the dictionary. A short
example program that uses twelve-bit codes is in
LZW12.C, and it will illustrate some of the
techniques used here.
16
9.4.3 Tree Maintenance and Navigation

The LZW dictionary is maintained as a multi-way
tree.
struct dictionary
int code_value
int parent_code
char character
dict TABLE_SIZE
The node is defined by three items
Code_value. the actual code for the string that
terminates at this node and is what the
compression program emits when it wants to encode
the string
Parent_code. the code for that parent string
Character. This is the character for this
particular node.

17
9.4.3 Tree Maintenance and Navigation (cont. 1)
This tree maintains the dictionary pointers
through a hashed array of nodes. unsigned int
find_child_node( int parent_code, int
child_character ) int index int offset
index ( child_character ltlt ( BITS - 8 ) )
parent_code if ( index 0 ) offset 1
else offset TABLE_SIZE - index for (
) if ( dict index .code_value UNUSED )
return( index ) if ( dict index
.parent_code parent_code dict
index .character (char) child_character )
return( index ) index - offset if (
index lt 0 ) index TABLE_SIZE
18
9.5 Compression
The compression program can be written fairly
easily. next_code FIRST CODE for ( i 0 i
lt TABLE_SIZE i ) dict i .code_value
UNUSED if ( ( string_code getc( input ) )
EOF ) string_code END_OF_STREAM while ( (
character getc( input ) ) ! EOF ) index
find_child_node( string_code, character ) if (
dict index .code_value ! - 1) string_code
dict index .code_value else if (
next_code lt MAX_CODE ) dict index
.code_value next_code dict index
.parent-code string_code dict index
.character (char) character
OutputBits( output, string_code, BITS )
string_code character OutputBits( output,
string_code, BITS ) OutputBits( output,
END_OF_STREAM, BITS )
19
9.6 Decompression
decode_string() follows the parent pointers up
though the dictionary until it finds a code less
than 256, which we have defined as the first
character in the string. A count of characters in
the decode stack is then returned to the calling
program. unsigned int decode_string( count, code
) unsigned int count unsigned int code
while ( code gt 255 )
decode_stack count dict code
.character code dict
code.parent_code decode_stack
count (char) code return( count )
20
9.6 Decompression (cont. 1)
The decompression routine next_code
FIRST_CODE old_code InputBits( input, BITS
) if ( old_code END_OF STREAM )
return character old_code putc(
old_code, output ) while ( ( new_code
InputBits( input, BITS ) ) ! END_OF_STREAM )
if ( new_code gt next_code )
decode_stack 0 (char) character count
decode_string( 1, old_code ) else
count decode_string( 0, new_code )
character decode_stack count - 1 while
( count gt 0 ) putc( decode_stack --count ,
output ) if ( next_code lt MAX_CODE )
dict next_code .parent_code old_code
dict next_code .character (char) character
next_code old_code new_code
21
9.7 The Code
The source code for a complete twelve-bit version
of LZW compression and decompression
lzw12.c
22
9.8 Improvements
A second version of the LZW program, LZW15V.C,
follows. It contains several enhancements, most
of which are also found in the UNIX compress
program. Be improved by increasing the size of
the dictionary. LZW15V.C starts out using a
nine-bit code, and it doesnt advance to ten bits
until the dictionary has added 256 new entries.
It progresses through ten, eleven, twelve, etc,
until it starts using fifteen-bit codes. One
final enhancement in LZW15V.C is the FLUSH_CODE.
This tells the decompressor to throw away all
phrases currently in the dictionary and to start
over with a blank slate.
lzw15v.c
23
9.9 Patents
One note of caution regarding the use of the LZW
algorithm. Terry Welch filed for, and was
awarded, a U.S. patent covering at least some
portions of his algorithm.

Write a Comment

User Comments (0)