Title: Data Compression
1Data Compression
2Data Compression
- Transmitting video and speech requires a great
deal of bandwidth. In many networks, such as the
Internet, the necessary bandwidth is simply not
available. - What we can do is to compress the data so that
we send far fewer bits. - There are many algorithms for performing data
compression. Some algorithms are more suitable
for use some types of data than others. - The process of compressing the data is called
encoding and the process of decompressing the
data is called decoding.
3Lossy and Non Lossy Compression
- If the process of encoding and decoding data
results in the loss of information then the
compressing technique is said to be lossy. - If the data is recovered accurately then the
technique is said to be lossless - Lossy techniques can often achieve a far better
reduction in the size of data but at the loss of
some of the information. - In the case of voice and video data, this loss
is often not noticeable. - The loss of data in a computer program, however,
would cause it to crash so a lossy technique
would not be suitable. -
4Lossy and Non Lossy Compression
- The type of compression we use may depend on
what we want to use it for. - If we want to compress a video clip, we need
only run the encoding algorithm once and
thereafter we will use the decoding algorithm
every time we want to view the video. - In this situation it is acceptable to have an
encoding algorithm that is much slower than the
decoding algorithm. - The decoding algorithm would have to work in
real time whereas the encoding algorithm may take
weeks on a supercomputer. - If, on the other hand, we want to compress
information from a video phone then both
compression and decompression will have to work
in real time.
5Entropy and Source Encoding
- Compression schemes fall into two categories
entropy encoding and source encoding. - Entropy encoding compresses the data without
regard to where the data came from or what it
means. - Source encoding tries to take advantage of the
properties of the data to improve compression
(usually by using a lossy technique).
6Run-Length Encoding
Run-length encoding is an example of entropy
encoding. Consider the following sequence of
decimal digits 3150000000000008458711111111111
1116354674000000000000000000000065. You will
notice that there are a lot of 0s and 1s being
repeated. Many types of data (e.g. simple
diagrams, gaps between speech, data containing
tables) have repeated symbols in them. It is a
simple matter to count the number of duplicate
symbols. We can replace these duplicates with
a special marker (say X) followed by the symbol
being duplicated and a two-digit count. The
encoded string will then become
315X01284587X11316354674X02265 (just half the
size).
7Half Byte Compression
- When sending numerical characters, significant
savings can be achieved by using 4 bits rather
than the 7 or 8 bits usually used to send
characters. - The ANSI character set represents the digits 0,
1, 2, 3, 4, 5, 6, 7, 8 and 9 with bit patterns - 00110001, 0010010, 00110011, 00110100, 00110101,
- 00110110, 00110111, 00111000 and 00111001.
- The first 4 bits are always 0011. If the
receiver knows this and can be relied upon to
recreate these bits then there is no real need to
send them. -
8Half Byte Compression
- If we send the digits 2 7 9 1, we can send them
in half the time by just sending 0010 0111 1001
0001. - In byte format this would look like 00100111
10010001. - Of course, if we send numerical characters in
this way, we need to supply some sort of signal
to the receiving host so that it will know when
to start half-byte decompression. - This signal is a special control byte that is
sent just before the half-byte compressed data
begins.
9Statistical Encoding
- Statistical encoding examines the data and
represents the most frequent symbols using the
shortest codes. - Morse code does this E (the most frequent
letter) is represented by a single dit and Q (the
least used letter) is represented
dah-dah-dit-dah. - The same approach is used by the Ziv-Lempel
algorithm used by the UNIX Compress program and
by Huffman coding.
10Huffman coding
- Instead of representing symbols as a fixed
number of bits, fewer bits are used for the most
frequently occurring symbols. To do this, we
must first determine the relative frequency of
the symbols. With this information, we create an
unbalanced tree (a tree with unequal branches). -
- Consider the following sequence of DNA
- AAACCCTTGCAAATAA
- There are only 4 symbols A, C, G and T. The
frequencies of these symbols within the 16 symbol
string are as follows A 8, C 4, G 1 and T
3.
11Huffman Coding
In Huffman encoding, we list the symbols in order
of ascending frequency. We take the two symbols
with the smallest frequencies give the smallest
the bit label 0 and the larger the bit label 1
(if they have the same frequency then the
allocation of bit labels is arbitrary. G 1
(0) T 3 (1) C 4 A 8 Next we group the two
smallest symbols together and reorder the list.
We then, once again, label the two smallest
frequencies. GT 4 (0) C 4 (1) A 8
12Huffman continued
We repeat the process again. GTC 8 (0) A 8
(1) We draw a tree as follows we examine the
above tables in reverse order. The group of
symbols labelled 0 is sent to the left branch and
group labelled 1 is sent to the right. Figure
1 The first branch.
13Huffman continued
Figure 2 The second branch.
Figure 3 The final unbalanced tree.
14Huffman Encoding
The bits used for each symbol are found by
combining the 1s and 0s passed over en-route to
the symbol from the root of the tree. For
example, to get to T we go over 0, 0 and 1 so the
Huffman code for T will be 001. When we these
Huffman codes to encode the sequence
AAACCCTTGCAAATAA we get 1 1 1 01 01 01 001 001
000 01 1 1 1 001 1 1 (28 bits long) We can
calculate the length of the message from the
frequencies by summing up the lengths of the
codes multiplied by the frequencies. The length
of the above message is given as 8?1 4?2 3?3
1?3 28 bits.
15Huffman Encoding
- The average number of bits used per symbol is
the length of the message divided by the number
of symbols in the message 28/8 1.75 bits per
symbol. Compare that with the binary code we
would get if we just used 2-bits to represent
each symbol - 00 00 00 01 01 01 10 10 11 01 00 00 00 10 00 00
(32 bits long) - The saving may not seem much for a very short
sequence but for long sequences with many symbols
we can often achieve compression rates of 50. - To decode the Huffman encoded data, we simply
use the bits to find our way from the root of the
tree to the symbols (0 means go left and 1 means
go right).
16Source Encoding
- Source encoding tries to take advantage of the
nature of the data rather than just compressing
the data regardless of its source. A good
example of source encoding is the JPEG (Joint
Photographic Experts Group) which is a standard
for compressing photographs. - It relies on the fact that the human eye will
rarely notice minor distortions of an image
(particularly if they have never seen the
original). This being the case, the expedient
reduction of quality in return for better
compression is usually acceptable.
17Source Encoding
The photographic image must undergo a number of
steps in order to create a JPEG. Figure 4 shows
these steps as a block diagram. Figure 4
Block diagram of operation of JPEG. First
blocks of four pixels are averaged. The average
pixels are grouped into 8?8 blocks. Discre
cosine transmforms (DCT) are applied to the
blocks. The less important coefficients of the
DCT are removed (quantization). Each blocks
average value is replaced with the difference
between it and the neighbouring block. Run
length encoding is applied to the blocks and
finally Huffman encoding is used. To decode the
JPEG, the operations are reversed.
18MPEG
A standard compression technique for video and
sound is the MPEG standard. This is effectively
a sequence of JPEGs with additional compression
gained from the similarities between successive
frames. The MPEG2 standard is used in digital
television. The acronym MPEG stands for Moving
Picture Expert Group, which worked to generate
the specifications under ISO, the International
Organization for Standardization and IEC, the
International Electrotechnical Commission. What
is commonly referred to as "MPEG video" actually
consists at the present time of two finalized
standards, MPEG-11 and MPEG-22, with a third
standard, MPEG-4, was finalized in 1998 for Very
Low Bitrate Audio-Visual Coding. The MPEG-1 and
MPEG-2 standards are similar in basic concepts.
19MPEG-1
- MPEG-1 2 are based on motion compensated
block-based transform coding techniques, while
MPEG-4 deviates from these more traditional
approaches in its usage of software image
construct descriptors, for target bit-rates in
the very low range, lt 64Kb/sec. - MPEG-1 was finalized in 1991, and was originally
optimized to work at video resolutions of 352x240
pixels at 30 frames/sec (NTSC based) or 352x288
pixels at 25 frames/sec (PAL based), commonly
referred to as Source Input Format (SIF) video. - It is often mistakenly thought that the MPEG-1
resolution is limited to the above sizes, but it
in fact may go as high as 4095x4095 at 60
frames/sec. The bit-rate is optimized for
applications of around 1.5 Mb/sec, but again can
be used at higher rates if required. - MPEG-1 is defined for progressive frames only,
and has no direct provision for interlaced video
applications, such as in broadcast television
applications.
20MPEG-2
- MPEG-2 was finalized in 1994, and addressed
issues directly related to digital television
broadcasting, such as the efficient coding of
field-interlaced video and scalability. - Also, the target bit-rate was raised to between
4 and 9 Mb/sec, resulting in potentially very
high quality video. MPEG-2 consists of profiles
and levels. - The profile defines the bitstream scalability
and the colorspace resolution, while the level
defines the image resolution and the maximum
bit-rate per profile. Probably the most common
descriptor in use currently is Main Profile, Main
Level (MP_at_ML) which refers to 720x480 resolution
video at 30 frames/sec, at bit-rates up to 15
Mb/sec for NTSC video. - Another example is the HDTV resolution of
1920x1080 pixels at 30 frame/sec, at a bit-rate
of up to 80 Mb/sec. This is an example of the
Main Profile, High Level (MP_at_HL) descriptor.
21Practical Work
Consider the following string of decimal
digits 52100000000000084847111111111111162675430
00000000000000000000017 Using a marker X to
signify the occurrence of compression and
allowing a two-digit number for the repetition
count, show the output string following the
application of run-length encoding. SO
LUTION 521X01284847X1136267543X02217
22Summary
- There are two basic types of compression
techniques entropy encoding and source encoding.
Entropy encoding uses statistical techniques to
compress the data. - Typically it is lossess meaning that, when
uncompressed again, the data is restored to an
exact copy of the original data. Source encoding
uses knowledge about where the data came from in
order to improve compression. Source encoding is
often lossy meaning that, when uncompressed
again, the data suffers an acceptable amount of
distortion. - Run-length encoding and Huffman encoding are
examples of entropy encoding. JPEG and MPEG are
examples of source encoding.