Data Compression - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Data Compression

Description:

What we can do is to compress the data so that we send far fewer bits. ... If we want to compress a video clip, we need only run the encoding algorithm ... – PowerPoint PPT presentation

Number of Views:297
Avg rating:3.0/5.0
Slides: 23
Provided by: kevinc3
Category:

less

Transcript and Presenter's Notes

Title: Data Compression


1
Data Compression
2
Data Compression
  • Transmitting video and speech requires a great
    deal of bandwidth. In many networks, such as the
    Internet, the necessary bandwidth is simply not
    available.
  • What we can do is to compress the data so that
    we send far fewer bits.
  • There are many algorithms for performing data
    compression. Some algorithms are more suitable
    for use some types of data than others.
  • The process of compressing the data is called
    encoding and the process of decompressing the
    data is called decoding.

3
Lossy and Non Lossy Compression
  • If the process of encoding and decoding data
    results in the loss of information then the
    compressing technique is said to be lossy.
  • If the data is recovered accurately then the
    technique is said to be lossless
  • Lossy techniques can often achieve a far better
    reduction in the size of data but at the loss of
    some of the information.
  • In the case of voice and video data, this loss
    is often not noticeable.
  • The loss of data in a computer program, however,
    would cause it to crash so a lossy technique
    would not be suitable.
  •  

4
Lossy and Non Lossy Compression
  • The type of compression we use may depend on
    what we want to use it for.
  • If we want to compress a video clip, we need
    only run the encoding algorithm once and
    thereafter we will use the decoding algorithm
    every time we want to view the video.
  • In this situation it is acceptable to have an
    encoding algorithm that is much slower than the
    decoding algorithm.
  • The decoding algorithm would have to work in
    real time whereas the encoding algorithm may take
    weeks on a supercomputer.
  • If, on the other hand, we want to compress
    information from a video phone then both
    compression and decompression will have to work
    in real time.

5
Entropy and Source Encoding
  • Compression schemes fall into two categories
    entropy encoding and source encoding.
  • Entropy encoding compresses the data without
    regard to where the data came from or what it
    means.
  • Source encoding tries to take advantage of the
    properties of the data to improve compression
    (usually by using a lossy technique).

6
Run-Length Encoding
Run-length encoding is an example of entropy
encoding. Consider the following sequence of
decimal digits 3150000000000008458711111111111
1116354674000000000000000000000065.   You will
notice that there are a lot of 0s and 1s being
repeated. Many types of data (e.g. simple
diagrams, gaps between speech, data containing
tables) have repeated symbols in them. It is a
simple matter to count the number of duplicate
symbols. We can replace these duplicates with
a special marker (say X) followed by the symbol
being duplicated and a two-digit count. The
encoded string will then become
315X01284587X11316354674X02265 (just half the
size).
7
Half Byte Compression
  • When sending numerical characters, significant
    savings can be achieved by using 4 bits rather
    than the 7 or 8 bits usually used to send
    characters.
  • The ANSI character set represents the digits 0,
    1, 2, 3, 4, 5, 6, 7, 8 and 9 with bit patterns
  • 00110001, 0010010, 00110011, 00110100, 00110101,
  • 00110110, 00110111, 00111000 and 00111001.
  • The first 4 bits are always 0011. If the
    receiver knows this and can be relied upon to
    recreate these bits then there is no real need to
    send them.
  •  

8
Half Byte Compression
  • If we send the digits 2 7 9 1, we can send them
    in half the time by just sending 0010 0111 1001
    0001.
  • In byte format this would look like 00100111
    10010001.
  • Of course, if we send numerical characters in
    this way, we need to supply some sort of signal
    to the receiving host so that it will know when
    to start half-byte decompression.
  • This signal is a special control byte that is
    sent just before the half-byte compressed data
    begins.

9
Statistical Encoding
  • Statistical encoding examines the data and
    represents the most frequent symbols using the
    shortest codes.
  • Morse code does this E (the most frequent
    letter) is represented by a single dit and Q (the
    least used letter) is represented
    dah-dah-dit-dah.
  • The same approach is used by the Ziv-Lempel
    algorithm used by the UNIX Compress program and
    by Huffman coding.

10
Huffman coding
  • Instead of representing symbols as a fixed
    number of bits, fewer bits are used for the most
    frequently occurring symbols. To do this, we
    must first determine the relative frequency of
    the symbols. With this information, we create an
    unbalanced tree (a tree with unequal branches).
  •  
  • Consider the following sequence of DNA
  • AAACCCTTGCAAATAA
  • There are only 4 symbols A, C, G and T. The
    frequencies of these symbols within the 16 symbol
    string are as follows A 8, C 4, G 1 and T
    3.

11
Huffman Coding
In Huffman encoding, we list the symbols in order
of ascending frequency. We take the two symbols
with the smallest frequencies give the smallest
the bit label 0 and the larger the bit label 1
(if they have the same frequency then the
allocation of bit labels is arbitrary. G 1
(0) T 3 (1) C 4 A 8 Next we group the two
smallest symbols together and reorder the list.
We then, once again, label the two smallest
frequencies. GT 4 (0) C 4 (1) A 8
12
Huffman continued
We repeat the process again. GTC 8 (0) A 8
(1)   We draw a tree as follows we examine the
above tables in reverse order. The group of
symbols labelled 0 is sent to the left branch and
group labelled 1 is sent to the right.   Figure
1 The first branch.
13
Huffman continued
Figure 2 The second branch. 
Figure 3 The final unbalanced tree.
14
Huffman Encoding
The bits used for each symbol are found by
combining the 1s and 0s passed over en-route to
the symbol from the root of the tree. For
example, to get to T we go over 0, 0 and 1 so the
Huffman code for T will be 001. When we these
Huffman codes to encode the sequence
AAACCCTTGCAAATAA we get 1 1 1 01 01 01 001 001
000 01 1 1 1 001 1 1 (28 bits long)   We can
calculate the length of the message from the
frequencies by summing up the lengths of the
codes multiplied by the frequencies. The length
of the above message is given as 8?1 4?2 3?3
1?3 28 bits.
15
Huffman Encoding
  • The average number of bits used per symbol is
    the length of the message divided by the number
    of symbols in the message 28/8 1.75 bits per
    symbol. Compare that with the binary code we
    would get if we just used 2-bits to represent
    each symbol
  • 00 00 00 01 01 01 10 10 11 01 00 00 00 10 00 00
    (32 bits long)
  • The saving may not seem much for a very short
    sequence but for long sequences with many symbols
    we can often achieve compression rates of 50.
  • To decode the Huffman encoded data, we simply
    use the bits to find our way from the root of the
    tree to the symbols (0 means go left and 1 means
    go right).

16
Source Encoding
  • Source encoding tries to take advantage of the
    nature of the data rather than just compressing
    the data regardless of its source. A good
    example of source encoding is the JPEG (Joint
    Photographic Experts Group) which is a standard
    for compressing photographs.
  • It relies on the fact that the human eye will
    rarely notice minor distortions of an image
    (particularly if they have never seen the
    original). This being the case, the expedient
    reduction of quality in return for better
    compression is usually acceptable.

17
Source Encoding
The photographic image must undergo a number of
steps in order to create a JPEG. Figure 4 shows
these steps as a block diagram.     Figure 4
Block diagram of operation of JPEG.   First
blocks of four pixels are averaged. The average
pixels are grouped into 8?8 blocks. Discre
cosine transmforms (DCT) are applied to the
blocks. The less important coefficients of the
DCT are removed (quantization). Each blocks
average value is replaced with the difference
between it and the neighbouring block. Run
length encoding is applied to the blocks and
finally Huffman encoding is used. To decode the
JPEG, the operations are reversed.
18
MPEG
A standard compression technique for video and
sound is the MPEG standard. This is effectively
a sequence of JPEGs with additional compression
gained from the similarities between successive
frames. The MPEG2 standard is used in digital
television. The acronym MPEG stands for Moving
Picture Expert Group, which worked to generate
the specifications under ISO, the International
Organization for Standardization and IEC, the
International Electrotechnical Commission. What
is commonly referred to as "MPEG video" actually
consists at the present time of two finalized
standards, MPEG-11 and MPEG-22, with a third
standard, MPEG-4, was finalized in 1998 for Very
Low Bitrate Audio-Visual Coding. The MPEG-1 and
MPEG-2 standards are similar in basic concepts.
19
MPEG-1
  • MPEG-1 2 are based on motion compensated
    block-based transform coding techniques, while
    MPEG-4 deviates from these more traditional
    approaches in its usage of software image
    construct descriptors, for target bit-rates in
    the very low range, lt 64Kb/sec.
  • MPEG-1 was finalized in 1991, and was originally
    optimized to work at video resolutions of 352x240
    pixels at 30 frames/sec (NTSC based) or 352x288
    pixels at 25 frames/sec (PAL based), commonly
    referred to as Source Input Format (SIF) video.
  • It is often mistakenly thought that the MPEG-1
    resolution is limited to the above sizes, but it
    in fact may go as high as 4095x4095 at 60
    frames/sec. The bit-rate is optimized for
    applications of around 1.5 Mb/sec, but again can
    be used at higher rates if required.
  • MPEG-1 is defined for progressive frames only,
    and has no direct provision for interlaced video
    applications, such as in broadcast television
    applications.

20
MPEG-2
  • MPEG-2 was finalized in 1994, and addressed
    issues directly related to digital television
    broadcasting, such as the efficient coding of
    field-interlaced video and scalability.
  • Also, the target bit-rate was raised to between
    4 and 9 Mb/sec, resulting in potentially very
    high quality video. MPEG-2 consists of profiles
    and levels.
  • The profile defines the bitstream scalability
    and the colorspace resolution, while the level
    defines the image resolution and the maximum
    bit-rate per profile. Probably the most common
    descriptor in use currently is Main Profile, Main
    Level (MP_at_ML) which refers to 720x480 resolution
    video at 30 frames/sec, at bit-rates up to 15
    Mb/sec for NTSC video.
  • Another example is the HDTV resolution of
    1920x1080 pixels at 30 frame/sec, at a bit-rate
    of up to 80 Mb/sec. This is an example of the
    Main Profile, High Level (MP_at_HL) descriptor.

21
Practical Work
Consider the following string of decimal
digits 52100000000000084847111111111111162675430
00000000000000000000017 Using a marker X to
signify the occurrence of compression and
allowing a two-digit number for the repetition
count, show the output string following the
application of run-length encoding. SO
LUTION 521X01284847X1136267543X02217
22
Summary
  • There are two basic types of compression
    techniques entropy encoding and source encoding.
    Entropy encoding uses statistical techniques to
    compress the data.
  • Typically it is lossess meaning that, when
    uncompressed again, the data is restored to an
    exact copy of the original data. Source encoding
    uses knowledge about where the data came from in
    order to improve compression. Source encoding is
    often lossy meaning that, when uncompressed
    again, the data suffers an acceptable amount of
    distortion.
  • Run-length encoding and Huffman encoding are
    examples of entropy encoding. JPEG and MPEG are
    examples of source encoding.
Write a Comment
User Comments (0)
About PowerShow.com