Title: Information Theory and Source Coding
1Lecture 5
- Information Theory and Source Coding
2Contents
- Formal definition of information
- Formal definition of entropy
- Information loss due to noise
- Channel capacity
3Terminology
- An information source consists of an alphabet of
symbols. Each symbol has a probability
associated with it. A message consists of one or
more symbols.
4- Example If were to transmit an ASCII text file,
then the alphabet could be all the characters in
the ASCII set, a symbol would be any individual
character, e.g. the letter a with its
associated probability P(a). - A message is any sequence of symbols e.g. the
letter a, a whole word aardvark or a whole
file.
5- If you are going to be successful at data
compression then you need to think carefully
about your definition of the alphabet of symbols.
Just because someone says it is an ASCII file
doesnt mean there is 256 8 binary digit
characters. It may help your cause in data
compression to think about it differently.
6- For example is there something to be gained by
redefining your alphabet as 216 65536
characters of 16 bits (there is if the 8 bit
characters are not independent as in most ASCII
files. For example think about the probability
of a u following a q in an English text file
or a carriage return following a in a C
program).
7Information
In communications information is a measurable
quantity with a precise definition. If a message
m has probability P(m) then the information
conveyed by the message is
Note that log2 is usually used to give the
information in bits. Do not confuse bits as the
information conveyed with 16 bit as in 16 bit
microprocessor, this second quantity we shall
call binary digits for the remainder of this
lecture.
8Also remember that-
When using your calculators! Note that reception
of a highly probable event contains little
information, for example if P(m)1.0 then the
information conveyed is 0 bits. Reception of a
highly improbable message gives lots of
information for example if P(m)0.0 then this
contains an infinite amount of information.
9Information (Example)
- If an alphabet of consists of two symbols A and B
with probabilities P(A) and P(B) calculate the
information associated with receiving message
consisting of A followed by B if A and B are
independent.
10The answer to this question can be calculated by
two methods, if your knowledge of the properties
of the log function is good.
Method 1 Calculate the probability of the
message and then calculate the information
associated with it. Probability of the message
is P(A)P(B) if A and B are independent therefore
the information content is-
11Method 2 Calculate the information associated
with each symbol then add. Information associated
with receiving symbol A is log2(P(A)) and the
information associated with symbol B is
log2(P(B)) therefore the probability associated
with receiving A followed by B is
Remembering that log (x)log (y)log (xy)
12Entropy
In this case nothing to do with thermodynamics.
The entropy of a source is a measure of its
randomness, the more random a source the higher
its entropy. The entropy of a source is also the
upper bound for arithmetic source coding, it
gives the minimum number of average binary digits
per symbol. Arithmetic source coding is any
technique which assumes there is no memory in the
system. An example of arithmetic source coding is
Huffman coding, see next lecture.
13Entropy is defined as-
For example if a source can transmit one of three
symbols A, B, C with associated probabilities
P(A)0.60, P(B)P(C)0.20 then the entropy of
this source is-
14This means that the best possible arithmetic
coding scheme would represent this message with
an average of 1.371 binary digits per symbol
transmitted. The highest entropy occurs when the
symbols have equal probabilities and in this case
the best thing to do is to allocate each one an
equal length code.
15Example 1 If we have four symbols, A, B, C and
D, each with probability 0.25 then the entropy of
this source is-
Therefore the coding scheme-
Symbol Code
A 00
B 01
C 10
D 11
16is 100 efficient since the average number of
binary digits per symbol is equal to the
entropy. Next lecture we shall look at arithmetic
coding schemes which can approach 100 efficiency
when the probabilities are not equal.
17Example 2 For a two symbol (binary) source where
p is the probability of transmitting a 1 we have-
18Information Loss Due to Noise
If a symbol A has probability P(A) then we have
already seen the information transmitted is-
If the channel is completely noiseless, then this
is also the information received. If the channel
is not noiseless, then there is a loss in
information as the result of incorrect decisions.
19The information received when there is noise in
the system is redefined as-
This can be written mathematically as-
The top line of this equation is always less than
1.000 in a system with noise. The effective or
received entropy is therefore less than the
transmitted entropy-
20The difference between the transmitted entropy
and the received (effective) entropy Heff is
called the equivocation (E) and is the loss in
information rate caused by noise (in bits/symbol)
21Example A source has an alphabet of three
symbols, P(A)0.5, P(B)0.2 and P(C)0.3. For an
A transmitted the probability of reception for
each symbol is A0.6, B0.2, C0.2. For a B
transmitted, the probability of reception for
each symbol is A0.5, B0.5, C0.0. For a C
transmitted, the probability of reception for
each symbol is A0.0, B0.333, C0.667.
22Calculate the information associated with the
transmission and reception of each symbol and
calculate the equivocation.. On transmission, for
P(A)0.5000, P(B)0.2000, P(C)0.3000. Therefore
the information associated with each symbol at
the transmitter is-
23On reception we need to calculate the probability
of A received-
24Therefore we can calculate P(ATXARX) which is
simply 0.3/0.40.75.
Notice the reduction from 1.0000 bits on
transmission to 0.585 bits on reception.
25If we go through a similar process for the symbol
B
A reduction from 2.3219 bits on transmission to
0.737 bits on reception.
26If we go through a similar process for the symbol
C
A reduction from 1.7370 bits on transmission to
1.153 bits on reception.
27Note how the symbol with the lowest probability
(B) has suffered the greatest loss due to noise.
Noise creates uncertainty and this has the
greatest effect on those signals which are very
low probability. Think about this in the context
of alarm signals. To find the equivocation we
need all the probabilities-
Using Bayes rule-
28(No Transcript)
29Therefore-
30Channel Capacity
The Hartley-Shannon coding theorem states that
the maximum capacity of a channel (RMAX) is given
by
where B is the bandwidth of the channel in Hz and
S/N is the power signal to noise ratio (as a
power ratio, NOT in dB)
31If we divide by the bandwidth we obtain the
Shannon limit
Average signal power S can be expressed as-
Eb is the energy per bit. k is the number of
bits per symbol. T is the duration of a symbol.
Ck/T is the transmission rate of the system in
bits/s.
32NN0B is the total noise power. N0 is the one
sided noise power spectral density in W/Hz
From this we can calculate the minimum bit energy
to noise power spectral density, called the
Shannon Bound-
33This is for a continuous input, infinite block
size and coding rate 0.00 (not very practical!).
If we use a rate ½ code the capacity for a
continuous input and infinite block size is 0.00
dB (still not practical for digital
communications). If we use a rate ½ code and a
binary input and infinite block size, the
capacity limit is 0.19 dB. It is difficult (but
not impossible) to get very close to this limit,
for example Turbo codes in lecture 9.
34Lecture 6
35Lecture 6 contents
- Definition of coding efficiency
- Desirable properties of a source code
- Huffman coding
- Lempel-Ziv coding
- Other source coding methods
36Definition of coding efficiency
For an arithmetic code (i.e. one based only on
the probabilities of the symbols, for example
Huffman coding), the efficiency of the code is
defined as-
37Where H is the entropy of the source as defined
in lecture 6 and L is the average length of a
codeword.
Where P(m) is the probability of the symbol m and
lm is the length of the codeword assigned to the
symbol m in binary digits.
38Variable Length Arithmetic Codes
The principle behind arithmetic coding schemes
(Huffman coding) is to assign different length
codewords to symbols. By assigning shorter
codewords to the more frequent (probable)
symbols, we hope to reduce the average length of
a codeword, L. There is one essential property
and one desirable property of variable length
codewords.
39It is essential that a variable length code is
uniquely decodable. This means that a received
message must have a single possible meaning. For
example if we have four possible symbols in our
alphabet and we assign them the following codes.
A0, B 01, C11 and D00. If we receive the
message 0011 then it is not known whether the
message was D, C or A, A, C. This code is not
uniquely decodable and therefore useless.
40It is also desirable that a code is
instantaneously decodable. For example if we
again have four symbols in our alphabet and we
assign them the following codes. A0, B01,
C011, D111. This code is uniquely decodable
but not instantaneously decodable. If we receive
a sequence 0010 (which represents A, B, A) we do
not know that the first digit (0) represents A
rather than B or C until we receive the second
digit (0).
41Similarly we do not know if the second (0) and
the third received digit (1) represents B rather
than C until we receive the fourth digit (0)
etc. This code is usable, but the decoding is
unnecessarily complicated. If we reverse the
order of the bits in our previous code, so that
our new code is-
42A0, B10, C110, D 111. Then we have uniquely
and instantaneously decodable code. This is
called a comma code, because receipt of a 0
indicates the end of a codeword (except for the
maximum length case). The same message as used
in the previous example would be 0100 and this
can be instantaneously decoded as ABA using the
diagrams on the next slide.
43(No Transcript)
44Simple Coding and Efficiency
For comparison, we shall use the same example for
simple and Huffman coding. This example consists
of eight symbols (A-H) with the probabilities of
occurrence given in the table.
Symbol Probability
A 0.10
B 0.18
C 0.40
D 0.05
E 0.06
F 0.10
G 0.07
H 0.04
45Simple Coding
For the simple case (often referred to as the
uncoded case) we would assign each of the symbols
a three bit code, as shown opposite.
Symbol Code
A 000
B 001
C 010
D 011
E 100
F 101
G 110
H 111
46Simple Coding (efficiency)
The entropy of this source is-
The average length of a codeword is 3 binary
digits, so the efficiency is-
47Huffman coding
This is a variable length coding method. The
method is- 1) Reduction. Write the symbols in
descending order of probability. Reduce the two
least probable symbols into one symbol which has
the probability of the two symbols added
together. Reorder again in descending order of
probability. Repeat until all symbols are
combined into 1 symbol of probability 1.00
48(No Transcript)
492) Splitting process. Working backwards (from
the right) through the tree you have created,
assign a 0 to the top branch of each combining
operation and a 1 to the bottom branch. Add each
new digit to the right of the previous one.
50(No Transcript)
51Symbol Code
A 011
B 001
C 1
D 00010
E 0101
F 0000
G 0100
H 00011
The average length of a codeword for this code
is-
52The efficiency of this code is therefore
Note that this efficiency is higher than simple
coding. If the probabilities are all
Where n is an integer then Huffman coding is 100
efficient
53Lempel-Ziv Coding
This family of coding methods is fundamentally
different from the arithmetic technique described
so far (Huffman coding). It uses the fact that
small strings within messages repeat locally.
One such Lempel - Ziv method has a history buffer
and an incoming buffer and looks for matches. If
a match is found the position and length of the
match is transmitted instead of the character.
54If we have the string tam_eht
_no_tas_tac_eht to be transmitted history
buffer (already sent) and we are using a history
buffer of length 16 characters and maximum match
of 8 characters. We wish to encode tam_eht.
55We would transmit _eht as a match to _eht at
the right end of our history buffer using 8 bits,
1 bit to indicate a match has been found, 4 to
give the position in the history buffer of the
match and 3 for the length of the match (note
that we only need 3 bits for a match length of 8
as a match of length zero isnt a match at all!).
So encoding _eht becomes-
56to indicate a match has been found to indicate
the match starts at position 11 in the history
buffer to indicate the length of the match is 4
(000 indicates 1, 001 2, 010 3, 011 4 etc.)
1
1011
011
Thus _eht encodes as 11011011. We then have-
57 tam _eht_no_tas_tac_ to
be transmitted history buffer (already
sent) There is no match for the m. We would
then transmit m as 9 bits, 1 bit to indicate we
couldnt find a match and 8 bits for the ASCII
code for m. So encoding m becomes 0 to
indicate no match 01011101 Eight bit ASCII code
for m. Thus m encodes as 001011101.
58We then have ta
m_eht_no_tas_tac_ to be transmitted history
buffer (already sent) We would then transmit ta
as a match to the ta at history buffer position
9 or 13 using the same sort of 8 bit pattern as
we used for _eht. So encoding ta becomes-
59to indicate a match to indicate the match starts
at position 9 to indicate the length of the match
is 2 characters
1
1001
001
Thus ta encodes as 11001001. Overall this
means we have encoded 7 8 bit ASCII characters
into 25 bits, a compression factor of 25/56.
This is fairly typical for practical
implementations of this type of coding.
60Normally the history buffer is much longer, at
least 256 characters, and the stream to be
encoded doesnt rhyme! This type of technique has
many advantages over arithmetic source coding
techniques such as Huffman. It is more efficient
(often achieving over 100 by our previous
memoryless definition) and it can adapt to
changes in data statistics.
61However, errors can be disastrous, the coding
time is longer as matches have to be searched for
and it doesnt work for short messages. This
technique is used for compression of computer
files for transport (ZIP) and saving space on
hard disk drives (compress, stacker etc.) and in
some data modem standards.
62The particular implementation of Lempel Ziv we
have looked at is not very fast because of the
searching of history buffers. The implementation
Lempel Ziv Welch (LZW) builds up tables of
commonly occurring strings and then allocates
them a 9 to 16 bit code. (16 gives best
compression but 9 is faster and uses less
memory). Compression ratio is similar to the
method we have looked at, but it is more suited
to long messages and it is faster (but LZW is
useless for exam questions!)
63LZW
- Encoding Algorithm
- 1) Initialize the dictionary to contain all
blocks of length one (Da,b). - 2) Search for the longest block W which has
appeared in the dictionary. - 3) Encode W by its index in the dictionary.
- 4) Add W followed by the first symbol of the next
block to the dictionary. - 5) Go to Step 2.
64LZW
65Run Length Encoding
Images naturally contain a lot of data. A FAX
page has 1728 pixels per line, 3.85 lines/mm and
a page length of 300mm, which gives almost 2
million pixels per page. This would take around
7 minutes to transmit using the typical 4.8
kbit/s modem built into FAX machines. This is
reduced using run length encoding where pixels
tend to be black or white in runs.
66(No Transcript)
67A standard Huffman code has been devised for the
run lengths based on the above probabilities and
is used in so called Group 3 FAX transmission. An
example is-
68(No Transcript)
69There is not much of a saving in this case (34
bits reduced to 26), but remember a real page of
text contains large white spaces. If you insert
a document in a FAX machine, it will quickly pass
through the white areas of a page but slow down
on dense text or images. An average page has a
compression factor of around 7 and therefore
takes around a minute to pass through the FAX
machine.
70Source Coding for Speech and Audio
In speech the time series of samples is broken up
into short blocks or frames before encoding. We
then transmit the co-efficients of a model. For
example in linear prediction coding (LPC) the
encoder breaks the incoming sound into 30-50
frames per second. For each frame it sends the
frequency (pitch) and intensity (volume) of a
simple buzzer. It also sends the coefficients of
an FIR filter that when used with the buzzer give
a minimum error between the input signal and the
filtered buzzer.
71The co - efficients of the filters are estimated
using an adaptive filter which compares its
output with the voice input samples. LPC can
bring the data rate down to 2400 bits/s without
sounding too bad and more modern techniques such
as harmonic excited linear prediction (HELP) can
be intelligible at 960 bits/s. At these coding
rate people tend to sound a bit artificial,
typical mobile phone coding rates for voice range
from 9.6 kbit/s to around 13 kbit/s.
LPC using 2400bits/s
Original signal
HELP using 960 bits/s
72Other Techniques for Compression
Other techniques transform the input filter
blocks into the frequency domain and use a
perceptional/precision adaptive subband coding
(PASC) model to eliminate frequency bins which
the ear cannot hear due to masking. This
technique was used in digital compact cassette
(DCC) and has high perceived quality (comparable
with CD) while compressing audio by a factor of
4.
73PASC
Spectral mask
Power (dB)
Threshold of human hearing
Masked because of signal close in frequency
Frequency kHz (using log scale)
74Other Techniques for Compression
LPC, HELP and PASC are lossy, there is a loss in
information in the encoding process, even when
there is no noise in the transmission channel.
This is very noticeable in HELP with extremely
low data rate transmission. However, the loss in
DCC is imperceptible? compared with a CD.
However DCC is being scrapped, the people who you
rely on most to pay for new audio technology
wouldnt pay for a compressed system and now DCC
has been superceded by CD - R and CD - RW.