Title: CS225ECE205A, Spring 2005: Information Theory
1CS225/ECE205A, Spring 2005Information Theory
- Wim van Dam
- Engineering 1, Room 5109vandam_at_cs
- http//www.cs.ucsb.edu/vandam/teaching/S06_CS225/
2Formalities
- The coordinates of the final
- date Thursday June 15
- time 4 to 7pm
- place Phelps 1401
- The paper is due at the end of Week 10.
- Coming two weeks bits pieces and exercises.
3Chapter 7KolmogorovComplexity
4Measuring Information
The data compression of Chapter 5 concerned the
expected length of a compressed string that is
sampled from a random source with entropy
H(X). The practice of coding suggests that it is
possible to talk about the compression of an
individual bit-string. This is not obvious.
Should we view the string 0101110101001010001110
001100011011 as a completely random string of
35 bits, or asa single letter from a source with
alphabet 0,135?
5Randomness
We know that we can give a short description of
0101010101010101010101010101010101 as Print
01 seventeen times. For other strings like
0101110101001010001110001100011011 this seems
more problematic.
We want to make the following idea work
A regular string is a string that has a short
description. An random/irregular string has no
such summary.
6Allowed Descriptions
Problem What do we consider a description?What
may be an obvious description for one, may be
illegible for somebody else
0110010010000111111011010101000100010000101101000
1100001000110100110001001100011001100010100010111
0000000110111000001110011010001001010010
?
We need a proper definition of a description.
7Turing Describable
Key idea We allow as our description the
descriptionof a Turing machine (which is a
universal notion).
A string x will have many different TMs.We
consider the size of the shortest descriptionas
an indication of the intrinsic complexity of x.
To fix the description (length) of the TM we fix
a universal Turing machine U that can have the
description of other TMs as binary input.
8Kolmogorov Complexity of x
The Kolmogorov complexity KU(x) of a string x
with respect to a universal TM U is defined as
where p is the program that describes x,
and l(p) is its binary length.
AKA descriptive complexity, algorithmic
complexity, or Kolmogorov-Solomonoff-Chaitin
complexity.
9Kolmogorov(?) Complexity
The idea of measuring the complexity of
bit-stringsby the smallest possible Turing
machine that produces the string has been
proposed by R. Solomonoff - A. Kolmogorov -
G. Chaitin
(1964)
(1966)
(1965)
10How Long is x?
- The definition of KU assumes that U does not
knowthe length of the string x, which means that
p has to contain this information. - Sometimes we want to avoid this issue, in which
case we use the conditional Kolmogorov complexity
Example For x00...0 we have K(x) log l(x)
but K(xl(x)) c.
11Universality of KU
Universality question The definition of K
depends on the choice of the universal TM U. how
critical is this? What happens if we replace U
by some other TM M? Proof idea cM is the
amount of information that the universal TM U
needs to simulate the TM M. Straightforward
consequence If both U and M are universal TMs,
then there exists a constant c such that
KU(x)KM(x) lt c for all strings x.
Theorem 7.2.1 (given a universal TM U) For any
TM M there exists a constant cM such that for
all strings x wehave KU(x) KM(x) cM
(constant does not depend on x.)
12Dealing with Constants
- Because of the previous universality result, we
denote the Kolmogorov complexity with simply
K(x). - You have to be careful with the intrinsic
constants to reach meaningful statements. Think
quantifiersFor all x there is a c such that
K(x)c is trivial.There is a c such that for
all x K(x)c is false. - Some straightforward results (c does not depend
on x) - Theorem 7.2.2 K(xl(x)) l(x) c.
- Theorem 7.2.2 K(x) K(xl(x)) 2 log l(x) c.
13Some Examples
- Although K is supposed to talk about the
complexityof individual strings, our results are
typical phrasedin the familiar in the limit
setting. - Hence for x0n we say K(xn)c and K(x)log n
c. - Question Why not K(x)log n c?
- Other example In binary we have p11.0010010
If p1pn denote the first n bits, then
K(p1pnn)c. - This last example indicates the difference
betweenShannons information measure and K.
14Incompressibility
Theorem 7.2.4 The number of strings x?0,1
with Kolmogorov complexity K(x)lt k is less than
2k. Proof There are no more than 2j programs of
length j,hence there are no more than 122k1
2k1programs of length less than k. For every
k there are 2k strings of length k, hence there
exists an incompressible string x with
K(x)l(x)k. This is a nonconstructive proof
For reasonably sized k it is impossible to give
an explicit x with K(x)l(x). Why? We can ask
though, what are the properties of such x?
15Shannon Entropy versus K?
- How does Shannons entropy relate to
K-complexity? - Example What is the complexity of strings with
ten 1s? - Answer For large lengths l(x), we only need ten
indicesthat describe the 1-positions K(x) 10
log l(x) c. - Example What is the complexity of strings with
10 1s? - Answer Given the 10 rate, index all allowed
strings.This index is a number smaller than x
x has 10 1s.We know that there are
approximately 2l(x)H((0.1,0.9)) strings. Hence
K(xl(x)) l(x)H((0.1,0.9)) c.
16Shannon Entropy versus K
- A string of length n and Hamming weight w can be
described by w and an index j for the right
element in the set S y y?0,1n and w(y)
w , with S Binom(n,w). - Hence we have the bound K(xn) 2 log w log
S c. - Because log Binom(n,w) nH((w/n,1w/n)) we
have -
- Theorem 7.2.5 K(xn) nH((w/n,1w/n)) 2 log
w c. - This shows that Shannons entropy gives an upper
bound on the average Kolmogorov complexity per
letter. - An incompressible string is 50-50 in zeros and
ones.
17Kolmogorov Complexity
The (conditional) Kolmogorov complexity of a
string x with respect to a universal TM U is
defined as where l(p) is the binary
length of p, which is the (binary,
self-delimiting) program that describes x to U.
18Two Trivial Upper Bounds
- Be aware that the input tape is binary without
markers. - The l(x) in K(xl(x)) can act like a marker,
henceTheorem 7.2.2 K(xl(x)) l(x) c. - Proof Let c bits describe the programPrint
the following l(x) bit values 01001000100 - If l is unknown then we have to encode it. Note
that l(x) is an arbitrary big number, which we
have to describe in a self-delimiting way to
avoid confusion. - Easy/lazy solution use 00, 11 with end-marker
01. - Theorem 7.2.2 K(x) K(xl(x)) 2 log l(x) c.
- Proof Let c2 log l(x) bits describe the
programPrint the following l(x) bit values
01001000100
19Average K-Complexity?
- Given a set S of size S, what can we say
aboutthe average Kolmogorov complexity (?x?S
K(x))/S? - The px that reproduce the x have to be described
in a self-delimiting way. Krafts inequality
says - Using Jensens inequality, this gives us
20Shannon vs Kolmogorov
- A string of length n and Hamming weight k can be
described by k and an index j for the right
element in the set S y y?0,1n and w(y)
k , with S Binom(n,k). - Because
we get - Theorem 7.2.5 K(xn) nH((k/n,1k/n)) 2 log
k c - and also
21K Complexity of Integers
- Using binary encodings we can talk about the
Kolmogorov complexity of all sorts of things. - Kolmogorov complexity K(N) of integers N?N
withK(N) log N log log N log N. - Theorem 7.4.3 Summing over all of N it must hold
that - hence for an infinite number of integers K(N) gt
log N.
22(Un)computability of K
To which extent can we know the value K(x)? - An
upper bound K(x) T can be proven by a specific
example of a program p (of length T) such that
U(p)x. - For a lower bound K(x) T we have to
prove that all programs p with length lt T do not
produce x The uncomputability of the Halting
problem shows that this is impossible to do for
all but the smallest T. For specific x, K(x) can
only be approximated from above.
23Applications of K-Complexity
Kolmogorov complexity gives a rigorous definition
of the notions of order and randomness.
The TM model gives us the most general way of
describing mathematical objects like primes,
computer programs, mathematical theories, graphs,
and so on. Together with the incompressibility
theorem, this allows us to make general
statements about these objects.
24Counting Primes less than N
Q How many primes are there less than N? Let
p1,,pm be the m primes between 1 and N.We know
that we can describe N by
Hence lte1,,emgt gives a description of
N. Furthermore, for each j we have ej ? log N
. Thus lte1,,emgt requires less than 2mlog log N
bits. Incompressibility There are N with K(N) gt
log N. Conclusion m ? ½ log N / log log N
c/log log N.
25Occams Razor
Why do we assume that the sun will rise
tomorrow? What is the next number in
0,1,4,9,? Shorter explanations are preferred
over longer ones
26Learning Theory
Assume that there isa TM M inside the box.(The
function F is Turing computable.)
j
?
F(j)
Given a (finite) set of observations about F,we
know that certain TMs are impossible, whileother
TM models M are consistent with our data.
How likely do we consider the possible
models?Occams Razor approach Pr(TM M)
2K(M). From this we can make nontrivial
predictions about F.
27Universal Probability
How likely is a string of binary data x? In a
completely random universe each string wouldhave
equal probability 2l(x). Typically data is not
completely random but generated by a process p,
which can be modeled by a Turing machine. The
smaller the TM, the more likely the process. The
universal probability captures this viewpoint
28Why Universal Probability?
- Theorem 7.6.1 For every TM M, there is a
constant cM such that PU(x) cMPM(x) for all x. - For every universal TM W there are constants c,d
such that dPW(x) PU(x) cPW(x) for all x. - Theorem 7.11.1 There exists a constant c such
thatfor all strings x.