CS225ECE205A, Spring 2005: Information Theory - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CS225ECE205A, Spring 2005: Information Theory

Description:

the length of the string x, which means that p has to contain this information. ... programs p with length T do not produce x... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 29
Provided by: wimva
Category:

less

Transcript and Presenter's Notes

Title: CS225ECE205A, Spring 2005: Information Theory


1
CS225/ECE205A, Spring 2005Information Theory
  • Wim van Dam
  • Engineering 1, Room 5109vandam_at_cs
  • http//www.cs.ucsb.edu/vandam/teaching/S06_CS225/

2
Formalities
  • The coordinates of the final
  • date Thursday June 15
  • time 4 to 7pm
  • place Phelps 1401
  • The paper is due at the end of Week 10.
  • Coming two weeks bits pieces and exercises.

3
Chapter 7KolmogorovComplexity
4
Measuring Information
The data compression of Chapter 5 concerned the
expected length of a compressed string that is
sampled from a random source with entropy
H(X). The practice of coding suggests that it is
possible to talk about the compression of an
individual bit-string. This is not obvious.
Should we view the string 0101110101001010001110
001100011011 as a completely random string of
35 bits, or asa single letter from a source with
alphabet 0,135?
5
Randomness
We know that we can give a short description of
0101010101010101010101010101010101 as Print
01 seventeen times. For other strings like
0101110101001010001110001100011011 this seems
more problematic.
We want to make the following idea work
A regular string is a string that has a short
description. An random/irregular string has no
such summary.
6
Allowed Descriptions
Problem What do we consider a description?What
may be an obvious description for one, may be
illegible for somebody else
0110010010000111111011010101000100010000101101000
1100001000110100110001001100011001100010100010111
0000000110111000001110011010001001010010
?
We need a proper definition of a description.
7
Turing Describable
Key idea We allow as our description the
descriptionof a Turing machine (which is a
universal notion).
A string x will have many different TMs.We
consider the size of the shortest descriptionas
an indication of the intrinsic complexity of x.
To fix the description (length) of the TM we fix
a universal Turing machine U that can have the
description of other TMs as binary input.
8
Kolmogorov Complexity of x
The Kolmogorov complexity KU(x) of a string x
with respect to a universal TM U is defined as
where p is the program that describes x,
and l(p) is its binary length.
AKA descriptive complexity, algorithmic
complexity, or Kolmogorov-Solomonoff-Chaitin
complexity.
9
Kolmogorov(?) Complexity
The idea of measuring the complexity of
bit-stringsby the smallest possible Turing
machine that produces the string has been
proposed by R. Solomonoff - A. Kolmogorov -
G. Chaitin
(1964)
(1966)
(1965)
10
How Long is x?
  • The definition of KU assumes that U does not
    knowthe length of the string x, which means that
    p has to contain this information.
  • Sometimes we want to avoid this issue, in which
    case we use the conditional Kolmogorov complexity

Example For x00...0 we have K(x) log l(x)
but K(xl(x)) c.
11
Universality of KU
Universality question The definition of K
depends on the choice of the universal TM U. how
critical is this? What happens if we replace U
by some other TM M? Proof idea cM is the
amount of information that the universal TM U
needs to simulate the TM M. Straightforward
consequence If both U and M are universal TMs,
then there exists a constant c such that
KU(x)KM(x) lt c for all strings x.
Theorem 7.2.1 (given a universal TM U) For any
TM M there exists a constant cM such that for
all strings x wehave KU(x) KM(x) cM
(constant does not depend on x.)
12
Dealing with Constants
  • Because of the previous universality result, we
    denote the Kolmogorov complexity with simply
    K(x).
  • You have to be careful with the intrinsic
    constants to reach meaningful statements. Think
    quantifiersFor all x there is a c such that
    K(x)c is trivial.There is a c such that for
    all x K(x)c is false.
  • Some straightforward results (c does not depend
    on x)
  • Theorem 7.2.2 K(xl(x)) l(x) c.
  • Theorem 7.2.2 K(x) K(xl(x)) 2 log l(x) c.

13
Some Examples
  • Although K is supposed to talk about the
    complexityof individual strings, our results are
    typical phrasedin the familiar in the limit
    setting.
  • Hence for x0n we say K(xn)c and K(x)log n
    c.
  • Question Why not K(x)log n c?
  • Other example In binary we have p11.0010010
    If p1pn denote the first n bits, then
    K(p1pnn)c.
  • This last example indicates the difference
    betweenShannons information measure and K.

14
Incompressibility
Theorem 7.2.4 The number of strings x?0,1
with Kolmogorov complexity K(x)lt k is less than
2k. Proof There are no more than 2j programs of
length j,hence there are no more than 122k1
2k1programs of length less than k. For every
k there are 2k strings of length k, hence there
exists an incompressible string x with
K(x)l(x)k. This is a nonconstructive proof
For reasonably sized k it is impossible to give
an explicit x with K(x)l(x). Why? We can ask
though, what are the properties of such x?
15
Shannon Entropy versus K?
  • How does Shannons entropy relate to
    K-complexity?
  • Example What is the complexity of strings with
    ten 1s?
  • Answer For large lengths l(x), we only need ten
    indicesthat describe the 1-positions K(x) 10
    log l(x) c.
  • Example What is the complexity of strings with
    10 1s?
  • Answer Given the 10 rate, index all allowed
    strings.This index is a number smaller than x
    x has 10 1s.We know that there are
    approximately 2l(x)H((0.1,0.9)) strings. Hence
    K(xl(x)) l(x)H((0.1,0.9)) c.

16
Shannon Entropy versus K
  • A string of length n and Hamming weight w can be
    described by w and an index j for the right
    element in the set S y y?0,1n and w(y)
    w , with S Binom(n,w).
  • Hence we have the bound K(xn) 2 log w log
    S c.
  • Because log Binom(n,w) nH((w/n,1w/n)) we
    have
  • Theorem 7.2.5 K(xn) nH((w/n,1w/n)) 2 log
    w c.
  • This shows that Shannons entropy gives an upper
    bound on the average Kolmogorov complexity per
    letter.
  • An incompressible string is 50-50 in zeros and
    ones.

17
Kolmogorov Complexity
The (conditional) Kolmogorov complexity of a
string x with respect to a universal TM U is
defined as where l(p) is the binary
length of p, which is the (binary,
self-delimiting) program that describes x to U.
18
Two Trivial Upper Bounds
  • Be aware that the input tape is binary without
    markers.
  • The l(x) in K(xl(x)) can act like a marker,
    henceTheorem 7.2.2 K(xl(x)) l(x) c.
  • Proof Let c bits describe the programPrint
    the following l(x) bit values 01001000100
  • If l is unknown then we have to encode it. Note
    that l(x) is an arbitrary big number, which we
    have to describe in a self-delimiting way to
    avoid confusion.
  • Easy/lazy solution use 00, 11 with end-marker
    01.
  • Theorem 7.2.2 K(x) K(xl(x)) 2 log l(x) c.
  • Proof Let c2 log l(x) bits describe the
    programPrint the following l(x) bit values
    01001000100

19
Average K-Complexity?
  • Given a set S of size S, what can we say
    aboutthe average Kolmogorov complexity (?x?S
    K(x))/S?
  • The px that reproduce the x have to be described
    in a self-delimiting way. Krafts inequality
    says
  • Using Jensens inequality, this gives us

20
Shannon vs Kolmogorov
  • A string of length n and Hamming weight k can be
    described by k and an index j for the right
    element in the set S y y?0,1n and w(y)
    k , with S Binom(n,k).
  • Because
    we get
  • Theorem 7.2.5 K(xn) nH((k/n,1k/n)) 2 log
    k c
  • and also

21
K Complexity of Integers
  • Using binary encodings we can talk about the
    Kolmogorov complexity of all sorts of things.
  • Kolmogorov complexity K(N) of integers N?N
    withK(N) log N log log N log N.
  • Theorem 7.4.3 Summing over all of N it must hold
    that
  • hence for an infinite number of integers K(N) gt
    log N.

22
(Un)computability of K
To which extent can we know the value K(x)? - An
upper bound K(x) T can be proven by a specific
example of a program p (of length T) such that
U(p)x. - For a lower bound K(x) T we have to
prove that all programs p with length lt T do not
produce x The uncomputability of the Halting
problem shows that this is impossible to do for
all but the smallest T. For specific x, K(x) can
only be approximated from above.
23
Applications of K-Complexity
Kolmogorov complexity gives a rigorous definition
of the notions of order and randomness.
The TM model gives us the most general way of
describing mathematical objects like primes,
computer programs, mathematical theories, graphs,
and so on. Together with the incompressibility
theorem, this allows us to make general
statements about these objects.
24
Counting Primes less than N
Q How many primes are there less than N? Let
p1,,pm be the m primes between 1 and N.We know
that we can describe N by
Hence lte1,,emgt gives a description of
N. Furthermore, for each j we have ej ? log N
. Thus lte1,,emgt requires less than 2mlog log N
bits. Incompressibility There are N with K(N) gt
log N. Conclusion m ? ½ log N / log log N
c/log log N.
25
Occams Razor
Why do we assume that the sun will rise
tomorrow? What is the next number in
0,1,4,9,? Shorter explanations are preferred
over longer ones
26
Learning Theory
Assume that there isa TM M inside the box.(The
function F is Turing computable.)
j
?
F(j)
Given a (finite) set of observations about F,we
know that certain TMs are impossible, whileother
TM models M are consistent with our data.
How likely do we consider the possible
models?Occams Razor approach Pr(TM M)
2K(M). From this we can make nontrivial
predictions about F.
27
Universal Probability
How likely is a string of binary data x? In a
completely random universe each string wouldhave
equal probability 2l(x). Typically data is not
completely random but generated by a process p,
which can be modeled by a Turing machine. The
smaller the TM, the more likely the process. The
universal probability captures this viewpoint
28
Why Universal Probability?
  • Theorem 7.6.1 For every TM M, there is a
    constant cM such that PU(x) cMPM(x) for all x.
  • For every universal TM W there are constants c,d
    such that dPW(x) PU(x) cPW(x) for all x.
  • Theorem 7.11.1 There exists a constant c such
    thatfor all strings x.
Write a Comment
User Comments (0)
About PowerShow.com