A Bit of Information Theory - PowerPoint PPT Presentation

About This Presentation
Title:

A Bit of Information Theory

Description:

Numbers and math. Well... Signs on paper/screen. Written Language ... k words Dk possible messages, where D is English dictionary size. Length ~ log(complexity) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 19
Provided by: ASS91
Category:

less

Transcript and Presenter's Notes

Title: A Bit of Information Theory


1
A Bit of Information Theory
  • Unsupervised Learning Working Group
  • Assaf Oron, Oct. 15 2003

Based mostly upon Cover Thomas, Elements of
Inf. Theory, 1991
2
Contents
  • Coding and Transmitting Information
  • Entropy etc.
  • Information Theory and Statistics
  • Information Theory and Machine Learning

3
What is Coding? (1)
  • We keep coding all the time
  • Crucial requirement for coding source and
    receiver agree on the key.
  • Modern coding telegraph-gtradio-gt
  • Practical problems How efficient can we make it?
    Tackled from 20s on.
  • 1940s Claude Shannon

4
What is Coding? (2)
  • Shannons greatness finding a solution of the
    specific problem, by working on the general
    problem.
  • Namely how does one quantify information, its
    coding and its transmission?
  • ANY type of information

5
Some Day-to-Day Codes
Code Channel Unique? Instant?
Spoken Language Sounds via air Well
Written Language Signs on paper/screen Well
Numbers and math Signs on paper/screen, electronic, etc. Usually (decimal point, operation signs, etc.)
DNA protein code Nucleotide pairs Yes (start, end, 3-somes)
6
Information Complexity of Some Coded Messages
  • Lets think written numbers
  • k digits ? 10k possible messages
  • How about written English?
  • k letters ? 26k possible messages
  • k words ? Dk possible messages, where D is
    English dictionary size
  • ? Length log(complexity)

7
Information Entropy
  • The expected length (bits) of a binary message
    conveying x-type information
  • other common descriptions code complexity,
    uncertainty, missing/required information,
    expected surprise, information content (BAD),
    etc.

8
Why Entropy?
  • Thermodynamics (mid 19th) amount of un-usable
    heat in system
  • Statistical Physics (end 19th) log (complexity
    of current system state)
  • ? amount of mess in the system
  • The two were proven to be equivalent
  • Statistical entropy is proportional to
    information entropy if p(x) is uniform
  • 2nd Law of Thermodynamics
  • Entropy never decreases (more later)

9
Entropy Properties, Examples
  • .

10
Kullback-Leibler Divergence(Relative Entropy)
  • In words the excess message length needed to
    use p(x)-optimized code for messages based on
    q(x)
  • Properties, Relation to H

11
Mutual Information
  • Relationship to D,H (hint cond. Prob.)
  • Properties, Examples

12
Entropy for Continuous RVs
  • Little h, Defined in the natural way
  • However it is not the same measure
  • h of discrete RVs is always 0, and H of
    continuous RVs is infinite (measure theory)
  • For many continuous distributions, h is log
    (variance) plus some constant
  • Why?

13
The Statistical Connection (1)
  • K-L D ? Likelihood Ratio
  • Law of large numbers can be rephrased as a limit
    on D
  • For dist.s with same variance, normal is the one
    with maximum h.
  • (2nd law of thermodynamics revisited)
  • h is an average quantity. Is the CLT, then, a
    law of nature? (I think YES!)

14
The Statistical Connection (2)
  • Mutual information is very useful
  • Certainly for discrete RVs
  • Also for continuous (no dist. assumptions!)
  • A lot of implications for stochastic processes,
    as well
  • I just dont quite understand them
  • English?

15
Machine Learning? (1)
  • So far, we havent mentioned noise
  • In inf. Theory, noise exists in the channel
  • Channel capacity max(mutual information) between
    source, receiver
  • Noise directly decreases the capacity
  • Shannons Biggest result this can be (almost)
    achieved with (almost) zero error
  • Known as the Channel Coding Theorem

16
Machine Learning? (2)
  • The CCT inspired practical developments
  • Now it all depends on code and channel!
  • Smarter, error-correcting codes
  • Tech developments focus on channel capacity

17
Machine Learning? (3)
  • Can you find analogy between coding and
    classification/clustering? (can it be useful??)

Coding Coding M. Learning
Source Entropy Variability of Interest Variability of Interest
Choice of Channel Parameterization Parameterization
Choice of Code Classification Rules Classification Rules
Channel noise Noise, random errors Noise, random errors
Channel Capacity Maximum accuracy Maximum accuracy
I (source,receiver) Actual Accuracy Actual Accuracy
18
Machine Learning? (4)
  • Inf. Theory tells us that
  • We CAN find a nearly optimal classification or
    clustering rule (coding)
  • We CAN find a nearly optimal parameterizationclas
    sification combo
  • Perhaps the newer wave of successful, but
    statistically intractable methods (boosting
    etc.) works by increasing channel capacity (i.e,
    high-dim parameterization)?
Write a Comment
User Comments (0)
About PowerShow.com