Title: A Bit of Information Theory
1A Bit of Information Theory
- Unsupervised Learning Working Group
- Assaf Oron, Oct. 15 2003
Based mostly upon Cover Thomas, Elements of
Inf. Theory, 1991
2Contents
- Coding and Transmitting Information
- Entropy etc.
- Information Theory and Statistics
- Information Theory and Machine Learning
3What is Coding? (1)
- We keep coding all the time
- Crucial requirement for coding source and
receiver agree on the key. - Modern coding telegraph-gtradio-gt
- Practical problems How efficient can we make it?
Tackled from 20s on. - 1940s Claude Shannon
4What is Coding? (2)
- Shannons greatness finding a solution of the
specific problem, by working on the general
problem. - Namely how does one quantify information, its
coding and its transmission? - ANY type of information
5Some Day-to-Day Codes
Code Channel Unique? Instant?
Spoken Language Sounds via air Well
Written Language Signs on paper/screen Well
Numbers and math Signs on paper/screen, electronic, etc. Usually (decimal point, operation signs, etc.)
DNA protein code Nucleotide pairs Yes (start, end, 3-somes)
6Information Complexity of Some Coded Messages
- Lets think written numbers
- k digits ? 10k possible messages
- How about written English?
- k letters ? 26k possible messages
- k words ? Dk possible messages, where D is
English dictionary size - ? Length log(complexity)
7Information Entropy
- The expected length (bits) of a binary message
conveying x-type information - other common descriptions code complexity,
uncertainty, missing/required information,
expected surprise, information content (BAD),
etc.
8Why Entropy?
- Thermodynamics (mid 19th) amount of un-usable
heat in system - Statistical Physics (end 19th) log (complexity
of current system state) - ? amount of mess in the system
- The two were proven to be equivalent
- Statistical entropy is proportional to
information entropy if p(x) is uniform - 2nd Law of Thermodynamics
- Entropy never decreases (more later)
9Entropy Properties, Examples
10Kullback-Leibler Divergence(Relative Entropy)
- In words the excess message length needed to
use p(x)-optimized code for messages based on
q(x) - Properties, Relation to H
11Mutual Information
- Relationship to D,H (hint cond. Prob.)
12Entropy for Continuous RVs
- Little h, Defined in the natural way
- However it is not the same measure
- h of discrete RVs is always 0, and H of
continuous RVs is infinite (measure theory) - For many continuous distributions, h is log
(variance) plus some constant - Why?
13The Statistical Connection (1)
- K-L D ? Likelihood Ratio
- Law of large numbers can be rephrased as a limit
on D - For dist.s with same variance, normal is the one
with maximum h. - (2nd law of thermodynamics revisited)
- h is an average quantity. Is the CLT, then, a
law of nature? (I think YES!)
14The Statistical Connection (2)
- Mutual information is very useful
- Certainly for discrete RVs
- Also for continuous (no dist. assumptions!)
- A lot of implications for stochastic processes,
as well - I just dont quite understand them
- English?
15Machine Learning? (1)
- So far, we havent mentioned noise
- In inf. Theory, noise exists in the channel
- Channel capacity max(mutual information) between
source, receiver - Noise directly decreases the capacity
- Shannons Biggest result this can be (almost)
achieved with (almost) zero error - Known as the Channel Coding Theorem
16Machine Learning? (2)
- The CCT inspired practical developments
- Now it all depends on code and channel!
- Smarter, error-correcting codes
- Tech developments focus on channel capacity
17Machine Learning? (3)
- Can you find analogy between coding and
classification/clustering? (can it be useful??)
Coding Coding M. Learning
Source Entropy Variability of Interest Variability of Interest
Choice of Channel Parameterization Parameterization
Choice of Code Classification Rules Classification Rules
Channel noise Noise, random errors Noise, random errors
Channel Capacity Maximum accuracy Maximum accuracy
I (source,receiver) Actual Accuracy Actual Accuracy
18Machine Learning? (4)
- Inf. Theory tells us that
- We CAN find a nearly optimal classification or
clustering rule (coding) - We CAN find a nearly optimal parameterizationclas
sification combo - Perhaps the newer wave of successful, but
statistically intractable methods (boosting
etc.) works by increasing channel capacity (i.e,
high-dim parameterization)?