Title: A Brief Introduction to Information Theory
1A Brief Introduction to Information Theory
- Readings
- Manning Schütze, Chap. 2.1, 2.2
- Cover Thomas Chap 2. (handout)
2Entropy A Measure of Uncertainty
Equivalently H(X) Ep log 1/p(x)
Slide from Cover and Thomas, Elements of
Information Theory, John Wiley Sons, 1991
3Entropy A Measure of Uncertainty
4Entropy Another Example
- Example 1.1.2 (Cover) Suppose we had a horse
race with eight horses taking part, Assume their
respective odds of winning are (1/2, 1/4, 1/8,
1/16, 1/64, 1/64, 1/64, 1/64) - We can calculate the entropy asH(X) -½ log(½)
- ¼ log(¼) - 1/8 log(1/8) - 1/16 log(1/16) - 4
1/64 log(1/64)) 2 bits
5Entropy is Convex
6Properties of information content
- H is a continuous function of the pi
- If all p are equal (pi 1/n), then H is a
monotone increasing function of n - If a message is broken into two successive
messages, the original H is a weighted sum of the
resulting values of H
7Coding Interpretation of Entropy
- Suppose we wish to send a message indicating
which horse won the race as succinctly as
possible. - We could send the index of the winning horse,
requiring 3 bits (since there are 8 horses). - Because the win probabilities are not uniform, we
should use shorter descriptions for more probable
horses. - For example, we could use the following set of
bit strings to represent the eight horses- 0, 10,
110, 1110, 111100, 111101, 111110, 111111. - The average description length will then be 2
bits, equal to the entropy.
8Kullback Leibler Divergence
Idea Consider you want to compare two
probability distributions P and Q that are
defined over the same set of outcomes.
unfair dice
A natural way of defining a distance between
two distributions is the so-called
Kullback-Leibler divergence (KL-distance), or
relative entropy
(from Jochen Triesch, UC San Diego)
9Mutual information
H(X,Y) H(X) H(YX) H(Y) H(XY)
H(X) H(XY) H(Y) H(YX) I(XY)
- Mutual information reduction in uncertainty of
one random variable due to knowing about another,
or the amount of information one random variable
contains about another.
(this and following slides from Drago Radev, U
of Michigan)
10Mutual information and entropy
H(X,Y)
H(YX)
H(XY)
I(XY)
H(Y)
H(X)
- I(XY) is 0 iff two variables are independent
- For two dependent variables, mutual information
grows not only with the degree of dependence, but
also according to the entropy of the variables
11Formulas for I(XY)
- I(XY) H(X) H(XY) H(X) H(Y) H(X,Y)
Since H(XX) 0, note that H(X) H(X)-H(XX)
I(XX)
pointwise mutual information
I(xy) log2
12Mutual Information
Mutual information is just the KL-divergence
between the joint distribution and the product of
marginals
i.e. its the cost in bits of assuming that X and
Y are independent when they are not
(from Jochen Triesch, UC San Diego)
13Mutual Information
- Mutual Information can be used to segment Chinese
Characters (Sproat Shi, 1990)
14Feature selection via Mutual Information
- Problem From training set of documents for some
given class (topic), choose k words which best
discriminate that topic. - One way is using terms with maximal Mutual
Information with the classes - For each word w and each category c
15Feature Selection An Example
- Test Corpus
- Reuters document set.
- Words in corpus 704903
- Sample subcorpus of ten documents with word
cancer - Words in subcorpus 5519
- cancer occurs
- 181 times in subcorpus
- 201 times in entire document set
16Most probable words given that Cancer appears
in the document
- 311 the
- 181 cancer
- 171 of
- 141 and
- 137 in
- 123 a
- 106 to
- 71 women
- 69 is
- 65 that
- 64 s
- 61 breast
-
56 said 54 for 37 on 36 about 35 but 35
are 34 it 33 have 33 at 32 they 30 with 29
who
17Words sorted by I(w,cancer)
- Word c total 2I(w,cancer)
P(wcancer) P(w) - Lung 15 15 128 0.00272 2.13e-05
- Cancers 14 14 128 0.00254 1.99e-05
- Counseling 14 14 128 0.00254 1.99e-05
- Mammograms 11 11 128 0.00199 1.56e-05
- Oestrogen 10 10 128 0.00181 1.42e-05
- Brca 8 8 128 0.00145 1.13e-05
- Brewster 9 9 128 0.00163 1.28e-05
- Detection 7 7 128 0.00127 9.93e-06
- Ovarian 7 7 128 0.00127 9.93e-06
- Incidence 6 6 128 0.00109 8.51e-06
- Klausner 6 6 128 0.00109 8.51e-06
- Lerman 6 6 128 0.00109 8.51e-06
- Mammography 4 4 128 0.000725 5.67e-06
18- Examples from
- Class-Based n-gram Models of Natural Language
- Brown et.al.
- Computational Linguistics, 18-4, 1992
19Word Cluster Trees Formed with MI
20Mutual Information Word Clusters
21Sticky Words and Relative Entropy
- Let Pnear(w1w2) be the probability that a word
chosen at random from the text is w1 and that a
2nd word, chosen at random from a window of 1,001
words centered on w1, but excluding the words in
a window of 5 centered on w1, is w2. - w1 and w2 are semantically sticky if
Pnear(w1w2)gtgtP(w1)P(w2) - But this is just saying
- w1 and w2 are semantically sticky if
D(Pnear(w1w2)) P(w1)P(w2)) is large - where D is point relative entropy
22Some Sticky Word Clusters