A Brief Introduction to Information Theory - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

A Brief Introduction to Information Theory

Description:

Slide from Cover and Thomas, Elements of Information Theory, John Wiley & Sons, 1991 ... Brewster 9 9 128 0.00163 1.28e-05. Detection 7 7 128 0.00127 9.93e-06 ... – PowerPoint PPT presentation

Number of Views:216

Avg rating:3.0/5.0

Slides: 23

Provided by: SEAS

Category:

more less

Transcript and Presenter's Notes

Title: A Brief Introduction to Information Theory

1
A Brief Introduction to Information Theory

Readings
Manning Schütze, Chap. 2.1, 2.2
Cover Thomas Chap 2. (handout)

2
Entropy A Measure of Uncertainty
Equivalently H(X) Ep log 1/p(x)
Slide from Cover and Thomas, Elements of
Information Theory, John Wiley Sons, 1991
3
Entropy A Measure of Uncertainty

4
Entropy Another Example

Example 1.1.2 (Cover) Suppose we had a horse
race with eight horses taking part, Assume their
respective odds of winning are (1/2, 1/4, 1/8,
1/16, 1/64, 1/64, 1/64, 1/64)
We can calculate the entropy asH(X) -½ log(½)
- ¼ log(¼) - 1/8 log(1/8) - 1/16 log(1/16) - 4
1/64 log(1/64)) 2 bits

5
Entropy is Convex

6
Properties of information content

H is a continuous function of the pi
If all p are equal (pi 1/n), then H is a
monotone increasing function of n
If a message is broken into two successive
messages, the original H is a weighted sum of the
resulting values of H

7
Coding Interpretation of Entropy

Suppose we wish to send a message indicating
which horse won the race as succinctly as
possible.
We could send the index of the winning horse,
requiring 3 bits (since there are 8 horses).
Because the win probabilities are not uniform, we
should use shorter descriptions for more probable
horses.
For example, we could use the following set of
bit strings to represent the eight horses- 0, 10,
110, 1110, 111100, 111101, 111110, 111111.
The average description length will then be 2
bits, equal to the entropy.

8
Kullback Leibler Divergence
Idea Consider you want to compare two
probability distributions P and Q that are
defined over the same set of outcomes.
unfair dice
A natural way of defining a distance between
two distributions is the so-called
Kullback-Leibler divergence (KL-distance), or
relative entropy
(from Jochen Triesch, UC San Diego)
9
Mutual information
H(X,Y) H(X) H(YX) H(Y) H(XY)
H(X) H(XY) H(Y) H(YX) I(XY)

Mutual information reduction in uncertainty of
one random variable due to knowing about another,
or the amount of information one random variable
contains about another.

(this and following slides from Drago Radev, U
of Michigan)
10
Mutual information and entropy
H(X,Y)
H(YX)
H(XY)
I(XY)
H(Y)
H(X)

I(XY) is 0 iff two variables are independent
For two dependent variables, mutual information
grows not only with the degree of dependence, but
also according to the entropy of the variables

11
Formulas for I(XY)

I(XY) H(X) H(XY) H(X) H(Y) H(X,Y)

Since H(XX) 0, note that H(X) H(X)-H(XX)
I(XX)
pointwise mutual information
I(xy) log2
12
Mutual Information
Mutual information is just the KL-divergence
between the joint distribution and the product of
marginals
i.e. its the cost in bits of assuming that X and
Y are independent when they are not
(from Jochen Triesch, UC San Diego)
13
Mutual Information

Mutual Information can be used to segment Chinese
Characters (Sproat Shi, 1990)

14
Feature selection via Mutual Information

Problem From training set of documents for some
given class (topic), choose k words which best
discriminate that topic.
One way is using terms with maximal Mutual
Information with the classes
For each word w and each category c

15
Feature Selection An Example

Test Corpus
Reuters document set.
Words in corpus 704903
Sample subcorpus of ten documents with word
cancer
Words in subcorpus 5519
cancer occurs
181 times in subcorpus
201 times in entire document set

16
Most probable words given that Cancer appears
in the document

311 the
181 cancer
171 of
141 and
137 in
123 a
106 to
71 women
69 is
65 that
64 s
61 breast

56 said 54 for 37 on 36 about 35 but 35
are 34 it 33 have 33 at 32 they 30 with 29
who
17
Words sorted by I(w,cancer)

Word c total 2I(w,cancer)
P(wcancer) P(w)
Lung 15 15 128 0.00272 2.13e-05
Cancers 14 14 128 0.00254 1.99e-05
Counseling 14 14 128 0.00254 1.99e-05
Mammograms 11 11 128 0.00199 1.56e-05
Oestrogen 10 10 128 0.00181 1.42e-05
Brca 8 8 128 0.00145 1.13e-05
Brewster 9 9 128 0.00163 1.28e-05
Detection 7 7 128 0.00127 9.93e-06
Ovarian 7 7 128 0.00127 9.93e-06
Incidence 6 6 128 0.00109 8.51e-06
Klausner 6 6 128 0.00109 8.51e-06
Lerman 6 6 128 0.00109 8.51e-06
Mammography 4 4 128 0.000725 5.67e-06

Examples from
Class-Based n-gram Models of Natural Language
Brown et.al.
Computational Linguistics, 18-4, 1992

19
Word Cluster Trees Formed with MI
20
Mutual Information Word Clusters
21
Sticky Words and Relative Entropy

Let Pnear(w1w2) be the probability that a word
chosen at random from the text is w1 and that a
2nd word, chosen at random from a window of 1,001
words centered on w1, but excluding the words in
a window of 5 centered on w1, is w2.
w1 and w2 are semantically sticky if
Pnear(w1w2)gtgtP(w1)P(w2)
But this is just saying
w1 and w2 are semantically sticky if
D(Pnear(w1w2)) P(w1)P(w2)) is large
where D is point relative entropy

22
Some Sticky Word Clusters

Write a Comment

User Comments (0)