A Brief Introduction to Information Theory - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

A Brief Introduction to Information Theory

Description:

Slide from Cover and Thomas, Elements of Information Theory, John Wiley & Sons, 1991 ... Brewster 9 9 128 0.00163 1.28e-05. Detection 7 7 128 0.00127 9.93e-06 ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 23
Provided by: SEAS
Category:

less

Transcript and Presenter's Notes

Title: A Brief Introduction to Information Theory


1
A Brief Introduction to Information Theory
  • Readings
  • Manning Schütze, Chap. 2.1, 2.2
  • Cover Thomas Chap 2. (handout)

2
Entropy A Measure of Uncertainty
Equivalently H(X) Ep log 1/p(x)
Slide from Cover and Thomas, Elements of
Information Theory, John Wiley Sons, 1991
3
Entropy A Measure of Uncertainty

4
Entropy Another Example
  • Example 1.1.2 (Cover) Suppose we had a horse
    race with eight horses taking part, Assume their
    respective odds of winning are (1/2, 1/4, 1/8,
    1/16, 1/64, 1/64, 1/64, 1/64)
  • We can calculate the entropy asH(X) -½ log(½)
    - ¼ log(¼) - 1/8 log(1/8) - 1/16 log(1/16) - 4
    1/64 log(1/64)) 2 bits


5
Entropy is Convex

6
Properties of information content
  • H is a continuous function of the pi
  • If all p are equal (pi 1/n), then H is a
    monotone increasing function of n
  • If a message is broken into two successive
    messages, the original H is a weighted sum of the
    resulting values of H

7
Coding Interpretation of Entropy
  • Suppose we wish to send a message indicating
    which horse won the race as succinctly as
    possible.
  • We could send the index of the winning horse,
    requiring 3 bits (since there are 8 horses).
  • Because the win probabilities are not uniform, we
    should use shorter descriptions for more probable
    horses.
  • For example, we could use the following set of
    bit strings to represent the eight horses- 0, 10,
    110, 1110, 111100, 111101, 111110, 111111.
  • The average description length will then be 2
    bits, equal to the entropy.

8
Kullback Leibler Divergence
Idea Consider you want to compare two
probability distributions P and Q that are
defined over the same set of outcomes.
unfair dice
A natural way of defining a distance between
two distributions is the so-called
Kullback-Leibler divergence (KL-distance), or
relative entropy
(from Jochen Triesch, UC San Diego)
9
Mutual information
H(X,Y) H(X) H(YX) H(Y) H(XY)
H(X) H(XY) H(Y) H(YX) I(XY)
  • Mutual information reduction in uncertainty of
    one random variable due to knowing about another,
    or the amount of information one random variable
    contains about another.

(this and following slides from Drago Radev, U
of Michigan)
10
Mutual information and entropy
H(X,Y)
H(YX)
H(XY)
I(XY)
H(Y)
H(X)
  • I(XY) is 0 iff two variables are independent
  • For two dependent variables, mutual information
    grows not only with the degree of dependence, but
    also according to the entropy of the variables

11
Formulas for I(XY)
  • I(XY) H(X) H(XY) H(X) H(Y) H(X,Y)

Since H(XX) 0, note that H(X) H(X)-H(XX)
I(XX)
pointwise mutual information
I(xy) log2
12
Mutual Information
Mutual information is just the KL-divergence
between the joint distribution and the product of
marginals
i.e. its the cost in bits of assuming that X and
Y are independent when they are not
(from Jochen Triesch, UC San Diego)
13
Mutual Information
  • Mutual Information can be used to segment Chinese
    Characters (Sproat Shi, 1990)

14
Feature selection via Mutual Information
  • Problem From training set of documents for some
    given class (topic), choose k words which best
    discriminate that topic.
  • One way is using terms with maximal Mutual
    Information with the classes
  • For each word w and each category c

15
Feature Selection An Example
  • Test Corpus
  • Reuters document set.
  • Words in corpus 704903
  • Sample subcorpus of ten documents with word
    cancer
  • Words in subcorpus 5519
  • cancer occurs
  • 181 times in subcorpus
  • 201 times in entire document set

16
Most probable words given that Cancer appears
in the document
  • 311 the
  • 181 cancer
  • 171 of
  • 141 and
  • 137 in
  • 123 a
  • 106 to
  • 71 women
  • 69 is
  • 65 that
  • 64 s
  • 61 breast

56 said 54 for 37 on 36 about 35 but 35
are 34 it 33 have 33 at 32 they 30 with 29
who
17
Words sorted by I(w,cancer)
  • Word c total 2I(w,cancer)
    P(wcancer) P(w)
  • Lung 15 15 128 0.00272 2.13e-05
  • Cancers 14 14 128 0.00254 1.99e-05
  • Counseling 14 14 128 0.00254 1.99e-05
  • Mammograms 11 11 128 0.00199 1.56e-05
  • Oestrogen 10 10 128 0.00181 1.42e-05
  • Brca 8 8 128 0.00145 1.13e-05
  • Brewster 9 9 128 0.00163 1.28e-05
  • Detection 7 7 128 0.00127 9.93e-06
  • Ovarian 7 7 128 0.00127 9.93e-06
  • Incidence 6 6 128 0.00109 8.51e-06
  • Klausner 6 6 128 0.00109 8.51e-06
  • Lerman 6 6 128 0.00109 8.51e-06
  • Mammography 4 4 128 0.000725 5.67e-06

18
  • Examples from
  • Class-Based n-gram Models of Natural Language
  • Brown et.al.
  • Computational Linguistics, 18-4, 1992

19
Word Cluster Trees Formed with MI
20
Mutual Information Word Clusters
21
Sticky Words and Relative Entropy
  • Let Pnear(w1w2) be the probability that a word
    chosen at random from the text is w1 and that a
    2nd word, chosen at random from a window of 1,001
    words centered on w1, but excluding the words in
    a window of 5 centered on w1, is w2.
  • w1 and w2 are semantically sticky if
    Pnear(w1w2)gtgtP(w1)P(w2)
  • But this is just saying
  • w1 and w2 are semantically sticky if
    D(Pnear(w1w2)) P(w1)P(w2)) is large
  • where D is point relative entropy

22
Some Sticky Word Clusters
Write a Comment
User Comments (0)
About PowerShow.com