WenHsiang Lu

About This Presentation

Title:

WenHsiang Lu

Description:

Statistical NLP aims to do statistical inference. ... Assume that the language phenomenon is acceptably modeled by one of the well ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 77

Provided by: whlu

Category:

more less

Transcript and Presenter's Notes

Title: WenHsiang Lu

1
Lecture 3 Basic Probability (Chapter 2 of
Manning and Schutze)
Wen-Hsiang Lu (???) Department of Computer
Science and Information Engineering, National
Cheng Kung University 2004/09/29
2
Motivation

Statistical NLP aims to do statistical inference.
Statistical inference consists of taking some
data (generated according to some unknown
probability distribution) and then making some
inferences about this distribution.
An example of statistical inference is the task
of language modeling, namely predicting the next
word given a window of previous words. To do
this, we need a model of the language.
Probability theory helps us to find such a model.

3
Probability Terminology

Probability theory deals with predicting how
likely it is that something will happen.
The process by which an observation is made is
called an experiment or a trial (e.g., tossing a
coin twice).
The collection of basic outcomes (or sample
points) for our experiment is called the sample
space.
An event is a subset of the sample space.
Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1, certainty.
A probability function/distribution distributes a
probability mass of 1 throughout the sample
space.

4
Experiments Sample Spaces

Set of possible basic outcomes of an experiment
sample space ?
coin toss (? head,tail)
tossing coin 2 times (? HH, HT, TH, TT)
dice roll (? 1, 2, 3, 4, 5, 6)
missing word ( ? ? vocabulary size)
Discrete (countable) versus continuous
(uncountable)
Every observation/trial is a basic outcome or
sample point.
Event A is a set of basic outcomes with A ? ? ,
and all A ?2? (the event space)
? is then the certain event, ? is the impossible
event

5
Events and Probability

The probability of event A is denoted p(A) (also
called the prior probability, i.e., the
probability before we consider any additional
knowledge).
Example Experiment toss coin three times
? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
cases with two or more tails
A HTT, THT, TTH, TTT
P(A) A/ ? 1/2 (assuming uniform
distribution)
all heads
A HHH
P(A) A/ W 1/8

6
Probability Properties

Basic properties
P 2? ? 0,1
P(?) 1
For disjoint events P(?Ai) ?i P(Ai)
NB axiomatic definition of probability take
the above three conditions as axioms
Immediate consequences
P(?) 0, P(A) 1 - P(A), A ? B ? P(A) ?
P(B)
?a?? P(a) 1

7
Joint Probability

Joint Probability of A and B P(A,B) P(A ? B)
2-dimensional table (AxB) with a value in
every cell giving the probability of that
specific pair occurring.

?
B
A
A?B
8
Conditional Probability

Sometimes we have partial knowledge about the
outcome of an experiment, then the conditional
(or posterior) probability can be helpful. If we
know that event B is true, then we can determine
the probability that A is true given this
knowledge P(AB) P(A,B) / P(B)

?
B
A
A?B
9
Conditional and Joint Probabilities

P(AB) P(A,B)/P(B) ? P(A,B) P(AB) P(B)
P(BA) P(A,B)/P(A) ? P(A,B) P(BA) P(A)
Chain rule P(A1,..., An) P(A1)P(A2A1)P(A3A1,
A2 )P(AnA1,..., An-1)

10
Bayes Rule

Since P(A,B) P(B,A), P(A ? B) P(B ? A), and
P(A,B) P(AB) P(B) P(BA) P(A)
P(AB) P(A,B)/P(B) P(BA) P(A) / P(B)
P(BA) P(A,B)/P(A) P(AB) P(B)/P(A)

?
B
A
A?B
11
Example

S have stiff neck, M have meningitis
P(SM) 0.5, P(M) 1/50,000, P(S)1/20
I have stiff neck, should I worry?

12
Independence

Two events A and B are independent of each other
if P(A) P(AB)
Example two coin tosses, weather today and
weather on March 4th, 1789
If A and B are independent, then we compute
P(A,B) from P(A) and P(B) as
P(A,B) P(AB) P(B) P(A) P(B)
Two events A and B are conditionally independent
of each other given C if
P(AC) P(AB,C)

13
A Golden Rule (of Statistical NLP)

If we are interested in which event is most
likely given A, we can use Bayes rule, max over
all B
argmaxB P(BA) argmaxB P(AB)P(B) / P(A)
argmaxB P(AB)P(B)
P(A) is a normalizing constant
However the denominator is easy to obtain.

14
Random Variables (RV)

Random variables (RV) allow us to talk about the
probabilities of numerical values that are
related to the event space (with a specific
numeric range)
An RV is a function X ?? Q
in general Q Rn, typically R
easier to handle real numbers than real-world
events
An RV is discrete if Q is a countable subset of
R an indicator RV (or Bernoulli trial) if Q0,
1
Can define a probability mass function (pmf) for
RV X that gives the probability it has at
different values
pX(x) P(Xx) P(Ax) where Ax ??? X(?)
x
often just p(x) if it is clear from context what
x is

15
Example

Suppose we define a discrete RV X that is the sum
of the faces of two die, then Q2, , 11, 12
with the pmf as follows
P(X2)1/36,
P(X3)2/36,
P(X4)3/36,
P(X5)4/36,
P(X6)5/36,
P(X7)6/36,
P(X8)5/36,
P(X9)4/36,
P(X10)3/36,
P(X11)2/36,
P(X12)1/36

16
Expectation and Variance

The expectation is the mean or average of a RV
defined as
The variance of a RV is a measure of whether the
values of the RV tend to vary over trials
The standard deviation (s) is the square root of
the variance.

17
Examples

What is the expectation of the sum of the numbers
on two dice? 2 P(X2) 2 1/36 1/18 3
P(X3) 3 2/36 3/18 4 P(X4) 4 3/36
6/18 5 P(X5) 5 4/36 10/18 6 P(X6) 6
5/36 15/18 7 P(X7) 7 6/36 21/18 8
P(X8) 8 5/36 20/18 9 P(X9) 9 4/36
18/18 10 P(X10) 10 3/36 15/18 11
P(X11) 11 2/36 11/18 12 p(X12) 12
1/36 6/18 E(SUM) 126/18 7

Or more simply
E(SUM) E(D1D2) E(D1) E(D2)
E(D1) E(D2) 1 1/6 2 1/6 6 1/6
1/6 (1 2 3 4 5 6) 21/6
Hence, E(SUM) 21/6 21/6 7

18
Examples

Var(X) E((X E(X))2) E(X2 2XE(X) E2(X))
E(X2) 2E(XE(X)) E2(X) E(X2) 2E2(X)
E2(X) E(X2) E2(X)
E(SUM2)329/6 and E2(SUM) 72 49
Hence, Var(SUM) 329/6 294/6 35/6

19
Joint and Conditional Distributions for RVs

Joint pmf for two RVs X and Y is p(x,y) P(Xx,
Yy)
Marginal pmfs are calculated as pX(x) ?y p(x,y)
and pY(y) ?x p(x,y)
If X and Y are independent then p(x,y) pX(x)
pY(y)
Define conditional pmf using joint distribution
pXY(xy) p(x,y)/ pY(y) if pY(y) gt 0
Chain rule
p(w,x,y,z) p(w) p(xw) p(yw,x) p(zw,x,y)

20
Estimating Probability Functions

What is the probability that sentence The cow
chewed its cud will be uttered? Unknown, so P
must be estimated from a sample of data.
An important measure for estimating P is the
relative frequency of the outcome, i.e., the
proportion of times an outcome u occurs
C(u) is the number of times u occurs in N trials.
For N??, the relative frequency tends to
stabilize around some number, the probability
estimate.
Two different approaches
Parametric (assume distribution)
Non-parametric (distribution free)

21
Parametric Methods

Assume that the language phenomenon is acceptably
modeled by one of the well-known standard
distributions (e.g., binomial, normal).
By assuming an explicit probabilistic model of
the process by which the data was generated, then
determining a particular probability distribution
within the family requires only the specification
of a few parameters, which requires less training
data (i.e., only a small number of parameters
need to be estimated).

22
Non-parametric Methods

No assumption is made about the underlying
distribution of the data.
For example, simply estimate P empirically by
counting a large number of random events is a
distribution-free method.
Given less prior information, more training data
is needed.

23
Estimating Probability

Example Toss coin three times
? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
count cases with exactly two tails A HTT,
THT, TTH
Run an experiment 1000 times (i.e., 3000 tosses)
Counted 386 cases with two tails (HTT, THT, or
TTH)
Estimate of p(A) 386 / 1000 .386
Run again 373, 399, 382, 355, 372, 406, 359
p(A) .379 (weighted average) or simply 3032 /
8000
Uniform distribution assumption p(A) 3/8 .375

24
Standard Distributions

In practice, one commonly finds the same basic
form of a probability mass function, but with
different constants employed.
Families of pmfs are called distributions and the
constants that define the different possible pmfs
in one family are called parameters.
Discrete Distributions the binomial
distribution, the multinomial distribution, the
Poisson distribution.
Continuous Distributions the normal
distribution, the standard normal distribution.

25
Standard Distributions Binomial

Series of trials with only two outcomes, 0 or 1,
with each trial being independent from all the
other outcomes.
Number r of successes out of n trials given that
the probability of success in any trial is p
( ) counts how many possibilities there are for
choosing r objects out of n, i.e., n! / (n-r)!r!

n r
26
Binomial Distribution

Works well for tossing a coin. However, for
linguistic corpora one never has complete
independence from one sentence to the next
approximation.
Use it when counting whether something has a
certain property or not (assuming independence).
Actually quite common in SNLP e.g., look through
a corpus to find out the estimate of the
percentage of sentences that have a certain word
in them or how often a verb is used as transitive
or intransitive.
Expectation is n ? p, variance is n ? p ? (1-p)

27
Standard Distributions Normal

The Normal (Gaussian) distribution is a
continuous distribution with two parameters mean
µ and standard deviation s. Standard normal if
?0 and ?1. (clustering)

m
X
28
Frequentist Statistics

s sequence of observations
?M model (a distribution plus parameters)
For fixed model ?M, Maximum likelihood estimate
Probability expresses something about the world
with no prior belief!

29
Baysian Statistics I Bayesian Updating

Assume that the data are coming in sequentially
and are independent.
Given an a-priori probability distribution, we
can update our beliefs when a new datum comes in
by calculating the Maximum A Posteriori (MAP)
distribution.
The MAP probability becomes the new prior and the
process repeats on each new datum.

30
Bayesian Statistics MAP

P(s) is a normalizing constant

31
Bayesian Statistics II Bayesian Decision Theory

Bayesian Statistics can be used to evaluate which
model or family of models better explains some
data.
We define two different models of the event and
calculate the likelihood ratio between these two
models.

32
Bayesian Decision Theory

Suppose we have 2 models M1 and M2 we want to
evaluate which model better explains some new
data.
then M1 is the most likely model, otherwise M2

33
Essential Information Theory

Developed by Shannon in the 1940s.
Goal is to maximize the amount of information
that can be transmitted over an imperfect
communication channel.
Wished to determine the theoretical maxima for
data compression (entropy H) and transmission
rate (channel capacity C).
If a message is transmitted at a rate slower than
C, then the probability of transmission errors
can be made as small as desired.

34
Entropy

X discrete RV, p(X)
Entropy (or self-information) is the average
uncertainty of a single RV. Let p(x)P(Xx),
where x ?X, then
In other words, entropy measures the amount of
information in a random variable measured in
bits. It is also the average length of the
message needed to transmit an outcome of that
variable using the optimal code.
An optimal code sends a message of probability
p(i) in ?-log2 p(i)? bits.

35
Entropy (cont.)

i.e., when the value of X is determinate, it
provides no new information

36
Using the Formula Examples

Toss a fair coin ? head, tail
P(Xhead) .5, P(Xtail) .5
H(X) - 0.5 log2(0.5) - 0.5 log2(0.5) 1
Take fair, 32-sided die p(x) 1/32 for every
side x
H(p) -?i 1..32 p(xi) log2p(xi) - 32 (1/32
log2 (1/32)) (since for all i, p(xi) 1/32) ? 5
bits
Unfair coin
P(Xhead) .2 ? H(X) .722 P(Xhead) .01 ?
H(X) .081

37
Entropy of a Weighted Coin(one toss)
38
The Limits

When H(p) 0?
if a result of an experiment is known ahead of
time
necessarily
Upper bound?
for ? n H(p) ? log2n
nothing can be more uncertain than the uniform
distribution
Entropy increases with message length.

39
Coding Interpretation of Entropy

The least (average) number of bits needed to
encode a message (string, sequence, series,...)
(each element being a result of a random process
with some distribution p) gives H(p).
Compression algorithms
do well on data with repeating (easily
predictable low entropy) patterns
their results have high entropy ? compressing
compressed data does nothing

40
Coding Example

Experience some characters are more common, some
(very) rare
What if we use more bits for the rare, and fewer
bits for the frequent? be careful want to
decode (easily)!
suppose p(a) 0.3, p(b) 0.3, p(c)
0.3, the rest p(x)? .0004
code a 00, b 01, c 10, rest
11b1b2b3b4b5b6b7b8
code acbbécbaac 0010010111000011111001000010
a c b b é
c b a a c
number of bits used 28 (vs. 80 using naive
8-bit uniform coding)

41
Properties of Entropy I

Entropy is non-negative
H(X) ? 0
Recalling that H(X) - ?x?X p(x) log2 p(x)
log(p(x)) is negative or zero for x 1,
p(x) is non-negative, so p(x)logp(x) is negative
sum of negative numbers is negative
and -x is positive for negative x

42
Joint Entropy

The joint entropy of a pair of discrete random
variables X, Y p(x,y) is the amount of
information needed on average to specify both
their values.

43
Conditional Entropy

The conditional entropy of a discrete random
variable Y given another X, for X, Y p(x,y),
expresses how much extra information is needed to
supply on average for communicating Y given that
the other party knows X. (Recall that H(X)
E(log2(1/pX(x))) weights are not conditional)

44
Adverb Forms

Conditional Entropy is better than
unconditional Entropy
H(YX) ?H(Y)
H(X,Y) ? H(X) H(Y) (follows from the previous
(in)equalities)
equality iff X,Y independent
recall X,Y independent iff p(X,Y) p(X)p(Y)
H(p) is concave (remember the coin toss graph?)
concave function f over an interval (a,b)
function f is convex if -f is concave
for proofs and generalizations, see
Cover/Thomas

45
Chain Rule for Entropy

The product became a sum due to the log.

46
Entropy Rate

Because the amount of information contained in a
message depends on its length, we may want to
compare using entropy rate (the entropy per
unit).
The entropy rate of a language is the limit of
the entropy rate of a sample of language as the
sample gets longer and longer.

47
Mutual Information

By the chain rule for entropy, we have H(X,Y)
H(X) H(YX) H(Y)H(XY)
Therefore, H(X)-H(XY)H(Y)-H(YX)I(X Y)
I(X Y) is called the mutual information between
X and Y.
It is the reduction in uncertainty of one random
variable due to knowing about another, or, in
other words, the amount of information one random
variable contains about another.

48
Relationship between I and H
49
Mutual Information (cont)

I(X Y) is symmetric, non-negative measure of the
common information of two variables.
Some see it as a measure of dependence between
two variables, but better to think of it as a
measure of independence.
I(X Y) is 0 only when X and Y are independent
H(XY)H(X)
For two dependent variables, I grows not only
according to the degree of dependence but also
according to the entropy of the two variables.
H(X)H(X)-H(XX)I(X X) ? Why entropy is called
self-information.

50
Mutual Information (cont)

We can also derive conditional mutual
information
I(X YZ) I((XY) Z) H(XZ) H(XY,Z)
Chain rule
I(X1n Y) I(X1 Y) I(X2 Y X1) I(Xn
Y X1, , Xn-1)
Dont confuse with pointwise mutual information,
which has some problems

51
Mutual Information and Entropy

number of bits the knowledge
of Y lowers the entropy of X
(symmetry)

I(X,Y) H(X) - H(XY)
H(Y) - H(YX)
Recall H(X,Y) H(XY) H(Y) Þ -H(XY) H(Y) -
H(X,Y) Þ
I(X,Y) H(X) H(Y) - H(X,Y)
I(X,X) H(X) H(XX) H(X) (since H(XX) 0)
I(X,Y) I(Y,X) (symmetry)
I(X,Y) ³ 0

52
The Noisy Channel Model

Want to optimize a communication across a channel
in terms of throughput and accuracy the
communication of messages in the presence of
noise in the channel.
There is a duality between compression (achieved
by removing all redundancy) and transmission
accuracy (achieved by adding controlled
redundancy so that the input can be recovered in
the presence of noise).

53
The Noisy Channel Model

Goal encode the message in such a way that it
occupies minimal space while still containing
enough redundancy to be able to detect and
correct errors.

54
The Noisy Channel Model

Channel capacity rate at which one can transmit
information through the channel with an arbitrary
low probability of being unable to recover the
input from the output. For a memoryless channel
We reach a channels capacity if we manage to
design an input code X whose distribution p(X)
maximizes I between input and output.

55
Language and the Noisy Channel Model

In language we cant control the encoding phase
we can only decode the output to give the most
likely input. Determine the most likely input
given the output!

I
I
O
Noisy Channel p(oI)
56
The Noisy Channel Model

p(i) is the language model and p(oi) is the
channel probability
This is used in machine translation, optical
character recognition, speech recognition, etc.

57
The Noisy Channel Model
58
Relative Entropy Kullback-Leibler Divergence

For 2 pmfs, p(x) and q(x), their relative entropy
is
The relative entropy, also called the
Kullback-Leibler divergence, is a measure of how
different two probability distributions are (over
the same event space).
The KL divergence between p and q can also be
seen as the average number of bits that are
wasted by encoding events from a distribution p
with a code based on the not-quite-right
distribution q.

59
Comments on Relative Entropy

Goal minimize relative entropy D(pq) to have a
probabilistic model as accurate as possible.
Conventions
0 log 0 0
p log (p/0) (for p gt 0)
Distance? not quite
not symmetric D(pq) ¹ D(qp)
does not satisfy the triangle inequality
But can be useful to think of it as distance

60
Mutual Information and Relative Entropy

Random variables X, Y pXÇY(x,y), pX(x), pY(y)
Mutual information (between two random variables
X,Y)
I(X,Y) D(p(x,y)
p(x)p(y))
I(X,Y) measures how much (our knowledge of) Y
contributes (on average) to easing the prediction
of X
Or how p(x,y) deviates from independence
(p(x)p(y))

61
From Mutual Information to Entropy

By how many bits does the knowledge of Y lower
the entropy H(X)
I(X,Y) åx åy p(x,y) log2 (p(x,y)/p(y)p(x))
// use p(x,y)/p(y) p(xy)
åx åy p(x,y) log2 (p(xy)/p(x))
// use log(a/b) log a - log b (a p(xy), b
p(x)), distribute sums
åx åy p(x,y)log2p(xy) - åx åy
p(x,y)log2p(x)
// use def. of H(XY) (left term), and åy Î Y
p(x,y) p(x) (right term)
- H(XY) (- åx Î W p(x)log2p(x))
// use def. of H(X) (right term), swap terms
H(X) - H(XY) ...by symmetry also
H(Y) - H(YX)

62
Jensens Inequality

Recall f is convex on interval (a,b) iff
"x,y Î(a,b), "l Î 0,1
f(lx (1-l)y) lf(x)
(1-l)f(y)
J.I. for distribution p(x), r.v. X on W, and
convex f,
f(åxÎW p(x) x) åxÎW p(x)
f(x)
Proof (idea) by induction on the number of basic
outcomes
start with W 2 by
p(x1)f(x1) p(x2)f(x2) ³ f(p(x1)x1 p(x2)x2) (Ü
def. of convexity)
for the induction step (W k k1), just use
the induction hypothesis and def. of convexity
(again).

63
Relative Entropy Inequality

D(pq) ³ 0
Proof
0 - log 1 - log åxÎW q(x) - log åxÎW
(q(x)/p(x))p(x)
...apply Jensens inequality here (
- log is convex)...
åxÎW p(x) (-log(q(x)/p(x))) åxÎW p(x)
log(p(x)/q(x))
D(pq)

64
Other Entropy Facts

Log sum inequality for ri, si ³ 0
åi1..n (ri log(ri/si)) (åi1..n ri)
log(åi1..nri/åi1..nsi))
D(pq) is convex in p,q (Ü log sum inequality)
H(pX) log2W, where W is the sample space of
pX
Proof uniform u(x), same sample space W
åp(x) log u(x) -log2W
log2W - H(X) -åp(x) log
u(x) åp(x) log p(x) D(pu) ³ 0
H(p) is concave in p
Proof from H(X) log2W - D(pu),
D(pu) convex ÞH(x) concave

65
Entropy and Language

Entropy is measure of uncertainty. The more we
know about something the lower the entropy.
If a language model captures more of the
structure of the language than another model,
then its entropy should be lower.
Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we have already seen.

66
The Relation to Language Cross Entropy

We can use pointwise entropy as a measure of
surprise.
H(w h) -log2 m(w h) where w is next word
and h is the history of previously seen words.
The cross entropy between a random variable X
with true probability distribution p(x) and
another pmf q (normally a model of p) is given
by
H(X,q)H(X)D(pq) Hp(q)
Cross entropy can help us find out what our
average surprise for the next word is.

67
Cross Entropy

H(X) D(pq) bits needed for encoding p if q
is used.

68
Cross Entropy

Typical case weve got series of observations T
t1, t2, t3, t4, ..., tn(numbers, words, ...
ti Î W)
estimate (simple)
"y Î W (y) C(y) / T, def. C(y) t Î
T t y
...but the true p is unknown every sample is too
small!
Natural question how well do we do using
instead of p?
Idea simulate actual p by using a different T
(or rather by using different observations
we simulate the insufficiency of T vs. some other
data (random difference))

69
Conditional Cross Entropy

So far unconditional distribution(s) p(x),
p(x)...
In practice virtually always conditioning on
context
Interested in sample space Y, r.v. Y, y Î Y
context sample space W, r.v. X, x Î
W
our distribution p(yx), test
against p(y,x),
which is taken from some
independent data
Hp(p) - åy Î Y, x Î W p(y,x)
log2p(yx)

70
Sample Space vs. Data

In practice, it is inconvenient to sum over the
sample space(s) Y, W
Use the following formula
Hp(p) - åy Î Y, x Î W p(y,x)
log2p(yx)
- 1/T åi
1..T log2p(yixi)
This is in fact the normalized log probability
of the test data.
Hp(p) - 1/T log2 Pi 1..T
p(yixi)

71
Computation Example

a,b,..,z, prob. distribution
(assumed/estimated from data)
p(a) .25, p(b) .5, p(a) 1/64 for a Îc..r,
and 0 for s,t,u,v,w,x,y,z
Data (test) barb p(a) p(r) .25,
p(b) .5
Sum over W
a a b c d e f g ... p
q r s t ... z
-p(a)log2p(a)
.5.50000000001.500000 2.5
Sum over data
i / si
1/b 2/a 3/r 4/b 1/T
-log2p(si) 1 2 6
1 10 (1/4) 10 2.5

72
Cross Entropy Some Observations

H(p) ?? lt, , gt ?? Hp(p) ALL!
Previous example
p(a) .25, p(b) .5, p(a) 1/64 for a
Îc..r, 0 for the rest s,t,u,v,w,x,y,z
H(p) 2.5 bits H(p)
(barb)
Other data probable (1/8)(66612166)
4.25
H(p) lt 4.25 bits
H(p) (probable)
And finally abba (1/4)(2112) 1.5
H(p) gt 1.5 bits H(p)
(abba)
But what about baby -p(y)log2p(y)
-.25log20 (??)

73
Cross Entropy Usage

Comparing distributions vs. real data
Have 2 distributions p and q (on some W, X)
which is better?
The better has lower cross entropy on real data S
Real data S
HS(p) - 1/S åi 1..S log2p(yixi) ??
HS(q) - 1/S åi 1..S log2q(yixi)

74
Comparing Distributions
Test data S probable
HS(p) 4.25

p(.) from prev. example
p(a) .25, p(b) .5, p(a) 1/64
for a Îc..r, 0 for the rest s,t,u,v,w,x,y,z
q(..) (conditional defined by a table)
(1/8) (log(poth.)log(rp)log(or)log(bo)log(
ab)log(ba)log(lb)log(el))
(1/8) ( 0 3 0
0 1 0 1
0 )

q(or) 1
q(rp) .125
HS(q) .625
75
Entropy of a Language

Imagine that we produce the next letter using
p(ln1l1,...,ln),
where l1,...,ln is the sequence of all the
letters which had been uttered so far (i.e., n is
really big!) lets call l1,...,ln the history
Then compute its entropy
- åh Î H ål Î A p(l,h) log2 p(lh)
Not very practical, is it?

76
The Entropy of English

We can model English using n-gram models (also
known a Markov chains).
These models assume limited memory, i.e., we
assume that the next word depends only on the
previous k ones kth order Markov approximation.
What is the Entropy of English?
First order 4.03 bits
Second order 2.8 bits
Shannons experiment 1.3 bits

77
Perplexity

A measure related to the notion of cross entropy
and used in the speech recognition community is
called the perplexity.
Perplexity(x1n, m) 2H(x1n,m) m(x1n)-1/n
A perplexity of k means that you are as surprised
on average as you would have been if you had had
to guess between k equi-probable choices at each
step.

Write a Comment

User Comments (0)