Title: WenHsiang Lu
1Lecture 3 Basic Probability (Chapter 2 of
Manning and Schutze)
Wen-Hsiang Lu (???) Department of Computer
Science and Information Engineering, National
Cheng Kung University 2004/09/29
2Motivation
- Statistical NLP aims to do statistical inference.
- Statistical inference consists of taking some
data (generated according to some unknown
probability distribution) and then making some
inferences about this distribution. - An example of statistical inference is the task
of language modeling, namely predicting the next
word given a window of previous words. To do
this, we need a model of the language. - Probability theory helps us to find such a model.
3Probability Terminology
- Probability theory deals with predicting how
likely it is that something will happen. - The process by which an observation is made is
called an experiment or a trial (e.g., tossing a
coin twice). - The collection of basic outcomes (or sample
points) for our experiment is called the sample
space. - An event is a subset of the sample space.
- Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1, certainty. - A probability function/distribution distributes a
probability mass of 1 throughout the sample
space.
4Experiments Sample Spaces
- Set of possible basic outcomes of an experiment
sample space ? - coin toss (? head,tail)
- tossing coin 2 times (? HH, HT, TH, TT)
- dice roll (? 1, 2, 3, 4, 5, 6)
- missing word ( ? ? vocabulary size)
- Discrete (countable) versus continuous
(uncountable) - Every observation/trial is a basic outcome or
sample point. - Event A is a set of basic outcomes with A ? ? ,
and all A ?2? (the event space) - ? is then the certain event, ? is the impossible
event
5Events and Probability
- The probability of event A is denoted p(A) (also
called the prior probability, i.e., the
probability before we consider any additional
knowledge). - Example Experiment toss coin three times
- ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
- cases with two or more tails
- A HTT, THT, TTH, TTT
- P(A) A/ ? 1/2 (assuming uniform
distribution) - all heads
- A HHH
- P(A) A/ W 1/8
6Probability Properties
- Basic properties
- P 2? ? 0,1
- P(?) 1
- For disjoint events P(?Ai) ?i P(Ai)
- NB axiomatic definition of probability take
the above three conditions as axioms - Immediate consequences
- P(?) 0, P(A) 1 - P(A), A ? B ? P(A) ?
P(B) - ?a?? P(a) 1
7Joint Probability
- Joint Probability of A and B P(A,B) P(A ? B)
- 2-dimensional table (AxB) with a value in
every cell giving the probability of that
specific pair occurring.
?
B
A
A?B
8Conditional Probability
- Sometimes we have partial knowledge about the
outcome of an experiment, then the conditional
(or posterior) probability can be helpful. If we
know that event B is true, then we can determine
the probability that A is true given this
knowledge P(AB) P(A,B) / P(B)
?
B
A
A?B
9Conditional and Joint Probabilities
- P(AB) P(A,B)/P(B) ? P(A,B) P(AB) P(B)
- P(BA) P(A,B)/P(A) ? P(A,B) P(BA) P(A)
-
- Chain rule P(A1,..., An) P(A1)P(A2A1)P(A3A1,
A2 )P(AnA1,..., An-1)
10Bayes Rule
- Since P(A,B) P(B,A), P(A ? B) P(B ? A), and
P(A,B) P(AB) P(B) P(BA) P(A) - P(AB) P(A,B)/P(B) P(BA) P(A) / P(B)
- P(BA) P(A,B)/P(A) P(AB) P(B)/P(A)
?
B
A
A?B
11Example
- S have stiff neck, M have meningitis
- P(SM) 0.5, P(M) 1/50,000, P(S)1/20
- I have stiff neck, should I worry?
12Independence
- Two events A and B are independent of each other
if P(A) P(AB) - Example two coin tosses, weather today and
weather on March 4th, 1789 - If A and B are independent, then we compute
P(A,B) from P(A) and P(B) as
- P(A,B) P(AB) P(B) P(A) P(B)
- Two events A and B are conditionally independent
of each other given C if - P(AC) P(AB,C)
13A Golden Rule (of Statistical NLP)
- If we are interested in which event is most
likely given A, we can use Bayes rule, max over
all B - argmaxB P(BA) argmaxB P(AB)P(B) / P(A)
argmaxB P(AB)P(B) - P(A) is a normalizing constant
- However the denominator is easy to obtain.
14Random Variables (RV)
- Random variables (RV) allow us to talk about the
probabilities of numerical values that are
related to the event space (with a specific
numeric range) - An RV is a function X ?? Q
- in general Q Rn, typically R
- easier to handle real numbers than real-world
events - An RV is discrete if Q is a countable subset of
R an indicator RV (or Bernoulli trial) if Q0,
1 - Can define a probability mass function (pmf) for
RV X that gives the probability it has at
different values - pX(x) P(Xx) P(Ax) where Ax ??? X(?)
x - often just p(x) if it is clear from context what
x is
15Example
- Suppose we define a discrete RV X that is the sum
of the faces of two die, then Q2, , 11, 12
with the pmf as follows - P(X2)1/36,
- P(X3)2/36,
- P(X4)3/36,
- P(X5)4/36,
- P(X6)5/36,
- P(X7)6/36,
- P(X8)5/36,
- P(X9)4/36,
- P(X10)3/36,
- P(X11)2/36,
- P(X12)1/36
16Expectation and Variance
- The expectation is the mean or average of a RV
defined as - The variance of a RV is a measure of whether the
values of the RV tend to vary over trials -
- The standard deviation (s) is the square root of
the variance.
17Examples
- What is the expectation of the sum of the numbers
on two dice? 2 P(X2) 2 1/36 1/18 3
P(X3) 3 2/36 3/18 4 P(X4) 4 3/36
6/18 5 P(X5) 5 4/36 10/18 6 P(X6) 6
5/36 15/18 7 P(X7) 7 6/36 21/18 8
P(X8) 8 5/36 20/18 9 P(X9) 9 4/36
18/18 10 P(X10) 10 3/36 15/18 11
P(X11) 11 2/36 11/18 12 p(X12) 12
1/36 6/18 E(SUM) 126/18 7
- Or more simply
- E(SUM) E(D1D2) E(D1) E(D2)
- E(D1) E(D2) 1 1/6 2 1/6 6 1/6
1/6 (1 2 3 4 5 6) 21/6 - Hence, E(SUM) 21/6 21/6 7
18Examples
- Var(X) E((X E(X))2) E(X2 2XE(X) E2(X))
E(X2) 2E(XE(X)) E2(X) E(X2) 2E2(X)
E2(X) E(X2) E2(X) - E(SUM2)329/6 and E2(SUM) 72 49
- Hence, Var(SUM) 329/6 294/6 35/6
19Joint and Conditional Distributions for RVs
- Joint pmf for two RVs X and Y is p(x,y) P(Xx,
Yy) - Marginal pmfs are calculated as pX(x) ?y p(x,y)
and pY(y) ?x p(x,y) - If X and Y are independent then p(x,y) pX(x)
pY(y) - Define conditional pmf using joint distribution
pXY(xy) p(x,y)/ pY(y) if pY(y) gt 0
- Chain rule
- p(w,x,y,z) p(w) p(xw) p(yw,x) p(zw,x,y)
20Estimating Probability Functions
- What is the probability that sentence The cow
chewed its cud will be uttered? Unknown, so P
must be estimated from a sample of data. - An important measure for estimating P is the
relative frequency of the outcome, i.e., the
proportion of times an outcome u occurs -
- C(u) is the number of times u occurs in N trials.
- For N??, the relative frequency tends to
stabilize around some number, the probability
estimate. - Two different approaches
- Parametric (assume distribution)
- Non-parametric (distribution free)
21Parametric Methods
- Assume that the language phenomenon is acceptably
modeled by one of the well-known standard
distributions (e.g., binomial, normal). - By assuming an explicit probabilistic model of
the process by which the data was generated, then
determining a particular probability distribution
within the family requires only the specification
of a few parameters, which requires less training
data (i.e., only a small number of parameters
need to be estimated).
22Non-parametric Methods
- No assumption is made about the underlying
distribution of the data. - For example, simply estimate P empirically by
counting a large number of random events is a
distribution-free method. - Given less prior information, more training data
is needed.
23Estimating Probability
- Example Toss coin three times
- ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
- count cases with exactly two tails A HTT,
THT, TTH - Run an experiment 1000 times (i.e., 3000 tosses)
- Counted 386 cases with two tails (HTT, THT, or
TTH) - Estimate of p(A) 386 / 1000 .386
- Run again 373, 399, 382, 355, 372, 406, 359
- p(A) .379 (weighted average) or simply 3032 /
8000 - Uniform distribution assumption p(A) 3/8 .375
24Standard Distributions
- In practice, one commonly finds the same basic
form of a probability mass function, but with
different constants employed. - Families of pmfs are called distributions and the
constants that define the different possible pmfs
in one family are called parameters. - Discrete Distributions the binomial
distribution, the multinomial distribution, the
Poisson distribution. - Continuous Distributions the normal
distribution, the standard normal distribution.
25Standard Distributions Binomial
- Series of trials with only two outcomes, 0 or 1,
with each trial being independent from all the
other outcomes. - Number r of successes out of n trials given that
the probability of success in any trial is p - ( ) counts how many possibilities there are for
choosing r objects out of n, i.e., n! / (n-r)!r!
n r
26Binomial Distribution
- Works well for tossing a coin. However, for
linguistic corpora one never has complete
independence from one sentence to the next
approximation. - Use it when counting whether something has a
certain property or not (assuming independence). - Actually quite common in SNLP e.g., look through
a corpus to find out the estimate of the
percentage of sentences that have a certain word
in them or how often a verb is used as transitive
or intransitive. - Expectation is n ? p, variance is n ? p ? (1-p)
27Standard Distributions Normal
- The Normal (Gaussian) distribution is a
continuous distribution with two parameters mean
µ and standard deviation s. Standard normal if
?0 and ?1. (clustering)
m
X
28Frequentist Statistics
- s sequence of observations
- ?M model (a distribution plus parameters)
- For fixed model ?M, Maximum likelihood estimate
- Probability expresses something about the world
with no prior belief!
29Baysian Statistics I Bayesian Updating
- Assume that the data are coming in sequentially
and are independent. - Given an a-priori probability distribution, we
can update our beliefs when a new datum comes in
by calculating the Maximum A Posteriori (MAP)
distribution. - The MAP probability becomes the new prior and the
process repeats on each new datum.
30Bayesian Statistics MAP
- P(s) is a normalizing constant
31Bayesian Statistics II Bayesian Decision Theory
- Bayesian Statistics can be used to evaluate which
model or family of models better explains some
data. - We define two different models of the event and
calculate the likelihood ratio between these two
models.
32Bayesian Decision Theory
- Suppose we have 2 models M1 and M2 we want to
evaluate which model better explains some new
data. - then M1 is the most likely model, otherwise M2
33Essential Information Theory
- Developed by Shannon in the 1940s.
- Goal is to maximize the amount of information
that can be transmitted over an imperfect
communication channel. - Wished to determine the theoretical maxima for
data compression (entropy H) and transmission
rate (channel capacity C). - If a message is transmitted at a rate slower than
C, then the probability of transmission errors
can be made as small as desired.
34Entropy
- X discrete RV, p(X)
- Entropy (or self-information) is the average
uncertainty of a single RV. Let p(x)P(Xx),
where x ?X, then - In other words, entropy measures the amount of
information in a random variable measured in
bits. It is also the average length of the
message needed to transmit an outcome of that
variable using the optimal code. - An optimal code sends a message of probability
p(i) in ?-log2 p(i)? bits.
35Entropy (cont.)
- i.e., when the value of X is determinate, it
provides no new information
36Using the Formula Examples
- Toss a fair coin ? head, tail
- P(Xhead) .5, P(Xtail) .5
- H(X) - 0.5 log2(0.5) - 0.5 log2(0.5) 1
- Take fair, 32-sided die p(x) 1/32 for every
side x - H(p) -?i 1..32 p(xi) log2p(xi) - 32 (1/32
log2 (1/32)) (since for all i, p(xi) 1/32) ? 5
bits - Unfair coin
- P(Xhead) .2 ? H(X) .722 P(Xhead) .01 ?
H(X) .081
37Entropy of a Weighted Coin(one toss)
38The Limits
- When H(p) 0?
- if a result of an experiment is known ahead of
time - necessarily
- Upper bound?
- for ? n H(p) ? log2n
- nothing can be more uncertain than the uniform
distribution - Entropy increases with message length.
39Coding Interpretation of Entropy
- The least (average) number of bits needed to
encode a message (string, sequence, series,...)
(each element being a result of a random process
with some distribution p) gives H(p). - Compression algorithms
- do well on data with repeating (easily
predictable low entropy) patterns - their results have high entropy ? compressing
compressed data does nothing
40Coding Example
- Experience some characters are more common, some
(very) rare - What if we use more bits for the rare, and fewer
bits for the frequent? be careful want to
decode (easily)! - suppose p(a) 0.3, p(b) 0.3, p(c)
0.3, the rest p(x)? .0004 - code a 00, b 01, c 10, rest
11b1b2b3b4b5b6b7b8 - code acbbécbaac 0010010111000011111001000010
a c b b é
c b a a c - number of bits used 28 (vs. 80 using naive
8-bit uniform coding)
41Properties of Entropy I
- Entropy is non-negative
- H(X) ? 0
- Recalling that H(X) - ?x?X p(x) log2 p(x)
- log(p(x)) is negative or zero for x 1,
- p(x) is non-negative, so p(x)logp(x) is negative
- sum of negative numbers is negative
- and -x is positive for negative x
42Joint Entropy
- The joint entropy of a pair of discrete random
variables X, Y p(x,y) is the amount of
information needed on average to specify both
their values.
43Conditional Entropy
- The conditional entropy of a discrete random
variable Y given another X, for X, Y p(x,y),
expresses how much extra information is needed to
supply on average for communicating Y given that
the other party knows X. (Recall that H(X)
E(log2(1/pX(x))) weights are not conditional)
44Adverb Forms
- Conditional Entropy is better than
unconditional Entropy - H(YX) ?H(Y)
- H(X,Y) ? H(X) H(Y) (follows from the previous
(in)equalities) - equality iff X,Y independent
- recall X,Y independent iff p(X,Y) p(X)p(Y)
- H(p) is concave (remember the coin toss graph?)
- concave function f over an interval (a,b)
- function f is convex if -f is concave
- for proofs and generalizations, see
Cover/Thomas
45Chain Rule for Entropy
- The product became a sum due to the log.
46Entropy Rate
- Because the amount of information contained in a
message depends on its length, we may want to
compare using entropy rate (the entropy per
unit). - The entropy rate of a language is the limit of
the entropy rate of a sample of language as the
sample gets longer and longer.
47Mutual Information
- By the chain rule for entropy, we have H(X,Y)
H(X) H(YX) H(Y)H(XY) - Therefore, H(X)-H(XY)H(Y)-H(YX)I(X Y)
- I(X Y) is called the mutual information between
X and Y. - It is the reduction in uncertainty of one random
variable due to knowing about another, or, in
other words, the amount of information one random
variable contains about another.
48Relationship between I and H
49 Mutual Information (cont)
- I(X Y) is symmetric, non-negative measure of the
common information of two variables. - Some see it as a measure of dependence between
two variables, but better to think of it as a
measure of independence. - I(X Y) is 0 only when X and Y are independent
H(XY)H(X) - For two dependent variables, I grows not only
according to the degree of dependence but also
according to the entropy of the two variables. - H(X)H(X)-H(XX)I(X X) ? Why entropy is called
self-information.
50Mutual Information (cont)
- We can also derive conditional mutual
information - I(X YZ) I((XY) Z) H(XZ) H(XY,Z)
- Chain rule
- I(X1n Y) I(X1 Y) I(X2 Y X1) I(Xn
Y X1, , Xn-1) - Dont confuse with pointwise mutual information,
which has some problems
51Mutual Information and Entropy
- number of bits the knowledge
- of Y lowers the entropy of X
- (symmetry)
- I(X,Y) H(X) - H(XY)
- H(Y) - H(YX)
-
- Recall H(X,Y) H(XY) H(Y) Þ -H(XY) H(Y) -
H(X,Y) Þ - I(X,Y) H(X) H(Y) - H(X,Y)
- I(X,X) H(X) H(XX) H(X) (since H(XX) 0)
- I(X,Y) I(Y,X) (symmetry)
- I(X,Y) ³ 0
52The Noisy Channel Model
- Want to optimize a communication across a channel
in terms of throughput and accuracy the
communication of messages in the presence of
noise in the channel. - There is a duality between compression (achieved
by removing all redundancy) and transmission
accuracy (achieved by adding controlled
redundancy so that the input can be recovered in
the presence of noise).
53The Noisy Channel Model
- Goal encode the message in such a way that it
occupies minimal space while still containing
enough redundancy to be able to detect and
correct errors.
54The Noisy Channel Model
- Channel capacity rate at which one can transmit
information through the channel with an arbitrary
low probability of being unable to recover the
input from the output. For a memoryless channel - We reach a channels capacity if we manage to
design an input code X whose distribution p(X)
maximizes I between input and output.
55Language and the Noisy Channel Model
- In language we cant control the encoding phase
we can only decode the output to give the most
likely input. Determine the most likely input
given the output!
I
I
O
Noisy Channel p(oI)
56The Noisy Channel Model
- p(i) is the language model and p(oi) is the
channel probability - This is used in machine translation, optical
character recognition, speech recognition, etc.
57The Noisy Channel Model
58Relative Entropy Kullback-Leibler Divergence
- For 2 pmfs, p(x) and q(x), their relative entropy
is - The relative entropy, also called the
Kullback-Leibler divergence, is a measure of how
different two probability distributions are (over
the same event space). - The KL divergence between p and q can also be
seen as the average number of bits that are
wasted by encoding events from a distribution p
with a code based on the not-quite-right
distribution q.
59Comments on Relative Entropy
- Goal minimize relative entropy D(pq) to have a
probabilistic model as accurate as possible. - Conventions
- 0 log 0 0
- p log (p/0) (for p gt 0)
- Distance? not quite
- not symmetric D(pq) ¹ D(qp)
- does not satisfy the triangle inequality
- But can be useful to think of it as distance
60Mutual Information and Relative Entropy
- Random variables X, Y pXÇY(x,y), pX(x), pY(y)
- Mutual information (between two random variables
X,Y) - I(X,Y) D(p(x,y)
p(x)p(y)) - I(X,Y) measures how much (our knowledge of) Y
contributes (on average) to easing the prediction
of X - Or how p(x,y) deviates from independence
(p(x)p(y))
61From Mutual Information to Entropy
- By how many bits does the knowledge of Y lower
the entropy H(X) - I(X,Y) åx åy p(x,y) log2 (p(x,y)/p(y)p(x))
- // use p(x,y)/p(y) p(xy)
- åx åy p(x,y) log2 (p(xy)/p(x))
- // use log(a/b) log a - log b (a p(xy), b
p(x)), distribute sums - åx åy p(x,y)log2p(xy) - åx åy
p(x,y)log2p(x) - // use def. of H(XY) (left term), and åy Î Y
p(x,y) p(x) (right term) - - H(XY) (- åx Î W p(x)log2p(x))
- // use def. of H(X) (right term), swap terms
- H(X) - H(XY) ...by symmetry also
H(Y) - H(YX)
62Jensens Inequality
- Recall f is convex on interval (a,b) iff
- "x,y ÃŽ(a,b), "l ÃŽ 0,1
- f(lx (1-l)y) lf(x)
(1-l)f(y) - J.I. for distribution p(x), r.v. X on W, and
convex f, - f(åxÎW p(x) x) åxÎW p(x)
f(x) - Proof (idea) by induction on the number of basic
outcomes - start with W 2 by
- p(x1)f(x1) p(x2)f(x2) ³ f(p(x1)x1 p(x2)x2) (Ü
def. of convexity) - for the induction step (W k k1), just use
the induction hypothesis and def. of convexity
(again).
63Relative Entropy Inequality
- D(pq) ³ 0
- Proof
- 0 - log 1 - log åxÎW q(x) - log åxÎW
(q(x)/p(x))p(x) - ...apply Jensens inequality here (
- log is convex)... - åxÎW p(x) (-log(q(x)/p(x))) åxÎW p(x)
log(p(x)/q(x)) -
D(pq)
64Other Entropy Facts
- Log sum inequality for ri, si ³ 0
- åi1..n (ri log(ri/si)) (åi1..n ri)
log(åi1..nri/åi1..nsi)) - D(pq) is convex in p,q (Ü log sum inequality)
- H(pX) log2W, where W is the sample space of
pX - Proof uniform u(x), same sample space W
Ã¥p(x) log u(x) -log2W - log2W - H(X) -Ã¥p(x) log
u(x) åp(x) log p(x) D(pu) ³ 0 - H(p) is concave in p
- Proof from H(X) log2W - D(pu),
D(pu) convex ÞH(x) concave
65Entropy and Language
- Entropy is measure of uncertainty. The more we
know about something the lower the entropy. - If a language model captures more of the
structure of the language than another model,
then its entropy should be lower. - Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we have already seen.
66The Relation to Language Cross Entropy
- We can use pointwise entropy as a measure of
surprise. - H(w h) -log2 m(w h) where w is next word
and h is the history of previously seen words. - The cross entropy between a random variable X
with true probability distribution p(x) and
another pmf q (normally a model of p) is given
by - H(X,q)H(X)D(pq) Hp(q)
- Cross entropy can help us find out what our
average surprise for the next word is.
67Cross Entropy
- H(X) D(pq) bits needed for encoding p if q
is used.
68Cross Entropy
- Typical case weve got series of observations T
t1, t2, t3, t4, ..., tn(numbers, words, ...
ti ÃŽ W) - estimate (simple)
- "y ÃŽ W (y) C(y) / T, def. C(y) t ÃŽ
T t y - ...but the true p is unknown every sample is too
small! - Natural question how well do we do using
instead of p? - Idea simulate actual p by using a different T
- (or rather by using different observations
we simulate the insufficiency of T vs. some other
data (random difference))
69Conditional Cross Entropy
- So far unconditional distribution(s) p(x),
p(x)... - In practice virtually always conditioning on
context - Interested in sample space Y, r.v. Y, y ÃŽ Y
- context sample space W, r.v. X, x ÃŽ
W - our distribution p(yx), test
against p(y,x), - which is taken from some
independent data - Hp(p) - åy Î Y, x Î W p(y,x)
log2p(yx)
70Sample Space vs. Data
- In practice, it is inconvenient to sum over the
sample space(s) Y, W - Use the following formula
- Hp(p) - åy Î Y, x Î W p(y,x)
log2p(yx) - - 1/T åi
1..T log2p(yixi) - This is in fact the normalized log probability
of the test data. - Hp(p) - 1/T log2 Pi 1..T
p(yixi)
71Computation Example
- a,b,..,z, prob. distribution
(assumed/estimated from data) - p(a) .25, p(b) .5, p(a) 1/64 for a ÃŽc..r,
and 0 for s,t,u,v,w,x,y,z - Data (test) barb p(a) p(r) .25,
p(b) .5 - Sum over W
- a a b c d e f g ... p
q r s t ... z - -p(a)log2p(a)
.5.50000000001.500000 2.5 - Sum over data
- i / si
1/b 2/a 3/r 4/b 1/T - -log2p(si) 1 2 6
1 10 (1/4) 10 2.5 -
72Cross Entropy Some Observations
- H(p) ?? lt, , gt ?? Hp(p) ALL!
- Previous example
- p(a) .25, p(b) .5, p(a) 1/64 for a
ÃŽc..r, 0 for the rest s,t,u,v,w,x,y,z - H(p) 2.5 bits H(p)
(barb) - Other data probable (1/8)(66612166)
4.25 - H(p) lt 4.25 bits
H(p) (probable) - And finally abba (1/4)(2112) 1.5
- H(p) gt 1.5 bits H(p)
(abba) - But what about baby -p(y)log2p(y)
-.25log20 (??)
73Cross Entropy Usage
- Comparing distributions vs. real data
- Have 2 distributions p and q (on some W, X)
- which is better?
- The better has lower cross entropy on real data S
- Real data S
- HS(p) - 1/S åi 1..S log2p(yixi) ??
HS(q) - 1/S åi 1..S log2q(yixi)
74Comparing Distributions
Test data S probable
HS(p) 4.25
- p(.) from prev. example
- p(a) .25, p(b) .5, p(a) 1/64
for a ÃŽc..r, 0 for the rest s,t,u,v,w,x,y,z - q(..) (conditional defined by a table)
-
-
- (1/8) (log(poth.)log(rp)log(or)log(bo)log(
ab)log(ba)log(lb)log(el)) - (1/8) ( 0 3 0
0 1 0 1
0 ) -
q(or) 1
q(rp) .125
HS(q) .625
75Entropy of a Language
- Imagine that we produce the next letter using
- p(ln1l1,...,ln),
- where l1,...,ln is the sequence of all the
letters which had been uttered so far (i.e., n is
really big!) lets call l1,...,ln the history - Then compute its entropy
- - åh Î H ål Î A p(l,h) log2 p(lh)
- Not very practical, is it?
76The Entropy of English
- We can model English using n-gram models (also
known a Markov chains). - These models assume limited memory, i.e., we
assume that the next word depends only on the
previous k ones kth order Markov approximation. - What is the Entropy of English?
- First order 4.03 bits
- Second order 2.8 bits
- Shannons experiment 1.3 bits
77Perplexity
- A measure related to the notion of cross entropy
and used in the speech recognition community is
called the perplexity. - Perplexity(x1n, m) 2H(x1n,m) m(x1n)-1/n
- A perplexity of k means that you are as surprised
on average as you would have been if you had had
to guess between k equi-probable choices at each
step.