WenHsiang Lu - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

WenHsiang Lu

Description:

Statistical NLP aims to do statistical inference. ... Assume that the language phenomenon is acceptably modeled by one of the well ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 77
Provided by: whlu
Category:

less

Transcript and Presenter's Notes

Title: WenHsiang Lu


1
Lecture 3 Basic Probability (Chapter 2 of
Manning and Schutze)
Wen-Hsiang Lu (???) Department of Computer
Science and Information Engineering, National
Cheng Kung University 2004/09/29
2
Motivation
  • Statistical NLP aims to do statistical inference.
  • Statistical inference consists of taking some
    data (generated according to some unknown
    probability distribution) and then making some
    inferences about this distribution.
  • An example of statistical inference is the task
    of language modeling, namely predicting the next
    word given a window of previous words. To do
    this, we need a model of the language.
  • Probability theory helps us to find such a model.

3
Probability Terminology
  • Probability theory deals with predicting how
    likely it is that something will happen.
  • The process by which an observation is made is
    called an experiment or a trial (e.g., tossing a
    coin twice).
  • The collection of basic outcomes (or sample
    points) for our experiment is called the sample
    space.
  • An event is a subset of the sample space.
  • Probabilities are numbers between 0 and 1, where
    0 indicates impossibility and 1, certainty.
  • A probability function/distribution distributes a
    probability mass of 1 throughout the sample
    space.

4
Experiments Sample Spaces
  • Set of possible basic outcomes of an experiment
    sample space ?
  • coin toss (? head,tail)
  • tossing coin 2 times (? HH, HT, TH, TT)
  • dice roll (? 1, 2, 3, 4, 5, 6)
  • missing word ( ? ? vocabulary size)
  • Discrete (countable) versus continuous
    (uncountable)
  • Every observation/trial is a basic outcome or
    sample point.
  • Event A is a set of basic outcomes with A ? ? ,
    and all A ?2? (the event space)
  • ? is then the certain event, ? is the impossible
    event

5
Events and Probability
  • The probability of event A is denoted p(A) (also
    called the prior probability, i.e., the
    probability before we consider any additional
    knowledge).
  • Example Experiment toss coin three times
  • ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • cases with two or more tails
  • A HTT, THT, TTH, TTT
  • P(A) A/ ? 1/2 (assuming uniform
    distribution)
  • all heads
  • A HHH
  • P(A) A/ W 1/8

6
Probability Properties
  • Basic properties
  • P 2? ? 0,1
  • P(?) 1
  • For disjoint events P(?Ai) ?i P(Ai)
  • NB axiomatic definition of probability take
    the above three conditions as axioms
  • Immediate consequences
  • P(?) 0, P(A) 1 - P(A), A ? B ? P(A) ?
    P(B)
  • ?a?? P(a) 1

7
Joint Probability
  • Joint Probability of A and B P(A,B) P(A ? B)
  • 2-dimensional table (AxB) with a value in
    every cell giving the probability of that
    specific pair occurring.

?
B
A
A?B
8
Conditional Probability
  • Sometimes we have partial knowledge about the
    outcome of an experiment, then the conditional
    (or posterior) probability can be helpful. If we
    know that event B is true, then we can determine
    the probability that A is true given this
    knowledge P(AB) P(A,B) / P(B)

?
B
A
A?B
9
Conditional and Joint Probabilities
  • P(AB) P(A,B)/P(B) ? P(A,B) P(AB) P(B)
  • P(BA) P(A,B)/P(A) ? P(A,B) P(BA) P(A)
  • Chain rule P(A1,..., An) P(A1)P(A2A1)P(A3A1,
    A2 )P(AnA1,..., An-1)

10
Bayes Rule
  • Since P(A,B) P(B,A), P(A ? B) P(B ? A), and
    P(A,B) P(AB) P(B) P(BA) P(A)
  • P(AB) P(A,B)/P(B) P(BA) P(A) / P(B)
  • P(BA) P(A,B)/P(A) P(AB) P(B)/P(A)

?
B
A
A?B
11
Example
  • S have stiff neck, M have meningitis
  • P(SM) 0.5, P(M) 1/50,000, P(S)1/20
  • I have stiff neck, should I worry?

12
Independence
  • Two events A and B are independent of each other
    if P(A) P(AB)
  • Example two coin tosses, weather today and
    weather on March 4th, 1789
  • If A and B are independent, then we compute
    P(A,B) from P(A) and P(B) as
  • P(A,B) P(AB) P(B) P(A) P(B)
  • Two events A and B are conditionally independent
    of each other given C if
  • P(AC) P(AB,C)

13
A Golden Rule (of Statistical NLP)
  • If we are interested in which event is most
    likely given A, we can use Bayes rule, max over
    all B
  • argmaxB P(BA) argmaxB P(AB)P(B) / P(A)
    argmaxB P(AB)P(B)
  • P(A) is a normalizing constant
  • However the denominator is easy to obtain.

14
Random Variables (RV)
  • Random variables (RV) allow us to talk about the
    probabilities of numerical values that are
    related to the event space (with a specific
    numeric range)
  • An RV is a function X ?? Q
  • in general Q Rn, typically R
  • easier to handle real numbers than real-world
    events
  • An RV is discrete if Q is a countable subset of
    R an indicator RV (or Bernoulli trial) if Q0,
    1
  • Can define a probability mass function (pmf) for
    RV X that gives the probability it has at
    different values
  • pX(x) P(Xx) P(Ax) where Ax ??? X(?)
    x
  • often just p(x) if it is clear from context what
    x is

15
Example
  • Suppose we define a discrete RV X that is the sum
    of the faces of two die, then Q2, , 11, 12
    with the pmf as follows
  • P(X2)1/36,
  • P(X3)2/36,
  • P(X4)3/36,
  • P(X5)4/36,
  • P(X6)5/36,
  • P(X7)6/36,
  • P(X8)5/36,
  • P(X9)4/36,
  • P(X10)3/36,
  • P(X11)2/36,
  • P(X12)1/36

16
Expectation and Variance
  • The expectation is the mean or average of a RV
    defined as
  • The variance of a RV is a measure of whether the
    values of the RV tend to vary over trials
  • The standard deviation (s) is the square root of
    the variance.

17
Examples
  • What is the expectation of the sum of the numbers
    on two dice? 2 P(X2) 2 1/36 1/18 3
    P(X3) 3 2/36 3/18 4 P(X4) 4 3/36
    6/18 5 P(X5) 5 4/36 10/18 6 P(X6) 6
    5/36 15/18 7 P(X7) 7 6/36 21/18 8
    P(X8) 8 5/36 20/18 9 P(X9) 9 4/36
    18/18 10 P(X10) 10 3/36 15/18 11
    P(X11) 11 2/36 11/18 12 p(X12) 12
    1/36 6/18 E(SUM) 126/18 7
  • Or more simply
  • E(SUM) E(D1D2) E(D1) E(D2)
  • E(D1) E(D2) 1 1/6 2 1/6 6 1/6
    1/6 (1 2 3 4 5 6) 21/6
  • Hence, E(SUM) 21/6 21/6 7

18
Examples
  • Var(X) E((X E(X))2) E(X2 2XE(X) E2(X))
    E(X2) 2E(XE(X)) E2(X) E(X2) 2E2(X)
    E2(X) E(X2) E2(X)
  • E(SUM2)329/6 and E2(SUM) 72 49
  • Hence, Var(SUM) 329/6 294/6 35/6

19
Joint and Conditional Distributions for RVs
  • Joint pmf for two RVs X and Y is p(x,y) P(Xx,
    Yy)
  • Marginal pmfs are calculated as pX(x) ?y p(x,y)
    and pY(y) ?x p(x,y)
  • If X and Y are independent then p(x,y) pX(x)
    pY(y)
  • Define conditional pmf using joint distribution
    pXY(xy) p(x,y)/ pY(y) if pY(y) gt 0
  • Chain rule
  • p(w,x,y,z) p(w) p(xw) p(yw,x) p(zw,x,y)

20
Estimating Probability Functions
  • What is the probability that sentence The cow
    chewed its cud will be uttered? Unknown, so P
    must be estimated from a sample of data.
  • An important measure for estimating P is the
    relative frequency of the outcome, i.e., the
    proportion of times an outcome u occurs
  • C(u) is the number of times u occurs in N trials.
  • For N??, the relative frequency tends to
    stabilize around some number, the probability
    estimate.
  • Two different approaches
  • Parametric (assume distribution)
  • Non-parametric (distribution free)

21
Parametric Methods
  • Assume that the language phenomenon is acceptably
    modeled by one of the well-known standard
    distributions (e.g., binomial, normal).
  • By assuming an explicit probabilistic model of
    the process by which the data was generated, then
    determining a particular probability distribution
    within the family requires only the specification
    of a few parameters, which requires less training
    data (i.e., only a small number of parameters
    need to be estimated).

22
Non-parametric Methods
  • No assumption is made about the underlying
    distribution of the data.
  • For example, simply estimate P empirically by
    counting a large number of random events is a
    distribution-free method.
  • Given less prior information, more training data
    is needed.

23
Estimating Probability
  • Example Toss coin three times
  • ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • count cases with exactly two tails A HTT,
    THT, TTH
  • Run an experiment 1000 times (i.e., 3000 tosses)
  • Counted 386 cases with two tails (HTT, THT, or
    TTH)
  • Estimate of p(A) 386 / 1000 .386
  • Run again 373, 399, 382, 355, 372, 406, 359
  • p(A) .379 (weighted average) or simply 3032 /
    8000
  • Uniform distribution assumption p(A) 3/8 .375

24
Standard Distributions
  • In practice, one commonly finds the same basic
    form of a probability mass function, but with
    different constants employed.
  • Families of pmfs are called distributions and the
    constants that define the different possible pmfs
    in one family are called parameters.
  • Discrete Distributions the binomial
    distribution, the multinomial distribution, the
    Poisson distribution.
  • Continuous Distributions the normal
    distribution, the standard normal distribution.

25
Standard Distributions Binomial
  • Series of trials with only two outcomes, 0 or 1,
    with each trial being independent from all the
    other outcomes.
  • Number r of successes out of n trials given that
    the probability of success in any trial is p
  • ( ) counts how many possibilities there are for
    choosing r objects out of n, i.e., n! / (n-r)!r!

n r
26
Binomial Distribution
  • Works well for tossing a coin. However, for
    linguistic corpora one never has complete
    independence from one sentence to the next
    approximation.
  • Use it when counting whether something has a
    certain property or not (assuming independence).
  • Actually quite common in SNLP e.g., look through
    a corpus to find out the estimate of the
    percentage of sentences that have a certain word
    in them or how often a verb is used as transitive
    or intransitive.
  • Expectation is n ? p, variance is n ? p ? (1-p)

27
Standard Distributions Normal
  • The Normal (Gaussian) distribution is a
    continuous distribution with two parameters mean
    µ and standard deviation s. Standard normal if
    ?0 and ?1. (clustering)

m
X
28
Frequentist Statistics
  • s sequence of observations
  • ?M model (a distribution plus parameters)
  • For fixed model ?M, Maximum likelihood estimate
  • Probability expresses something about the world
    with no prior belief!

29
Baysian Statistics I Bayesian Updating
  • Assume that the data are coming in sequentially
    and are independent.
  • Given an a-priori probability distribution, we
    can update our beliefs when a new datum comes in
    by calculating the Maximum A Posteriori (MAP)
    distribution.
  • The MAP probability becomes the new prior and the
    process repeats on each new datum.

30
Bayesian Statistics MAP
  • P(s) is a normalizing constant

31
Bayesian Statistics II Bayesian Decision Theory
  • Bayesian Statistics can be used to evaluate which
    model or family of models better explains some
    data.
  • We define two different models of the event and
    calculate the likelihood ratio between these two
    models.

32
Bayesian Decision Theory
  • Suppose we have 2 models M1 and M2 we want to
    evaluate which model better explains some new
    data.
  • then M1 is the most likely model, otherwise M2

33
Essential Information Theory
  • Developed by Shannon in the 1940s.
  • Goal is to maximize the amount of information
    that can be transmitted over an imperfect
    communication channel.
  • Wished to determine the theoretical maxima for
    data compression (entropy H) and transmission
    rate (channel capacity C).
  • If a message is transmitted at a rate slower than
    C, then the probability of transmission errors
    can be made as small as desired.

34
Entropy
  • X discrete RV, p(X)
  • Entropy (or self-information) is the average
    uncertainty of a single RV. Let p(x)P(Xx),
    where x ?X, then
  • In other words, entropy measures the amount of
    information in a random variable measured in
    bits. It is also the average length of the
    message needed to transmit an outcome of that
    variable using the optimal code.
  • An optimal code sends a message of probability
    p(i) in ?-log2 p(i)? bits.

35
Entropy (cont.)
  • i.e., when the value of X is determinate, it
    provides no new information

36
Using the Formula Examples
  • Toss a fair coin ? head, tail
  • P(Xhead) .5, P(Xtail) .5
  • H(X) - 0.5 log2(0.5) - 0.5 log2(0.5) 1
  • Take fair, 32-sided die p(x) 1/32 for every
    side x
  • H(p) -?i 1..32 p(xi) log2p(xi) - 32 (1/32
    log2 (1/32)) (since for all i, p(xi) 1/32) ? 5
    bits
  • Unfair coin
  • P(Xhead) .2 ? H(X) .722 P(Xhead) .01 ?
    H(X) .081

37
Entropy of a Weighted Coin(one toss)
38
The Limits
  • When H(p) 0?
  • if a result of an experiment is known ahead of
    time
  • necessarily
  • Upper bound?
  • for ? n H(p) ? log2n
  • nothing can be more uncertain than the uniform
    distribution
  • Entropy increases with message length.

39
Coding Interpretation of Entropy
  • The least (average) number of bits needed to
    encode a message (string, sequence, series,...)
    (each element being a result of a random process
    with some distribution p) gives H(p).
  • Compression algorithms
  • do well on data with repeating (easily
    predictable low entropy) patterns
  • their results have high entropy ? compressing
    compressed data does nothing

40
Coding Example
  • Experience some characters are more common, some
    (very) rare
  • What if we use more bits for the rare, and fewer
    bits for the frequent? be careful want to
    decode (easily)!
  • suppose p(a) 0.3, p(b) 0.3, p(c)
    0.3, the rest p(x)? .0004
  • code a 00, b 01, c 10, rest
    11b1b2b3b4b5b6b7b8
  • code acbbécbaac 0010010111000011111001000010
    a c b b é
    c b a a c
  • number of bits used 28 (vs. 80 using naive
    8-bit uniform coding)

41
Properties of Entropy I
  • Entropy is non-negative
  • H(X) ? 0
  • Recalling that H(X) - ?x?X p(x) log2 p(x)
  • log(p(x)) is negative or zero for x 1,
  • p(x) is non-negative, so p(x)logp(x) is negative
  • sum of negative numbers is negative
  • and -x is positive for negative x

42
Joint Entropy
  • The joint entropy of a pair of discrete random
    variables X, Y p(x,y) is the amount of
    information needed on average to specify both
    their values.

43
Conditional Entropy
  • The conditional entropy of a discrete random
    variable Y given another X, for X, Y p(x,y),
    expresses how much extra information is needed to
    supply on average for communicating Y given that
    the other party knows X. (Recall that H(X)
    E(log2(1/pX(x))) weights are not conditional)

44
Adverb Forms
  • Conditional Entropy is better than
    unconditional Entropy
  • H(YX) ?H(Y)
  • H(X,Y) ? H(X) H(Y) (follows from the previous
    (in)equalities)
  • equality iff X,Y independent
  • recall X,Y independent iff p(X,Y) p(X)p(Y)
  • H(p) is concave (remember the coin toss graph?)
  • concave function f over an interval (a,b)
  • function f is convex if -f is concave
  • for proofs and generalizations, see
    Cover/Thomas

45
Chain Rule for Entropy
  • The product became a sum due to the log.

46
Entropy Rate
  • Because the amount of information contained in a
    message depends on its length, we may want to
    compare using entropy rate (the entropy per
    unit).
  • The entropy rate of a language is the limit of
    the entropy rate of a sample of language as the
    sample gets longer and longer.

47
Mutual Information
  • By the chain rule for entropy, we have H(X,Y)
    H(X) H(YX) H(Y)H(XY)
  • Therefore, H(X)-H(XY)H(Y)-H(YX)I(X Y)
  • I(X Y) is called the mutual information between
    X and Y.
  • It is the reduction in uncertainty of one random
    variable due to knowing about another, or, in
    other words, the amount of information one random
    variable contains about another.

48
Relationship between I and H
49
Mutual Information (cont)
  • I(X Y) is symmetric, non-negative measure of the
    common information of two variables.
  • Some see it as a measure of dependence between
    two variables, but better to think of it as a
    measure of independence.
  • I(X Y) is 0 only when X and Y are independent
    H(XY)H(X)
  • For two dependent variables, I grows not only
    according to the degree of dependence but also
    according to the entropy of the two variables.
  • H(X)H(X)-H(XX)I(X X) ? Why entropy is called
    self-information.

50
Mutual Information (cont)
  • We can also derive conditional mutual
    information
  • I(X YZ) I((XY) Z) H(XZ) H(XY,Z)
  • Chain rule
  • I(X1n Y) I(X1 Y) I(X2 Y X1) I(Xn
    Y X1, , Xn-1)
  • Dont confuse with pointwise mutual information,
    which has some problems

51
Mutual Information and Entropy
  • number of bits the knowledge
  • of Y lowers the entropy of X
  • (symmetry)
  • I(X,Y) H(X) - H(XY)
  • H(Y) - H(YX)
  • Recall H(X,Y) H(XY) H(Y) Þ -H(XY) H(Y) -
    H(X,Y) Þ
  • I(X,Y) H(X) H(Y) - H(X,Y)
  • I(X,X) H(X) H(XX) H(X) (since H(XX) 0)
  • I(X,Y) I(Y,X) (symmetry)
  • I(X,Y) ³ 0

52
The Noisy Channel Model
  • Want to optimize a communication across a channel
    in terms of throughput and accuracy the
    communication of messages in the presence of
    noise in the channel.
  • There is a duality between compression (achieved
    by removing all redundancy) and transmission
    accuracy (achieved by adding controlled
    redundancy so that the input can be recovered in
    the presence of noise).

53
The Noisy Channel Model
  • Goal encode the message in such a way that it
    occupies minimal space while still containing
    enough redundancy to be able to detect and
    correct errors.

54
The Noisy Channel Model
  • Channel capacity rate at which one can transmit
    information through the channel with an arbitrary
    low probability of being unable to recover the
    input from the output. For a memoryless channel
  • We reach a channels capacity if we manage to
    design an input code X whose distribution p(X)
    maximizes I between input and output.

55
Language and the Noisy Channel Model
  • In language we cant control the encoding phase
    we can only decode the output to give the most
    likely input. Determine the most likely input
    given the output!

I
I
O
Noisy Channel p(oI)
56
The Noisy Channel Model
  • p(i) is the language model and p(oi) is the
    channel probability
  • This is used in machine translation, optical
    character recognition, speech recognition, etc.

57
The Noisy Channel Model
58
Relative Entropy Kullback-Leibler Divergence
  • For 2 pmfs, p(x) and q(x), their relative entropy
    is
  • The relative entropy, also called the
    Kullback-Leibler divergence, is a measure of how
    different two probability distributions are (over
    the same event space).
  • The KL divergence between p and q can also be
    seen as the average number of bits that are
    wasted by encoding events from a distribution p
    with a code based on the not-quite-right
    distribution q.

59
Comments on Relative Entropy
  • Goal minimize relative entropy D(pq) to have a
    probabilistic model as accurate as possible.
  • Conventions
  • 0 log 0 0
  • p log (p/0) (for p gt 0)
  • Distance? not quite
  • not symmetric D(pq) ¹ D(qp)
  • does not satisfy the triangle inequality
  • But can be useful to think of it as distance

60
Mutual Information and Relative Entropy
  • Random variables X, Y pXÇY(x,y), pX(x), pY(y)
  • Mutual information (between two random variables
    X,Y)
  • I(X,Y) D(p(x,y)
    p(x)p(y))
  • I(X,Y) measures how much (our knowledge of) Y
    contributes (on average) to easing the prediction
    of X
  • Or how p(x,y) deviates from independence
    (p(x)p(y))

61
From Mutual Information to Entropy
  • By how many bits does the knowledge of Y lower
    the entropy H(X)
  • I(X,Y) åx åy p(x,y) log2 (p(x,y)/p(y)p(x))
  • // use p(x,y)/p(y) p(xy)
  • åx åy p(x,y) log2 (p(xy)/p(x))
  • // use log(a/b) log a - log b (a p(xy), b
    p(x)), distribute sums
  • åx åy p(x,y)log2p(xy) - åx åy
    p(x,y)log2p(x)
  • // use def. of H(XY) (left term), and åy Î Y
    p(x,y) p(x) (right term)
  • - H(XY) (- åx Î W p(x)log2p(x))
  • // use def. of H(X) (right term), swap terms
  • H(X) - H(XY) ...by symmetry also
    H(Y) - H(YX)

62
Jensens Inequality
  • Recall f is convex on interval (a,b) iff
  • "x,y Î(a,b), "l Î 0,1
  • f(lx (1-l)y) lf(x)
    (1-l)f(y)
  • J.I. for distribution p(x), r.v. X on W, and
    convex f,
  • f(åxÎW p(x) x) åxÎW p(x)
    f(x)
  • Proof (idea) by induction on the number of basic
    outcomes
  • start with W 2 by
  • p(x1)f(x1) p(x2)f(x2) ³ f(p(x1)x1 p(x2)x2) (Ü
    def. of convexity)
  • for the induction step (W k k1), just use
    the induction hypothesis and def. of convexity
    (again).

63
Relative Entropy Inequality
  • D(pq) ³ 0
  • Proof
  • 0 - log 1 - log åxÎW q(x) - log åxÎW
    (q(x)/p(x))p(x)
  • ...apply Jensens inequality here (
    - log is convex)...
  • åxÎW p(x) (-log(q(x)/p(x))) åxÎW p(x)
    log(p(x)/q(x))

  • D(pq)

64
Other Entropy Facts
  • Log sum inequality for ri, si ³ 0
  • åi1..n (ri log(ri/si)) (åi1..n ri)
    log(åi1..nri/åi1..nsi))
  • D(pq) is convex in p,q (Ü log sum inequality)
  • H(pX) log2W, where W is the sample space of
    pX
  • Proof uniform u(x), same sample space W
    åp(x) log u(x) -log2W
  • log2W - H(X) -åp(x) log
    u(x) åp(x) log p(x) D(pu) ³ 0
  • H(p) is concave in p
  • Proof from H(X) log2W - D(pu),
    D(pu) convex ÞH(x) concave

65
Entropy and Language
  • Entropy is measure of uncertainty. The more we
    know about something the lower the entropy.
  • If a language model captures more of the
    structure of the language than another model,
    then its entropy should be lower.
  • Entropy can be thought of as a matter of how
    surprised we will be to see the next word given
    previous words we have already seen.

66
The Relation to Language Cross Entropy
  • We can use pointwise entropy as a measure of
    surprise.
  • H(w h) -log2 m(w h) where w is next word
    and h is the history of previously seen words.
  • The cross entropy between a random variable X
    with true probability distribution p(x) and
    another pmf q (normally a model of p) is given
    by
  • H(X,q)H(X)D(pq) Hp(q)
  • Cross entropy can help us find out what our
    average surprise for the next word is.

67
Cross Entropy
  • H(X) D(pq) bits needed for encoding p if q
    is used.

68
Cross Entropy
  • Typical case weve got series of observations T
    t1, t2, t3, t4, ..., tn(numbers, words, ...
    ti Î W)
  • estimate (simple)
  • "y Î W (y) C(y) / T, def. C(y) t Î
    T t y
  • ...but the true p is unknown every sample is too
    small!
  • Natural question how well do we do using
    instead of p?
  • Idea simulate actual p by using a different T
  • (or rather by using different observations
    we simulate the insufficiency of T vs. some other
    data (random difference))

69
Conditional Cross Entropy
  • So far unconditional distribution(s) p(x),
    p(x)...
  • In practice virtually always conditioning on
    context
  • Interested in sample space Y, r.v. Y, y Î Y
  • context sample space W, r.v. X, x Î
    W
  • our distribution p(yx), test
    against p(y,x),
  • which is taken from some
    independent data
  • Hp(p) - åy Î Y, x Î W p(y,x)
    log2p(yx)

70
Sample Space vs. Data
  • In practice, it is inconvenient to sum over the
    sample space(s) Y, W
  • Use the following formula
  • Hp(p) - åy Î Y, x Î W p(y,x)
    log2p(yx)
  • - 1/T åi
    1..T log2p(yixi)
  • This is in fact the normalized log probability
    of the test data.
  • Hp(p) - 1/T log2 Pi 1..T
    p(yixi)

71
Computation Example
  • a,b,..,z, prob. distribution
    (assumed/estimated from data)
  • p(a) .25, p(b) .5, p(a) 1/64 for a Îc..r,
    and 0 for s,t,u,v,w,x,y,z
  • Data (test) barb p(a) p(r) .25,
    p(b) .5
  • Sum over W
  • a a b c d e f g ... p
    q r s t ... z
  • -p(a)log2p(a)
    .5.50000000001.500000 2.5
  • Sum over data
  • i / si
    1/b 2/a 3/r 4/b 1/T
  • -log2p(si) 1 2 6
    1 10 (1/4) 10 2.5


72
Cross Entropy Some Observations
  • H(p) ?? lt, , gt ?? Hp(p) ALL!
  • Previous example
  • p(a) .25, p(b) .5, p(a) 1/64 for a
    Îc..r, 0 for the rest s,t,u,v,w,x,y,z
  • H(p) 2.5 bits H(p)
    (barb)
  • Other data probable (1/8)(66612166)
    4.25
  • H(p) lt 4.25 bits
    H(p) (probable)
  • And finally abba (1/4)(2112) 1.5
  • H(p) gt 1.5 bits H(p)
    (abba)
  • But what about baby -p(y)log2p(y)
    -.25log20 (??)

73
Cross Entropy Usage
  • Comparing distributions vs. real data
  • Have 2 distributions p and q (on some W, X)
  • which is better?
  • The better has lower cross entropy on real data S
  • Real data S
  • HS(p) - 1/S åi 1..S log2p(yixi) ??
    HS(q) - 1/S åi 1..S log2q(yixi)

74
Comparing Distributions
Test data S probable
HS(p) 4.25
  • p(.) from prev. example
  • p(a) .25, p(b) .5, p(a) 1/64
    for a Îc..r, 0 for the rest s,t,u,v,w,x,y,z
  • q(..) (conditional defined by a table)
  • (1/8) (log(poth.)log(rp)log(or)log(bo)log(
    ab)log(ba)log(lb)log(el))
  • (1/8) ( 0 3 0
    0 1 0 1
    0 )

q(or) 1
q(rp) .125
HS(q) .625
75
Entropy of a Language
  • Imagine that we produce the next letter using
  • p(ln1l1,...,ln),
  • where l1,...,ln is the sequence of all the
    letters which had been uttered so far (i.e., n is
    really big!) lets call l1,...,ln the history
  • Then compute its entropy
  • - åh Î H ål Î A p(l,h) log2 p(lh)
  • Not very practical, is it?

76
The Entropy of English
  • We can model English using n-gram models (also
    known a Markov chains).
  • These models assume limited memory, i.e., we
    assume that the next word depends only on the
    previous k ones kth order Markov approximation.
  • What is the Entropy of English?
  • First order 4.03 bits
  • Second order 2.8 bits
  • Shannons experiment 1.3 bits

77
Perplexity
  • A measure related to the notion of cross entropy
    and used in the speech recognition community is
    called the perplexity.
  • Perplexity(x1n, m) 2H(x1n,m) m(x1n)-1/n
  • A perplexity of k means that you are as surprised
    on average as you would have been if you had had
    to guess between k equi-probable choices at each
    step.
Write a Comment
User Comments (0)
About PowerShow.com