Chapter 7 Mathematical Foundations - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Chapter 7 Mathematical Foundations

Description:

The process by which an observation is made is called an ... Probabilities are numbers between 0 and 1, where 0 indicates impossibility and 1 certainty. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 79
Provided by: hsinhs
Category:

less

Transcript and Presenter's Notes

Title: Chapter 7 Mathematical Foundations


1
Chapter 7Mathematical Foundations
2
Notions of Probability Theory
  • Probability theory deals with predicting how
    likely it is that something will happen.
  • The process by which an observation is made is
    called an experiment or a trial.
  • The collection of basic outcomes (or sample
    points) for our experiment is called the sample
    space ? (Omega).
  • An event is a subset of the sample space.
  • Probabilities are numbers between 0 and 1, where
    0 indicates impossibility and 1 certainty.
  • A probability function/distribution distributes a
    probability mass of 1 throughout the sample space.

3
Example
the event of interest
an experiment
  • A fair coin is tossed 3 times. What is the
    chance of 2 heads?
  • ?HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • uniform distribution
  • P(basic outcome)
  • The chance of getting 2 heads when tossing 3
    coinsAHHT, HTH, THHP(A)

sample space
probabilistic function
a subset of ?
4
Conditional Probability
  • Conditional probabilities measure the probability
    of events given some knowledge.
  • Prior probabilities measure the probabilities of
    events before we consider our additional
    knowledge.
  • Posterior probabilities are probabilities that
    result from using our additional knowledge.

1st 2nd 3rd H H T T H
Event B 1st is H Event A 2 Hs in 1st, 2nd and
3rd
? P(AB)
P(A?B)
P(B)
5
A
B
The multiplication rule
The chain rule
(used in Markov models, )
6
Independence
  • The chain rule relates intersection with
    conditionalization (important to NLP)
  • Independence and conditional independence of
    events are two very important notions in
    statistics
  • independence
  • conditional independence

7
Bayes Theorem
  • Bayes Theorem lets us swap the order of
    dependence between events. This is important when
    the former quantity is difficult to determine.
  • P(A) is a normalizing constant.

8
An Application
  • Pick the best conclusion given some evidence
  • (1) evaluate the probability P(ce)
    lt--- unknown
  • (2) Select c with the largest P(ce).
  • P( c ) and P(ec) are known.
  • Example.
  • Relative probability of a disease
  • how often a symptom is associated

9
Bayes Theorem
A
B
10
Bayes Theorem
(i?j)
if
, P(A) gt 0,
The group of sets Bi partition A
(used in noisy channel model)
11
An Example
  • A parasitic gap occurs once in 100,000 sentences.
  • A complicated pattern matcher attempts to
    identify sentences with parasitic gaps.
  • The answer is positive with probability 0.95 when
    a sentence has a parasitic gap, and the answer is
    positive with probability 0.005 when it has no
    parasitic gap.
  • When the test says that a sentence contains a
    parasitic gaps, what is the probability that this
    is true?

P(G)0.00001, ,
P(TG)0.95,
12
Random Variables
  • A random variable is a function
    X ? (sample space) ? Rn
  • Talk about the probabilities that are related to
    the event space
  • A discrete random variable is a function
    X ? (sample space) ? S
    where S is a
    countable subset of R.
  • If X ? (sample space) ? 0,1, then X
    is called a Bernoulli trial.
  • The probability mass function (pmf) for a random
    variable X gives the probability that the random
    variable has different numeric values.

pmf
where
13
Events toss two dice and sum their faces
?(1,1), (1,2), , (1,6), (2,1), (2,2), ,
(6,1), (6,2), , (6,6) S2, 3, 4, , 12 X ? ?
S
pmf p(3)p(X3)P(A3)P((1,2),(2,1))2/36 whe
re A3??? X(?)3(1,2),(2,1)
14
Expectation
  • The expectation is the mean (?, Mu) or average of
    a random variable.
  • Example
  • Roll one die and Y is the value on its face

E(aYb)aE(Y)b E(XY)E(X)E(Y)
15
Variance
  • The variance (?2, Sigma2) of a random variable is
    a measure of whether the values of the random
    variable tend to be consistent over trials or to
    vary a lot.
  • standard deviation (?)
  • square root of the variance

16
X a random variable that is the sum of the
numbers on two dice
X toss two dice, and sum of their faces,
I.e., XYY Y toss one die, and its value
17
Joint and Conditional Distributions
  • More than one random variable can be defined over
    a sample space. In this case, we talk about a
    joint or multivariate probability distribution.
  • The joint probability mass function for two
    discrete random variables X and Y is
    p(x, y)P(Xx, Yy)
  • The marginal probability mass function totals up
    the probability masses for the values of each
    variable separately.
  • If X and Y are independent, then

18
Joint and Conditional Distributions
  • Similar intersection rules hold for joint
    distributions as for events.
  • Chain rule in terms of random variables

19
Estimating Probability Functions
  • What is the probability that sentence The cow
    chewed its cud will be uttered? Unknown ? P
    must be estimated from a sample of data.
  • An important measure for estimating P is the
    relative frequency of the outcome, i.e., the
    proportion of times a certain outcome occurs.
  • Assuming that certain aspects of language can be
    modeled by one of the well-known distribution is
    called using a parametric approach.
  • If no such assumption can be made, we must use a
    non-parametric approach or distribution-free
    approach.

20
parametric approach
  • Select an explicit probabilistic model
  • Specify a few parameters to determine a
    particular probability distribution
  • The amount of training data required is not
    great, and can be calculated to make good
    probability estimates

21
Standard Distributions
  • In practice, one commonly finds the same basic
    form of a probability mass function, but with
    different constants employed.
  • Families of pmfs are called distributions and the
    constants that define the different possible pmfs
    in one family are called parameters.
  • Discrete Distributions the binomial
    distribution, the multinomial distribution, the
    Poisson distribution.
  • Continuous Distributions the normal
    distribution, the standard normal distribution.

22
Binomial distribution
  • A series of trials with only two outcomes, each
    trial is independent from all others
  • The number r of successes out of n trials given
    that the probability of success in any trial is p

parameters
variable
expectation np variance np(1-p)
23
(No Transcript)
24
The normal distribution
  • two parameters for mean ? and standard deviation
    ?, the curve is given by
  • standard normal distribution
  • ?0, ?1

25
(No Transcript)
26
Baysian Statistics I Bayesian Updating
  • frequentist statistics vs. Bayesian statistics
  • Toss a coin 10 times and get 8 heads ? 8/10
    (maximum likelihood estimate)
  • 8 heads out of 10 just happens sometimes given a
    small sample
  • Assume that the data are coming in sequentially
    and are independent.
  • Given an a priori probability distribution, we
    can update our beliefs when a new datum comes in
    by calculating the Maximum A Posteriori (MAP)
    distribution.
  • The MAP probability becomes the new prior and the
    process repeats on each new datum.

27
Frequentist Statistics
?m the model that asserts P(head)m s a
particular sequence of observations yielding i
heads and j tails
Find the MLE (maximum likelihood estimate) by
differentiating the above polynomial.
8 heads and 2 tails 8/(82)0.8
28
belief in the fairness of a coin a regular, fair
one
Bayesian Statistics
a priori probabilistic distribution
a particular sequence of observations i heads
and j tails
New belief in the fairness of a coin??
new priori
(when i8,j2)
29
Bayesian Statistics II Bayesian Decision Theory
  • Bayesian Statistics can be used to evaluate which
    model or family of models better explains some
    data.
  • We define two different models of the event and
    calculate the likelihood ratio between these two
    models.

30
Entropy
  • The entropy is the average uncertainty of a
    single random variable.
  • Let p(x)P(Xx) where x ?X.
  • H(p) H(X) - ?x?X p(x)log2p(x)
  • In other words, entropy measures the amount of
    information in a random variable. It is normally
    measured in bits.

Toss two coins and count the number of
heads p(0)1/4, p(1)1/2, p(2)1/4
31
Example
  • Roll an 8-sided die
  • Entropy (another view)
  • the average length of the message needed to
    transmit an outcome of that variable1 2 3 4 5 6
    7 8001 010 011 100 101 110 111 000
  • Optimal code to send a message of probability
    p(i)

32
Example
  • Problem
  • send a friend a message that is a number from 0
    to 3.
  • How long a message must you send ?
  • (in terms of number of bits)
  • Example watch a house with two occupants.

33
  • Variable-length encoding
  • Code tree
  • (1) all messages are handled.
  • (2) when one message ends and the next starts.
  • Fewer bits for more frequent messages more bits
    for less frequent messages.

0-No occupants
10-both occupants
110-First occupant
111-Second occupant
34
  • W random variable for a message
  • V(W) possible messages
  • P probability distribution
  • lower bound on the number of bits needed to
    encode such a message

(entropy of a random variable)
35
  • (1) entropy of a message
  • lower bound for the average number of bits needed
    to transmit that message.
  • (2) encoding method
  • using ?-log P(w) ?bits
  • entropy (another view)
  • a measure of the uncertainty about what a message
    says.
  • fewer bits for more certain message more bits for
    less certain message

36
(No Transcript)
37
Simplified Polynesian
  • p t k a i u1/8 1/4 1/8 1/4 1/8 1/8
  • per-letter entropy
  • p t k a i u
  • 100 00 101 01 110 111

Fewer bits are used to send more frequent letters.
38
Joint Entropy and Conditional Entropy
  • The joint entropy of a pair of discrete random
    variables X, Y p(x,y) is the amount of
    information needed on average to specify both
    their values.
  • The conditional entropy of a discrete random
    variable Y given another X, for X, Y p(x,y),
    expresses how much extra information you still
    need to supply on average to communicate Y given
    that the other party knows X.

39
Chain rule for entropy
Proof
40
Simplified Polynesian revised
  • Distinction between model and reality
  • Simplified Polynesian has syllable structure
  • All words are consist of sequences of CV
    (consonant-vowel) syllables.
  • A better model in terms of two random variables C
    and V.

41
consonant
P(,V)
P(V,C)
v o w e l
P(C,)
42
(No Transcript)
43
(No Transcript)
44
Short Summary
  • Better understanding means much less uncertainty
    (2.44bits lt 5 bits)
  • Incorrect models cross entropy is larger than
    that of the correct model

correct model
approximate model
45
Entropy rate per-letter/word entropy
  • The amount of information contained in a message
    depends on the length of the message
  • The entropy of a human language L

46
Mutual Information
  • By the chain rule for entropy, we have H(X,Y)
    H(X) H(YX) H(Y)H(XY)
  • Therefore, H(X)-H(XY)H(Y)-H(YX)
  • This difference is called the mutual information
    between X and Y.
  • It is the reduction in uncertainty of one random
    variable due to knowing about another, or, in
    other words, the amount of information one random
    variable contains about another.

47
H(X,Y)
H(XY)
H(YX)
I(XY)
H(Y)
H(X)
48
It is 0 only when two variables are independent,
but for two dependent variables, mutual
information grows not only with the degree of
dependence, but also according to the entropy of
the variables
Conditional mutual information
Chain rule for mutual information
49
Pointwise Mutual Information
Applications clustering words word sense
disambiguation
50
  • Clustering by Next Word (Brown, et al., 1992)
  • 1. Each word was characterized by the word that
    immediately followed it.
  • c(wi) ltw1,w2, ...,wwgt
  • ?wj total wi wj in the corpus
  • 2. Define the distance measure on such vectors.
  • mutual information I(x, y)
  • the amount of information one outcome gives
    us about the other
  • I(x, y) (-log P(x)) - (-log
    P(xy))
  • log

def
x uncertainty x gives y uncertainty
certainty (x gives y)
51
  • Example.
  • How much information the word pancake
  • gives us about the following word syrup

52
Physical meaning of MI
  • (1) wi and wj have no particular relation to each
    other.
  • I(x y) ? 0
  • P(wj wi) P(wj) x?y???????????

53
Physical meaning of MI
  • (2) wi and wj are perfectly coordinated.

a very large number
I(x y) gtgt 0 ???y?,??????x???
54
Physical meaning of MI
  • (3) wi and wj are negatively correlated
  • a very small negative number
  • I(x y) ltlt 0 ????

55
The Noisy Channel Model
  • Assuming that you want to communicate messages
    over a channel of restricted capacity, optimize
    (in terms of throughput and accuracy) the
    communication in the presence of noise in the
    channel.
  • A channels capacity can be reached by designing
    an input code that maximizes the mutual
    information between the input and output over all
    possible input distributions.

c h a n n e l
n o i s y
W
X
Channel p(yx)
Y
Encoder
Decoder
Message from a finite alphabet
Attempt to reconstruct message based on output
Input to channel
Output from channel
56
A binary symmetric channel. A 1 or 0 in the input
gets flipped on transmission with probability p.
1-p
I(XY) H(Y) H(YX) H(Y) H(p)
0
0
p
1
1
1- p
H(p)-?p(x)log2p(x)
The channel capacity is 1 bit only if the entropy
H(p) is 0. I.E., If p0 the channel
reliability transmits a 0 as 0 and 1 as 1. If p1
it always flips bits
The channel capacity is 0 when both 0s and 1s are
transmitted with equal probability as 0s and
1s (i.e., p1/2). ? Completely noisy binary
channel
57
The noisy channel model in linguistics
I
O
Noisy Channel p(oi)
Decoder
p(i) language model p(oi) channel probability
58
Speech Recognition
  • Find the sequence of words that maximizes
  • Maximize P( ) P(Speech Signal )

Speech Signal)
Speech Signal )
Speech Signal)
language model
acoustic aspects of speech
signal
59
big pig
The dog...
Assume P(big the) P(pig the). P(the big
dog) P(the)P(big the)P(dog the big) P(the
pig dog) P(the)P(pig the)P(dog the pig) ?
P(dog the big) gt P(dog the pig) ? the big
dog is selected. gt dog selects big
60
(No Transcript)
61
Relative Entropy or Kullback-Leibler Divergence
  • For 2 pmfs, p(x) and q(x), their relative entropy
    is
  • The relative entropy (also known as the
    Kullback-Leibler divergence) is a measure of how
    different two probability distributions (over the
    same event space) are.
  • The KL divergence between p and q can also be
    seen as the average number of bits that are
    wasted by encoding events from a distribution p
    with a code based on a not-quite-right
    distribution q.

i.e., no bits are wasted when pq
62
Recall
A measure of how far a joint distribution is from
independence
Conditional relative entropy
Chain rule for relative entropy
Application measure selectional preferences in
selection
63
The Relation to Language Cross-Entropy
  • Entropy can be thought of as a matter of how
    surprised we will be to see the next word given
    previous words we already saw.
  • The cross entropy between a random variable X
    with true probability distribution p(x) and
    another pmf q (normally a model of p) is given
    by
  • Cross-entropy can help us find out what our
    average surprise for the next word is.

64
Cross Entropy
  • Cross entropy of a language L(Xi)p(x)
  • The language is nice
  • Large body of utterances available

65
How much good does the approximate model do ?
correct model
approximate model
66

proof
.(1)
.(2)
(correct model)
(approximate model)
yx-1
(1)
e
(2)
67
cross entropy of a language L give a model M
(all English test)
infinite
representative samples of English text
68
(No Transcript)
69
Cross Entropy as a Model Evaluator
  • Example
  • Find the best model to produce message of 20
    words.
  • Correct probabilistic model.

70
  • Per-word cross entropy
  • Approximate model

(0.05 log P(M1) 0.05 log P(M2) 0.05 log
P(M3) 0.10 log P(M4) 0.10 log P(M5) 0.20
log P(M6) 0.20 log P(M7) 0.25 log P(M8) )
100 samples M1 M1 M1 M1 M1 (5) M2 M2 M2
M2 M2 (5) M3 M3 M3 M3 M3 (5) M4 M4 M4 ...
M4 (10) M5 M5 M5 ... M5 (10) M6 M6 M6 ...
M6 (20) M7 M7 M7 ... M7 (20) M8 M8 M8 ...
M8 (25)
Each message is independent of the next
71
P(M1) P(M1) P(M1)
M1 M2 . . . M8
5 times
P(M2) P(M2) P(M2)
5 times
...
P(M8) P(M8) P(M8)
25 times
72
  • per-word cross entropy
  • the test suite of 100 examples was exactly
    indicative of the probabilistic model.
  • To measure the cross entropy of a model works
    only if the test sequence has not been used by
    model builder.

closed test (inside test) open test (outside
test)
73
  • The Composition of Brown and LOB Corpora
  • Brown Corpus
  • Brown University Standard Corpus of Present-day
    American English.
  • LOB Corpus
  • Lancaster/Oslo-Bergen Corpus of British English.

74
(No Transcript)
75
The Entropy of English
  • We can model English using n-gram models (also
    known a Markov chains).
  • These models assume limited memory, i.e., we
    assume that the next word depends only on the
    previous k ones kth order Markov
    approximation.

76
P(w1,n)
bigram trigram
77
The Entropy of English
  • What is the Entropy of English?

78
Perplexity
  • A measure related to the notion of cross-entropy
    and used in the speech recognition community is
    called the perplexity.
  • A perplexity of k means that you are as surprised
    on average as you would have been if you had had
    to guess between k equiprobable choices at each
    step.
Write a Comment
User Comments (0)
About PowerShow.com