Title: Chapter 7 Mathematical Foundations
1Chapter 7Mathematical Foundations
2Notions of Probability Theory
- Probability theory deals with predicting how
likely it is that something will happen. - The process by which an observation is made is
called an experiment or a trial. - The collection of basic outcomes (or sample
points) for our experiment is called the sample
space ? (Omega). - An event is a subset of the sample space.
- Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1 certainty. - A probability function/distribution distributes a
probability mass of 1 throughout the sample space.
3Example
the event of interest
an experiment
- A fair coin is tossed 3 times. What is the
chance of 2 heads? - ?HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
- uniform distribution
- P(basic outcome)
- The chance of getting 2 heads when tossing 3
coinsAHHT, HTH, THHP(A)
sample space
probabilistic function
a subset of ?
4Conditional Probability
- Conditional probabilities measure the probability
of events given some knowledge. - Prior probabilities measure the probabilities of
events before we consider our additional
knowledge. - Posterior probabilities are probabilities that
result from using our additional knowledge.
1st 2nd 3rd H H T T H
Event B 1st is H Event A 2 Hs in 1st, 2nd and
3rd
? P(AB)
P(A?B)
P(B)
5A
B
The multiplication rule
The chain rule
(used in Markov models, )
6Independence
- The chain rule relates intersection with
conditionalization (important to NLP) - Independence and conditional independence of
events are two very important notions in
statistics - independence
- conditional independence
7Bayes Theorem
- Bayes Theorem lets us swap the order of
dependence between events. This is important when
the former quantity is difficult to determine. - P(A) is a normalizing constant.
8An Application
- Pick the best conclusion given some evidence
- (1) evaluate the probability P(ce)
lt--- unknown - (2) Select c with the largest P(ce).
- P( c ) and P(ec) are known.
- Example.
- Relative probability of a disease
- how often a symptom is associated
-
9Bayes Theorem
A
B
10Bayes Theorem
(i?j)
if
, P(A) gt 0,
The group of sets Bi partition A
(used in noisy channel model)
11An Example
- A parasitic gap occurs once in 100,000 sentences.
- A complicated pattern matcher attempts to
identify sentences with parasitic gaps. - The answer is positive with probability 0.95 when
a sentence has a parasitic gap, and the answer is
positive with probability 0.005 when it has no
parasitic gap. - When the test says that a sentence contains a
parasitic gaps, what is the probability that this
is true?
P(G)0.00001, ,
P(TG)0.95,
12Random Variables
- A random variable is a function
X ? (sample space) ? Rn - Talk about the probabilities that are related to
the event space - A discrete random variable is a function
X ? (sample space) ? S
where S is a
countable subset of R. - If X ? (sample space) ? 0,1, then X
is called a Bernoulli trial. - The probability mass function (pmf) for a random
variable X gives the probability that the random
variable has different numeric values.
pmf
where
13Events toss two dice and sum their faces
?(1,1), (1,2), , (1,6), (2,1), (2,2), ,
(6,1), (6,2), , (6,6) S2, 3, 4, , 12 X ? ?
S
pmf p(3)p(X3)P(A3)P((1,2),(2,1))2/36 whe
re A3??? X(?)3(1,2),(2,1)
14Expectation
- The expectation is the mean (?, Mu) or average of
a random variable. - Example
- Roll one die and Y is the value on its face
E(aYb)aE(Y)b E(XY)E(X)E(Y)
15Variance
- The variance (?2, Sigma2) of a random variable is
a measure of whether the values of the random
variable tend to be consistent over trials or to
vary a lot. - standard deviation (?)
- square root of the variance
16X a random variable that is the sum of the
numbers on two dice
X toss two dice, and sum of their faces,
I.e., XYY Y toss one die, and its value
17Joint and Conditional Distributions
- More than one random variable can be defined over
a sample space. In this case, we talk about a
joint or multivariate probability distribution. - The joint probability mass function for two
discrete random variables X and Y is
p(x, y)P(Xx, Yy) - The marginal probability mass function totals up
the probability masses for the values of each
variable separately. - If X and Y are independent, then
18Joint and Conditional Distributions
- Similar intersection rules hold for joint
distributions as for events. - Chain rule in terms of random variables
19Estimating Probability Functions
- What is the probability that sentence The cow
chewed its cud will be uttered? Unknown ? P
must be estimated from a sample of data. - An important measure for estimating P is the
relative frequency of the outcome, i.e., the
proportion of times a certain outcome occurs. - Assuming that certain aspects of language can be
modeled by one of the well-known distribution is
called using a parametric approach. - If no such assumption can be made, we must use a
non-parametric approach or distribution-free
approach.
20parametric approach
- Select an explicit probabilistic model
- Specify a few parameters to determine a
particular probability distribution - The amount of training data required is not
great, and can be calculated to make good
probability estimates
21Standard Distributions
- In practice, one commonly finds the same basic
form of a probability mass function, but with
different constants employed. - Families of pmfs are called distributions and the
constants that define the different possible pmfs
in one family are called parameters. - Discrete Distributions the binomial
distribution, the multinomial distribution, the
Poisson distribution. - Continuous Distributions the normal
distribution, the standard normal distribution.
22Binomial distribution
- A series of trials with only two outcomes, each
trial is independent from all others - The number r of successes out of n trials given
that the probability of success in any trial is p
parameters
variable
expectation np variance np(1-p)
23(No Transcript)
24The normal distribution
- two parameters for mean ? and standard deviation
?, the curve is given by - standard normal distribution
- ?0, ?1
25(No Transcript)
26Baysian Statistics I Bayesian Updating
- frequentist statistics vs. Bayesian statistics
- Toss a coin 10 times and get 8 heads ? 8/10
(maximum likelihood estimate) - 8 heads out of 10 just happens sometimes given a
small sample - Assume that the data are coming in sequentially
and are independent. - Given an a priori probability distribution, we
can update our beliefs when a new datum comes in
by calculating the Maximum A Posteriori (MAP)
distribution. - The MAP probability becomes the new prior and the
process repeats on each new datum.
27Frequentist Statistics
?m the model that asserts P(head)m s a
particular sequence of observations yielding i
heads and j tails
Find the MLE (maximum likelihood estimate) by
differentiating the above polynomial.
8 heads and 2 tails 8/(82)0.8
28belief in the fairness of a coin a regular, fair
one
Bayesian Statistics
a priori probabilistic distribution
a particular sequence of observations i heads
and j tails
New belief in the fairness of a coin??
new priori
(when i8,j2)
29Bayesian Statistics II Bayesian Decision Theory
- Bayesian Statistics can be used to evaluate which
model or family of models better explains some
data. - We define two different models of the event and
calculate the likelihood ratio between these two
models.
30Entropy
- The entropy is the average uncertainty of a
single random variable. - Let p(x)P(Xx) where x ?X.
- H(p) H(X) - ?x?X p(x)log2p(x)
- In other words, entropy measures the amount of
information in a random variable. It is normally
measured in bits.
Toss two coins and count the number of
heads p(0)1/4, p(1)1/2, p(2)1/4
31Example
- Roll an 8-sided die
- Entropy (another view)
- the average length of the message needed to
transmit an outcome of that variable1 2 3 4 5 6
7 8001 010 011 100 101 110 111 000 - Optimal code to send a message of probability
p(i) -
32Example
- Problem
- send a friend a message that is a number from 0
to 3. - How long a message must you send ?
- (in terms of number of bits)
- Example watch a house with two occupants.
33- Variable-length encoding
- Code tree
- (1) all messages are handled.
- (2) when one message ends and the next starts.
- Fewer bits for more frequent messages more bits
for less frequent messages.
0-No occupants
10-both occupants
110-First occupant
111-Second occupant
34- W random variable for a message
- V(W) possible messages
- P probability distribution
- lower bound on the number of bits needed to
encode such a message
(entropy of a random variable)
35- (1) entropy of a message
- lower bound for the average number of bits needed
to transmit that message. - (2) encoding method
- using ?-log P(w) ?bits
- entropy (another view)
- a measure of the uncertainty about what a message
says. - fewer bits for more certain message more bits for
less certain message
36(No Transcript)
37Simplified Polynesian
- p t k a i u1/8 1/4 1/8 1/4 1/8 1/8
- per-letter entropy
- p t k a i u
- 100 00 101 01 110 111
Fewer bits are used to send more frequent letters.
38Joint Entropy and Conditional Entropy
- The joint entropy of a pair of discrete random
variables X, Y p(x,y) is the amount of
information needed on average to specify both
their values. - The conditional entropy of a discrete random
variable Y given another X, for X, Y p(x,y),
expresses how much extra information you still
need to supply on average to communicate Y given
that the other party knows X.
39Chain rule for entropy
Proof
40Simplified Polynesian revised
- Distinction between model and reality
- Simplified Polynesian has syllable structure
- All words are consist of sequences of CV
(consonant-vowel) syllables. - A better model in terms of two random variables C
and V.
41consonant
P(,V)
P(V,C)
v o w e l
P(C,)
42(No Transcript)
43(No Transcript)
44Short Summary
- Better understanding means much less uncertainty
(2.44bits lt 5 bits) - Incorrect models cross entropy is larger than
that of the correct model
correct model
approximate model
45Entropy rate per-letter/word entropy
- The amount of information contained in a message
depends on the length of the message - The entropy of a human language L
46Mutual Information
- By the chain rule for entropy, we have H(X,Y)
H(X) H(YX) H(Y)H(XY) - Therefore, H(X)-H(XY)H(Y)-H(YX)
- This difference is called the mutual information
between X and Y. - It is the reduction in uncertainty of one random
variable due to knowing about another, or, in
other words, the amount of information one random
variable contains about another.
47H(X,Y)
H(XY)
H(YX)
I(XY)
H(Y)
H(X)
48It is 0 only when two variables are independent,
but for two dependent variables, mutual
information grows not only with the degree of
dependence, but also according to the entropy of
the variables
Conditional mutual information
Chain rule for mutual information
49Pointwise Mutual Information
Applications clustering words word sense
disambiguation
50- Clustering by Next Word (Brown, et al., 1992)
- 1. Each word was characterized by the word that
immediately followed it. - c(wi) ltw1,w2, ...,wwgt
- ?wj total wi wj in the corpus
- 2. Define the distance measure on such vectors.
- mutual information I(x, y)
- the amount of information one outcome gives
us about the other - I(x, y) (-log P(x)) - (-log
P(xy)) -
- log
def
x uncertainty x gives y uncertainty
certainty (x gives y)
51- Example.
- How much information the word pancake
- gives us about the following word syrup
52Physical meaning of MI
- (1) wi and wj have no particular relation to each
other. - I(x y) ? 0
- P(wj wi) P(wj) x?y???????????
53Physical meaning of MI
- (2) wi and wj are perfectly coordinated.
-
a very large number
I(x y) gtgt 0 ???y?,??????x???
54Physical meaning of MI
- (3) wi and wj are negatively correlated
- a very small negative number
- I(x y) ltlt 0 ????
55The Noisy Channel Model
- Assuming that you want to communicate messages
over a channel of restricted capacity, optimize
(in terms of throughput and accuracy) the
communication in the presence of noise in the
channel. - A channels capacity can be reached by designing
an input code that maximizes the mutual
information between the input and output over all
possible input distributions.
c h a n n e l
n o i s y
W
X
Channel p(yx)
Y
Encoder
Decoder
Message from a finite alphabet
Attempt to reconstruct message based on output
Input to channel
Output from channel
56A binary symmetric channel. A 1 or 0 in the input
gets flipped on transmission with probability p.
1-p
I(XY) H(Y) H(YX) H(Y) H(p)
0
0
p
1
1
1- p
H(p)-?p(x)log2p(x)
The channel capacity is 1 bit only if the entropy
H(p) is 0. I.E., If p0 the channel
reliability transmits a 0 as 0 and 1 as 1. If p1
it always flips bits
The channel capacity is 0 when both 0s and 1s are
transmitted with equal probability as 0s and
1s (i.e., p1/2). ? Completely noisy binary
channel
57The noisy channel model in linguistics
I
O
Noisy Channel p(oi)
Decoder
p(i) language model p(oi) channel probability
58Speech Recognition
- Find the sequence of words that maximizes
- Maximize P( ) P(Speech Signal )
Speech Signal)
Speech Signal )
Speech Signal)
language model
acoustic aspects of speech
signal
59big pig
The dog...
Assume P(big the) P(pig the). P(the big
dog) P(the)P(big the)P(dog the big) P(the
pig dog) P(the)P(pig the)P(dog the pig) ?
P(dog the big) gt P(dog the pig) ? the big
dog is selected. gt dog selects big
60(No Transcript)
61Relative Entropy or Kullback-Leibler Divergence
- For 2 pmfs, p(x) and q(x), their relative entropy
is - The relative entropy (also known as the
Kullback-Leibler divergence) is a measure of how
different two probability distributions (over the
same event space) are. - The KL divergence between p and q can also be
seen as the average number of bits that are
wasted by encoding events from a distribution p
with a code based on a not-quite-right
distribution q.
i.e., no bits are wasted when pq
62Recall
A measure of how far a joint distribution is from
independence
Conditional relative entropy
Chain rule for relative entropy
Application measure selectional preferences in
selection
63The Relation to Language Cross-Entropy
- Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we already saw. - The cross entropy between a random variable X
with true probability distribution p(x) and
another pmf q (normally a model of p) is given
by - Cross-entropy can help us find out what our
average surprise for the next word is.
64Cross Entropy
- Cross entropy of a language L(Xi)p(x)
- The language is nice
- Large body of utterances available
65How much good does the approximate model do ?
correct model
approximate model
66proof
.(1)
.(2)
(correct model)
(approximate model)
yx-1
(1)
e
(2)
67cross entropy of a language L give a model M
(all English test)
infinite
representative samples of English text
68(No Transcript)
69Cross Entropy as a Model Evaluator
- Example
- Find the best model to produce message of 20
words. - Correct probabilistic model.
70- Per-word cross entropy
- Approximate model
(0.05 log P(M1) 0.05 log P(M2) 0.05 log
P(M3) 0.10 log P(M4) 0.10 log P(M5) 0.20
log P(M6) 0.20 log P(M7) 0.25 log P(M8) )
100 samples M1 M1 M1 M1 M1 (5) M2 M2 M2
M2 M2 (5) M3 M3 M3 M3 M3 (5) M4 M4 M4 ...
M4 (10) M5 M5 M5 ... M5 (10) M6 M6 M6 ...
M6 (20) M7 M7 M7 ... M7 (20) M8 M8 M8 ...
M8 (25)
Each message is independent of the next
71P(M1) P(M1) P(M1)
M1 M2 . . . M8
5 times
P(M2) P(M2) P(M2)
5 times
...
P(M8) P(M8) P(M8)
25 times
72- per-word cross entropy
- the test suite of 100 examples was exactly
indicative of the probabilistic model. - To measure the cross entropy of a model works
only if the test sequence has not been used by
model builder.
closed test (inside test) open test (outside
test)
73- The Composition of Brown and LOB Corpora
- Brown Corpus
- Brown University Standard Corpus of Present-day
American English. - LOB Corpus
- Lancaster/Oslo-Bergen Corpus of British English.
74(No Transcript)
75The Entropy of English
- We can model English using n-gram models (also
known a Markov chains). - These models assume limited memory, i.e., we
assume that the next word depends only on the
previous k ones kth order Markov
approximation.
76P(w1,n)
bigram trigram
77The Entropy of English
- What is the Entropy of English?
78Perplexity
- A measure related to the notion of cross-entropy
and used in the speech recognition community is
called the perplexity. - A perplexity of k means that you are as surprised
on average as you would have been if you had had
to guess between k equiprobable choices at each
step.