Chapter 7 Mathematical Foundations - PowerPoint PPT Presentation

1 / 78

About This Presentation

Title:

Chapter 7 Mathematical Foundations

Description:

The process by which an observation is made is called an ... Probabilities are numbers between 0 and 1, where 0 indicates impossibility and 1 certainty. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 79

Provided by: hsinhs

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 7 Mathematical Foundations

1
Chapter 7Mathematical Foundations
2
Notions of Probability Theory

Probability theory deals with predicting how
likely it is that something will happen.
The process by which an observation is made is
called an experiment or a trial.
The collection of basic outcomes (or sample
points) for our experiment is called the sample
space ? (Omega).
An event is a subset of the sample space.
Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1 certainty.
A probability function/distribution distributes a
probability mass of 1 throughout the sample space.

3
Example
the event of interest
an experiment

A fair coin is tossed 3 times. What is the
chance of 2 heads?
?HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
uniform distribution
P(basic outcome)
The chance of getting 2 heads when tossing 3
coinsAHHT, HTH, THHP(A)

sample space
probabilistic function
a subset of ?
4
Conditional Probability

Conditional probabilities measure the probability
of events given some knowledge.
Prior probabilities measure the probabilities of
events before we consider our additional
knowledge.
Posterior probabilities are probabilities that
result from using our additional knowledge.

1st 2nd 3rd H H T T H
Event B 1st is H Event A 2 Hs in 1st, 2nd and
3rd
? P(AB)
P(A?B)
P(B)
5
A
B
The multiplication rule
The chain rule
(used in Markov models, )
6
Independence

The chain rule relates intersection with
conditionalization (important to NLP)
Independence and conditional independence of
events are two very important notions in
statistics
independence
conditional independence

7
Bayes Theorem

Bayes Theorem lets us swap the order of
dependence between events. This is important when
the former quantity is difficult to determine.
P(A) is a normalizing constant.

8
An Application

Pick the best conclusion given some evidence
(1) evaluate the probability P(ce)
lt--- unknown
(2) Select c with the largest P(ce).
P( c ) and P(ec) are known.
Example.
Relative probability of a disease
how often a symptom is associated

9
Bayes Theorem
A
B
10
Bayes Theorem
(i?j)
if
, P(A) gt 0,
The group of sets Bi partition A
(used in noisy channel model)
11
An Example

A parasitic gap occurs once in 100,000 sentences.
A complicated pattern matcher attempts to
identify sentences with parasitic gaps.
The answer is positive with probability 0.95 when
a sentence has a parasitic gap, and the answer is
positive with probability 0.005 when it has no
parasitic gap.
When the test says that a sentence contains a
parasitic gaps, what is the probability that this
is true?

P(G)0.00001, ,
P(TG)0.95,
12
Random Variables

A random variable is a function
X ? (sample space) ? Rn
Talk about the probabilities that are related to
the event space
A discrete random variable is a function
X ? (sample space) ? S
where S is a
countable subset of R.
If X ? (sample space) ? 0,1, then X
is called a Bernoulli trial.
The probability mass function (pmf) for a random
variable X gives the probability that the random
variable has different numeric values.

pmf
where
13
Events toss two dice and sum their faces
?(1,1), (1,2), , (1,6), (2,1), (2,2), ,
(6,1), (6,2), , (6,6) S2, 3, 4, , 12 X ? ?
S
pmf p(3)p(X3)P(A3)P((1,2),(2,1))2/36 whe
re A3??? X(?)3(1,2),(2,1)
14
Expectation

The expectation is the mean (?, Mu) or average of
a random variable.
Example
Roll one die and Y is the value on its face

E(aYb)aE(Y)b E(XY)E(X)E(Y)
15
Variance

The variance (?2, Sigma2) of a random variable is
a measure of whether the values of the random
variable tend to be consistent over trials or to
vary a lot.
standard deviation (?)
square root of the variance

16
X a random variable that is the sum of the
numbers on two dice
X toss two dice, and sum of their faces,
I.e., XYY Y toss one die, and its value
17
Joint and Conditional Distributions

More than one random variable can be defined over
a sample space. In this case, we talk about a
joint or multivariate probability distribution.
The joint probability mass function for two
discrete random variables X and Y is
p(x, y)P(Xx, Yy)
The marginal probability mass function totals up
the probability masses for the values of each
variable separately.
If X and Y are independent, then

18
Joint and Conditional Distributions

Similar intersection rules hold for joint
distributions as for events.
Chain rule in terms of random variables

19
Estimating Probability Functions

What is the probability that sentence The cow
chewed its cud will be uttered? Unknown ? P
must be estimated from a sample of data.
An important measure for estimating P is the
relative frequency of the outcome, i.e., the
proportion of times a certain outcome occurs.
Assuming that certain aspects of language can be
modeled by one of the well-known distribution is
called using a parametric approach.
If no such assumption can be made, we must use a
non-parametric approach or distribution-free
approach.

20
parametric approach

Select an explicit probabilistic model
Specify a few parameters to determine a
particular probability distribution
The amount of training data required is not
great, and can be calculated to make good
probability estimates

21
Standard Distributions

In practice, one commonly finds the same basic
form of a probability mass function, but with
different constants employed.
Families of pmfs are called distributions and the
constants that define the different possible pmfs
in one family are called parameters.
Discrete Distributions the binomial
distribution, the multinomial distribution, the
Poisson distribution.
Continuous Distributions the normal
distribution, the standard normal distribution.

22
Binomial distribution

A series of trials with only two outcomes, each
trial is independent from all others
The number r of successes out of n trials given
that the probability of success in any trial is p

parameters
variable
expectation np variance np(1-p)
23
(No Transcript)
24
The normal distribution

two parameters for mean ? and standard deviation
?, the curve is given by
standard normal distribution
?0, ?1

25
(No Transcript)
26
Baysian Statistics I Bayesian Updating

frequentist statistics vs. Bayesian statistics
Toss a coin 10 times and get 8 heads ? 8/10
(maximum likelihood estimate)
8 heads out of 10 just happens sometimes given a
small sample
Assume that the data are coming in sequentially
and are independent.
Given an a priori probability distribution, we
can update our beliefs when a new datum comes in
by calculating the Maximum A Posteriori (MAP)
distribution.
The MAP probability becomes the new prior and the
process repeats on each new datum.

27
Frequentist Statistics
?m the model that asserts P(head)m s a
particular sequence of observations yielding i
heads and j tails
Find the MLE (maximum likelihood estimate) by
differentiating the above polynomial.
8 heads and 2 tails 8/(82)0.8
28
belief in the fairness of a coin a regular, fair
one
Bayesian Statistics
a priori probabilistic distribution
a particular sequence of observations i heads
and j tails
New belief in the fairness of a coin??
new priori
(when i8,j2)
29
Bayesian Statistics II Bayesian Decision Theory

Bayesian Statistics can be used to evaluate which
model or family of models better explains some
data.
We define two different models of the event and
calculate the likelihood ratio between these two
models.

30
Entropy

The entropy is the average uncertainty of a
single random variable.
Let p(x)P(Xx) where x ?X.
H(p) H(X) - ?x?X p(x)log2p(x)
In other words, entropy measures the amount of
information in a random variable. It is normally
measured in bits.

Toss two coins and count the number of
heads p(0)1/4, p(1)1/2, p(2)1/4
31
Example

Roll an 8-sided die
Entropy (another view)
the average length of the message needed to
transmit an outcome of that variable1 2 3 4 5 6
7 8001 010 011 100 101 110 111 000
Optimal code to send a message of probability
p(i)

32
Example

Problem
send a friend a message that is a number from 0
to 3.
How long a message must you send ?
(in terms of number of bits)
Example watch a house with two occupants.

Variable-length encoding
Code tree
(1) all messages are handled.
(2) when one message ends and the next starts.
Fewer bits for more frequent messages more bits
for less frequent messages.

0-No occupants
10-both occupants
110-First occupant
111-Second occupant
34

W random variable for a message
V(W) possible messages
P probability distribution
lower bound on the number of bits needed to
encode such a message

(entropy of a random variable)
35

(1) entropy of a message
lower bound for the average number of bits needed
to transmit that message.
(2) encoding method
using ?-log P(w) ?bits
entropy (another view)
a measure of the uncertainty about what a message
says.
fewer bits for more certain message more bits for
less certain message

36
(No Transcript)
37
Simplified Polynesian

p t k a i u1/8 1/4 1/8 1/4 1/8 1/8
per-letter entropy
p t k a i u
100 00 101 01 110 111

Fewer bits are used to send more frequent letters.
38
Joint Entropy and Conditional Entropy

The joint entropy of a pair of discrete random
variables X, Y p(x,y) is the amount of
information needed on average to specify both
their values.
The conditional entropy of a discrete random
variable Y given another X, for X, Y p(x,y),
expresses how much extra information you still
need to supply on average to communicate Y given
that the other party knows X.

39
Chain rule for entropy
Proof
40
Simplified Polynesian revised

Distinction between model and reality
Simplified Polynesian has syllable structure
All words are consist of sequences of CV
(consonant-vowel) syllables.
A better model in terms of two random variables C
and V.

41
consonant
P(,V)
P(V,C)
v o w e l
P(C,)
42
(No Transcript)
43
(No Transcript)
44
Short Summary

Better understanding means much less uncertainty
(2.44bits lt 5 bits)
Incorrect models cross entropy is larger than
that of the correct model

correct model
approximate model
45
Entropy rate per-letter/word entropy

The amount of information contained in a message
depends on the length of the message
The entropy of a human language L

46
Mutual Information

By the chain rule for entropy, we have H(X,Y)
H(X) H(YX) H(Y)H(XY)
Therefore, H(X)-H(XY)H(Y)-H(YX)
This difference is called the mutual information
between X and Y.
It is the reduction in uncertainty of one random
variable due to knowing about another, or, in
other words, the amount of information one random
variable contains about another.

47
H(X,Y)
H(XY)
H(YX)
I(XY)
H(Y)
H(X)
48
It is 0 only when two variables are independent,
but for two dependent variables, mutual
information grows not only with the degree of
dependence, but also according to the entropy of
the variables
Conditional mutual information
Chain rule for mutual information
49
Pointwise Mutual Information
Applications clustering words word sense
disambiguation
50

Clustering by Next Word (Brown, et al., 1992)
1. Each word was characterized by the word that
immediately followed it.
c(wi) ltw1,w2, ...,wwgt
?wj total wi wj in the corpus
2. Define the distance measure on such vectors.
mutual information I(x, y)
the amount of information one outcome gives
us about the other
I(x, y) (-log P(x)) - (-log
P(xy))
log

def
x uncertainty x gives y uncertainty
certainty (x gives y)
51

Example.
How much information the word pancake
gives us about the following word syrup

52
Physical meaning of MI

(1) wi and wj have no particular relation to each
other.
I(x y) ? 0
P(wj wi) P(wj) x?y???????????

53
Physical meaning of MI

(2) wi and wj are perfectly coordinated.

a very large number
I(x y) gtgt 0 ???y?,??????x???
54
Physical meaning of MI

(3) wi and wj are negatively correlated
a very small negative number
I(x y) ltlt 0 ????

55
The Noisy Channel Model

Assuming that you want to communicate messages
over a channel of restricted capacity, optimize
(in terms of throughput and accuracy) the
communication in the presence of noise in the
channel.
A channels capacity can be reached by designing
an input code that maximizes the mutual
information between the input and output over all
possible input distributions.

c h a n n e l
n o i s y
W
X
Channel p(yx)
Y
Encoder
Decoder
Message from a finite alphabet
Attempt to reconstruct message based on output
Input to channel
Output from channel
56
A binary symmetric channel. A 1 or 0 in the input
gets flipped on transmission with probability p.
1-p
I(XY) H(Y) H(YX) H(Y) H(p)
0
0
p
1
1
1- p
H(p)-?p(x)log2p(x)
The channel capacity is 1 bit only if the entropy
H(p) is 0. I.E., If p0 the channel
reliability transmits a 0 as 0 and 1 as 1. If p1
it always flips bits
The channel capacity is 0 when both 0s and 1s are
transmitted with equal probability as 0s and
1s (i.e., p1/2). ? Completely noisy binary
channel
57
The noisy channel model in linguistics
I
O
Noisy Channel p(oi)
Decoder
p(i) language model p(oi) channel probability
58
Speech Recognition

Find the sequence of words that maximizes
Maximize P( ) P(Speech Signal )

Speech Signal)
Speech Signal )
Speech Signal)
language model
acoustic aspects of speech
signal
59
big pig
The dog...
Assume P(big the) P(pig the). P(the big
dog) P(the)P(big the)P(dog the big) P(the
pig dog) P(the)P(pig the)P(dog the pig) ?
P(dog the big) gt P(dog the pig) ? the big
dog is selected. gt dog selects big
60
(No Transcript)
61
Relative Entropy or Kullback-Leibler Divergence

For 2 pmfs, p(x) and q(x), their relative entropy
is
The relative entropy (also known as the
Kullback-Leibler divergence) is a measure of how
different two probability distributions (over the
same event space) are.
The KL divergence between p and q can also be
seen as the average number of bits that are
wasted by encoding events from a distribution p
with a code based on a not-quite-right
distribution q.

i.e., no bits are wasted when pq
62
Recall
A measure of how far a joint distribution is from
independence
Conditional relative entropy
Chain rule for relative entropy
Application measure selectional preferences in
selection
63
The Relation to Language Cross-Entropy

Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we already saw.
The cross entropy between a random variable X
with true probability distribution p(x) and
another pmf q (normally a model of p) is given
by
Cross-entropy can help us find out what our
average surprise for the next word is.

64
Cross Entropy

Cross entropy of a language L(Xi)p(x)
The language is nice
Large body of utterances available

65
How much good does the approximate model do ?
correct model
approximate model
66

proof
.(1)
.(2)
(correct model)
(approximate model)
yx-1
(1)
e
(2)
67
cross entropy of a language L give a model M
(all English test)
infinite
representative samples of English text
68
(No Transcript)
69
Cross Entropy as a Model Evaluator

Example
Find the best model to produce message of 20
words.
Correct probabilistic model.

Per-word cross entropy
Approximate model

(0.05 log P(M1) 0.05 log P(M2) 0.05 log
P(M3) 0.10 log P(M4) 0.10 log P(M5) 0.20
log P(M6) 0.20 log P(M7) 0.25 log P(M8) )
100 samples M1 M1 M1 M1 M1 (5) M2 M2 M2
M2 M2 (5) M3 M3 M3 M3 M3 (5) M4 M4 M4 ...
M4 (10) M5 M5 M5 ... M5 (10) M6 M6 M6 ...
M6 (20) M7 M7 M7 ... M7 (20) M8 M8 M8 ...
M8 (25)
Each message is independent of the next
71
P(M1) P(M1) P(M1)
M1 M2 . . . M8
5 times
P(M2) P(M2) P(M2)
5 times
...
P(M8) P(M8) P(M8)
25 times
72

per-word cross entropy
the test suite of 100 examples was exactly
indicative of the probabilistic model.
To measure the cross entropy of a model works
only if the test sequence has not been used by
model builder.

closed test (inside test) open test (outside
test)
73

The Composition of Brown and LOB Corpora
Brown Corpus
Brown University Standard Corpus of Present-day
American English.
LOB Corpus
Lancaster/Oslo-Bergen Corpus of British English.

74
(No Transcript)
75
The Entropy of English

We can model English using n-gram models (also
known a Markov chains).
These models assume limited memory, i.e., we
assume that the next word depends only on the
previous k ones kth order Markov
approximation.

76
P(w1,n)
bigram trigram
77
The Entropy of English

What is the Entropy of English?

78
Perplexity

A measure related to the notion of cross-entropy
and used in the speech recognition community is
called the perplexity.
A perplexity of k means that you are as surprised
on average as you would have been if you had had
to guess between k equiprobable choices at each
step.

Write a Comment

User Comments (0)