CSA4050: Advanced Topics in NLP - PowerPoint PPT Presentation

About This Presentation

Title:

CSA4050: Advanced Topics in NLP

Description:

... character/phoneme in a sequence, given the first N words/characters/phonemes. ... could be used to study the the probability of the next character or phoneme. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 25

Provided by: MikeR2

Category:

more less

Transcript and Presenter's Notes

Title: CSA4050: Advanced Topics in NLP

1
CSA4050Advanced Topics in NLP

Probability I
Experiments/Outcomes/Events
Independence/Dependence
Bayes Rule
Conditional Probability/Chain Rule

2
Acknowledgement

Much of this material is based on material by
Mary Dalrymple, Kings College, London

3
Experiment, Basic Outcome,Sample Space

Probability theory is founded upon the notion of
an experiment.
An experiment is a situation which can have one
or more different basic outcomes.
Example if we throw a die, there are six
possible basic outcomes.
A Sample Space O is a set of all possible basic
outcomes. For example,
If we toss a coin, O H,T
If we toss a coin twice, O HT,TH,TT,HH
if we throw a die, O 1,2,3,4,5,6

4
Event

An Event A ? O is a set of basic outcomes e.g.
tossing two heads HH
throwing a 6, 6
getting either a 2 or a 4, 2,4.
O itself is the certain event, whilst is the
impossible event.
Event Space ? Sample Space

5
Probability distribution

A probability distribution of an experiment is a
function that assigns a number (or probability)
between 0 and 1 to each basic outcome such that
the sum of all the probabilities 1.
The probability p(E) of an event E is the sum of
the probabilities of all the basic outcomes in E.
Uniform distribution is when each basic outcome
is equally likely.

6
Probability of an Event die example

Sample space set of basic outcomes
1,2,3,4,5,6
If the die is not loaded, distribution is
uniform.
Thus for each basic outcome, e.g. 6 (throwing a
six) is assigned the same probability 1/6.
So p(3,6) p(3) p(6) 2/6 1/3

7
Estimating Probability

Repeat experiment T times and count frequency of
E.
Estimated p(E) count(E)/count(T)
This can be done over m runs, yielding estimates
p1(E),...pm(E).
Best estimate is (possibly weighted) average of
individual pi(E)

8
3 times coin toss

O HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
Cases with exactly 2 tails HTT, THT,TTH
Experimenti 1000 cases (3000 tosses).
c1(E) 386, p1(E) .386
c2(E) 375, p2(E) .375
pmean(E) (.386.375)/2 .381
Uniform distribution is when each basic outcome
is equally likely.
Assuming uniform distribution, p(E) 3/8 .375

9
Word Probability

General ProblemWhat is the probability of the
next word/character/phoneme in a sequence, given
the first N words/characters/phonemes.
To approach this problem we study an experiment
whose sample space is the set of possible words.
N.B. The same approach could be used to study the
the probability of the next character or phoneme.

10
Word Probability

Approximation 1 all words are equally probable
Then probability of each word 1/N where N is
the number of word types.
But all words are not equally probable
Approximation 2 probability of each word is the
same as its frequency of occurrence in a corpus.

11
Word Probability

Estimate p(w) - the probability of word w
Given corpus Cp(w) ? count(w)/size(C)
Example
Brown corpus 1,000,000 tokens
the 69,971 tokens
Probability of the 69,971/1,000,000 ? .07
rabbit 11 tokens
Probability of rabbit 11/1,000,000 ? .00001
conclusion next word is most likely to be the
Is this correct?

12
A counter example

Given the context Look at the cute ...
is the more likely than rabbit?
Context matters in determining what word comes
next.
What is the probability of the next word in a
sequence, given the first N words?

13
Independent Events
A eggs
B monday
sample space
14
Sample Space

(eggs,mon) (cereal,mon) (nothing,mon)
(eggs,tue) (cereal,tue) (nothing,tue)
(eggs,wed) (cereal,wed) (nothing,wed)
(eggs,thu) (cereal,thu) (nothing,thu)
(eggs,fri) (cereal,fri) (nothing,fri)
(eggs,sat) (cereal,sat) (nothing,sat)
(eggs,sun) (cereal,sun) (nothing,sun)

15
Independent Events

Two events, A and B, are independent if the fact
that A occurs does not affect the probability of
B occurring.
When two events, A and B, are independent, the
probability of both occurring p(A,B) is the
product of the prior probabilities of each, i.e.
p(A,B) p(A) p(B)

16
Dependent Events

Two events, A and B, are dependent if the
occurrence of one affects the probability of the
occurrence of the other.

17
Dependent Events
A
A ? B
B
sample space
18
Conditional Probability

The conditional probability of an event A given
that event B has already occurred is written
p(AB)
In general p(AB) ? p(BA)

19
Dependent Events p(AB)? p(BA)
sample space
A
A ? B
B
20
Example Dependencies

Consider fair die example with
A outcome divisible by 2
B outcome divisible by 3
C outcome divisible by 4
p(AB) p(A ? B)/p(B) (1/6)/(1/3) ½
p(AC) p(A ? C)/p(C) (1/6)/(1/6) 1

21
Conditional Probability

Intuitively, after B has occurred, event A is
replaced by A ? B, the sample space O is replaced
by B, and probabilities are renormalised
accordingly
The conditional probability of an event A given
that B has occurred (p(B)gt0) is thus given by
p(AB) p(A ? B)/p(B).
If A and B are independent,p(A ? B)
p(A) p(B) sop(AB) p(A) p(B) /p(B) p(A).

22
Bayesian Inversion

For A and B to occur, either B must occur first,
then B, or vice versa. We get the following
possibilites
p(AB) p(A ? B)/p(B)p(BA) p(A ? B)/p(A)
Hence p(AB) p(B) p(BA) p(A)
We can thus express p(AB) in terms of p(BA)
p(AB) p(BA) p(A)/p(B)
This equivalence, known as Bayes Theorem, is
useful when one or other quantity is difficult to
determine

23
Bayes Theorem

p(BA) p(B?A)/p(A) p(AB) p(B)/p(A)
The denominator p(A) can be ignored if we are
only interested in which event out of some set is
most likely.
Typically we are interested in the value of B
that maximises an observation A, i.e.
arg maxB p(AB) p(B)/p(A) arg maxB p(AB) p(B)

24
The Chain Rule

We can use the definition of conditional
probability to more than two events
p(A1 ? ... ? An) p(A1) p(A2A1) p(A3A1 ?
A2)..., p(AnA1 ? ... ? An-1)
The chain rule allows us to talk about the
probability of sequences of events p(A1,...,An).

Write a Comment

User Comments (0)