Csci 4152: Statistical Natural Language Procesing - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Csci 4152: Statistical Natural Language Procesing

Description:

Probabilities are numbers between 0 and 1, where 0 indicates impossibility and 1, ... lottery (| W | _at_ 107 .. 1012) # of traffic accidents somewhere per year (W = N) ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 23
Provided by: N248
Category:

less

Transcript and Presenter's Notes

Title: Csci 4152: Statistical Natural Language Procesing


1
Mathematical Foundations I Probability Theory
2
Notions of Probability Theory
  • Probability theory deals with predicting how
    likely it is that something will happen.
  • The process by which an observation is made is
    called an experiment or a trial.
  • The collection of basic outcomes (or sample
    points) for our experiment is called the sample
    space.
  • Probabilities are numbers between 0 and 1, where
    0 indicates impossibility and 1, certainty.
  • A probability function/distribution distributes a
    probability mass of 1 throughout the sample
    space.

3
Experiments Sample Spaces
  • Set of possible basic outcomes sample space W
  • coin toss (W head,tail), die (W 1..6)
  • yes/no opinion poll, quality test (bad/good) (W
    0,1)
  • lottery ( W _at_ 107 .. 1012)
  • of traffic accidents somewhere per year (W
    N)
  • spelling errors (W Z), where Z is an
    alphabet, and Z is a set of possible strings
    over such and alphabet
  • missing word ( W _at_ vocabulary size)

4
Events
  • An event is a subset of the sample space
  • Event A is a set of basic outcomes
  • Usually A Ì W , and all A Î 2W (the event space)
  • W is then the certain event, Æ is the impossible
    event
  • Example
  • experiment three times coin toss
  • W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • count cases with exactly two tails then
  • A HTT, THT, TTH
  • all heads
  • A HHH

5
Probability
  • Repeat experiment many times, record how many
    times a given event A occurred (count c1).
  • Do this whole series many times remember all
    cis.
  • Observation if repeated really many times, the
    ratios of ci/Ti (where Ti is the number of
    experiments run in the i-th series) are close to
    some (unknown but) constant value.
  • Call this constant a probability of A. Notation
    p(A)

6
Estimating probability
  • Remember ... close to an unknown constant.
  • We can only estimate it
  • from a single series (typical case, as mostly the
    outcome of a series is given to us and we cannot
    repeat the experiment), set
  • p(A) c1/T1.
  • otherwise, take the weighted average of all ci/Ti
    (or, if the data allows, simply look at the set
    of series as if it is a single long series).
  • This is the best estimate.

7
  • Recall our example
  • experiment three times coin toss
  • W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • count cases with exactly two tails A HTT,
    THT, TTH
  • Run an experiment 1000 times (i.e. 3000 tosses)
  • Counted 386 cases with two tails (HTT, THT, or
    TTH)
  • estimate p(A) 386 / 1000 .386
  • Run again 373, 399, 382, 355, 372, 406, 359
  • p(A) .379 (weighted average) or simply 3032 /
    8000
  • Uniform distribution assumption p(A) 3/8 .375

8
Basic Properties
  • Basic properties
  • p 2 W 0,1
  • p(W) 1
  • Disjoint events p(ÈAi) åi p(Ai)
  • NB axiomatic definition of probability take
    the above three conditions as axioms
  • Immediate consequences
  • p(Æ) 0, p(A ) 1
  • p(A), A Í B Þ p(A) p(B)
  • åa Î W p(a) 1

9
Conditional Probability and Independence
  • Conditional probabilities measure the probability
    of events given some knowledge.
  • Prior probabilities measure the probabilities of
    events before we consider our additional
    knowledge.
  • Posterior probabilities are probabilities that
    result from using our additional knowledge.

10
Joint and Conditional Probability
  • p(A,B) p(A Ç B)
  • p(AB) p(A,B) / p(B)
  • Estimating form counts
  • p(AB) p(A,B) / p(B) (c(A Ç B) / T) / (c(B) /
    T) c(A Ç B) / c(B)
  • W
  • A
    B
  • A Ç B

11
Bayes Rule
  • p(A,B) p(B,A) since p(A Ç B) p(B Ç A)
  • therefore p(AB) p(B) p(BA) p(A), and
    therefore p(AB) p(BA) p(A) / p(B) !



?
A
A Ç B
B
12
Independence
  • Can we compute p(A,B) from p(A) and p(B)?
  • Recall from previous foil
  • p(AB) p(BA) p(A) /
    p(B)
  • p(AB) p(B) p(BA) p(A)
  • p(A,B) p(BA) p(A)
  • ... were almost there how p(BA) relates to
    p(B)?
  • p(BA) P(B) iff A and B are independent
  • Example two coin tosses, weather today and
    weather on March 4th 1789
  • Any two events for which p(BA) P(B)!

13
The Golden Rule (of Classic Statistical NLP)
  • Interested in an event A given B (where it is not
    easy or practical or desirable) to estimate
    p(AB))
  • take Bayes rule, max over all Bs
  • argmaxA p(AB) argmaxA p(BA) . p(A) / p(B)
  • argmaxA p(BA) p(A) !
  • ... as p(B) is constant when changing As

14
Random Variables
  • is a function X W Q
  • in general Q Rn, typically R
  • easier to handle real numbers than real-world
    events
  • random variable is discrete if Q is countable
    (i.e. also if finite)
  • Example die natural numbering 1,6, coin
    0,1
  • Probability distribution
  • pX(x) p(Xx) df p(Ax) where Ax a Î W X(a)
    x
  • often just p(x) if it is clear from context what
    X is

15
ExpectationJoint and Conditional Distributions
  • is a mean of a random variable (weighted average)
  • E(X) åxÎX(W) x . pX(x)
  • Example one six-sided die 3.5, two dice (sum) 7
  • Joint and Conditional distribution rules
  • analogous to probability of events
  • Bayes pXY(x,y) notation pXY(xy) even simpler
    notation p(xy) p(yx) .
    p(x) / p(y)
  • Chain rule p(w,x,y,z) p(z).p(yz).p(xy,z).p(w
    x,y,z)

16
Chain Rule
  • p(A1, A2, A3, A4, ..., An) !
  • p(A1A2,A3,A4,...,An) p(A2A3,A4,...,An)
    p(A3A4,...,An) ...
    p(An-1An) p(An)
  • this is a direct consequence of the Bayes rule.

17
Estimating Probability Functions
  • What is the probability that sentence The cow
    chewed its cud will be uttered? Unknown gt P
    must be estimated from a sample of data.
  • An important measure for estimating P is the
    relative frequency of the outcome, i.e., the
    proportion of times a certain outcome occurs.
  • Assuming that certain aspects of language can be
    modeled by one of the well-known distribution is
    called using a parametric approach.
  • If no such assumption can be made, we must use a
    non-parametric approach.

18
Standard Distributions
  • In practice, one commonly finds the same basic
    form of a probability mass function, but with
    different constants employed.
  • Families of pmfs are called distributions and the
    constants that define the different possible pmfs
    in one family are called parameters.
  • Discrete Distributions the binomial
    distribution, the multinomial distribution, the
    Poisson distribution.
  • Continuous Distributions the normal
    distribution, the standard normal distribution.

19
Standard distributions
  • Binomial (discrete)
  • outcome 0 or 1 (thus binomial)
  • make n trials
  • interested in the (probability of) number of
    successes r
  • Must be careful its not uniform!
  • pb(rn) ( ) / 2n (for equally likely
    outcome)
  • ( ) counts how many possibilities there are for
    choosing r objects out of n n! / (n-r)!r!

n r
n r
20
Continuous Distributions
  • The normal distribution (Gaussian)
  • pnorm(xm,s) e-(x-m)2/(2s2)/sÖ2p
  • where
  • m is the mean (x-coordinate of the peak) (0)
  • s is the standard deviation (1)
  • ?
    X
  • other hyperbolic, t

21
Baysian Statistics I Bayesian Updating
  • Given an a-priori probability distribution, we
    can update our beliefs when a new datum comes in
    by calculating the Maximum A Posteriori (MAP)
    distribution.
  • The MAP probability becomes the new prior and the
    process repeats on each new datum.
  • Assume that the data are coming in sequentially
    and are independent.

22
Bayesian Statistics II Bayesian Decision Theory
  • Bayesian Statistics can be used to evaluate which
    model or family of models better explains some
    data.
  • We define two different models of the event and
    calculate the likelihood ratio between these two
    models.
Write a Comment
User Comments (0)
About PowerShow.com