Statistical Learning From data to distributions - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Statistical Learning From data to distributions

Description:

P(d|q) = Pj P(dj|q) = qc (1-q)N-c. i.i.d assumption. Gather c cherries ... Prior specifies a 'virtual count' of a heads, b tails. Does this work in general? ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 44
Provided by: KrisH86
Category:

less

Transcript and Presenter's Notes

Title: Statistical Learning From data to distributions


1
Statistical Learning(From data to distributions)
2
Reminders
  • HW5 deadline extended to Friday

3
Agenda
  • Learning a probability distribution from data
  • Maximum likelihood estimation (MLE)
  • Maximum a posteriori (MAP) estimation
  • Expectation Maximization (EM)

4
Motivation
  • Agent has made observations (data)
  • Now must make sense of it (hypotheses)
  • Hypotheses alone may be important (e.g., in basic
    science)
  • For inference (e.g., forecasting)
  • To take sensible actions (decision making)
  • A basic component of economics, social and hard
    sciences, engineering,

5
Candy Example
  • Candy comes in 2 flavors, cherry and lime, with
    identical wrappers
  • Manufacturer makes 5 (indistinguishable) bags
  • Suppose we draw
  • What bag are we holding? What flavor will we draw
    next?

H1C 100L 0
H2C 75L 25
H3C 50L 50
H4C 25L 75
H5C 0L 100
6
Machine Learning vs. Statistics
  • Machine Learning ? automated statistics
  • This lecture
  • Bayesian learning, the more traditional
    statistics (RN 20.1-3)
  • Learning Bayes Nets

7
Bayesian Learning
  • Main idea Consider the probability of each
    hypothesis, given the data
  • Data d
  • Hypotheses P(hid)

h1C 100L 0
h2C 75L 25
h3C 50L 50
h4C 25L 75
h5C 0L 100
8
Using Bayes Rule
  • P(hid) a P(dhi) P(hi) is the posterior
  • (Recall, 1/a Si P(dhi) P(hi))
  • P(dhi) is the likelihood
  • P(hi) is the hypothesis prior

h1C 100L 0
h2C 75L 25
h3C 50L 50
h4C 25L 75
h5C 0L 100
9
Computing the Posterior
  • Assume draws are independent
  • Let P(h1),,P(h5) (0.1,0.2,0.4,0.2,0.1)
  • d 10 x

P(dh1) 0 P(dh2) 0.2510 P(dh3)
0.510 P(dh4) 0.7510 P(dh5) 110
10
Posterior Hypotheses
11
Predicting the Next Draw
H
  • P(Xd) Si P(Xhi,d)P(hid) Si P(Xhi)P(hid)

D
X
Probability that next candy drawn is a lime
P(h1d) 0P(h2d) 0.00P(h3d) 0.00P(h4d)
0.10P(h5d) 0.90
P(Xh1) 0P(Xh2) 0.25P(Xh3) 0.5P(Xh4)
0.75P(Xh5) 1
P(Xd) 0.975
12
P(Next Candy is Lime d)
13
Other properties of Bayesian Estimation
  • Any learning technique trades off between good
    fit and hypothesis complexity
  • Prior can penalize complex hypotheses
  • Many more complex hypotheses than simple ones
  • Ockhams razor

14
Hypothesis Spaces often Intractable
  • A hypothesis is a joint probability table over
    state variables
  • 2n entries gt hypothesis space is 0,1(2n)
  • 2(2n) deterministic hypotheses6 boolean
    variables gt over 1022 hypotheses
  • Summing over hypotheses is expensive!

15
Some Common Simplifications
  • Maximum a posteriori estimation (MAP)
  • hMAP argmaxhi P(hid)
  • P(Xd) ? P(XhMAP)
  • Maximum likelihood estimation (ML)
  • hML argmaxhi P(dhi)
  • P(Xd) ? P(XhML)
  • Both approximate the true Bayesian predictions as
    the of data grows large

16
Maximum a Posteriori
  • hMAP argmaxhi P(hid)
  • P(Xd) ? P(XhMAP)

P(XhMAP)
P(Xd)
h3
h4
h5
hMAP
17
Maximum a Posteriori
  • For large amounts of data,P(incorrect
    hypothesisd) gt 0
  • For small sample sizes, MAP predictions are
    overconfident

P(XhMAP)
P(Xd)
18
Maximum Likelihood
  • hML argmaxhi P(dhi)
  • P(Xd) ? P(XhML)

P(XhML)
P(Xd)
undefined
h5
hML
19
Maximum Likelihood
  • hML hMAP with uniform prior
  • Relevance of prior diminishes with more data
  • Preferred by some statisticians
  • Are priors cheating?
  • What is a prior anyway?

20
Advantages of MAP and MLE over Bayesian estimation
  • Involves an optimization rather than a large
    summation
  • Local search techniques
  • For some types of distributions, there are
    closed-form solutions that are easily computed

21
Learning Coin Flips (Bernoulli distribution)
  • Let the unknown fraction of cherries be q
  • Suppose draws are independent and identically
    distributed (i.i.d)
  • Observe that c out of N draws are cherries

22
Maximum Likelihood
  • Likelihood of data dd1,,dN given q
  • P(dq) Pj P(djq) qc (1-q)N-c

i.i.d assumption
Gather c cherries together, then N-c limes
23
Maximum Likelihood
  • Same as maximizing log likelihood
  • L(dq) log P(dq) c log q (N-c) log(1-q)
  • maxq L(dq)gt dL/dq 0gt 0 c/q
    (N-c)/(1-q)gt q c/N

24
Maximum Likelihood for BN
  • For any BN, the ML parameters of any CPT can be
    derived by the fraction of observed values in the
    data

N1000
B 200
E 500
P(E) 0.5
P(B) 0.2
Earthquake
Burglar
AE,B 19/20AB 188/200AE 170/500A
1/380
Alarm
25
Maximum Likelihood for Gaussian Models
  • Observe a continuous variable x1,,xN
  • Fit a Gaussian with mean m, std s
  • Standard procedure write log likelihoodL N(C
    log s) Sj (xj-m)2/(2s2)
  • Set derivatives to zero

26
Maximum Likelihood for Gaussian Models
  • Observe a continuous variable x1,,xN
  • Resultsm 1/N S xj (sample mean)s2
    1/N S (xj-m)2 (sample variance)

27
Maximum Likelihood for Conditional Linear
Gaussians
  • Y is a child of X
  • Data (xj,yj)
  • X is gaussian, Y is a linear Gaussian function of
    X
  • Y(x) N(axb,s)
  • ML estimate of a, b is given by least squares
    regression, s by standard errors

X
Y
28
Back to Coin Flips
  • What about Bayesian or MAP learning?
  • Motivation
  • I pick a coin out of my pocket
  • 1 flip turns up heads
  • Whats the MLE?

29
Back to Coin Flips
  • Need some prior distribution P(q)
  • P(qd) P(dq)P(q) qc (1-q)N-c P(q)

Define, for all q, the probability that I believe
in q
P(q)
q
1
0
30
MAP estimate
  • Could maximize qc (1-q)N-c P(q) using some
    optimization
  • Turns out for some families of P(q), the MAP
    estimate is easy to compute

(Conjugate prior)
P(q)
Beta distributions
q
1
0
31
Beta Distribution
  • Betaa,b(q) a qa-1 (1-q)b-1
  • a, b hyperparameters
  • a is a normalizationconstant
  • Mean at a/(ab)

32
Posterior with Beta Prior
  • Posterior qc (1-q)N-c P(q) a qca-1
    (1-q)N-cb-1
  • MAP estimateq(ca)/(Nab)
  • Posterior is also abeta distribution!
  • See heads, increment a
  • See tails, increment b
  • Prior specifies a virtual count of a heads, b
    tails

33
Does this work in general?
  • Only specific distributions have the right type
    of prior
  • Bernoulli, Poisson, geometric, Gaussian,
    exponential,
  • Otherwise, MAP needs a (often expensive)
    numerical optimization

34
How to deal with missing observations?
  • Very difficult statistical problem in general
  • E.g., surveys
  • Did the person not fill out political affiliation
    randomly?
  • Or do independents do this more often than
    someone with a strong affiliation?
  • Better if a variable is completely hidden

35
Expectation Maximization for Gaussian Mixture
models
Clustering N gaussian distributions
Data have labels to which Gaussian they belong
to, but label is a hidden variable
E step compute probability a datapoint belongs
to each gaussian M step compute ML estimates of
each gaussian, weighted by the probability that
each sample belongs to it
36
Learning HMMs
  • Want to find transition and observation
    probabilities
  • Data many sequences O1t(j) for 1?j?N
  • Problem we dont observe the Xs!

X0
X1
X2
X3
O1
O2
O3
37
Learning HMMs
  • Assume stationary markov chain, discrete states
    x1,,xm
  • Transition parametersqij P(Xt1xjXtxi)
  • Observation parameters?i P(OXtxi)

X0
X1
X2
X3
O1
O2
O3
38
Learning HMMs
  • Assume stationary markov chain, discrete states
    x1,,xm
  • Transition parameterspij P(Xt1xjXtxi)
  • Observation parameters?i P(OXtxi)
  • Initial statesli P(X0xi)

x1
q13, q31
x3
x2
?3
?2
O
39
Expectation Maximization
  • Initialize parameters randomly
  • E-step infer expected probabilities of hidden
    variables over time, given current parameters
  • M-step maximize likelihood of data over
    parameters

x1
q13, q31
x3
x2
P(initial state)
P(transition ij)
P(emission)
q (?1, ?2, ?3,p11,p12,...,p32p33, ?1,?2,?3)
?3
?2
O
40
Expectation Maximization
q (?1, ?2, ?3,q11,q12,...,q32q33, ?1,?2,?3)
Initialize q(0)
E Compute EP(Zz q(0),O)
Z all combinations of hidden sequences
x1
x2
x3
x2
x2
x1
x1
x1
x2
x2
x1
x3
x2
q13, q31
x3
x2
Result probability distribution over hidden
state at time t
?3
?2
M compute q(1) ML estimate of transition /
obs. distributions
O
41
Expectation Maximization
q (?1, ?2, ?3,q11,q12,...,q32q33, ?1,?2,?3)
Initialize q(0)
E Compute EP(Zz q(0),O)
Z all combinations of hidden sequences
This is the hard part
x1
x2
x3
x2
x2
x1
x1
x1
x2
x2
x1
x3
x2
q13, q31
x3
x2
Result probability distribution over hidden
state at time t
?3
?2
M compute q(1) ML estimate of transition /
obs. distributions
O
42
E-Step on HMMs
  • Computing expectations can be done by
  • Sampling
  • Using the forward/backward algorithm on the
    unrolled HMM (RN pp. 546)
  • The latter gives the classic Baum-Welch algorithm
  • Note that EM can still get stuck in local optima
    or even saddle points

43
Next Time
  • Machine learning
Write a Comment
User Comments (0)
About PowerShow.com