Title: Exact Inference by Complete Enumeration
1Exact Inference by Complete Enumeration
2Overview
- What (Inference recap)
- Why (Burglary example)
- How (Enumeration)
3Inference
E.g. infer parameters ? in model H from observed
data D
likelyhood
prior
evidence/normalization
marginalization
4Inference by complete enumeration
List the values of P(? D,H) for all possible ?.
Remark burglary example The noisy or (rest
of the world) of false alarms in combination
with burglars and earthquakes is
treated implicit in example. Should be made
explicit. P(a1b1, e1) value is far from
trivial to proof. I will therefore treat a
simplified case of only burglars And false
alarms.
5Are you being robbed?
Inference example
At work, far from home. Neighbor phones that your
burglar alarm is ringing. What is the
probability that there was a burglar in your
home?
P(b, f, a, p) P(pb, f, a) P(b, f, a) P(pa)
P(ab, f) P(b) P(f) (1)
6P(b1p1) ? Use
marginalization
P(b1p1) P(b1, a0, f0p1) P(b1, a1,
f0p1) P(b1, a0,
f1p1) P(b1, a1, f1p1)
7P(b1p1) P(b1, f0 a1, p1) P(b1, f1
a1, p1)
Note that b and f are not conditionally
independent.
8So, P(b1p1) 0.495. When we learn that we
had an alarm due to an electrical glich (f1),
the posterior probability of b1 becomes
So, one cause of alarm, b1, becomes less
probable when another cause, f1, becomes more
probable. Even though these two causes are
independent a priori! This is called explaining
away, and we do this intuitively.
9P(b1p1) P(b1, f0 a1, p1) P(b1, f1
a1, p1)
(1)
(Alternative to slide 6)
(PK)
P(a1f1)1
10Exact inference
for continuous hypothesis spaces
E.g. infer Gaussian distributions given observed
data x
(1)
(2)
Discretize µ and s, with ranges based on observed
data. E.g. µ in 0, 2 and s in 0, 1 given
x0.54, 0.55, 0.56, 0.57, 1.8.
Evaluate likelihood (2) at each point in this
two-dimensional grid.
11Assume we have two models. E.g. H1 Gaussian,
H2 mixture of two Gaussians. Then H1 has
parameters ?1µ1,s1 and H2 has parameters
?2µ2a,s2a, µ2b,s2b, aa. Discretizing the
5-dimensional parameter space ?2, we can evaluate
P(x ?2). Then, we could compare the likelyhood
of the models by comparing (for i1 and i2)
This is an sum over a two- and five-dimensional
parameter space. For any accuracy, we would need
a grid of about 10(nr of parameters) points. So,
complete enumeration like this is
computationally quickly impossible.
12Exact Marginalization
13In chapter 21 we saw that complete enumeration by
discretizing a continuous hypothesis space, is
exponential with the number of parameters in the
hypothesis and quickly impossible. A solution is
to do exact marginalization over
continuous parameters (nuisance parameters), by
doing integrals. This is a macho activity
enjoyed only by those without a social life
and an obsession with definite integration.
Message of this chapter Do not try this.
Stoop to the approximate methods and dont
look back.
14Consider a Gaussian distribution P(xµ,s). We
want to infer P(µ,sx) P(xµ,s) P(µ,s) / P(x),
so we need the prior P(µ,s). For exact
marginalization (integrals), it is often
convenient To use a prior that has similar
properties as the likelyhood P(xµ,s), to obtain
tractable problems. Such priors are called
conjugate priors. E.g. the prior for µ is chosen
as a Gaussian distribution and the prior for s
(which is positive by definition) is chosen as a
Gamma distribution (which is like the Gaussian
except that it goes from 0 to infinity). P(µ,s)
P(µ) P(s) cst improper uninformative
prior (uniform over entire range of µ and s), for
simplicity.
15All we need to know from the data to calculate
this likelyhood are S and xa, which are
therefore called the sufficient statistics.
Given our improper (constant) priors, the
posterior probability of µ and s is proportional
to the likelyhood and, hence, their maxima
coincide. Maximizing likelyhood wrt µ,s
(derivative 0) results in µ,sMaximum
likelyhood xa, (S/N)1/2 , where (S/N)1/2 is
the famous biased estimate of s called sN.
16If we just ask, what might µ be and do not care
about what s might be, we want the marginalized
posterior of µ, which is Proportional to the
marginalized likelyhood
Maximizing wrt µ results in µMLxa
If we want to know most likely value of s and do
not care about µ, we can go along similar lines
and obtain that sML(S/(N-1))1/2, which is the
famous unbiased estimate of s called sN-1.
17So, using bayesian we get by straightforward
inference and a simple realization of the priors
that µ,sML xa, (S/N)1/2 µML
xa sML (S/(N-1))1/2 These estimates are
also well known from sampling theory But there
they are obtained ad-hoc according to MacKay.
18Any Questions?