Concepts and Definitions in Bayesian Estimation

About This Presentation

Title:

Concepts and Definitions in Bayesian Estimation

Description:

A quote from Glenn Shafer. Probability is not really about numbers; ... His secretary enters and tells him that Dr Watson has had a car accident. ' Watson? ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 58

Provided by: bobmi9

Category:

more less

Transcript and Presenter's Notes

Title: Concepts and Definitions in Bayesian Estimation

1
Concepts and Definitions in Bayesian Estimation

Robert J. Mislevy
University of Maryland
November 29, 2005

2
A quote from Glenn Shafer

Probability is not really about numbers
it is about the structure of reasoning.
Glenn Shafer, quoted in Pearl, 1988, p. 77

3
Views of Probability

Two conceptions of probability
Aleatory (chance)
Long-run frequencies, mechanisms
Probability is a property of the world
Degree of belief (subjective)
Probability is a property of Your state of
knowledge (de Finetti) and model of the situation
Same formal definitions machinery
Aleatory paradigms as analogical basis for degree
of belief (Glenn Shafer)

4
Frames of discernment

Frame of discernment is all the possible
combinations of values of the variables your are
working with. (Shafer, 1976)
Discern detect, recognize, distinguish
Property of you as much as property of world
Depends on what you know and what your purpose is
Frame of discernment can evolve over time
Medical diagnosis
Document literacy example (more information)

5
Frames of Discernment in Assessment

In Student Model, determining what aspects of
skill knowledge to use as explicit
student-model variables--psych perspective,
grainsize, reporting requirements
In Evidence Model, evidence identification
(task scoring), evaluation rules map from unique
work product to common observable variables.
In Task Model, task variables are aspects of
situations that are important in task design to
keep track of and manipulate, to achieve
assessments purpose.

6
(Random) Variables

We will start on variables with a finite number
of possible values.
Denote random variable by upper case, say X.
Denote particular values and generic values by
lower case, x.
Y is the outcome of a coin flip yÎ h,t.
Xi is the answer to Item i xi Î 0,1.

7
Finite Probability Distributions

Finite set of possible values x1,xn
Prob(Xxj), P(Xxj), or more simply p(xj), is the
probability that X takes the value xj.
0 p(xj) 1.
P(Xxj or Xxm) p(xj) p(xm).

8
Continuous Probability Distributions

Infinitely many possible values eg, x
xÎ0,1, x xÎ(-,)
Events A1,Am are sets of possible values
A1 x xlt0, A2 x xÎ(0,1), A3 x
xgt0,
P(Aj) is the probability that X takes a value in
Aj
0 p(Aj) 1.
If A1 Am are disjoint events that exhaust all
possible values of x, then
If Aj and Ak are disjoint events, P(Aj È Ak)
P(Aj) P(Ak).

9
The Icy Road Example
Police Inspector Smith is impatiently awaiting
the arrival of Mr. Holmes and Dr. Watson. They
are late, and Inspector Smith has another
important appointment (lunch). Looking out the
window he wonders whether the roads are icy.
Both are notoriously bad drivers, so if the roads
are icy they are likely to crash. His secretary
enters and tells him that Dr Watson has had a car
accident. Watson? OK. Ill be the roads are
icy! Then Holmes has most probably crashed too.
Ill go for lunch now. (based on an example in
Jensen, 1996, p. 7)

Jensen, F.V. (1996). An introduction to Bayesian
networks. New York Springer-Verlag.
10
From the Icy Road Example

Ice Is there an icy road?
Values Yes, No
Initial Probabilities (.7, .3)
Note choice of values for variable icy road,
and probabilities determined by knowledge of
weather in the area at this time of year, but
without having looked out the window.

11
Icy Road Probabilities
Ice
P(Iceyes)
Yes
.7
.3
No
P(Iceno)
12
Joint probability distributions

Two random variables, X and Y
P(Xxj,Yyk), or p(xj, yk), is the probability
that X takes the value xj and Y takes the value
yk .
0 p(xj , yk) 1.

13
Marginal probability distributions 1

Two discrete random variables, X and Y
Recall P(Xxj,Yyk), or p(xj, yk), is the joint
probability that X takes the value xj and Y takes
the value yk
The marginal probability of a value xj of X is
the sum over all the possible joint probabilities
p(xj, yk) with that value of X

14
Conditional probability distributions

Two random variables, X and Y
P(XxjYyk), or p(xj yk), is the probability
that X takes the value xj given that Y takes the
value yk .
This is how we express relationships among
real-world phenomena
Coin flip p(heads) vs. p(headsBobReport)
P(heart attackage, family history, blood
pressure)
P(February 10 high temperature geographical
location, February 9 high temperature)
IRT P(Xj1) vs. P(Xj1q)

15
Conditional probability distributions

Two discrete random variables, X and Y
P(XxjYyk), or p(xj yk), is the probability
that X takes the value xj given that Y takes the
value yk .
0 p(xj yk) 1 for each given yk.
for each given yk
P(Xxj or Xxm Yyk) p(xj yk) p(xm yk).

16
Marginal probability distributions 2

Two discrete random variables, X and Y
Recall p(xj yk), is the probability that X xj
given Y yk .
The marginal probability of a value of X is the
sum of its conditional probabilities given all
possible values of Y, with each weighted by its
probability

17
Bayes Theorem

The setup, with two random variables, X and Y
You know conditional probabilities, p(xj yk),
which tell you what to believe about X if you
knew the value of Y.
You learn Xx what should you believe about Y?
You combine two things
Relative conditional probabilities (the
likelihood)
Previous probabilities about Y values

posterior likelihood
prior
18
From the Icy Road Example

Ice Is there an icy road?
Values Yes, No
Initial Probabilities (.7, .3)
Watson Does Watson have a car crash?
Values Yes, No
Probabilities conditional on Icy Road
(.8, .2) if IceYes, (.1, .9) if IceNo.

19
Icy Road Conditional Probabilities
Watson
No
Yes
Ice
.2
Yes
.8
.9
.1
No
p(WatsonnoIceyes)
p(WatsonyesIceyes)
20
Icy Road Conditional Probabilities
Watson
No
Yes
Ice
.2
Yes
.8
.9
.1
No
p(WatsonnoIceno)
p(WatsonyesIceno)
21
Icy Road Likelihoods
Note 2/9 ratio
Watson
No
Yes
Ice
p(WatsonnoIceyes)
.2
Yes
.8
.9
.1
No
p(WatsonnoIceno)
22
Icy Road Likelihoods
Note 8/1 ratio
Watson
No
Yes
Ice
p(WatsonyesIceyes)
.2
Yes
.8
.9
.1
No
p(WatsonyesIceno)
23
Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
24
Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
Note Sum .59, not 1.00. These arent
probabilities.
25
Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
Yes
.95
.05
Divide through by normalizing constant .59 to get
posterior probabilities.
26
Icy Road Bayes TheoremIf Watson no
Prior Likelihood µ Posterior
Watson
No
Yes
No
Ice
.2
.8
Yes
.9
.1
No
Divide through by normalizing constant to get
posterior probabilities.
27
Independence

Independence
The probability of the joint occurrence of
values of two variables is always equal to the
product of the probabilities individually
P(Xx,Yy) P(Xx) P(Yy).
Equivalent to saying that learning the value of
one of the variables does not change your belief
about the other.

28
Conditional independence

Conditional independence
The conditional probability of the joint
occurrence given the value of another variable is
always equal to the product of the conditional
probabilities
P(Xx,YyZz) P(Xx Zz) P(Yy Zz).

29
Conditional independence

Conditional independence is not a grace of
nature for which we must wait passively, but
rather a psychological necessity which we satisfy
actively by organizing our knowledge in a
specific way.
An important tool in such organization is the
identification of intermediate variables that
induce conditional independence among
observables if such variables are not in our
vocabulary, we create them.
In medical diagnosis, for instance, when some
symptoms directly influence one another, the
medical profession invents a name for that
interaction (e.g., syndrome, complication,
pathological state) and treats it as a new
auxiliary variable that induces conditional
independence dependency between any two
interacting systems is fully attributed to the
dependencies of each on the auxiliary variable.
(Pearl, 1988, p. 44)

30
Working with distributions
31
Discrete Random VariablesDensity Functions

A discrete random variable is characterized by a
set of possible values it can take, and the
probability assigned to each of those
possibilities
x1,xn
Prob(Xxj), P(Xxj), or more simply p(xj).
Can represent this as a histogram

1.0
0.7
p(x)
0.3
0.0
0
1
x
32
Continuous Random VariablesDensity Functions

The normal distribution is often written N(m,s),
showing that it depends on two parameters, the
mean m and the standard deviation s. (BUGS
writes it in terms of the mean m and the
precision, 1/s2.) The density function
of the normal distribution is

s
Gelman et al. sometimes just write a density as
(xm,s).
m
33
Continuous Random VariablesDensity Functions

To determine the probability of getting a value
less than some value z for a real-valued
distribution, integrate the density over (-,z).
Using the normal (3.5,3) distribution,
Prob(zlt4.5 m3.5,3) is

m
34
Continuous Random VariablesDensity Functions

Integrating the density over the entire range of
a continuous random variable gives the value 1
e.g.,
This will become important
in Bayesian analysis, because
sometimes we can determine
functions p that are proportional
to the distributions we want, but
dont integrate to one.

m
35
Continuous Random VariablesAverages, or
Expected Values

Suppose you have some function g(x) that is
defined over the range of X, and your current
current belief about x is expressed by a
distribution with density p(x). The expected
value of g(x) is obtained by integrating g(x)
over the range, with respect to p(x)
()
You obtain the mean of a distribution by
calculating () with g(x) x. You get the
variance by subtracting the square of the mean
from another application of () with g(x)x2.
You can approximate these quantities by taking
many draws from p(x), evaluating g at each of
them, and examining the distribution of the
results. (This works if you can sample in
proportion to p too.)

36
Parametric Distributions--Why?

Paradigmatic shapes, summarized in terms of a few
variables (i.e., the parameters)
Often straightforward relationships between
values of parameters and values of summary
features
Building blocks for large problems (BUGS)
Computational advantages
Conjugate priors make Bayesian inference very
simple (used in classic conjugate analyses and in
Gibbs sampling)
For some, can generate values conveniently (used
in Metropolis-Hastings estimation)

37
Some discrete parametric distributions (see
Gelman et al., Appendix A)

Bernoulli. For success/fail in a single binary
trial with probability p. E.g., one coin flips,
with p the probability of heads.
In BUGS notation, r dbern(p) where rÎ0,1.
Binomial. For counts of successes in binary
trials, each with probability p, in n independent
trials. E.g., n coin flips, with p the common
probability of heads.
In BUGS notation, r dbin(p,n) where rÎ 0,1,
..., n.
Categorical. A single trial with a variable that
can take values 1,2,..., ncat,
with respective probabilities (p1, p2, ...,
pncat) which sum to one.
r dcat(p).

38
Some discrete parametric distributions A closer
look at the binomial distribution

Binomial. For counts of successes in binary
trials, each with probability p, in n independent
trials. E.g., n coin flips, with p the common
probability of heads.
In BUGS notation, r dbin(p,n) where rÎ 0,1,
..., n.

Count of successes
The variable
Count of failures
The success probability parameter
The failure probability
The success probability
We will be using this as a likelihood in an
example of the use of conjugate distributions.
39
Some continuous parametric distributions (see
Gelman et al., Appendix A)

Normal. Often used in measurement.
x dnorm(mu,tau) BUGS format,
written in terms of precision t
Uniform. Can use as an uninformative
prior on an interval.
x dunif(a,b)

dnorm(0,1)
dunif(3,5)
40
Some continuous parametric distributions (see
Gelman et al., Appendix A)

Beta. Defined on 0,1. Conjugate prior for the
probability parameter in Bernoulli binomial
models.
p dbeta(a,b)
Gamma. Defined on (0,) Conjugate prior for
precision in the normal distribution.
x dgamma(r,mu)

dbeta(1.1,1.1)
dbeta(10,40)
41
Some continuous parametric distributions A
closer look at the Beta distribution

Beta. Defined on 0,1. Conjugate prior for the
probability parameter in Bernoulli binomial
models.
p dbeta(a,b)
Mean(p)
Variance(p)
Mode(p)

PseudoCount of successes
PseudoCount of failures
The variable success probability
The failure probability
Shape, or prior sample info
The success probability
42
Summarizing distributions

A probability distribution conveys everything you
know, and still dont know, about a variable at a
given state of information.
With finite distributions with few variables, as
in MSBNx, we could just look at the vector of
probabilities.
With continuous variables, can look at summary
statistics Means, standard deviations, medians,
modes, percentile points.
Can also look at pictures of estimates of the
density,
and collections of values drawn from the
distribution.

43
Bayes Theorem revisited

The general form of Bayes Theorem
Review of Bayes Theorem with finite variables
An example with a continuous variable
A beta-binomial example
A glimpse ahead to computational approaches
Basic ideas behind...
Conjugate priors
Sampling-based approaches (MCMC/BUGS)

44
Bayes Theorem

The setup, with two random variables, X and Y
You know conditional probabilities, p(x y),
which tell you what to believe about X if you
knew the value of Y.
You learn Xx what should you believe about Y?
You combine two things
Relative conditional probabilities (the
likelihood)
Previous probabilities about Y values
Note this is proportional to the posterior.

posterior likelihood
prior
45
Bayes Theorem

Note that this is proportional to the posterior.
To make it a proper distribution, must divide
through by all the possibilities for y, given
that Xx--the normalizing constant C. If y is a
finite variable, this means dividing through by a
sum
If y is a continuous variable, it means dividing
through by an integral

posterior likelihood prior
The joker in the deck!
46
An example with a continuous variable A
beta-binomial example

The setup We are flipping a biased coin, where
the probability of heads p could be anywhere
between 0 and 1. We are interested in p. We
will have two sources of information
Prior beliefs, which we will express as a beta
distribution, and
Data, which will come in the form of counts of
heads in 10 independent flips.

47
An example with a continuous variable A
beta-binomial example--the Prior Distribution

The prior distribution
Lets suppose we think it is more likely that
the coin is close to fair, so p is probably
nearer to .5 than it is to either 0 or 1. We
dont have any reason to think it is biased
toward either heads or tails, so well want a
prior distribution that is symmetric around .5.
Were not real sure about what p might be--say
about as sure as only 6 observations. This
corresponds to 3 pseudo-counts of H and 3 of T,
which, if we want to use a beta distribution to
express this belief, corresponds to beta(4,4)

48
An example with a continuous variable A
beta-binomial example--the Prior Distribution

Beta. Defined on 0,1. Conjugate prior for the
probability parameter in Bernoulli binomial
models.
p dbeta(4,4)
Mean(p)
Variance(p)
Mode(p)

PseudoCount of successes
PseudoCount of failures
The variable success probability
The failure probability
Shape, or prior sample info
The success probability
49
An example with a continuous variable A
beta-binomial example--the Likelihood

The likelihood
Next we will flip the coin ten times. Assuming
the same true (but unknown to us) value of p is
in effect for each of ten independent trials, we
can use the binomial distribution to model the
probability of getting any number of heads i.e.,

Count of observed successes
The variable
Count of observed failures
The success probability parameter
The failure probability
The success probability
50
An example with a continuous variable A
beta-binomial example--the Likelihood

The likelihood
We flip the coin ten times, and observe 7 heads
i.e., r7. The likelihood is obtained now using
the same form as in the preceding slide, except
now r is fixed at 7 and we are interested in the
relative value of this function at different
possible values of p

51
An example with a continuous variable Obtaining
the posterior by Bayes Theorem
posterior likelihood prior

General form
In our example, 7 plays the role of x, and p
plays the role of y. Before normalizing
After normalizing

Now, how can we get an idea of what this means we
believe about p after combining our prior belief
and our observations?
52
An example with a continuous variable In pictures
Prior x Likelihood Posterior
53
An example with a continuous variable Using the
fact that we have conjugate distributions
Now
This is just the kernel of a beta(11,7)
distribution. This is rather special. The data
were observed in accordance with a probability
function which would have that same mathematical
form as a likelihood once data are observed. We
chose a prior distribution (in this case, a beta
distribution) which would combine with the
likelihood just so as to produce another
distribution in the same parametric family
(another beta distribution), just with updated
parameters. We can work out its summary
statistics

Mean(p) Variance(p)
Mode(p)
prior was .5
.028
.5

54
An example with a continuous variable Using BUGS
Now
What BUGS does in this simple problem with one
variable is to sample lots of values from the
posterior distribution for p that is, its
distribution as determined first with information
from the prior, but further conditional on the
observed data. Here are the summary statistics
from 50000 draws

Mean(p) Variance(p)
Mode(p)
prior was .5
.028
.5

.11162.0125
55
An example with a continuous variable Using BUGS

BUGS setup for this problem

56
Looking ahead to sampling-based approaches with
many variables

BUGS Bayesian-inference Using Gibbs Sampling
Basic idea Model multi-parameter problem in
terms of assemblies of distributions and
functions for all data and all parameters (taking
advantage of conditional dependence whenever
possible).
E.g., p(Datax,y) p(xz) p(y) p(z). ()
Observe Data Posterior p(x,y,zData) is
proportional to (). Hard to evaluate
normalizing constant, but ...

57
Looking ahead to sampling-based approaches with
many variables

Can draw values from full conditional
distributions
Start with a possible value for each variable in
cycle 0.
In cycle t1,
Draw xt1 from p(xY yt,Z zt,Data)
Draw yt1 from p(yX xt1,Z zt,Data)
Draw zt1 from p(zX xt1,Y yt1,Data)
Under suitable conditions, these series of draws
will come to approximate draws from the actual
true joint posterior for all the parameters.