Title: Concepts and Definitions in Bayesian Estimation
1Concepts and Definitions in Bayesian Estimation
- Robert J. Mislevy
- University of Maryland
- November 29, 2005
2A quote from Glenn Shafer
- Probability is not really about numbers
- it is about the structure of reasoning.
- Glenn Shafer, quoted in Pearl, 1988, p. 77
3Views of Probability
- Two conceptions of probability
- Aleatory (chance)
- Long-run frequencies, mechanisms
- Probability is a property of the world
- Degree of belief (subjective)
- Probability is a property of Your state of
knowledge (de Finetti) and model of the situation - Same formal definitions machinery
- Aleatory paradigms as analogical basis for degree
of belief (Glenn Shafer)
4Frames of discernment
- Frame of discernment is all the possible
combinations of values of the variables your are
working with. (Shafer, 1976) - Discern detect, recognize, distinguish
- Property of you as much as property of world
- Depends on what you know and what your purpose is
- Frame of discernment can evolve over time
- Medical diagnosis
- Document literacy example (more information)
5Frames of Discernment in Assessment
- In Student Model, determining what aspects of
skill knowledge to use as explicit
student-model variables--psych perspective,
grainsize, reporting requirements - In Evidence Model, evidence identification
(task scoring), evaluation rules map from unique
work product to common observable variables. - In Task Model, task variables are aspects of
situations that are important in task design to
keep track of and manipulate, to achieve
assessments purpose.
6(Random) Variables
- We will start on variables with a finite number
of possible values. - Denote random variable by upper case, say X.
- Denote particular values and generic values by
lower case, x. - Y is the outcome of a coin flip yÃŽ h,t.
- Xi is the answer to Item i xi ÃŽ 0,1.
7Finite Probability Distributions
- Finite set of possible values x1,xn
- Prob(Xxj), P(Xxj), or more simply p(xj), is the
probability that X takes the value xj. - 0 p(xj) 1.
-
- P(Xxj or Xxm) p(xj) p(xm).
8Continuous Probability Distributions
- Infinitely many possible values eg, x
xÃŽ0,1, x xÃŽ(-,) - Events A1,Am are sets of possible values
- A1 x xlt0, A2 x xÃŽ(0,1), A3 x
xgt0, - P(Aj) is the probability that X takes a value in
Aj - 0 p(Aj) 1.
- If A1 Am are disjoint events that exhaust all
possible values of x, then - If Aj and Ak are disjoint events, P(Aj È Ak)
P(Aj) P(Ak).
9The Icy Road Example
Police Inspector Smith is impatiently awaiting
the arrival of Mr. Holmes and Dr. Watson. They
are late, and Inspector Smith has another
important appointment (lunch). Looking out the
window he wonders whether the roads are icy.
Both are notoriously bad drivers, so if the roads
are icy they are likely to crash. His secretary
enters and tells him that Dr Watson has had a car
accident. Watson? OK. Ill be the roads are
icy! Then Holmes has most probably crashed too.
Ill go for lunch now. (based on an example in
Jensen, 1996, p. 7)
Jensen, F.V. (1996). An introduction to Bayesian
networks. New York Springer-Verlag.
10From the Icy Road Example
- Ice Is there an icy road?
- Values Yes, No
- Initial Probabilities (.7, .3)
- Note choice of values for variable icy road,
and probabilities determined by knowledge of
weather in the area at this time of year, but
without having looked out the window.
11Icy Road Probabilities
Ice
P(Iceyes)
Yes
.7
.3
No
P(Iceno)
12Joint probability distributions
- Two random variables, X and Y
- P(Xxj,Yyk), or p(xj, yk), is the probability
that X takes the value xj and Y takes the value
yk . - 0 p(xj , yk) 1.
-
13Marginal probability distributions 1
- Two discrete random variables, X and Y
- Recall P(Xxj,Yyk), or p(xj, yk), is the joint
probability that X takes the value xj and Y takes
the value yk - The marginal probability of a value xj of X is
the sum over all the possible joint probabilities
p(xj, yk) with that value of X -
14Conditional probability distributions
- Two random variables, X and Y
- P(XxjYyk), or p(xj yk), is the probability
that X takes the value xj given that Y takes the
value yk . - This is how we express relationships among
real-world phenomena - Coin flip p(heads) vs. p(headsBobReport)
- P(heart attackage, family history, blood
pressure) - P(February 10 high temperature geographical
location, February 9 high temperature) - IRT P(Xj1) vs. P(Xj1q)
15Conditional probability distributions
- Two discrete random variables, X and Y
- P(XxjYyk), or p(xj yk), is the probability
that X takes the value xj given that Y takes the
value yk . - 0 p(xj yk) 1 for each given yk.
- for each given yk
- P(Xxj or Xxm Yyk) p(xj yk) p(xm yk).
16Marginal probability distributions 2
- Two discrete random variables, X and Y
- Recall p(xj yk), is the probability that X xj
given Y yk . - The marginal probability of a value of X is the
sum of its conditional probabilities given all
possible values of Y, with each weighted by its
probability -
17Bayes Theorem
- The setup, with two random variables, X and Y
- You know conditional probabilities, p(xj yk),
which tell you what to believe about X if you
knew the value of Y. - You learn Xx what should you believe about Y?
- You combine two things
- Relative conditional probabilities (the
likelihood) - Previous probabilities about Y values
posterior likelihood
prior
18From the Icy Road Example
- Ice Is there an icy road?
- Values Yes, No
- Initial Probabilities (.7, .3)
- Watson Does Watson have a car crash?
- Values Yes, No
- Probabilities conditional on Icy Road
- (.8, .2) if IceYes, (.1, .9) if IceNo.
19Icy Road Conditional Probabilities
Watson
No
Yes
Ice
.2
Yes
.8
.9
.1
No
p(WatsonnoIceyes)
p(WatsonyesIceyes)
20Icy Road Conditional Probabilities
Watson
No
Yes
Ice
.2
Yes
.8
.9
.1
No
p(WatsonnoIceno)
p(WatsonyesIceno)
21Icy Road Likelihoods
Note 2/9 ratio
Watson
No
Yes
Ice
p(WatsonnoIceyes)
.2
Yes
.8
.9
.1
No
p(WatsonnoIceno)
22Icy Road Likelihoods
Note 8/1 ratio
Watson
No
Yes
Ice
p(WatsonyesIceyes)
.2
Yes
.8
.9
.1
No
p(WatsonyesIceno)
23Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
24Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
Note Sum .59, not 1.00. These arent
probabilities.
25Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
Yes
.95
.05
Divide through by normalizing constant .59 to get
posterior probabilities.
26Icy Road Bayes TheoremIf Watson no
Prior Likelihood µ Posterior
Watson
No
Yes
No
Ice
.2
.8
Yes
.9
.1
No
Divide through by normalizing constant to get
posterior probabilities.
27Independence
- Independence
- The probability of the joint occurrence of
values of two variables is always equal to the
product of the probabilities individually - P(Xx,Yy) P(Xx) P(Yy).
- Equivalent to saying that learning the value of
one of the variables does not change your belief
about the other.
28Conditional independence
- Conditional independence
- The conditional probability of the joint
occurrence given the value of another variable is
always equal to the product of the conditional
probabilities - P(Xx,YyZz) P(Xx Zz) P(Yy Zz).
29Conditional independence
- Conditional independence is not a grace of
nature for which we must wait passively, but
rather a psychological necessity which we satisfy
actively by organizing our knowledge in a
specific way. - An important tool in such organization is the
identification of intermediate variables that
induce conditional independence among
observables if such variables are not in our
vocabulary, we create them. - In medical diagnosis, for instance, when some
symptoms directly influence one another, the
medical profession invents a name for that
interaction (e.g., syndrome, complication,
pathological state) and treats it as a new
auxiliary variable that induces conditional
independence dependency between any two
interacting systems is fully attributed to the
dependencies of each on the auxiliary variable.
(Pearl, 1988, p. 44)
30Working with distributions
31Discrete Random VariablesDensity Functions
- A discrete random variable is characterized by a
set of possible values it can take, and the
probability assigned to each of those
possibilities - x1,xn
- Prob(Xxj), P(Xxj), or more simply p(xj).
- Can represent this as a histogram
1.0
0.7
p(x)
0.3
0.0
0
1
x
32Continuous Random VariablesDensity Functions
- The normal distribution is often written N(m,s),
showing that it depends on two parameters, the
mean m and the standard deviation s. (BUGS
writes it in terms of the mean m and the
precision, 1/s2.) The density function - of the normal distribution is
s
Gelman et al. sometimes just write a density as
(xm,s).
m
33Continuous Random VariablesDensity Functions
- To determine the probability of getting a value
less than some value z for a real-valued
distribution, integrate the density over (-,z).
Using the normal (3.5,3) distribution,
Prob(zlt4.5 m3.5,3) is
m
34Continuous Random VariablesDensity Functions
- Integrating the density over the entire range of
a continuous random variable gives the value 1
e.g., - This will become important
- in Bayesian analysis, because
- sometimes we can determine
- functions p that are proportional
- to the distributions we want, but
- dont integrate to one.
m
35Continuous Random VariablesAverages, or
Expected Values
- Suppose you have some function g(x) that is
defined over the range of X, and your current
current belief about x is expressed by a
distribution with density p(x). The expected
value of g(x) is obtained by integrating g(x)
over the range, with respect to p(x) -
- ()
- You obtain the mean of a distribution by
calculating () with g(x) x. You get the
variance by subtracting the square of the mean
from another application of () with g(x)x2. - You can approximate these quantities by taking
many draws from p(x), evaluating g at each of
them, and examining the distribution of the
results. (This works if you can sample in
proportion to p too.) -
36Parametric Distributions--Why?
- Paradigmatic shapes, summarized in terms of a few
variables (i.e., the parameters) - Often straightforward relationships between
values of parameters and values of summary
features - Building blocks for large problems (BUGS)
- Computational advantages
- Conjugate priors make Bayesian inference very
simple (used in classic conjugate analyses and in
Gibbs sampling) - For some, can generate values conveniently (used
in Metropolis-Hastings estimation)
37Some discrete parametric distributions (see
Gelman et al., Appendix A)
- Bernoulli. For success/fail in a single binary
trial with probability p. E.g., one coin flips,
with p the probability of heads. - In BUGS notation, r dbern(p) where rÃŽ0,1.
- Binomial. For counts of successes in binary
trials, each with probability p, in n independent
trials. E.g., n coin flips, with p the common
probability of heads. - In BUGS notation, r dbin(p,n) where rÃŽ 0,1,
..., n. - Categorical. A single trial with a variable that
can take values 1,2,..., ncat, - with respective probabilities (p1, p2, ...,
pncat) which sum to one. - r dcat(p).
38Some discrete parametric distributions A closer
look at the binomial distribution
- Binomial. For counts of successes in binary
trials, each with probability p, in n independent
trials. E.g., n coin flips, with p the common
probability of heads. - In BUGS notation, r dbin(p,n) where rÃŽ 0,1,
..., n.
Count of successes
The variable
Count of failures
The success probability parameter
The failure probability
The success probability
We will be using this as a likelihood in an
example of the use of conjugate distributions.
39Some continuous parametric distributions (see
Gelman et al., Appendix A)
- Normal. Often used in measurement.
- x dnorm(mu,tau) BUGS format,
- written in terms of precision t
- Uniform. Can use as an uninformative
- prior on an interval.
- x dunif(a,b)
dnorm(0,1)
dunif(3,5)
40Some continuous parametric distributions (see
Gelman et al., Appendix A)
- Beta. Defined on 0,1. Conjugate prior for the
probability parameter in Bernoulli binomial
models. - p dbeta(a,b)
- Gamma. Defined on (0,) Conjugate prior for
precision in the normal distribution. - x dgamma(r,mu)
dbeta(1.1,1.1)
dbeta(10,40)
41Some continuous parametric distributions A
closer look at the Beta distribution
- Beta. Defined on 0,1. Conjugate prior for the
probability parameter in Bernoulli binomial
models. - p dbeta(a,b)
- Mean(p)
- Variance(p)
- Mode(p)
PseudoCount of successes
PseudoCount of failures
The variable success probability
The failure probability
Shape, or prior sample info
The success probability
42Summarizing distributions
- A probability distribution conveys everything you
know, and still dont know, about a variable at a
given state of information. - With finite distributions with few variables, as
in MSBNx, we could just look at the vector of
probabilities. - With continuous variables, can look at summary
statistics Means, standard deviations, medians,
modes, percentile points. - Can also look at pictures of estimates of the
density, - and collections of values drawn from the
distribution.
43Bayes Theorem revisited
- The general form of Bayes Theorem
- Review of Bayes Theorem with finite variables
- An example with a continuous variable
- A beta-binomial example
- A glimpse ahead to computational approaches
- Basic ideas behind...
- Conjugate priors
- Sampling-based approaches (MCMC/BUGS)
44Bayes Theorem
- The setup, with two random variables, X and Y
- You know conditional probabilities, p(x y),
which tell you what to believe about X if you
knew the value of Y. - You learn Xx what should you believe about Y?
- You combine two things
- Relative conditional probabilities (the
likelihood) - Previous probabilities about Y values
- Note this is proportional to the posterior.
posterior likelihood
prior
45Bayes Theorem
- Note that this is proportional to the posterior.
- To make it a proper distribution, must divide
through by all the possibilities for y, given
that Xx--the normalizing constant C. If y is a
finite variable, this means dividing through by a
sum - If y is a continuous variable, it means dividing
through by an integral
posterior likelihood prior
The joker in the deck!
46An example with a continuous variable A
beta-binomial example
- The setup We are flipping a biased coin, where
the probability of heads p could be anywhere
between 0 and 1. We are interested in p. We
will have two sources of information - Prior beliefs, which we will express as a beta
distribution, and - Data, which will come in the form of counts of
heads in 10 independent flips.
47An example with a continuous variable A
beta-binomial example--the Prior Distribution
- The prior distribution
- Lets suppose we think it is more likely that
the coin is close to fair, so p is probably
nearer to .5 than it is to either 0 or 1. We
dont have any reason to think it is biased
toward either heads or tails, so well want a
prior distribution that is symmetric around .5.
Were not real sure about what p might be--say
about as sure as only 6 observations. This
corresponds to 3 pseudo-counts of H and 3 of T,
which, if we want to use a beta distribution to
express this belief, corresponds to beta(4,4)
48An example with a continuous variable A
beta-binomial example--the Prior Distribution
- Beta. Defined on 0,1. Conjugate prior for the
probability parameter in Bernoulli binomial
models. - p dbeta(4,4)
- Mean(p)
- Variance(p)
- Mode(p)
PseudoCount of successes
PseudoCount of failures
The variable success probability
The failure probability
Shape, or prior sample info
The success probability
49An example with a continuous variable A
beta-binomial example--the Likelihood
- The likelihood
- Next we will flip the coin ten times. Assuming
the same true (but unknown to us) value of p is
in effect for each of ten independent trials, we
can use the binomial distribution to model the
probability of getting any number of heads i.e.,
Count of observed successes
The variable
Count of observed failures
The success probability parameter
The failure probability
The success probability
50An example with a continuous variable A
beta-binomial example--the Likelihood
- The likelihood
- We flip the coin ten times, and observe 7 heads
i.e., r7. The likelihood is obtained now using
the same form as in the preceding slide, except
now r is fixed at 7 and we are interested in the
relative value of this function at different
possible values of p
51An example with a continuous variable Obtaining
the posterior by Bayes Theorem
posterior likelihood prior
- General form
- In our example, 7 plays the role of x, and p
plays the role of y. Before normalizing - After normalizing
Now, how can we get an idea of what this means we
believe about p after combining our prior belief
and our observations?
52An example with a continuous variable In pictures
Prior x Likelihood Posterior
53An example with a continuous variable Using the
fact that we have conjugate distributions
Now
This is just the kernel of a beta(11,7)
distribution. This is rather special. The data
were observed in accordance with a probability
function which would have that same mathematical
form as a likelihood once data are observed. We
chose a prior distribution (in this case, a beta
distribution) which would combine with the
likelihood just so as to produce another
distribution in the same parametric family
(another beta distribution), just with updated
parameters. We can work out its summary
statistics
- Mean(p) Variance(p)
Mode(p) - prior was .5
.028
.5
54An example with a continuous variable Using BUGS
Now
What BUGS does in this simple problem with one
variable is to sample lots of values from the
posterior distribution for p that is, its
distribution as determined first with information
from the prior, but further conditional on the
observed data. Here are the summary statistics
from 50000 draws
- Mean(p) Variance(p)
Mode(p) - prior was .5
.028
.5
.11162.0125
55An example with a continuous variable Using BUGS
- BUGS setup for this problem
56Looking ahead to sampling-based approaches with
many variables
- BUGS Bayesian-inference Using Gibbs Sampling
- Basic idea Model multi-parameter problem in
terms of assemblies of distributions and
functions for all data and all parameters (taking
advantage of conditional dependence whenever
possible). - E.g., p(Datax,y) p(xz) p(y) p(z). ()
- Observe Data Posterior p(x,y,zData) is
proportional to (). Hard to evaluate
normalizing constant, but ...
57Looking ahead to sampling-based approaches with
many variables
- Can draw values from full conditional
distributions - Start with a possible value for each variable in
cycle 0. - In cycle t1,
- Draw xt1 from p(xY yt,Z zt,Data)
- Draw yt1 from p(yX xt1,Z zt,Data)
- Draw zt1 from p(zX xt1,Y yt1,Data)
- Under suitable conditions, these series of draws
will come to approximate draws from the actual
true joint posterior for all the parameters.