Concepts and Definitions in Bayesian Estimation - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Concepts and Definitions in Bayesian Estimation

Description:

A quote from Glenn Shafer. Probability is not really about numbers; ... His secretary enters and tells him that Dr Watson has had a car accident. ' Watson? ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 58
Provided by: bobmi9
Category:

less

Transcript and Presenter's Notes

Title: Concepts and Definitions in Bayesian Estimation


1
Concepts and Definitions in Bayesian Estimation
  • Robert J. Mislevy
  • University of Maryland
  • November 29, 2005

2
A quote from Glenn Shafer
  • Probability is not really about numbers
  • it is about the structure of reasoning.
  • Glenn Shafer, quoted in Pearl, 1988, p. 77

3
Views of Probability
  • Two conceptions of probability
  • Aleatory (chance)
  • Long-run frequencies, mechanisms
  • Probability is a property of the world
  • Degree of belief (subjective)
  • Probability is a property of Your state of
    knowledge (de Finetti) and model of the situation
  • Same formal definitions machinery
  • Aleatory paradigms as analogical basis for degree
    of belief (Glenn Shafer)

4
Frames of discernment
  • Frame of discernment is all the possible
    combinations of values of the variables your are
    working with. (Shafer, 1976)
  • Discern detect, recognize, distinguish
  • Property of you as much as property of world
  • Depends on what you know and what your purpose is
  • Frame of discernment can evolve over time
  • Medical diagnosis
  • Document literacy example (more information)

5
Frames of Discernment in Assessment
  • In Student Model, determining what aspects of
    skill knowledge to use as explicit
    student-model variables--psych perspective,
    grainsize, reporting requirements
  • In Evidence Model, evidence identification
    (task scoring), evaluation rules map from unique
    work product to common observable variables.
  • In Task Model, task variables are aspects of
    situations that are important in task design to
    keep track of and manipulate, to achieve
    assessments purpose.

6
(Random) Variables
  • We will start on variables with a finite number
    of possible values.
  • Denote random variable by upper case, say X.
  • Denote particular values and generic values by
    lower case, x.
  • Y is the outcome of a coin flip yÃŽ h,t.
  • Xi is the answer to Item i xi ÃŽ 0,1.

7
Finite Probability Distributions
  • Finite set of possible values x1,xn
  • Prob(Xxj), P(Xxj), or more simply p(xj), is the
    probability that X takes the value xj.
  • 0 p(xj) 1.
  • P(Xxj or Xxm) p(xj) p(xm).

8
Continuous Probability Distributions
  • Infinitely many possible values eg, x
    xÃŽ0,1, x xÃŽ(-,)
  • Events A1,Am are sets of possible values
  • A1 x xlt0, A2 x xÃŽ(0,1), A3 x
    xgt0,
  • P(Aj) is the probability that X takes a value in
    Aj
  • 0 p(Aj) 1.
  • If A1 Am are disjoint events that exhaust all
    possible values of x, then
  • If Aj and Ak are disjoint events, P(Aj È Ak)
    P(Aj) P(Ak).

9
The Icy Road Example
Police Inspector Smith is impatiently awaiting
the arrival of Mr. Holmes and Dr. Watson. They
are late, and Inspector Smith has another
important appointment (lunch). Looking out the
window he wonders whether the roads are icy.
Both are notoriously bad drivers, so if the roads
are icy they are likely to crash. His secretary
enters and tells him that Dr Watson has had a car
accident. Watson? OK. Ill be the roads are
icy! Then Holmes has most probably crashed too.
Ill go for lunch now. (based on an example in
Jensen, 1996, p. 7)

Jensen, F.V. (1996). An introduction to Bayesian
networks. New York Springer-Verlag.
10
From the Icy Road Example
  • Ice Is there an icy road?
  • Values Yes, No
  • Initial Probabilities (.7, .3)
  • Note choice of values for variable icy road,
    and probabilities determined by knowledge of
    weather in the area at this time of year, but
    without having looked out the window.

11
Icy Road Probabilities
Ice
P(Iceyes)
Yes
.7
.3
No
P(Iceno)
12
Joint probability distributions
  • Two random variables, X and Y
  • P(Xxj,Yyk), or p(xj, yk), is the probability
    that X takes the value xj and Y takes the value
    yk .
  • 0 p(xj , yk) 1.

13
Marginal probability distributions 1
  • Two discrete random variables, X and Y
  • Recall P(Xxj,Yyk), or p(xj, yk), is the joint
    probability that X takes the value xj and Y takes
    the value yk
  • The marginal probability of a value xj of X is
    the sum over all the possible joint probabilities
    p(xj, yk) with that value of X

14
Conditional probability distributions
  • Two random variables, X and Y
  • P(XxjYyk), or p(xj yk), is the probability
    that X takes the value xj given that Y takes the
    value yk .
  • This is how we express relationships among
    real-world phenomena
  • Coin flip p(heads) vs. p(headsBobReport)
  • P(heart attackage, family history, blood
    pressure)
  • P(February 10 high temperature geographical
    location, February 9 high temperature)
  • IRT P(Xj1) vs. P(Xj1q)

15
Conditional probability distributions
  • Two discrete random variables, X and Y
  • P(XxjYyk), or p(xj yk), is the probability
    that X takes the value xj given that Y takes the
    value yk .
  • 0 p(xj yk) 1 for each given yk.
  • for each given yk
  • P(Xxj or Xxm Yyk) p(xj yk) p(xm yk).

16
Marginal probability distributions 2
  • Two discrete random variables, X and Y
  • Recall p(xj yk), is the probability that X xj
    given Y yk .
  • The marginal probability of a value of X is the
    sum of its conditional probabilities given all
    possible values of Y, with each weighted by its
    probability

17
Bayes Theorem
  • The setup, with two random variables, X and Y
  • You know conditional probabilities, p(xj yk),
    which tell you what to believe about X if you
    knew the value of Y.
  • You learn Xx what should you believe about Y?
  • You combine two things
  • Relative conditional probabilities (the
    likelihood)
  • Previous probabilities about Y values

posterior likelihood
prior
18
From the Icy Road Example
  • Ice Is there an icy road?
  • Values Yes, No
  • Initial Probabilities (.7, .3)
  • Watson Does Watson have a car crash?
  • Values Yes, No
  • Probabilities conditional on Icy Road
  • (.8, .2) if IceYes, (.1, .9) if IceNo.

19
Icy Road Conditional Probabilities
Watson
No
Yes
Ice
.2
Yes
.8
.9
.1
No
p(WatsonnoIceyes)
p(WatsonyesIceyes)
20
Icy Road Conditional Probabilities
Watson
No
Yes
Ice
.2
Yes
.8
.9
.1
No
p(WatsonnoIceno)
p(WatsonyesIceno)
21
Icy Road Likelihoods
Note 2/9 ratio
Watson
No
Yes
Ice
p(WatsonnoIceyes)
.2
Yes
.8
.9
.1
No
p(WatsonnoIceno)
22
Icy Road Likelihoods
Note 8/1 ratio
Watson
No
Yes
Ice
p(WatsonyesIceyes)
.2
Yes
.8
.9
.1
No
p(WatsonyesIceno)
23
Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
24
Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
Note Sum .59, not 1.00. These arent
probabilities.
25
Icy Road Bayes TheoremIf Watson yes
Prior Likelihood µ Posterior
Yes
.95
.05
Divide through by normalizing constant .59 to get
posterior probabilities.
26
Icy Road Bayes TheoremIf Watson no
Prior Likelihood µ Posterior
Watson
No
Yes
No
Ice
.2
.8
Yes
.9
.1
No
Divide through by normalizing constant to get
posterior probabilities.
27
Independence
  • Independence
  • The probability of the joint occurrence of
    values of two variables is always equal to the
    product of the probabilities individually
  • P(Xx,Yy) P(Xx) P(Yy).
  • Equivalent to saying that learning the value of
    one of the variables does not change your belief
    about the other.

28
Conditional independence
  • Conditional independence
  • The conditional probability of the joint
    occurrence given the value of another variable is
    always equal to the product of the conditional
    probabilities
  • P(Xx,YyZz) P(Xx Zz) P(Yy Zz).

29
Conditional independence
  • Conditional independence is not a grace of
    nature for which we must wait passively, but
    rather a psychological necessity which we satisfy
    actively by organizing our knowledge in a
    specific way.
  • An important tool in such organization is the
    identification of intermediate variables that
    induce conditional independence among
    observables if such variables are not in our
    vocabulary, we create them.
  • In medical diagnosis, for instance, when some
    symptoms directly influence one another, the
    medical profession invents a name for that
    interaction (e.g., syndrome, complication,
    pathological state) and treats it as a new
    auxiliary variable that induces conditional
    independence dependency between any two
    interacting systems is fully attributed to the
    dependencies of each on the auxiliary variable.
    (Pearl, 1988, p. 44)

30
Working with distributions
31
Discrete Random VariablesDensity Functions
  • A discrete random variable is characterized by a
    set of possible values it can take, and the
    probability assigned to each of those
    possibilities
  • x1,xn
  • Prob(Xxj), P(Xxj), or more simply p(xj).
  • Can represent this as a histogram

1.0
0.7
p(x)
0.3
0.0
0
1
x
32
Continuous Random VariablesDensity Functions
  • The normal distribution is often written N(m,s),
    showing that it depends on two parameters, the
    mean m and the standard deviation s. (BUGS
    writes it in terms of the mean m and the
    precision, 1/s2.) The density function
  • of the normal distribution is

s
Gelman et al. sometimes just write a density as
(xm,s).
m
33
Continuous Random VariablesDensity Functions
  • To determine the probability of getting a value
    less than some value z for a real-valued
    distribution, integrate the density over (-,z).
    Using the normal (3.5,3) distribution,
    Prob(zlt4.5 m3.5,3) is

m
34
Continuous Random VariablesDensity Functions
  • Integrating the density over the entire range of
    a continuous random variable gives the value 1
    e.g.,
  • This will become important
  • in Bayesian analysis, because
  • sometimes we can determine
  • functions p that are proportional
  • to the distributions we want, but
  • dont integrate to one.

m
35
Continuous Random VariablesAverages, or
Expected Values
  • Suppose you have some function g(x) that is
    defined over the range of X, and your current
    current belief about x is expressed by a
    distribution with density p(x). The expected
    value of g(x) is obtained by integrating g(x)
    over the range, with respect to p(x)
  • ()
  • You obtain the mean of a distribution by
    calculating () with g(x) x. You get the
    variance by subtracting the square of the mean
    from another application of () with g(x)x2.
  • You can approximate these quantities by taking
    many draws from p(x), evaluating g at each of
    them, and examining the distribution of the
    results. (This works if you can sample in
    proportion to p too.)

36
Parametric Distributions--Why?
  • Paradigmatic shapes, summarized in terms of a few
    variables (i.e., the parameters)
  • Often straightforward relationships between
    values of parameters and values of summary
    features
  • Building blocks for large problems (BUGS)
  • Computational advantages
  • Conjugate priors make Bayesian inference very
    simple (used in classic conjugate analyses and in
    Gibbs sampling)
  • For some, can generate values conveniently (used
    in Metropolis-Hastings estimation)

37
Some discrete parametric distributions (see
Gelman et al., Appendix A)
  • Bernoulli. For success/fail in a single binary
    trial with probability p. E.g., one coin flips,
    with p the probability of heads.
  • In BUGS notation, r dbern(p) where rÃŽ0,1.
  • Binomial. For counts of successes in binary
    trials, each with probability p, in n independent
    trials. E.g., n coin flips, with p the common
    probability of heads.
  • In BUGS notation, r dbin(p,n) where rÃŽ 0,1,
    ..., n.
  • Categorical. A single trial with a variable that
    can take values 1,2,..., ncat,
  • with respective probabilities (p1, p2, ...,
    pncat) which sum to one.
  • r dcat(p).

38
Some discrete parametric distributions A closer
look at the binomial distribution
  • Binomial. For counts of successes in binary
    trials, each with probability p, in n independent
    trials. E.g., n coin flips, with p the common
    probability of heads.
  • In BUGS notation, r dbin(p,n) where rÃŽ 0,1,
    ..., n.

Count of successes
The variable
Count of failures
The success probability parameter
The failure probability
The success probability
We will be using this as a likelihood in an
example of the use of conjugate distributions.
39
Some continuous parametric distributions (see
Gelman et al., Appendix A)
  • Normal. Often used in measurement.
  • x dnorm(mu,tau) BUGS format,
  • written in terms of precision t
  • Uniform. Can use as an uninformative
  • prior on an interval.
  • x dunif(a,b)

dnorm(0,1)
dunif(3,5)
40
Some continuous parametric distributions (see
Gelman et al., Appendix A)
  • Beta. Defined on 0,1. Conjugate prior for the
    probability parameter in Bernoulli binomial
    models.
  • p dbeta(a,b)
  • Gamma. Defined on (0,) Conjugate prior for
    precision in the normal distribution.
  • x dgamma(r,mu)

dbeta(1.1,1.1)
dbeta(10,40)
41
Some continuous parametric distributions A
closer look at the Beta distribution
  • Beta. Defined on 0,1. Conjugate prior for the
    probability parameter in Bernoulli binomial
    models.
  • p dbeta(a,b)
  • Mean(p)
  • Variance(p)
  • Mode(p)

PseudoCount of successes
PseudoCount of failures
The variable success probability
The failure probability
Shape, or prior sample info
The success probability
42
Summarizing distributions
  • A probability distribution conveys everything you
    know, and still dont know, about a variable at a
    given state of information.
  • With finite distributions with few variables, as
    in MSBNx, we could just look at the vector of
    probabilities.
  • With continuous variables, can look at summary
    statistics Means, standard deviations, medians,
    modes, percentile points.
  • Can also look at pictures of estimates of the
    density,
  • and collections of values drawn from the
    distribution.

43
Bayes Theorem revisited
  • The general form of Bayes Theorem
  • Review of Bayes Theorem with finite variables
  • An example with a continuous variable
  • A beta-binomial example
  • A glimpse ahead to computational approaches
  • Basic ideas behind...
  • Conjugate priors
  • Sampling-based approaches (MCMC/BUGS)

44
Bayes Theorem
  • The setup, with two random variables, X and Y
  • You know conditional probabilities, p(x y),
    which tell you what to believe about X if you
    knew the value of Y.
  • You learn Xx what should you believe about Y?
  • You combine two things
  • Relative conditional probabilities (the
    likelihood)
  • Previous probabilities about Y values
  • Note this is proportional to the posterior.

posterior likelihood
prior
45
Bayes Theorem
  • Note that this is proportional to the posterior.
  • To make it a proper distribution, must divide
    through by all the possibilities for y, given
    that Xx--the normalizing constant C. If y is a
    finite variable, this means dividing through by a
    sum
  • If y is a continuous variable, it means dividing
    through by an integral

posterior likelihood prior
The joker in the deck!
46
An example with a continuous variable A
beta-binomial example
  • The setup We are flipping a biased coin, where
    the probability of heads p could be anywhere
    between 0 and 1. We are interested in p. We
    will have two sources of information
  • Prior beliefs, which we will express as a beta
    distribution, and
  • Data, which will come in the form of counts of
    heads in 10 independent flips.

47
An example with a continuous variable A
beta-binomial example--the Prior Distribution
  • The prior distribution
  • Lets suppose we think it is more likely that
    the coin is close to fair, so p is probably
    nearer to .5 than it is to either 0 or 1. We
    dont have any reason to think it is biased
    toward either heads or tails, so well want a
    prior distribution that is symmetric around .5.
    Were not real sure about what p might be--say
    about as sure as only 6 observations. This
    corresponds to 3 pseudo-counts of H and 3 of T,
    which, if we want to use a beta distribution to
    express this belief, corresponds to beta(4,4)

48
An example with a continuous variable A
beta-binomial example--the Prior Distribution
  • Beta. Defined on 0,1. Conjugate prior for the
    probability parameter in Bernoulli binomial
    models.
  • p dbeta(4,4)
  • Mean(p)
  • Variance(p)
  • Mode(p)

PseudoCount of successes
PseudoCount of failures
The variable success probability
The failure probability
Shape, or prior sample info
The success probability
49
An example with a continuous variable A
beta-binomial example--the Likelihood
  • The likelihood
  • Next we will flip the coin ten times. Assuming
    the same true (but unknown to us) value of p is
    in effect for each of ten independent trials, we
    can use the binomial distribution to model the
    probability of getting any number of heads i.e.,

Count of observed successes
The variable
Count of observed failures
The success probability parameter
The failure probability
The success probability
50
An example with a continuous variable A
beta-binomial example--the Likelihood
  • The likelihood
  • We flip the coin ten times, and observe 7 heads
    i.e., r7. The likelihood is obtained now using
    the same form as in the preceding slide, except
    now r is fixed at 7 and we are interested in the
    relative value of this function at different
    possible values of p

51
An example with a continuous variable Obtaining
the posterior by Bayes Theorem
posterior likelihood prior
  • General form
  • In our example, 7 plays the role of x, and p
    plays the role of y. Before normalizing
  • After normalizing

Now, how can we get an idea of what this means we
believe about p after combining our prior belief
and our observations?
52
An example with a continuous variable In pictures
Prior x Likelihood Posterior
53
An example with a continuous variable Using the
fact that we have conjugate distributions
Now
This is just the kernel of a beta(11,7)
distribution. This is rather special. The data
were observed in accordance with a probability
function which would have that same mathematical
form as a likelihood once data are observed. We
chose a prior distribution (in this case, a beta
distribution) which would combine with the
likelihood just so as to produce another
distribution in the same parametric family
(another beta distribution), just with updated
parameters. We can work out its summary
statistics
  • Mean(p) Variance(p)
    Mode(p)
  • prior was .5
    .028
    .5

54
An example with a continuous variable Using BUGS
Now
What BUGS does in this simple problem with one
variable is to sample lots of values from the
posterior distribution for p that is, its
distribution as determined first with information
from the prior, but further conditional on the
observed data. Here are the summary statistics
from 50000 draws
  • Mean(p) Variance(p)
    Mode(p)
  • prior was .5
    .028
    .5

.11162.0125
55
An example with a continuous variable Using BUGS
  • BUGS setup for this problem

56
Looking ahead to sampling-based approaches with
many variables
  • BUGS Bayesian-inference Using Gibbs Sampling
  • Basic idea Model multi-parameter problem in
    terms of assemblies of distributions and
    functions for all data and all parameters (taking
    advantage of conditional dependence whenever
    possible).
  • E.g., p(Datax,y) p(xz) p(y) p(z). ()
  • Observe Data Posterior p(x,y,zData) is
    proportional to (). Hard to evaluate
    normalizing constant, but ...

57
Looking ahead to sampling-based approaches with
many variables
  • Can draw values from full conditional
    distributions
  • Start with a possible value for each variable in
    cycle 0.
  • In cycle t1,
  • Draw xt1 from p(xY yt,Z zt,Data)
  • Draw yt1 from p(yX xt1,Z zt,Data)
  • Draw zt1 from p(zX xt1,Y yt1,Data)
  • Under suitable conditions, these series of draws
    will come to approximate draws from the actual
    true joint posterior for all the parameters.
Write a Comment
User Comments (0)
About PowerShow.com