Introduction to Hierarchical Models - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Introduction to Hierarchical Models

Description:

The hierarchical setup and the concept of ... basic Bayesian setup for. hierarchical data ... just a standard Bayesian setup where we are assigning some ... – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 19

Provided by: jeffgry

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Hierarchical Models

1
Introduction to Hierarchical Models

Intuitions of Hierarchical Modeling
The hierarchical setup and the concept of
exchangeability
The hierarchical Poisson model with gamma priors
The hierarchical normal model with normal priors

2
The Concept of Hierarchical Data Structures

Hierarchical data is ubiquitous in the social
sciences where measurement occurs at different
levels of aggregation.
e.g. we collect measurements of individuals who
live in a certain locality or belong to a
particular race or social group.
When this occurs, standard techniques either
assume that these groups belong to entirely
different populations or ignore the aggregate
information entirely.
Hierarchical models provide a way of pooling the
information for the disparate groups without
assuming that they belong to precisely the same
population.

3
The basic Bayesian setup for hierarchical data
structures

Suppose we have collected data about some random
variable Y from m different populations with n
observations for each population.
Let yij represent observation j from population
i.
Suppose yij f(?i), where ?i is a vector of
parameters for population i.
Further, ?i f(?) ? may also be a vector
? note, until this point this is just a standard
Bayesian setup where we are assigning some prior
distribution for the parameters ? that govern the
distribution of y.
Now we extend the model, and assume that the
parameters ?11, ?12 that govern the distribution
of the ?s are themselves random variables and
assign a prior distribution to these variables as
well
? f(a,b)
where, ? is called the hyperprior. The parameters
a,b,c,d for the hyperprior may be known and
represent our prior beliefs about ? or, in
theory, we can also assign a probability
distribution for these quantities as well, and
proceed to another layer of hierarchy.

4
Graphical Illustration of a hierarchical model
Priors for each sub-population
Hyperpriors for the full sample
Data
5
Exchangeability

Exchangeability (formal) The parameters ?1,
?2,, ?n are exchangeable in their joint
distribution if p(?1, ?2,, ?n) is invariant to
permutations in the index 1, 2, , n.
Exchangeability (informal) If no information
other than the data is available to distinguish
any of the ?js form any of the others, and no
ordering of the parameters can be made, one must
assume symmetry among the parameters in the prior
distribution.
This concept is closely related to the concept of
identically and independent random variables
where, conditional on the data, each observation
is treated the same.

6
Example A Gamma-Poisson Model

Gill (2002) examines data on the number of
marriages per 1,000 people in Italy from 1936 to
1951. He asks, did the marriage rate decline
during the war years?
To model this process, he assumes that the number
of marriages per 1000 people follows a Poisson
distribution.
marriagest Poisson(??t )
How would we have addressed this question before
now?
Why might we model this as a hierarchical process?

7
Exchangeability continued

Exchangeability means that we can treat the
parameters for each sub-population as
exchangeable units.
In its simplest form, each parameter ?j is
treated as an independent sample from a
distribution governed by unknown parameter vector
?.
p(?1, ?2,, ?n ?) ??i p(?i ?)
? in a more general form, we may also condition
on data that we have about the different
sub-populations.
Further, we can write the joint prior
distribution as
p(?1, ?2,, ?n , ?) p(?1, ?2,, ?n ?) p(?).
By Bayes rule
p(?1, ?2,, ?n , ? Y ) ?? prior ? likelihood
for Y.

8
Italian Marriages cont.

marriagest Poisson( ?t ) for t 1936, , 1951.
To model this as a hierarchical process, we
assume that each of the annual means ?t are
exchangeable draws from a common distribution.
? In this case, the gamma distribution has
desirable properties.
Thus,
?t Gamma( ?, ? ) for t 1936, , 1951
? Note that ? and ? are unknown parameters.
To satisfy the requirement of exchangeability,
what must we assume about the data generating
process?
Finally, to complete the hierarchical structure,
we must assign hyperpriors for the parameters ?
and ?. Again, the gamma distribution has nice
properties, so we assume that
? Gamma( A, B ) and ? Gamma( C, D ).
? Note, we pick real numbers for the numbers A,
B, C, D to represent our prior beliefs (which in
the usual case we shall assume are flat).

9
Graphical Representation of the Hierarchical
Gamma-Poisson Model
The prior parameters ? and ? are unknown. Both ?
and ? are assumed to be drawn from Gamma
distributions
? Gamma ( C, D )
? Gamma ( C, D )
The year specific means ?t are random draw from a
gamma distribution.
?1936 Gamma ( ?, ? )
?t Gamma ( ?, ? )
?1951 Gamma ( ?, ? )
The data observed for any given year y is a
random draw from a Poisson distribution with
year-specific mean.
y1936 Poisson ( ?1936 )
yt Poisson ( ?t )
y1951 Poisson ( ?1951 )
In this model we have more unknown parameters
than observations! There are t parameters ? ?
? and only t observations. Why is this okay?
10
The conditional distributions of ?i, ?, ? in the
Gamma-Poisson Hierarchical Model

To implement this model in a Gibbs Sampler, it is
necessary to derive the conditional distribution
of ?i, ?, and ?. WinBugs knows what these are,
but sometimes it is informative to derive them
ourselves. In this case,
p(?i, ?, ? y) ? ?i Poisson(yi ?i)?
p(?i?,?)p(?)p(?)
Using our trick for conditional distributions, we
know that
p(?i ?, ?, y) ? ?i Poisson(yi ?i)? p(?i?,?)
?(yi?,1?)
and
p(? ?1 , ?, y) ? p(?) ?i p(?i?,?) ? Not a
standard dist.
and
p(? ?1 , ?, y) ? p(?) ?i p(?i?,?) ? Gamma
Distribution

11
WinBugs Implementation of the Italian Marriage
Rates Example

model
for (i in 116)
marriagesi dpois(lambdai)
lambdai dgamma(alpha,beta)
alpha dgamma(1,1) A1,B1 (diffuse priors)
beta dgamma(1,1) C1,D1 (diffuse priors)
Data
list(marriages7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7)
Use the boxplot function in WinBugs

12
Example 2Political Ideologies

Previously, we examined individuals responses to
a 7-point liberal-conservative ideology survey
question in two different ways.
Method 1) Assume that all respondents are drawn
from the same pool and examine the overall mean
and variance.
Method 2) Break respondents into categories based
on their self-reported partisan identities and
estimate the mean and variance of Democrats,
Republicans, and Independents separately.
Using a hierarchical approach, individuals are
treated as independent draws from a
party-specific distribution, but the mean of each
of the party-specific distributions is itself a
draw from a hyper-distribution with some unknown
mean and variance.
? if the hyper-distribution has zero variance,
then Method 1 above is a special case.
? if the hyper-distribution has infinite
variance, then Method 2 above is a special case.
? typically, however, we find that if we borrow
strength across populations by including the
hyper-distribution, the separate population means
shrink toward a common mean.
Note I am using the term population
colloquially, not in technical sense

13
The ideology example

We assume that the random variable respondent
ideology (denoted y) follows a normal
distribution with a mean and variance specific to
the respondents party
ydem,j N(?dem, ?dem)
yind,j N(?ind, ?ind)
yrep,j N(?rep, ?rep) for all j ? sample.
Further, assume that ?i is also a normal random
variable with unknown mean and variance.
Thus, ?p N(?M , T ) for p ? Dems, Inds, Reps
But, we shall allow model the precision terms in
the standard way with non-informative gamma
priors.
Thus, ?p ?(.1, .1) for p ? Dems, Inds, Reps.
Finally, we need to assign pdfs for the
hyperpriors as well.
What would be a reasonable distribution to
choose?

14
Ideology example continued

If ?p N(?M , T ) for p ? Dems, Inds, Reps
Then the obvious choice for the hyperpriors for M
and T is to assume that the mean is normally
distributed and the precision follows a gamma
distribution.
Thus, M N(4, .01) and T ?(.1, .1)
Note, if we assume that T ??, then we are
imposing the condition that ?Dem ?Ind ?Rep.
This is equivalent to assuming that all
observations are drawn from a distribution with
the same overall mean.
If we assume that T ?0, then we are assuming
that there is no underlying structure to the
data. This is equivalent to assuming that there
is no hierarchical structure in the data.

15
Derivation of the conditional distributions of ?,
?, M for the normal hierarchical model

yp,j N(?p, ?p) ?p N(M, T) ?p ?(? ,?)
MN(m,t) T?(a,b)
By the conditional distribution trick
p(?py,?p, ?p/i, M,T) ? N(?p,?p)N(M,T)

This is the kernel of a normal distribution. The
mean of this distribution is a weighted average
of the sample mean for sub-population p and the
parent population. The weights are provided by
the precision of the parent population and the
sub-population.
16
Derivation of the conditional distributions of ?,
?, M for the normal hierarchical model

yp,j N(?p, ?p) ?p N(M, T) ?p ?(? ,?)
MN(m,t) T?(a,b)
By the conditional distribution trick
p(My,?p,?p,T) ? ?pN(?p,?p)N(M,T)

This is the kernel of a normal distribution. The
mean of this distribution is a weighted average
of the prior population mean and the average
sub-population mean. The weights are provided by
the prior precision and the precision of the
parent population.
17
WinBugs Implementation of the ideology example

model
for (i in 1669)
ideologyRi dnorm(mupid3i,
taupid3i)
temp1i lt- pid3i
temp2i lt- pid7i
for (j in 13)
muj dnorm(M,T)
tauj dgamma(.1,.1)
M dnorm(4,.01)
T dgamma(.1,.1)

18
Final Comments

In the case of political ideologies, we find that
there is very little shrinkage toward the
overall mean. Why?
One of the properties of hierarchical models is
that if the posterior precision of the hyperprior
is very large, then we are essentially finding
that each of the sub-populations is drawn from a
common distribution with zero variance. This
means that we have endogenously estimated that
the means of all of the sub-populations are
identical. In this case, there may be a great
deal of shrinkage toward the overall population
mean due to the fact that variation from the
population mean is random noise.
On the other hand, if the posterior precision of
the hyperprior is very small, then we are
essentially finding that each of the
sub-populations is drawn from a distribution with
a very different mean. In which case there is
little shrinkage toward the population mean,
which is desirable because we dont want to
impose structure where none exists.
The absence of shrinkage in the example is due to
the fact that Democrats, Republicans, and
Independents actual do have significantly
different ideologies.