Title: Introduction to Hierarchical Models
1Introduction to Hierarchical Models
- Intuitions of Hierarchical Modeling
- The hierarchical setup and the concept of
exchangeability - The hierarchical Poisson model with gamma priors
- The hierarchical normal model with normal priors
2The Concept of Hierarchical Data Structures
- Hierarchical data is ubiquitous in the social
sciences where measurement occurs at different
levels of aggregation. - e.g. we collect measurements of individuals who
live in a certain locality or belong to a
particular race or social group. - When this occurs, standard techniques either
assume that these groups belong to entirely
different populations or ignore the aggregate
information entirely. - Hierarchical models provide a way of pooling the
information for the disparate groups without
assuming that they belong to precisely the same
population.
3The basic Bayesian setup for hierarchical data
structures
- Suppose we have collected data about some random
variable Y from m different populations with n
observations for each population. - Let yij represent observation j from population
i. - Suppose yij f(?i), where ?i is a vector of
parameters for population i. - Further, ?i f(?) ? may also be a vector
- ? note, until this point this is just a standard
Bayesian setup where we are assigning some prior
distribution for the parameters ? that govern the
distribution of y. - Now we extend the model, and assume that the
parameters ?11, ?12 that govern the distribution
of the ?s are themselves random variables and
assign a prior distribution to these variables as
well - ? f(a,b)
- where, ? is called the hyperprior. The parameters
a,b,c,d for the hyperprior may be known and
represent our prior beliefs about ? or, in
theory, we can also assign a probability
distribution for these quantities as well, and
proceed to another layer of hierarchy.
4Graphical Illustration of a hierarchical model
Priors for each sub-population
Hyperpriors for the full sample
Data
5Exchangeability
- Exchangeability (formal) The parameters ?1,
?2,, ?n are exchangeable in their joint
distribution if p(?1, ?2,, ?n) is invariant to
permutations in the index 1, 2, , n. - Exchangeability (informal) If no information
other than the data is available to distinguish
any of the ?js form any of the others, and no
ordering of the parameters can be made, one must
assume symmetry among the parameters in the prior
distribution. - This concept is closely related to the concept of
identically and independent random variables
where, conditional on the data, each observation
is treated the same.
6Example A Gamma-Poisson Model
- Gill (2002) examines data on the number of
marriages per 1,000 people in Italy from 1936 to
1951. He asks, did the marriage rate decline
during the war years? - To model this process, he assumes that the number
of marriages per 1000 people follows a Poisson
distribution. - marriagest Poisson(??t )
- How would we have addressed this question before
now? - Why might we model this as a hierarchical process?
7Exchangeability continued
- Exchangeability means that we can treat the
parameters for each sub-population as
exchangeable units. - In its simplest form, each parameter ?j is
treated as an independent sample from a
distribution governed by unknown parameter vector
?. - p(?1, ?2,, ?n ?) ??i p(?i ?)
- ? in a more general form, we may also condition
on data that we have about the different
sub-populations. - Further, we can write the joint prior
distribution as - p(?1, ?2,, ?n , ?) p(?1, ?2,, ?n ?) p(?).
- By Bayes rule
- p(?1, ?2,, ?n , ? Y ) ?? prior ? likelihood
for Y.
8Italian Marriages cont.
- marriagest Poisson( ?t ) for t 1936, , 1951.
- To model this as a hierarchical process, we
assume that each of the annual means ?t are
exchangeable draws from a common distribution. - ? In this case, the gamma distribution has
desirable properties. - Thus,
- ?t Gamma( ?, ? ) for t 1936, , 1951
- ? Note that ? and ? are unknown parameters.
- To satisfy the requirement of exchangeability,
what must we assume about the data generating
process? - Finally, to complete the hierarchical structure,
we must assign hyperpriors for the parameters ?
and ?. Again, the gamma distribution has nice
properties, so we assume that - ? Gamma( A, B ) and ? Gamma( C, D ).
- ? Note, we pick real numbers for the numbers A,
B, C, D to represent our prior beliefs (which in
the usual case we shall assume are flat).
9Graphical Representation of the Hierarchical
Gamma-Poisson Model
The prior parameters ? and ? are unknown. Both ?
and ? are assumed to be drawn from Gamma
distributions
? Gamma ( C, D )
? Gamma ( C, D )
The year specific means ?t are random draw from a
gamma distribution.
?1936 Gamma ( ?, ? )
?t Gamma ( ?, ? )
?1951 Gamma ( ?, ? )
The data observed for any given year y is a
random draw from a Poisson distribution with
year-specific mean.
y1936 Poisson ( ?1936 )
yt Poisson ( ?t )
y1951 Poisson ( ?1951 )
In this model we have more unknown parameters
than observations! There are t parameters ? ?
? and only t observations. Why is this okay?
10The conditional distributions of ?i, ?, ? in the
Gamma-Poisson Hierarchical Model
- To implement this model in a Gibbs Sampler, it is
necessary to derive the conditional distribution
of ?i, ?, and ?. WinBugs knows what these are,
but sometimes it is informative to derive them
ourselves. In this case, - p(?i, ?, ? y) ? ?i Poisson(yi ?i)?
p(?i?,?)p(?)p(?) - Using our trick for conditional distributions, we
know that - p(?i ?, ?, y) ? ?i Poisson(yi ?i)? p(?i?,?)
?(yi?,1?) - and
- p(? ?1 , ?, y) ? p(?) ?i p(?i?,?) ? Not a
standard dist. - and
- p(? ?1 , ?, y) ? p(?) ?i p(?i?,?) ? Gamma
Distribution
11WinBugs Implementation of the Italian Marriage
Rates Example
- model
- for (i in 116)
- marriagesi dpois(lambdai)
- lambdai dgamma(alpha,beta)
-
- alpha dgamma(1,1) A1,B1 (diffuse priors)
- beta dgamma(1,1) C1,D1 (diffuse priors)
-
- Data
- list(marriages7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7)
- Use the boxplot function in WinBugs
12Example 2Political Ideologies
- Previously, we examined individuals responses to
a 7-point liberal-conservative ideology survey
question in two different ways. - Method 1) Assume that all respondents are drawn
from the same pool and examine the overall mean
and variance. - Method 2) Break respondents into categories based
on their self-reported partisan identities and
estimate the mean and variance of Democrats,
Republicans, and Independents separately. - Using a hierarchical approach, individuals are
treated as independent draws from a
party-specific distribution, but the mean of each
of the party-specific distributions is itself a
draw from a hyper-distribution with some unknown
mean and variance. - ? if the hyper-distribution has zero variance,
then Method 1 above is a special case. - ? if the hyper-distribution has infinite
variance, then Method 2 above is a special case. - ? typically, however, we find that if we borrow
strength across populations by including the
hyper-distribution, the separate population means
shrink toward a common mean. - Note I am using the term population
colloquially, not in technical sense
13The ideology example
- We assume that the random variable respondent
ideology (denoted y) follows a normal
distribution with a mean and variance specific to
the respondents party - ydem,j N(?dem, ?dem)
- yind,j N(?ind, ?ind)
- yrep,j N(?rep, ?rep) for all j ? sample.
- Further, assume that ?i is also a normal random
variable with unknown mean and variance. - Thus, ?p N(?M , T ) for p ? Dems, Inds, Reps
- But, we shall allow model the precision terms in
the standard way with non-informative gamma
priors. - Thus, ?p ?(.1, .1) for p ? Dems, Inds, Reps.
- Finally, we need to assign pdfs for the
hyperpriors as well. - What would be a reasonable distribution to
choose?
14Ideology example continued
- If ?p N(?M , T ) for p ? Dems, Inds, Reps
- Then the obvious choice for the hyperpriors for M
and T is to assume that the mean is normally
distributed and the precision follows a gamma
distribution. - Thus, M N(4, .01) and T ?(.1, .1)
- Note, if we assume that T ??, then we are
imposing the condition that ?Dem ?Ind ?Rep.
This is equivalent to assuming that all
observations are drawn from a distribution with
the same overall mean. - If we assume that T ?0, then we are assuming
that there is no underlying structure to the
data. This is equivalent to assuming that there
is no hierarchical structure in the data.
15Derivation of the conditional distributions of ?,
?, M for the normal hierarchical model
- yp,j N(?p, ?p) ?p N(M, T) ?p ?(? ,?)
MN(m,t) T?(a,b) - By the conditional distribution trick
- p(?py,?p, ?p/i, M,T) ? N(?p,?p)N(M,T)
This is the kernel of a normal distribution. The
mean of this distribution is a weighted average
of the sample mean for sub-population p and the
parent population. The weights are provided by
the precision of the parent population and the
sub-population.
16Derivation of the conditional distributions of ?,
?, M for the normal hierarchical model
- yp,j N(?p, ?p) ?p N(M, T) ?p ?(? ,?)
MN(m,t) T?(a,b) - By the conditional distribution trick
- p(My,?p,?p,T) ? ?pN(?p,?p)N(M,T)
This is the kernel of a normal distribution. The
mean of this distribution is a weighted average
of the prior population mean and the average
sub-population mean. The weights are provided by
the prior precision and the precision of the
parent population.
17WinBugs Implementation of the ideology example
- model
- for (i in 1669)
- ideologyRi dnorm(mupid3i,
taupid3i) - temp1i lt- pid3i
- temp2i lt- pid7i
-
- for (j in 13)
- muj dnorm(M,T)
- tauj dgamma(.1,.1)
-
- M dnorm(4,.01)
- T dgamma(.1,.1)
18Final Comments
- In the case of political ideologies, we find that
there is very little shrinkage toward the
overall mean. Why? - One of the properties of hierarchical models is
that if the posterior precision of the hyperprior
is very large, then we are essentially finding
that each of the sub-populations is drawn from a
common distribution with zero variance. This
means that we have endogenously estimated that
the means of all of the sub-populations are
identical. In this case, there may be a great
deal of shrinkage toward the overall population
mean due to the fact that variation from the
population mean is random noise. - On the other hand, if the posterior precision of
the hyperprior is very small, then we are
essentially finding that each of the
sub-populations is drawn from a distribution with
a very different mean. In which case there is
little shrinkage toward the population mean,
which is desirable because we dont want to
impose structure where none exists. - The absence of shrinkage in the example is due to
the fact that Democrats, Republicans, and
Independents actual do have significantly
different ideologies.