Title: Sample Size calculations for multilevel models (Part II)
1Sample Size calculations for multilevel models
(Part II)
- William Browne and Mousa Golalizadeh
- Department of Clinical Veterinary Sciences
- University of Bristol
2Summary
- Introduction to sample size calculations.
- Simulation-based approaches.
- The MLPOWSIM software.
- Two level normal examples.
- Binary responses.
- Cross classified models.
- Future plans.
3Background
- Many quantitative social science research
questions are of the form of a hypothesis A has
a significant effect on B. - To answer such a question data is collected that
allows the researcher to (hopefully) test whether
statistically A has a significant effect on B.
(In fact we aim to reject the hypothesis that A
doesnt significantly affect B). - A test is performed and either the researcher is
happy and A indeed has a significant effect on B
or is left wondering why the data collected do
not back up their hypothesis. Is the hypothesis
false or was the data not sufficient? - The sufficiency of the data is the motivation for
sample size calculations.
4Example
- Suppose I have the research question Are
Welshmen on average taller than 175 cms? - I now need to get hold of a random sample of n
Welshmen and measure each of their heights. - I make some statistical assumption about the
distribution of the heights of Welshmen e.g. that
they come from a Normal distribution. - I might like to check this assumption by plotting
a histogram of the data. - I can then form a statistical hypothesis test and
test whether indeed Welshmen are taller than
175cms. - I need to decide how big to make n, my sample of
Welshmen.
5Hypothesis Testing
- Let us assume our null hypothesis is that the
average height of Welshmen (µ) is 175cm. - So we test H0µ175 vs HAµgt175 (or alternatively
H0?0 vs HA?gt0 where ?µ-175) - In practice we calculate from our sample its mean
( ) and standard deviation (s2) and use these
along with n to form a test statistic which we
can compare with the distribution assumed under
H0
6Type I and Type II errors
- No hypothesis test is perfect and there is always
the possibility of errors - P(Type I error) a significance level or size
- P(Type II error) ß, 1-ß is the power of the
test. - In general we fix a to some value e.g. 0.05, 0.01
then 1-ß depends on our sample size.
Truth Truth
H0 True H0 False
Decision Reject H0 Type I error Correct
Decision Accept H0 Correct Type II error
7Example hypothesis test
- Let us assume that in reality our sample mean is
180cms and the population standard deviation (sd)
is 5cms (known). - We can then form a test statistic as follows
- Note here that for small n and unknown sd we
should use a student-t distribution rather than
Normal. - For a 1-sided Z test we wish Z gt 1.645 and so
we need our sample to be of size 3 to reject H0,
using a student-t distribution increases this to
5. (Here a0.05) - However if the sample mean had been only 176cms
then we would need n gt (1.6455)2 68 Welshmen
to reject H0
8Power calculations
- Our last slide in some sense is backwards as we
cannot get from a given sample mean to choosing a
sample size! - What we do instead is use different terminology
and play God! - We will choose an effect size, ? which will
represent a guess at the increase in the sample
mean for Welshmen. - There then exists an (approximate) formula that
links four quantities, size (a), power (1-ß),
effect size (?) and sample size (n) - Note that the standard error (SE) of ? is a
function of n and s the population sd which is
assumed known. - We can now evaluate one of these quantities
conditional on the others e.g. what sample size
is required given a,1-ß and ??
Here RHS is sum of cases H0 true and H0 false.
9Welsh height example
- Here we have looked at two examples with effect
sizes 5 and 1 respectively. Assume s takes the
value 5 and so let us suppose we take a sample of
size 25 Welshmen. - Then
- Case 1 5/(5/v25)1.645z1-ß,z1-ß3.355
- ß0.9996
- Case 2 1/(5/ v25)1.645z1-ß,z1-ß-0.645
- ß0.25946
- So here a sample of 25 Welshmen from a population
with mean 180cms would almost always result in
rejecting H0, - but if the population mean is 176cms then only
26 of such samples would be rejected. - We can plot curves of how power increases with
sample size as shown in the next slide.
10Power curve for Welshmen example
- Here we see the two power curves for the two
scenarios
11Extending the idea
- The simple formula
can - be used in many situations and hypothesis tests.
- To generalise the idea we assume that ? is an
effect size associated with a statistic that we
wish to compare with a (null) hypothesized value
of 0. - The complication occurs in finding a formula for
the standard error for the statistic and relating
this formula to the sample size, n. - We will next consider an alternative approach
before returning to look at how the above
approach extends to multilevel models.
12The use of simulation
- In reality our (hoped for) research path will be
as follows - Construct research question -gt Form null
hypothesis that we believe false -gt Collect
appropriate data -gt Reject hypothesis therefore
proving our research question. - Assuming what we believe in our research question
is correct and hence null hypothesis is false we
can still be let down by not collecting enough
data. - The idea behind using simulation is to simulate
the data gathering process (assuming we know the
right answer) many times and see how often we can
reject the null hypothesis. The percentage of
rejected null hypotheses (via simulation) will
then estimate power.
13Simulation in our example
- Consider our Welsh height example case 2 where we
believe Welshmen have a mean height of 176cms
(and sd 5cms) and we are testing the hypothesis
H0µ175cms, and we consider a sample size 25. - Then we generate N samples (e.g. 5000) of size
25, - and for each sample
form a lower bound for the confidence interval of
the form - . This we compare with
the value 175 and the proportion greater than 175
is an estimate of the power of the test. - We can repeat this exercise for different sample
sizes and form a power curve.
14Power curve comparison
Note simulation curve is a good approximation of
the theoretical curve although there are some
minor (Monte Carlo) errors even with 5000
simulations per sample size.
15Advantages/Disadvantages
- Theoretical approach is quick when the formula
can be derived. - Approximations for more complex situations exist
which are equally quick. - Simulation approach generalizes to more
situations but is much slower and we may need
large numbers of simulations per scenario to get
accurate power estimates. - Note that alternative method described next
typically needs less simulations per scenario for
the same accuracy.
16Methods to calculate Power calculations from
simulations
- The method illustrated previously we will
describe as the 0/1 method where each simulation
results in a sample that would either reject or
fail to reject the null hypothesis. - An alternative method (the standard error method)
is to instead take from each simulation the
estimated standard error of the parameter to be
tested. Then the average of these standard errors
can be used with the effect size to construct the
power. - A 2nd alternative (when MCMC is used) is as in
the 0/1 approach to establish whether the
simulation sample rejects the null hypothesis but
rather than construct the interval assuming
normality to use the quantiles of the MCMC chain.
17MLPOWSIM software package
- Software package currently being written and
tested (90 complete). - A rather old fashioned text-based interface
allows user to specify sample size scenarios. - Software then generates either MLwiN macro code
or an R command file to run the simulations to
calculate power for scenarios. - Normal, Binomial and Poisson response offered.
- Software will cope with 1-level, 2-level
(balanced and unbalanced), 3-level nested
(balanced and unbalanced) and cross-classified
(balanced and unbalanced) with 2 higher
classifications models. - Many options for unbalanced designs.
- Extensive user manual in preparation (100 pages
with examples)
18MLPOWSIM state of play
- The software still requires code to generate
macros for unbalanced cross-classified models in
MLwiN to be written. - Documentation for normal responses is almost
complete with short chapters on other response
types to write. - Currently MLwiN code (using IGLS) runs
considerably faster than R code (using lmer) for
most models though we have had some success
speeding up R code. - Exception is cross-classified models where we
have only programmed an MCMC-based macro in MLwiN
that is typically slower than the lmer code in R.
See more details later.
19What happens with multilevel data?
- We will here mainly consider 2-level models and
take as our application area education, so we
have students nested within schools. - When deciding on a sampling scheme we have many
choices - How many schools, N ?
- How many pupils per school, nj ?
- Should we collect the same size sample from each
school ? - Our decision will depend on which parameter we
wish to estimate in the model.
20Education Example
- For motivation we considered a two level dataset
with exam marks measured for each student in a
collection of schools. In fact this dataset
exists and has 4915 students in 96 schools. - Our hypothesis of interest is that the exam mark
for an average student is gt 20 (null hypothesis
20) which with such a large sample results in the
null hypothesis being rejected for our particular
data. - If we fit the following multilevel model to the
data we get the estimates given - If we treat these estimates as population values,
we are interested in what power for testing our
hypothesis results from various combinations of N
and nj
21Design effect formula
- If we assume balance then with n pupils in each
of N schools for our simple model (and only this
simple model) the following formula holds - Design effect 1 (n-1)? where ? is the
intra-class correlation. - So if we know the simple random sample size
required for a given power we need to multiply
this by the design effect. - For example our data has ?16.205/(16.205139.367)
0.104 - So for schools of size 10 pupils we would need
190.1041.94 times as many students (in total)
to get the same power. - For this model (and this model only) we could
therefore perform our power calculations assuming
simple random sampling from a population with
variance 155.572 and scale up the sample required
based on the design. - So
- And for schools of size 10 we require
1.94338.4657 pupils which we can round up to 66
schools.
22Comparison of formula/simulations
- The following graph compares the design effect
formula to the simulation approach
23Other sample size issues
- There are other reasons why we may be interested
in sample size questions in multilevel modelling. - It is often problematic to fit multilevel models
when the number of higher level units is small as
demonstrated in the last graph - Also some methods can be biased for small sample
sizes. - Note although method comparison is done using a
similar approach of generating simulated
datasets, here power calculations are not the
main aim that said when performing power
calculations parameter bias of methods should be
noted as this will result in bias of predicted
power (see later). - Browne (1998), Browne and Draper(2006) compare
MCMC, RIGLS and IGLS for small sample sizes and
continuous responses, and MCMC, MQL and PQL for
binary response models. - Maas and Hox (2004) also look at small sample
sizes and how robust estimation is to the Normal
distributional assumption of the level 2
residuals.
24Sampling policy
- The design effect formula
- Design effect 1 (n-1)?
- suggests that if we are to sample a fixed
(balanced) number of pupils nN then our best
power results when n is smallest i.e. sampling
one pupil each from 100 schools is better than
sampling 100 pupils from the same school. - The effect of sampling policy is most important
in scenarios where ? is large e.g. repeated
measures designs. - The simulation procedure gives approximately the
same power curve and so in this simple example we
have an easy to use formula. - The reason in practice for sampling several
pupils from each school is purely the additional
cost incurred in visiting additional schools.
25More complex examples random intercepts and
random slopes
- We will now look at more complex random effect
models with predictor variables. - We will consider the random intercept model
- and the random slopes model
- We will consider how to extend (approximately)
the theoretical approach and also the simulation
approach.
26Example problem
- We will continue with our educational example but
also consider the effect of gender (ß1). For a
random intercepts model let us assume the true
parameter values are ß020.9, ß1 1.6, s2u15,
s2e135. - We have two hypotheses to test
- Hypothesis 1 boys get on average more than 20
marks (H0 ß020 vs HA ß0gt20) - Hypothesis 2 girls do better on average than
boys (H0 ß10 vs HA ß1gt0) - We will also consider the effect of random slopes
on the dataset so will have a second model with
additionally s2u010, su012, s2u15
27What happens when we include random slopes?
- The following table gives power values for ß1 for
the random intercept model. - Note that pairs of values with the same total Nn
have similar powers.
Schools (N) Schools (N) Schools (N) Schools (N) Schools (N) Schools (N)
Pupils (n) 20 30 40 50 60
Pupils (n) 20 0.39 0.49 0.60 0.70 0.75
Pupils (n) 30 0.51 0.67 0.76 0.84 0.89
Pupils (n) 40 0.61 0.77 0.86 0.92 0.95
Pupils (n) 50 0.69 0.84 0.92 0.96 0.98
Pupils (n) 60 0.77 0.90 0.95 0.98 0.99
28What happens when we include random slopes?
- The following table gives power values for ß1 for
the random slopes model. - Note that pairs of values with the same total Nn
now do not have similar powers and larger N is
better.
Schools (N) Schools (N) Schools (N) Schools (N) Schools (N) Schools (N)
Pupils (n) 20 30 40 50 60
Pupils (n) 20 0.33 0.45 0.54 0.63 0.69
Pupils (n) 30 0.42 0.55 0.67 0.76 0.82
Pupils (n) 40 0.49 0.64 0.76 0.83 0.89
Pupils (n) 50 0.57 0.69 0.80 0.89 0.93
Pupils (n) 60 0.60 0.76 0.85 0.91 0.95
29Binary response models
- The simplest model for binary data is to compare
an observed proportion with some fixed
proportion, e.g. to detect whether more than 9
out of 10 cats prefer! - We will be using as our example the use of
contraceptives by women in Bangladesh taken from
the MLwiN user guide. Here the full dataset
(nearly 2000 women) has roughly 40 contraceptive
use and we are interested in what sample size we
need to say that less than 50 use
contraceptives. - The standard approach uses a normal approximation
to the binomial for the underlying probability to
give a standard formula that can be easily
calculated and gives a sample size of 153 for a
power of 0.80.
30Model based approach
- We will use a logistic regression model
- and now the probability 0.5 will correspond to
the value 0 for ß0. - The MLPOWSIM software will perform approximate
Wald tests i.e. assume that ß0 is normally
distributed. - As can be seen by the curves in the following
slide the power is fairly similar (sample size of
160 for power of 0.8) but not identical to that
obtained by the standard formulae as a different
quantity is being assumed normally distributed.
31Power graphs
Here we see the results of the 0/1 approach in
blue and the much smoother red line corresponds
to the standard error approach for calculating
power.
32Multilevel binary response models
- We can extend the earlier modelling of the
Bangladeshi dataset to account for the structure
of the data women nested within districts. - We will again test whether usage of contraceptive
is less than 50 but now we account for the
clustering in the model with a random effects
logistic regression model. - The estimate of the district level variability is
0.246 - We also now have options of different estimation
methods and will consider MQL1 and PQL2.
33Differences between methods MQL 1
- Here we see that the SE method (red) gives higher
power than the 0/1 method (blue) but this is due
to the bias in the estimates and hence the SEs.
Here the 0/1 method is preferable.
34Differences between methods PQL 2
- Here we see less difference between the SE
method (red) and the 0/1 method (blue) as PQL is
far less biased than MQL.
35Cross-classified models
- We will here consider a dataset of education
results from Fife in Scotland. Here for each
student their attainment at 16 can be affected by
both the secondary school and the primary school
they attended. - We will look at perhaps the simplest model that
accounts for the structure. Here our response is
a mark out of 10 from which we subtract 5 which
we treat as a pass mark. This response is then
assumed to be affected by both the primary and
secondary school attended - Approximate estimates from the actual data that
we will use for our power calculations are ß0
0.5, variances for primary, secondary and
residual are 1.2, 0.4 and 5 respectively. We are
interested in sample sizes to detect an average
greater than 5.
36Cross classified methods
- Currently in MLPOWSIM we have only implemented
cross classified models using MCMC. - The IGLS method uses a clever trick of converting
cross-classified models to constrained nested
models but this is harder to do within simulation
macros. - The R function lmer allows cross-classified
models which although slower than nested models
can be computed with no problems. - We looked at balanced structures with 5 pupils
per combination of secondary and primary school,
20-50 primary schools in steps of 10 and 10-30
secondary schools in steps of 10. - The MCMC method was run for a burnin of 500 and
main run of 5001 for each simulation. We used the
default inverse-gamma (e,e) priors for the
variance parameters. - The power values for various combinations can be
seen on the next page based on the standard error
method.
37Comparison of Power estimates
XC1 XC 2 MCMC R
20 10 0.39 0.47
20 20 0.47 0.56
20 30 0.67 0.59
30 10 0.54 0.55
30 20 0.67 0.66
30 30 0.72 0.70
40 10 0.60 0.61
40 20 0.73 0.72
40 30 0.75 0.78
50 10 0.62 0.64
50 20 0.79 0.77
50 30 0.83 0.83
38Cross-classified models
- From the table we see reasonable agreement
between MLwiN and R. - When the number of higher level units is 10 we
have bigger differences as the effect of prior
and running for only 5000 will affect the MCMC
estimates. - This was more marked when we initially also
considered only 10 primary schools. - It is reassuring that in the range that counts
i.e. where power is 0.8 the methods agree well. - R took of the order of 1 hour 20 minutes whilst
the MCMC code ran overnight.
39Conclusions
- In this talk we have shown the flexibility of
using simulation to perform power calculations
for multilevel models. - Although computationally the approach can be slow
the flexibility of the approach means it can be
used for virtually all models given enough
assumptions. - Low powered studies often involve small amounts
of data thus making the power calculations
quicker. - Other software PINT (Bosker,Snijders and
Guldemond, 1996) is quick for specific models
balanced 2 level models and we compare answers in
our documentation.
40Further work
- Find (faster) approximations to simulation
results potentially create look up tables,
power curves etc. - Investigate efficient MCMC methods.
- Consider the IGLS algorithm for cross-classified
models - Expand the model classes considered.
- Link MLPOWSIM directly within MLwiN.
41References
- Bosker, R.J., Snijders, T.A.B. and Guldemond, H.
(1996) PINT (Power IN Two-level designs) User
Manual. - Browne, W.J. (1998) Applying MCMC methods to
Multi-level models. Unpublished PhD. thesis.
University of Bath. - Browne, W.J. and Draper, D. (2006) A comparison
of Bayesian and likelihood-based methods for
fitting multilevel models (with discussion).
Bayesian Analysis 1, 473-550. - Mass, C.J.M. and Hox, J.J. (2004) Robustness
issues in multilevel regression analysis.
Statistica Neerlandica 58, 127-137.