Sample Size calculations for multilevel models (Part II) - PowerPoint PPT Presentation

About This Presentation
Title:

Sample Size calculations for multilevel models (Part II)

Description:

Power curve for Welshmen example. Here we see the two power curves for the two scenarios: ... full dataset (nearly 2000 women) has roughly 40% contraceptive use ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 42
Provided by: smssderm
Category:

less

Transcript and Presenter's Notes

Title: Sample Size calculations for multilevel models (Part II)


1
Sample Size calculations for multilevel models
(Part II)
  • William Browne and Mousa Golalizadeh
  • Department of Clinical Veterinary Sciences
  • University of Bristol

2
Summary
  • Introduction to sample size calculations.
  • Simulation-based approaches.
  • The MLPOWSIM software.
  • Two level normal examples.
  • Binary responses.
  • Cross classified models.
  • Future plans.

3
Background
  • Many quantitative social science research
    questions are of the form of a hypothesis A has
    a significant effect on B.
  • To answer such a question data is collected that
    allows the researcher to (hopefully) test whether
    statistically A has a significant effect on B.
    (In fact we aim to reject the hypothesis that A
    doesnt significantly affect B).
  • A test is performed and either the researcher is
    happy and A indeed has a significant effect on B
    or is left wondering why the data collected do
    not back up their hypothesis. Is the hypothesis
    false or was the data not sufficient?
  • The sufficiency of the data is the motivation for
    sample size calculations.

4
Example
  • Suppose I have the research question Are
    Welshmen on average taller than 175 cms?
  • I now need to get hold of a random sample of n
    Welshmen and measure each of their heights.
  • I make some statistical assumption about the
    distribution of the heights of Welshmen e.g. that
    they come from a Normal distribution.
  • I might like to check this assumption by plotting
    a histogram of the data.
  • I can then form a statistical hypothesis test and
    test whether indeed Welshmen are taller than
    175cms.
  • I need to decide how big to make n, my sample of
    Welshmen.

5
Hypothesis Testing
  • Let us assume our null hypothesis is that the
    average height of Welshmen (µ) is 175cm.
  • So we test H0µ175 vs HAµgt175 (or alternatively
    H0?0 vs HA?gt0 where ?µ-175)
  • In practice we calculate from our sample its mean
    ( ) and standard deviation (s2) and use these
    along with n to form a test statistic which we
    can compare with the distribution assumed under
    H0

6
Type I and Type II errors
  • No hypothesis test is perfect and there is always
    the possibility of errors
  • P(Type I error) a significance level or size
  • P(Type II error) ß, 1-ß is the power of the
    test.
  • In general we fix a to some value e.g. 0.05, 0.01
    then 1-ß depends on our sample size.

Truth Truth
H0 True H0 False
Decision Reject H0 Type I error Correct
Decision Accept H0 Correct Type II error
7
Example hypothesis test
  • Let us assume that in reality our sample mean is
    180cms and the population standard deviation (sd)
    is 5cms (known).
  • We can then form a test statistic as follows
  • Note here that for small n and unknown sd we
    should use a student-t distribution rather than
    Normal.
  • For a 1-sided Z test we wish Z gt 1.645 and so
    we need our sample to be of size 3 to reject H0,
    using a student-t distribution increases this to
    5. (Here a0.05)
  • However if the sample mean had been only 176cms
    then we would need n gt (1.6455)2 68 Welshmen
    to reject H0

8
Power calculations
  • Our last slide in some sense is backwards as we
    cannot get from a given sample mean to choosing a
    sample size!
  • What we do instead is use different terminology
    and play God!
  • We will choose an effect size, ? which will
    represent a guess at the increase in the sample
    mean for Welshmen.
  • There then exists an (approximate) formula that
    links four quantities, size (a), power (1-ß),
    effect size (?) and sample size (n)
  • Note that the standard error (SE) of ? is a
    function of n and s the population sd which is
    assumed known.
  • We can now evaluate one of these quantities
    conditional on the others e.g. what sample size
    is required given a,1-ß and ??

Here RHS is sum of cases H0 true and H0 false.
9
Welsh height example
  • Here we have looked at two examples with effect
    sizes 5 and 1 respectively. Assume s takes the
    value 5 and so let us suppose we take a sample of
    size 25 Welshmen.
  • Then
  • Case 1 5/(5/v25)1.645z1-ß,z1-ß3.355
  • ß0.9996
  • Case 2 1/(5/ v25)1.645z1-ß,z1-ß-0.645
  • ß0.25946
  • So here a sample of 25 Welshmen from a population
    with mean 180cms would almost always result in
    rejecting H0,
  • but if the population mean is 176cms then only
    26 of such samples would be rejected.
  • We can plot curves of how power increases with
    sample size as shown in the next slide.

10
Power curve for Welshmen example
  • Here we see the two power curves for the two
    scenarios

11
Extending the idea
  • The simple formula
    can
  • be used in many situations and hypothesis tests.
  • To generalise the idea we assume that ? is an
    effect size associated with a statistic that we
    wish to compare with a (null) hypothesized value
    of 0.
  • The complication occurs in finding a formula for
    the standard error for the statistic and relating
    this formula to the sample size, n.
  • We will next consider an alternative approach
    before returning to look at how the above
    approach extends to multilevel models.

12
The use of simulation
  • In reality our (hoped for) research path will be
    as follows
  • Construct research question -gt Form null
    hypothesis that we believe false -gt Collect
    appropriate data -gt Reject hypothesis therefore
    proving our research question.
  • Assuming what we believe in our research question
    is correct and hence null hypothesis is false we
    can still be let down by not collecting enough
    data.
  • The idea behind using simulation is to simulate
    the data gathering process (assuming we know the
    right answer) many times and see how often we can
    reject the null hypothesis. The percentage of
    rejected null hypotheses (via simulation) will
    then estimate power.

13
Simulation in our example
  • Consider our Welsh height example case 2 where we
    believe Welshmen have a mean height of 176cms
    (and sd 5cms) and we are testing the hypothesis
    H0µ175cms, and we consider a sample size 25.
  • Then we generate N samples (e.g. 5000) of size
    25,
  • and for each sample
    form a lower bound for the confidence interval of
    the form
  • . This we compare with
    the value 175 and the proportion greater than 175
    is an estimate of the power of the test.
  • We can repeat this exercise for different sample
    sizes and form a power curve.

14
Power curve comparison
Note simulation curve is a good approximation of
the theoretical curve although there are some
minor (Monte Carlo) errors even with 5000
simulations per sample size.
15
Advantages/Disadvantages
  • Theoretical approach is quick when the formula
    can be derived.
  • Approximations for more complex situations exist
    which are equally quick.
  • Simulation approach generalizes to more
    situations but is much slower and we may need
    large numbers of simulations per scenario to get
    accurate power estimates.
  • Note that alternative method described next
    typically needs less simulations per scenario for
    the same accuracy.

16
Methods to calculate Power calculations from
simulations
  • The method illustrated previously we will
    describe as the 0/1 method where each simulation
    results in a sample that would either reject or
    fail to reject the null hypothesis.
  • An alternative method (the standard error method)
    is to instead take from each simulation the
    estimated standard error of the parameter to be
    tested. Then the average of these standard errors
    can be used with the effect size to construct the
    power.
  • A 2nd alternative (when MCMC is used) is as in
    the 0/1 approach to establish whether the
    simulation sample rejects the null hypothesis but
    rather than construct the interval assuming
    normality to use the quantiles of the MCMC chain.

17
MLPOWSIM software package
  • Software package currently being written and
    tested (90 complete).
  • A rather old fashioned text-based interface
    allows user to specify sample size scenarios.
  • Software then generates either MLwiN macro code
    or an R command file to run the simulations to
    calculate power for scenarios.
  • Normal, Binomial and Poisson response offered.
  • Software will cope with 1-level, 2-level
    (balanced and unbalanced), 3-level nested
    (balanced and unbalanced) and cross-classified
    (balanced and unbalanced) with 2 higher
    classifications models.
  • Many options for unbalanced designs.
  • Extensive user manual in preparation (100 pages
    with examples)

18
MLPOWSIM state of play
  • The software still requires code to generate
    macros for unbalanced cross-classified models in
    MLwiN to be written.
  • Documentation for normal responses is almost
    complete with short chapters on other response
    types to write.
  • Currently MLwiN code (using IGLS) runs
    considerably faster than R code (using lmer) for
    most models though we have had some success
    speeding up R code.
  • Exception is cross-classified models where we
    have only programmed an MCMC-based macro in MLwiN
    that is typically slower than the lmer code in R.
    See more details later.

19
What happens with multilevel data?
  • We will here mainly consider 2-level models and
    take as our application area education, so we
    have students nested within schools.
  • When deciding on a sampling scheme we have many
    choices
  • How many schools, N ?
  • How many pupils per school, nj ?
  • Should we collect the same size sample from each
    school ?
  • Our decision will depend on which parameter we
    wish to estimate in the model.

20
Education Example
  • For motivation we considered a two level dataset
    with exam marks measured for each student in a
    collection of schools. In fact this dataset
    exists and has 4915 students in 96 schools.
  • Our hypothesis of interest is that the exam mark
    for an average student is gt 20 (null hypothesis
    20) which with such a large sample results in the
    null hypothesis being rejected for our particular
    data.
  • If we fit the following multilevel model to the
    data we get the estimates given
  • If we treat these estimates as population values,
    we are interested in what power for testing our
    hypothesis results from various combinations of N
    and nj

21
Design effect formula
  • If we assume balance then with n pupils in each
    of N schools for our simple model (and only this
    simple model) the following formula holds
  • Design effect 1 (n-1)? where ? is the
    intra-class correlation.
  • So if we know the simple random sample size
    required for a given power we need to multiply
    this by the design effect.
  • For example our data has ?16.205/(16.205139.367)
    0.104
  • So for schools of size 10 pupils we would need
    190.1041.94 times as many students (in total)
    to get the same power.
  • For this model (and this model only) we could
    therefore perform our power calculations assuming
    simple random sampling from a population with
    variance 155.572 and scale up the sample required
    based on the design.
  • So
  • And for schools of size 10 we require
    1.94338.4657 pupils which we can round up to 66
    schools.

22
Comparison of formula/simulations
  • The following graph compares the design effect
    formula to the simulation approach

23
Other sample size issues
  • There are other reasons why we may be interested
    in sample size questions in multilevel modelling.
  • It is often problematic to fit multilevel models
    when the number of higher level units is small as
    demonstrated in the last graph
  • Also some methods can be biased for small sample
    sizes.
  • Note although method comparison is done using a
    similar approach of generating simulated
    datasets, here power calculations are not the
    main aim that said when performing power
    calculations parameter bias of methods should be
    noted as this will result in bias of predicted
    power (see later).
  • Browne (1998), Browne and Draper(2006) compare
    MCMC, RIGLS and IGLS for small sample sizes and
    continuous responses, and MCMC, MQL and PQL for
    binary response models.
  • Maas and Hox (2004) also look at small sample
    sizes and how robust estimation is to the Normal
    distributional assumption of the level 2
    residuals.

24
Sampling policy
  • The design effect formula
  • Design effect 1 (n-1)?
  • suggests that if we are to sample a fixed
    (balanced) number of pupils nN then our best
    power results when n is smallest i.e. sampling
    one pupil each from 100 schools is better than
    sampling 100 pupils from the same school.
  • The effect of sampling policy is most important
    in scenarios where ? is large e.g. repeated
    measures designs.
  • The simulation procedure gives approximately the
    same power curve and so in this simple example we
    have an easy to use formula.
  • The reason in practice for sampling several
    pupils from each school is purely the additional
    cost incurred in visiting additional schools.

25
More complex examples random intercepts and
random slopes
  • We will now look at more complex random effect
    models with predictor variables.
  • We will consider the random intercept model
  • and the random slopes model
  • We will consider how to extend (approximately)
    the theoretical approach and also the simulation
    approach.

26
Example problem
  • We will continue with our educational example but
    also consider the effect of gender (ß1). For a
    random intercepts model let us assume the true
    parameter values are ß020.9, ß1 1.6, s2u15,
    s2e135.
  • We have two hypotheses to test
  • Hypothesis 1 boys get on average more than 20
    marks (H0 ß020 vs HA ß0gt20)
  • Hypothesis 2 girls do better on average than
    boys (H0 ß10 vs HA ß1gt0)
  • We will also consider the effect of random slopes
    on the dataset so will have a second model with
    additionally s2u010, su012, s2u15

27
What happens when we include random slopes?
  • The following table gives power values for ß1 for
    the random intercept model.
  • Note that pairs of values with the same total Nn
    have similar powers.

Schools (N) Schools (N) Schools (N) Schools (N) Schools (N) Schools (N)
Pupils (n) 20 30 40 50 60
Pupils (n) 20 0.39 0.49 0.60 0.70 0.75
Pupils (n) 30 0.51 0.67 0.76 0.84 0.89
Pupils (n) 40 0.61 0.77 0.86 0.92 0.95
Pupils (n) 50 0.69 0.84 0.92 0.96 0.98
Pupils (n) 60 0.77 0.90 0.95 0.98 0.99
28
What happens when we include random slopes?
  • The following table gives power values for ß1 for
    the random slopes model.
  • Note that pairs of values with the same total Nn
    now do not have similar powers and larger N is
    better.

Schools (N) Schools (N) Schools (N) Schools (N) Schools (N) Schools (N)
Pupils (n) 20 30 40 50 60
Pupils (n) 20 0.33 0.45 0.54 0.63 0.69
Pupils (n) 30 0.42 0.55 0.67 0.76 0.82
Pupils (n) 40 0.49 0.64 0.76 0.83 0.89
Pupils (n) 50 0.57 0.69 0.80 0.89 0.93
Pupils (n) 60 0.60 0.76 0.85 0.91 0.95
29
Binary response models
  • The simplest model for binary data is to compare
    an observed proportion with some fixed
    proportion, e.g. to detect whether more than 9
    out of 10 cats prefer!
  • We will be using as our example the use of
    contraceptives by women in Bangladesh taken from
    the MLwiN user guide. Here the full dataset
    (nearly 2000 women) has roughly 40 contraceptive
    use and we are interested in what sample size we
    need to say that less than 50 use
    contraceptives.
  • The standard approach uses a normal approximation
    to the binomial for the underlying probability to
    give a standard formula that can be easily
    calculated and gives a sample size of 153 for a
    power of 0.80.

30
Model based approach
  • We will use a logistic regression model
  • and now the probability 0.5 will correspond to
    the value 0 for ß0.
  • The MLPOWSIM software will perform approximate
    Wald tests i.e. assume that ß0 is normally
    distributed.
  • As can be seen by the curves in the following
    slide the power is fairly similar (sample size of
    160 for power of 0.8) but not identical to that
    obtained by the standard formulae as a different
    quantity is being assumed normally distributed.

31
Power graphs
Here we see the results of the 0/1 approach in
blue and the much smoother red line corresponds
to the standard error approach for calculating
power.
32
Multilevel binary response models
  • We can extend the earlier modelling of the
    Bangladeshi dataset to account for the structure
    of the data women nested within districts.
  • We will again test whether usage of contraceptive
    is less than 50 but now we account for the
    clustering in the model with a random effects
    logistic regression model.
  • The estimate of the district level variability is
    0.246
  • We also now have options of different estimation
    methods and will consider MQL1 and PQL2.

33
Differences between methods MQL 1
  • Here we see that the SE method (red) gives higher
    power than the 0/1 method (blue) but this is due
    to the bias in the estimates and hence the SEs.
    Here the 0/1 method is preferable.

34
Differences between methods PQL 2
  • Here we see less difference between the SE
    method (red) and the 0/1 method (blue) as PQL is
    far less biased than MQL.

35
Cross-classified models
  • We will here consider a dataset of education
    results from Fife in Scotland. Here for each
    student their attainment at 16 can be affected by
    both the secondary school and the primary school
    they attended.
  • We will look at perhaps the simplest model that
    accounts for the structure. Here our response is
    a mark out of 10 from which we subtract 5 which
    we treat as a pass mark. This response is then
    assumed to be affected by both the primary and
    secondary school attended
  • Approximate estimates from the actual data that
    we will use for our power calculations are ß0
    0.5, variances for primary, secondary and
    residual are 1.2, 0.4 and 5 respectively. We are
    interested in sample sizes to detect an average
    greater than 5.

36
Cross classified methods
  • Currently in MLPOWSIM we have only implemented
    cross classified models using MCMC.
  • The IGLS method uses a clever trick of converting
    cross-classified models to constrained nested
    models but this is harder to do within simulation
    macros.
  • The R function lmer allows cross-classified
    models which although slower than nested models
    can be computed with no problems.
  • We looked at balanced structures with 5 pupils
    per combination of secondary and primary school,
    20-50 primary schools in steps of 10 and 10-30
    secondary schools in steps of 10.
  • The MCMC method was run for a burnin of 500 and
    main run of 5001 for each simulation. We used the
    default inverse-gamma (e,e) priors for the
    variance parameters.
  • The power values for various combinations can be
    seen on the next page based on the standard error
    method.

37
Comparison of Power estimates
XC1 XC 2 MCMC R
20 10 0.39 0.47
20 20 0.47 0.56
20 30 0.67 0.59
30 10 0.54 0.55
30 20 0.67 0.66
30 30 0.72 0.70
40 10 0.60 0.61
40 20 0.73 0.72
40 30 0.75 0.78
50 10 0.62 0.64
50 20 0.79 0.77
50 30 0.83 0.83
38
Cross-classified models
  • From the table we see reasonable agreement
    between MLwiN and R.
  • When the number of higher level units is 10 we
    have bigger differences as the effect of prior
    and running for only 5000 will affect the MCMC
    estimates.
  • This was more marked when we initially also
    considered only 10 primary schools.
  • It is reassuring that in the range that counts
    i.e. where power is 0.8 the methods agree well.
  • R took of the order of 1 hour 20 minutes whilst
    the MCMC code ran overnight.

39
Conclusions
  • In this talk we have shown the flexibility of
    using simulation to perform power calculations
    for multilevel models.
  • Although computationally the approach can be slow
    the flexibility of the approach means it can be
    used for virtually all models given enough
    assumptions.
  • Low powered studies often involve small amounts
    of data thus making the power calculations
    quicker.
  • Other software PINT (Bosker,Snijders and
    Guldemond, 1996) is quick for specific models
    balanced 2 level models and we compare answers in
    our documentation.

40
Further work
  • Find (faster) approximations to simulation
    results potentially create look up tables,
    power curves etc.
  • Investigate efficient MCMC methods.
  • Consider the IGLS algorithm for cross-classified
    models
  • Expand the model classes considered.
  • Link MLPOWSIM directly within MLwiN.

41
References
  • Bosker, R.J., Snijders, T.A.B. and Guldemond, H.
    (1996) PINT (Power IN Two-level designs) User
    Manual.
  • Browne, W.J. (1998) Applying MCMC methods to
    Multi-level models. Unpublished PhD. thesis.
    University of Bath.
  • Browne, W.J. and Draper, D. (2006) A comparison
    of Bayesian and likelihood-based methods for
    fitting multilevel models (with discussion).
    Bayesian Analysis 1, 473-550.
  • Mass, C.J.M. and Hox, J.J. (2004) Robustness
    issues in multilevel regression analysis.
    Statistica Neerlandica 58, 127-137.
Write a Comment
User Comments (0)
About PowerShow.com