Title: ASNEMGE
1ASNEMGE
- Young Investigator Meeting
- Vienna April 21-23, 2006
Maastricht University Department of Methodology
and Statistics
2Outline
- What do clinician expects from statistics could
that be a source of misunderstanding?? - The infamous Null Hypothesis Significance Testing
(NHTS) - What it is and what it is not
- A small selection of questions of practitioners
interest Sample size (power), randomisation,
counfounding, bias and interaction - Interactive session
3The present (though partly challenged) state of
affairs
- The simplistic dichotomisation of hypothesis
testing, already manifested in the language
usage - Accept/reject, significant/non-significant
- p-value has been handled and regarded as a
magical device (it is often overrated and
misinterpreted) - There is this misconception that significance
testing will put all doubts aside
4What medical doctors wish for
- As a model of clarity, I commend to you the
style of Improvised Munitions Handbook, in
which the authors append a comment for each
material This material was tested It is
effective - All else is recondite, if elegant, but ultimately
fruitless ratiocination - Can you afford to share this opinion, even if
this point of view is temptingly attractive and
practical? - Should you get involved with obsessively abstruse
mathematics and futile conjectures??
5How things really are
- Though statistical calculations carry an aura of
numerical exactitude, debate necessarily
surrounds statistical conclusions, MADE AS THEY
ARE AGAINST A BACKGROUND OF UNCERTAINTY - Statistical tests are aid to wise judgment, NOT A
TWO-VALUED LOGICAL DECLARATION OF TRUTH OR
FALSITY
6Statistical Inference
- The process of drawing conclusions about a
populations on the basis of measurement of
observations made on a sample of individuals from
the population - Statistics provides the tools for the ancient
inductive process of reasoning from the
particular to the general - Statistics deals with some general methods of
finding patterns that are hidden in a cloud of
irrelevancies, of natural variability, and of
error prone observations or measurements
7Inference from Sample to Population
- Population parameters
- s (Standard deviation)
- µ (Mean)
Sample Statistics (or parameters) Standard
deviation - s Sample mean -
8NHST The formal structure
- Outcome variable and its measurement scale
- Distributional properties (put on hold!)
- Null hypothesis H0
- Alternative hypothesis HA
- NHTS a confrontation of all-chance (H0)
explanation with the systematic plus-chance
explanation (HA) for the observed data - Because most of the time you usually test what
you want to discredit, NHST has been described as
the ritualised exercise of devils advocate
9Null Hypothesis Significance Testing (NHST) A
gist
- The starting assumptions of hypothesis testing is
that the observed data can be primarily explained
by chance factors. This starting assumption is
formulated as the null hypothesis, alias H0
10A small digression The Normal distribution
11The H0 distribution and a
Critical area
H0
HA
HA
12Formulating the hypothesis(interactive session)
- NSAIDs frequently cause gastrointestinal injury
and increase the risk of ulcer complications - An RCT is carried out to test whether a NSAID
(etodolac) causes less gastric injury than a
standard NSAID (naproxen) in a double-blind trial
assessing the effects on gastroduodenal injury,
symptoms and prostaglandin production in healthy
volunteers - (Gastrointest Endosc 1995 42428-33)
13The logic behind NHSTThe SIGNAL/NOISE ratio
- Often the test statistics (z, t, F) are referred
to as the SIGNAL/NOISE ratio which represents the
basic rationale of any statistical test!
HA
H0
14The SIGNAL/NOISE ratio
- If the signal, the difference, is large enough,
AS COMPARED TO THE NOISE, then it is reasonable
to conclude that the signal has some effect. - If the signal does not rise above the noise
level, then it is reasonable to conclude that no
association (systematic effect) exists. - The basis of all inferential statistics is to
attach a probability to this ratio.
15(No Transcript)
16Errors in hypothesis testing
- Decision Null hypothesis is
- _______________________________
- True False
- __________________________________________________
__ - Reject null Type I Error Correct
decision - (a) (power1-ß)
- Fail to reject null Correct decision
Type II Error - confidence (1-a) (ß)
17The H0 and HA distributions a and ß
Power 1- ?
Critical value(s)
18What is the p-value??
p lt a gt H0 discredited given the data
Observed value (sample measurement)- Test
statistic
19p-value continued
p gtgt a gt Not enough evidence to dismiss H0. It
is retained.
Observed value (sample measurement)- Test
statistic
20Statistical significance and the p-value
- The p-value can be thought as a probability of
obtaining a test statistic as extreme as or more
extreme than the actual test statistic obtained,
GIVEN THAT THE NULL HYPOTHESIS IS TRUE! - All that a significant result implies is that
one has observed something relatively unlikely to
happen given the hypothetical situation.
Everything else is a matter of what one does with
this information. - Statistical significance is a statement about the
likelihood p(dataH0) of the observed result,
nothing else. It does not guarantee that
something important or even meaningful has been
found.
21The language of the null hypothesis testing
- The terms of accepting or rejecting the null
hypothesis are too strong. - An alternative would be to replace these terms by
more moderate ones - Retaining the null hypothesis, or treating it
as viable and discrediting or dismissing the
null hypothesis
22The power of a test
- Definition
- Statistical power refers to the ability of a
statistical test to detect relationships between
variables to detect a difference, i.e. an effect
of specified size, when it exists. - It is the basis of procedures for estimating the
sample size needed to detect an effect of a
particular magnitude. - Official definition The power refers to the
probability of dismissing the null hypothesis
when it is false (1-ß), where ß represents the
probability of retaining a null hypothesis, when
it is false.
23Power
- Power is a direct function of four variables
- Significance level (a)
- Sample size (N)
- Effect size
- The type of statistical test being conducted
24How can you increase power ?
(1) Select a larger a
25How can you increase power ?
(1) Select larger a
26How can you increase power?
(1) Select larger a
27How can you increase power?
(1) Select a larger a
4
3
2
1
0
1.4
1.6
1.8
2.2
2
x
H0
HA
28How can you increase power ?
(2) Increase the difference between populations
means
29How can you increase power ?
(2) Increase the relevant effect difference
HA
H0
30How can you increase power?
(3) Increase the sample size gt standard error
decreases
31How can you increase power?
(3) Larger sample size
32How can you increase power ?
(3) Larger sample size
33How can you increase power ?
(3) Larger sample size
4
3
2
1
0
1.4
1.6
1.8
2.2
2
x
34Power formulas
For a continuous outcome variable
For proportions
With c1 z0.80 0.84 z0.90 1.28 z0.95
1.65 z0.975 1.96, s2 p(1 - p) and p (p1
p2)/2
35A randomized, double-blind comparison of placebo,
etodolac, and naproxen On gastro-intestinal
injury and prostaglandin production
Background NSAIDs frequently cause
gastrointestinal injury and increase the risk of
ulcer complications. We compared an NSAID
suggested to cause less gastric injury (etodolac)
with a standard NSAID (naproxen) and a placebo in
a 4-week double-blind trial assessing the effects
on gastroduodenal injury, symptoms, and
prostaglandin production in healthy
volunteers.Methods Fifty-two healthy
volunteers not taking NSAIDs, alcohol,
antibiotics, bismuth, or anti-ulcer drugs and
with a normal endoscopic examination were
randomly assigned to identical drugs placebo,
etodolac 400 mg, or naproxen 500 mg b.i.d. for 4
weeks. Endoscopies with biopsies were repeated at
weeks 1 and 4. The number and dimensions of
ulcers and erosions were recorded to quantitate
injury.Results At week 1 the mean number and
area of gastric ulcers per subject were greater
with naproxen than placebo or etodolac (area
naproxen, 7.4 mm2 placebo, 0.6 mm2, p 0.02 vs
naproxen etodolac, 2.1 mm2, p 0.06 vs
naproxen). Ulcer scores at week 4 were low and
comparable in the three groups. The mean number
and area of gastric erosions per subject were
greatest with naproxen at both weeks 1 and 4
(week 4 area naproxen, 58.3 mm2 placebo, 29.0
mm2 etodolac, 13.9 mm2, p lt 0.02, naproxen vs
placebo and vs etodolac). Placebo injury was
presumably due to biopsies at prior endoscopy.
Gastric mucosal prostaglandin E2 production did
not change significantly from baseline after 1 or
4 weeks of treatment with placebo or etodolac,
but did decrease significantly with naproxen
(week 0, 1689 week 1, 479 week 4, 577 pg/mg
protein). Gastrointestinal symptoms were present
in only 1 (5) of 20 visits in which endoscopy
showed no erosions or ulcers vs 21 (26) of 82
visits in which a mucosal defect was identified
(p 0.066).Conclusions Gastric injury with 4
weeks of etodolac is comparable to that seen with
placebo and significantly less than that
occurring with naproxen, presumably due to the
fact that etodolac does not suppress gastric
mucosal prostaglandin production, whereas
naproxen leads to a significant reduction.
(Gastrointest Endosc 199542428-33.)
36Interactive session I
- Acute upper gastrointestinal bleeding (UGIB) is a
serious complication of a variety of GI diseases,
and is associated with a mortality rate around
10-20. A new antifibrinolytic drug is to be
tested to see if it can reduce the mortality in
UGIB patients, and you are asked to suggest a
sample size for a proposed placebo-controlled
trial. How would you explain to the investigators
the need to specify a clinically significant
difference to be detected, and what size would
you recommend if they agree on a 20 reduction in
mortality??
37Interactive session IIThe NSAID example
- 2. How would you set up a RCT to investigate the
effect of the suggestively less detrimental
NSAID? And given the study design, which
statistical technique would you consider adequate
to test your hypothesis? In order to answer this
question, consider the following aspects - Which is your outcome variable and how is it
measured (which scale)? - How many groups will be compared?
- Do you think it suffices to measure the patients
only once (cross-sectional) or, preferably twice
with a pre and post measurement? - Are the samples to be compared dependent or
independent from each other? - Consider an extra explanatory factor,
additionally to the NSAID, you know to have an
influence on your outcome measure. How would your
study design be changed by considering this extra
variable? Would you carry several tests
separately, for each variable, or one test, with
the two variables together?
38Study designs
- Two samples are said to be independent when the
data points in one sample are unrelated to the
data points in a second sample. - An experimental study design The Randomised
Clinical Trial (RCT). Random allocation of
patients to case and control groups (independent
samples). - Observational study designs
- The cross-sectional where the participants are
seen/ measured at only one point in time
(independent measures) - The longitudinal, follow-up study, where the same
group of people are followed over time
(dependent, repeated measures). - In these different study designs, they all share
a common denominator the main question of
interest is often the comparison of the mean
values of two groups - The simplest method to make this comparison
between groups is the Students t-test.
39The set-up
- A study is carried out to evaluate the effect of
the new NSAID on the prostaglandin concentration - Three situations are distinguished
- One of independent samples with drug(s) and
control (placebo) groups (CASE I) - A second case in which one group is measured
twice, before and after the administration of the
drug (CASE II) - A third case with dependent samples (before and
after measurements) within independent
experimental and control groups (CASE III) - Evaluate the pros and contras of each of these
design settings and choose the adequate
statistical test.
40The t-test
- The usual signal/noise ratio
41- Subject Placebo NSAID 1
- --------------------------------------------------
-- - 1 65 62
- 2 88 86
- 3 125 118
- 4 103 105
- 5 90 91
- 6 76 72
- 7 85 81
- 8 126 122
- 9 97 95
- 10 142 145
- 11 132 132
- 12 110 105
- Mean 103.3 101.2
- SD 24.0 24.8
CASE I separate, independent groups
42With SPSS
Group Statistics
Independent Samples Test
The test statistic
The p-value.
43A different setting
- Now, instead of having a drug and placebo groups,
lets follow a single group of patients, whose
prostaglandin concentrations are measured before
and after the intervention, i.e. the drug
administration.
44- Subject Pretest Posttest Difference (d)
- --------------------------------------------------
---------------------------- - 1 65 62 -3
- 2 88 86 -2
- 3 125 118 -7
- 4 103 105 2
- 5 90 91 1
- 6 76 72 -4
- 7 85 81 -4
- 8 126 122 -4
- 9 97 95 -2
- 10 142 145 3
- 11 132 132 0
- 12 110 105 -5
- Mean 103.3 101.2 -2.1
- SD 24.0 24.8 3.02
CASE II repeated measures
45 Paired Samples Statistics
Paired Samples Correlations
Paired Samples Test
The test statistic
The p-value
46Note.
- The Signal the numerator for the computation of
the test statistics (the signal) is the same for
the dependent and independent cases (the
difference is 2.4). The distinction between the
two cases lies in the denominator, representing
the noise variability - The Noise The SD of pre-measures and
post-measures are quite large (25), reflecting
the large stable differences in prostaglandin
concentrations of human beings - By contrast, the SD of the concentration
differences is much smaller, only 3.5 - Stable differences between individuals are far
greater than any likely difference resulted from
treatment within individuals - Between subject SD is much larger than within
subject SD. By using the SD of the differences,
we are eliminating the intrinsic variability of
prostaglandin distribution in the population
from the noise. - In this way you can regards each single
individual is her/his own control.
47To sum up
- Drastic different conclusions results from the
application of two different statistical
approaches to the data (Same numerical data,
different experimental designs!!!!) - The appropriate test procedure is determined by
how the experiment is performed with independent
or dependent samples (repeated or matched
measures) - In CASE II the experiment is organised in such a
way as to measure individual rather than average
losses over the population, which is represented
by CASE I.
48Advantages of the paired approach
- You eliminate between subject difference from the
denominator (the noise variability) of the test - This can lead to a potential gain in statistical
power - It might also be possible to correct for baseline
differences between groups (inadequate
randomisation) - This advantage only exists as long as the
subjects or pairs have systematic differences
between them. If this is not the case, the test
can result in a loss, instead of gain in
statistical power. - Which is the shortcoming???
49- Experimental Control
-
- Subject Pretest Posttest (d) Pretest
Posttest (d) - --------------------------------------------------
------------------------------------ - 1 65 62 -3 68 70 2
- 2 88 86 -2 122 123 1
- 3 125 118 -7 84 83 -1
- 4 103 105 2 95 97 -2
- 5 90 91 1 106 106 0
- 6 76 72 -4 71 72 1
- 7 85 81 -4 87 86 -1
- 8 126 122 -4 147 152 5
- 9 97 95 -2 129 131 2
- 10 142 145 3 136 138 2
- 11 132 132 0 105 104 -1
- 12 110 105 -5 99 100 1
- Mean 103.3 101.2 -2.1 104.1 105.2
1.1 - SD 24.0 24.8 3.02 25.2 26.2 1.73
CASE III Separate groups, Within each
group Repeated measures
50(No Transcript)
51Reminder
- Most statistical tests, relates the magnitude of
an observed difference to the probability that
such a difference might occur by chance alone. - The notion of statistical significance is
embodied in this probability. But statistical
significance does not, of itself, reveal anything
about the importance of the observed difference.