Title: Basic Statistical Principles for the Clinical Research Scientist Kristin Cobb October 13 and October 20, 2004
1Basic Statistical Principlesfor the Clinical
Research ScientistKristin CobbOctober 13 and
October 20, 2004
2Statistics in Medical Research
- 1. Design phase
- Statistics starts in the planning stages of a
clinical trial or laboratory experiment to - establish optimal sample size needed
- ensure sound study design
- 2. Analysis phase
- Make inferences about a wider population.
3Common problems with statistics in medical
research
- Sample size too small to find an effect (design
phase problem) - Sub-optimal choice of measurement for predictors
and outcomes (design phase problem) - Inadequate control for confounders (design or
analysis problem) - Statistical analyses inadequate (analysis
problem) - Incorrect statistical test used (analysis
problem) - Incorrect interpretation of computer output
(analysis problem) - Therefore, it is essential to collaborate with
a statistician both during planning and analysis!
4Additionally, errors arise when
- The statistical content of the paper is confusing
or misleading because the authors do not fully
understand the statistical techniques used by the
statistician. - The statistician performs inadequate or
inappropriate analyses because she is unclear
about the questions the research is designed to
answer. - Therefore, clinical research scientists need to
understand the basic principles of biostatistics
5Outline (today and next week)
- 1. Primer on hypothesis testing, p-values,
confidence intervals, statistical power. -
- 2. Biostatistics in Practice Applying statistics
to clinical research design
6Quick review
- Standard deviation
- Histograms (frequency distributions)
- Normal distribution (bell curve)
7Review Standard deviation
Standard deviation tells you how variable a
characteristic is in a population. For example,
how variable is height in the US? A standard
deviation of height represents the average
distance that a random person is away from the
mean height in the population.
8Review Histograms
9Review Histograms
1 inch bins
10Review Normal Distribution
11Review Normal Distribution
In fact, here, 101/150 (67) subjects have
heights between 62.7 and 67.7 (1 standard
deviation below and above the mean).
A perfect, theoretical normal distribution
carries 68 of its area within 1 standard
deviation of the mean.
12Review Normal Distribution
In fact, here, 146/150 (97) subjects have
heights between 60.2 and 70.2 (2 standard
deviations below and above the mean).
A perfect, theoretical normal distribution
carries 95 of its area within 2 standard
deviations of the mean.
13Review Normal Distribution
In fact, here, 150/150 (100) subjects have
heights between 57.7 and 72.7 (1 standard
deviation below and above the mean).
A perfect, theoretical normal distribution
carries 99.7 of its area within 3 standard
deviations of the mean.
14Review Applying the normal distribution
- If womens heights in the US are normally
distributed with a mean of 65 inches and a
standard deviation of 2.5 inches, what percentage
of women do you expect to have heights above 6
feet (72 inches)?
From standard normal chart or computer ? Z of
2.8 corresponds to a right tail area of .0026
expect 2-3 women per 1000 to have heights of 6
feet or greater.
15Statistics Primer
- Statistical Inference
- Sample statistics
- Sampling distributions
- Central limit theorem
- Hypothesis testing
- P-values
- Confidence intervals
- Statistical power
16Statistical Inference The process of making
guesses about the truth from a sample.
17- EXAMPLE What is the average blood pressure of
US post-docs? - We could go out and measure blood pressure in
every US post-doc (thousands). - Or, we could take a sample and make inferences
about the truth from our sample.
Using what we observe, 1. We can test an a priori
guess (hypothesis testing). 2. We can estimate
the true value (confidence intervals).
18Statistical Inference is based on Sampling
Variability
- Sample Statistic we summarize a sample into one
number e.g., could be a mean, a difference in
means or proportions, an odds ratio, or a
correlation coefficient - E.g. average blood pressure of a sample of 50
American men - E.g. the difference in average blood pressure
between a sample of 50 men and a sample of 50
women - Sampling Variability If we could repeat an
experiment many, many times on different samples
with the same number of subjects, the resultant
sample statistic would not always be the same
(because of chance!). - Standard Error a measure of the sampling
variability
19Examples of Sample Statistics
- Single population mean
- Difference in means (ttest)
- Difference in proportions (Z-test)
- Odds ratio/risk ratio
- Correlation coefficient
- Regression coefficient
-
20Variability of a sample mean
Random Postdocs
The Truth (not knowable)
21Variability of a sample mean
Random samples of 5 post-docs
The Truth (not knowable)
22Variability of a sample mean
Samples of 50 Postdocs
The Truth (not knowable)
129 mmHg
134 mmHg
131 mmHg
130 mmHg
128 mmHg
130 mmHg
23Variability of a sample mean
Samples of 150 Postdocs
The Truth (not knowable)
131.2 mmHg
130.2 mmHg
129.7 mmHg
130.9 mmHg
130.4 mmHg
129.5 mmHg
24How sample means vary A computer experiment
- 1. Pick any probability distribution and specify
a mean and standard deviation. - 2. Tell the computer to randomly generate 1000
observations from that probability distributions - E.g., the computer is more likely to spit out
values with high probabilities - 3. Plot the observed values in a histogram.
- 4. Next, tell the computer to randomly generate
1000 averages-of-2 (randomly pick 2 and take
their average) from that probability
distribution. Plot observed averages in
histograms. - 5. Repeat for averages-of-5, and averages-of-100.
25Uniform on 0,1 average of 1(original
distribution)
26Uniform 1000 averages of 2
27Uniform 1000 averages of 5
28Uniform 1000 averages of 100
29Exp(1) average of 1(original distribution)
30Exp(1) 1000 averages of 2
31Exp(1) 1000 averages of 5
32Exp(1) 1000 averages of 100
33Bin(40, .05) average of 1(original
distribution)
34Bin(40, .05) 1000 averages of 2
35Bin(40, .05) 1000 averages of 5
36Bin(40, .05) 1000 averages of 100
37The Central Limit Theorem
- If all possible random samples, each of size n,
are taken from any population with a mean ? and a
standard deviation ?, the sampling distribution
of the sample means (averages) will
3. be approximately normally distributed
regardless of the shape of the parent population
(normality improves with larger n)
38Example 1 Weights of doctors
- Experimental question Are practicing doctors
setting a good example for their patients in
their weights? - Experiment Take a sample of practicing doctors
and measure their weights - Sample statistic mean weight for the sample
- ?IF weight is normally distributed in doctors
with a mean of 150 lbs and standard deviation of
15, how much would you expect the sample average
to vary if you could repeat the experiment over
and over?
39Relative frequency of 1000 observations of weight
mean 150 lbs standard deviation 15 lbs
40(No Transcript)
41(No Transcript)
42(No Transcript)
43Using Sampling Variability
- In reality, we only get to take one sample!!
- But, since we have an idea about how sampling
variability works, we can make inferences about
the truth based on one sample.
44Experimental results
- Lets say we take one sample of 100 doctors and
calculate their average weight.
45Expected Sampling Variability for n100 if the
true weight is 150 (and SD15)
46Expected Sampling Variability for n100 if the
true weight is 150 (and SD15)
47P-value associated with this experiment
P-value (the probability of our sample average
being 160 lbs or more IF the true average weight
is 150) lt .0001 Gives us evidence that 150 isnt
a good guess
48The P-value
- P-value is the probability that we would have
seen our data (or something more unexpected) just
by chance if the null hypothesis (null value) is
true. - Small p-values mean the null value is unlikely
given our data.
49The P-value
- By convention, p-values of lt.05 are often
accepted as statistically significant in the
medical literature but this is an arbitrary
cut-off. - A cut-off of plt.05 means that in about 5 of 100
experiments, a result would appear significant
just by chance (Type I error).
50Hypothesis Testing
- The Steps
- Define your hypotheses (null, alternative)
- The null hypothesis is the straw man that we
are trying to shoot down. - Null here mean weight of doctors 150 lbs
- Alternative here mean weight gt 150 lbs
(one-sided) - Specify your sampling distribution (under the
null) - If we repeated this experiment many, many times,
the sample average weights would be normally
distributed around 150 lbs with a standard error
of 1.5
- 3. Do a single experiment (observed sample mean
160 lbs) - 4. Calculate the p-value of what you observed
(plt.0001) - 5. Reject or fail to reject the null hypothesis
(reject)
51Errors in Hypothesis Testing
52Errors in Hypothesis Testing
- Type-I Error (false positive)
- Concluding that the observed effect is real when
its just due to chance. - Type-II Error (false negative)
- Missing a real effect.
- POWER (the complement of type-II error)
- The probability of seeing a real effect (of
rejecting the null if the null is false).
53Beyond Hypothesis TestingEstimation (confidence
intervals)
Wed estimate based on these data that the
average weight is somewhere closer to 160 lbs.
And we could state the precision of this estimate
(a confidence interval)
54Confidence Intervals
- (Sample statistic) ? (measure of how confident
we want to be) ? (standard error)
55Confidence interval (more information!!)
- 95 CI for the mean
- 1601.96(1.5) (157 163)
56What Confidence Intervals do
- They indicate the un/certainty about the size
of a population characteristic or effect. Wider
CIs indicate less certainty. - Confidence intervals can also answer the
question of whether or not an association exists
or a treatment is beneficial or harmful.
(analogous to p-values) - e.g., since the 95 CI of the mean weight does
not cross 150 lbs (the null value), then we
reject the null at plt.05.
57Expected Sampling Variability for n2
58Expected Sampling Variability for n2
59Expected Sampling Variability for n10
60Statistical Power
- We found the same sample mean (160 lbs) in our
100-doctor sample, 10-doctor sample, and 2-doctor
sample. - But we only rejected the null based on the
100-doctor and 10-doctor samples. - Larger samples give us more statistical power
61Can we quantify how much power we have for given
sample sizes?
62(No Transcript)
63(No Transcript)
64Null Distribution mean150 sd4.74
Clinically relevant alternative mean160
sd4.74
65(No Transcript)
66Null Distribution mean150 sd1.37
Nearly 100 power!
Clinically relevant alternative mean160
sd1.37
67Factors Affecting Power
- 1. Size of the difference (10 pounds higher)
- 2. Standard deviation of the characteristic
(sd15) - 3. Bigger sample size
- 4. Significance level desired
681. Bigger difference from the null mean
692. Bigger standard deviation
703. Bigger Sample Size
714. Higher significance level
72Examples of Sample Statistics
- Single population mean
- Difference in means (ttest)
- Difference in proportions (Z-test)
- Odds ratio/risk ratio
- Correlation coefficient
- Regression coefficient
-
73Example 2 Difference in means
- Example Rosental, R. and Jacobson, L. (1966)
Teachers expectancies Determinates of pupils
I.Q. gains. Psychological Reports, 19, 115-118.
74The Experiment (note exact numbers have been
altered)
- Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n90). - Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent these students were
identified as academic bloomers (n18). - BUT the children on the teachers lists had
actually been randomly assigned to the list. - At the end of the year, the same I.Q. test was
re-administered.
75The results
- Children who had been randomly assigned to the
top-20 percent list had mean I.Q. increase of
12.2 points (sd2.0) vs. children in the control
group only had an increase of 8.2 points (sd2.0) - Is this a statistically significant difference?
Give a confidence interval for this difference.
76Difference in means
- Sample statistic Difference in mean change in IQ
test score. - Null hypothesis no difference between academic
bloomers and normal students
77Explore sampling distributionof difference in
means
- Simulate 1000 differences in mean IQ change under
the null hypothesis (both academic bloomer and
controls improve by, lets say, 8 points, with a
standard deviation of 2.0)
78academic bloomers
79normal students
80Difference academic bloomers-normal students
Notice that most experiments yielded a difference
value between 1.1 and 1.1 (wider than the above
sampling distributions!)
81Confidence interval (more information!!)
- 95 CI for the difference 4.01.99(.52) (3.0
5.0)
Does not cross 0 therefore, significant at .05.
8295 confidence interval for the observed
difference 4 2.523-5
83Clearly lots of power to detect a difference of 4!
84- How much power to detect a difference of 1.0?
85Power closer to 50 now.
86Example 3 Difference in proportions
- Experimental question Do men tend to prefer Bush
more than women? - Experimental design Poll representative samples
of men and women in the U.S. and ask them the
question do you plan to vote for Bush in
November, yes or no? - Sample statistic The difference in the
proportion of men who are pro-Bush versus women
who are pro-Bush? - Null hypothesis the difference in proportions
0 - Observed results women.36 men.46
87Explore sampling distributionof difference in
proportions
- Simulate 1000 differences in proportion
preferring Bush under the null hypothesis (41
overall prefer Bush, with no difference between
genders)
88men
89women
Under the null hypothesis, most experiments
yielded a mean between .27 and .55
90Difference men-women
Under the null hypothesis, most experiments
yielded difference values between -.20 (women
preferring Bush more than men) and .20 (men
preferring Bush more)
91- What if we had 200 men and 200 women?
92men
Most of 1000 simulated experiments yielded a mean
between .34 and .48
93women
Most of 1000 simulated experiments yielded a mean
between .34 and .48
94Difference men-women
Notice that most experiments will yield a
difference value between -.10 (women preferring
Bush more than men) and .10 (men preferring Bush
more)
95- What if we had 800 men and 800 women?
96men
Most experiments will yield a mean between .38
and.44
97women
Most experiments will yield a mean between .38
and.44
98Difference men-women
Notice that most experiments will yield a
difference value between -.05 (women preferring
Bush more than men) and .05 (men preferring Bush
more)
99 If we sampled 1600 per group, a 2.5 difference
would be statistically significant at a
significance level of .05. If we sampled 3200
per group, a 1.25 difference would be
statistically significant at a significance
level of .05. If we sampled 6400 per group, a
.625 difference would be statistically
significant at a significance level of
.05. BUT if we found a significant difference
of 1 between men and women, would we care if we
were Bush or Kerry??
100Limits of hypothesis testingStatistical vs.
Clinical Significance
Consider a hypothetical trial comparing death
rates in 12,000 patients with multi-organ failure
receiving a new inotrope, with 12,000 patients
receiving usual care. If there was a 1
reduction in mortality in the treatment group
(49 deaths versus 50 in the usual care group)
this would be statistically significant (plt.05),
because of the large sample size. However, such
a small difference in death rates may not be
clinically important.
101Example 4 The odds ratio
- Experimental question Does smoking increase
fracture risk? - Experiment Ask 50 patients with fractures and 50
controls if they ever smoked. - Sample statistic Odds Ratio (measure of relative
risk) - Null hypothesis There is no association between
smoking and fractures (odds ratio1.0).
102The Odds Ratio (OR)
103Example 3 Sampling Variability of the null Odds
Ratio (OR) (50 cases/50 controls/20 exposed)
If the Odds Ratio1.0 then with 50 cases and 50
controls, of whom 20 smoke, this is the expected
variability of the sample OR?note the right skew
104The Sampling Variability of the natural log of
the OR (lnOR) is more Gaussian
105Statistical Power
- Statistical power here is the probability of
concluding that there is an association between
exposure and disease if an association truly
exists. - The stronger the association, the more likely we
are to pick it up in our study. - The more people we sample, the more likely we are
to conclude that there is an association if one
exists (because the sampling variability is
reduced).
106Part II Biostatistics in Practice Applying
statistics to clinical research design
107From concept to protocol
- Define your primary hypothesis
- Define your primary predictor and outcome
variables - Decide on study type (cross-sectional,
case-control, cohort, RCT) - Decide how you will measure your predictor and
outcome variables, balancing statistical power,
ease of measurement, and potential biases - Decide on the main statistical tests that will be
used in analysis - Calculate sample size needs for your chosen
statistical test/s - Describe your sample size needs in your written
protocol, disclosing your assumptions - Write a statistical analysis plan
- Briefly, describe descriptive statistics that you
plan to present - Describe which statistical tests you will use to
test your primary hypotheses - Describe which statistical tests you will use to
test your secondary hypotheses - Describe how you will account for confounders and
test for interactions - Describe any exploratory analyses that you might
perform
108Powering a studyWhat is the primary hypothesis?
- Before you can calculate sample size, you need to
know the primary statistical analysis that you
will use in the end. - What is your main outcome of interest?
- What is your main predictor of interest?
- Which statistical test will you use to test for
associations between your outcome and your
predictor? - Do you need to adjust sample size needs upwards
to account for loss to follow-up, switching arms
of a randomized trial, accounting for
confounders? - Seek guidance from a statistician
109Overview of statistical tests
- The following table gives the appropriate choice
of a statistical test or measure of association
for various types of data (outcome variables and
predictor variables) by study design.
e.g., blood pressure pounds age treatment
(1/0)
110(No Transcript)
111Comparing Groups
- T-test compares two means
- (null hypothesis difference in means 0)
- ANOVA compares means between gt2 groups
- (null hypothesis difference in means 0)
- Non-parametric tests are used when normality
assumptions are not met - (null hypothesis difference in medians 0)
- Chi-square test compares proportions between
groups - (null hypothesis categorical variables are
independent)
112Simple sample size formulas/calculators available
- Sample size for a difference in means
- Sample size for a difference in proportions
- Can roughly be used if you plan to calculate risk
ratios, odds ratios, or to run logistic
regression or chi-square tests - Sample size for a hazard ratio/log-rank test
- If you plan to do survival analysis Kaplan-Meier
methods (log-rank test), Cox regression
113(No Transcript)
114The pay-off for sitting through the theoretical
part of these lectures!
- Heres where it pays to understand whats behind
sample size/power calculations! - Youll have a much easier time using sample size
calculators if you arent just putting numbers
into a black box!
115(No Transcript)
116(No Transcript)
117(No Transcript)
118(No Transcript)
119If this look complicated, dont panic!
- In reality, youre unlikely to have to derive
sample size formulas yourself - ?but its critical to understand where they come
from if youre going to apply them yourself.
120Formula for difference in means
121Formula for difference in proportions
122Formula for hazard ratio/log-rank test
123Recommended sample size calculators!
- http//hedwig.mgh.harvard.edu/sample_size/size.htm
l - http//vancouver.stanford.edu8080/clio/index.html
- ?Traverse protocol wizard
124These sample size calculations are idealized
- We have not accounted for losses-to-follow up
- We have not accounted for non-compliance (for
intervention trial or RCT) - We have assumed that individuals are independent
observations (not true in clustered designs) - Consult a statistician for these considerations!
125Applying statistics to clinical research design
Example
- You want to study the relationship between
smoking and fractures.
126Steps
- ?Define your primary hypothesis
- ?Define your primary predictor and outcome
variables - ?Decide on study type
127Applying statistics to clinical research design
Example
- ? predictor smoking (yes/no or continuous)
- ?outcome osteoporotic fracture (time-to-event)
- ?Study design cohort
128From concept to protocol
- ?Decide how you will measure your predictor and
outcome variables - ?Decide on the main statistical tests that will
be used in analysis - ?Calculate sample size needs for your chosen
statistical test/s
129(No Transcript)
130Formula for hazard ratio/log-rank test
131Example sample size calculation
- Ratio of exposed to unexposed in your sample?
- 11
- Proportion of non-smokers who will fracture in
your defined population over your defined study
period? - 10
- What is a clinically meaningful hazard ratio?
- 2.0
- Based on hazard ratio, how many smokers will
fracture? - 1-902 19
- What power are you targeting?
- 80
- What significance level?
- .05
132Formula for hazard ratio/log-rank test
You may want to adjust upwards for loss to
follow-up. E.g., if you expect to lose 10,
divide the above estimate by 90.
133From concept to protocol
- Describe your sample size needs in your written
protocol, disclosing your assumptions - Write a statistical analysis plan
134(No Transcript)
135Statistical analysis plan
- Descriptive statistics
- E.g., of study population by smoking status
- Kaplan-Meier Curves (univariate)
- Describe exploratory analyses that may be used to
identify confounders and other predictors of
fracture - Cox regression (multivariate)
- What confounders have you measured, and how will
you incorporate them into multivariate analysis? - How will you explore for possible interactions?
- Describe potential exploratory analysis for other
predictors of fracture