Title: Bootstrap resampling
1Bootstrap resampling
2Heres some data
- mg of mercury in one gram soil samples
- 0.853511661, 0.391905707, 0.143344303,
0.198267857, 0.266572367, 0.327306702,
0.834747834, 5.322618220, 0.817037696,
0.157247167, 0.328456677, 3.793153524,
0.513433215, 0.502938253, 0.733454663,
0.279345254, 0.952473470, 0.742740502,
0.178309271, 0.469049646, 0.764546106,
1.819858816, 0.830187557, 0.369993886,
0.644729374, 0.841576129, 0.734056277,
0.773035692, 0.810722543, 0.357449318
3Soil mercury at a mine site
4How well does the sample statistic estimate the
population parameter?
- Accuracy bias and precision
- Thought experiment imagine repeating the
sampling procedure many times, and each time
calculating a sample mean - Bias is the difference between average sample
mean and population mean - Precision is the standard deviation of the sample
means
5Take some more samples
6Some statistical theory
- If the population is normally distributed
- The sample mean is a normally distributed random
variable with mean m and variance s2/n - The sample mean and variance are unbiased
estimates of the population mean and variance - The standard error of the mean is an unbiased
estimate of the precision of the sample mean, and
can be used to construct confidence intervals - The above are also true even if the population is
not normally distributed, as long as the sample
size is large enough
7Confidence Intervals
- If we know how data are sampled
- We can construct a Confidence Interval for an
unknown parameter, q. - A 95 C.I. gives a range such that true q is in
interval 95 of the time. - A 100(1-a) C.I. captures true q, (1-a) of the
time. - Smaller a, more sure true q falls in interval,
but wider interval.
- C.I. FOR MEAN OF NORMALLY DISTRIBUTED DATA
- 95 C.I. for m
- SE is standard error of mean.
- t97.5 is critical value of t distribution
- Critical t value depends on sample size (n)
- If n gt 20, then t97.5 1.96 2
8Confidence Intervals
9Distribution of 1000 sample means
1095 CIs based on t statistic
- Out of 1000 replicate samples
- CI contained µ 908 times
- Entire CI was below µ 87 times
- Entire CI was above µ 5 times
- Average sample mean very close to µ, (estimate is
unbiased) but - Sample mean less than µ 558 times
- Average sample SD less than s
- Biased estimate!
11Resampling for a confidence interval of the mean
- IN AN IDEAL WORLD
- Take sample
- Calculate sample mean std dev
- Take new sample from same population
- Calculate new mean std dev
- Repeat many times
- Look at the distribution of sample means std
devs - Find bias use to correct sample statistic for
original sample - Figure out formula to get 95 CI that has correct
coverage
- IN THE REAL WORLD
- Take sample
- Calculate sample mean std dev
- Find some way to simulate taking a new sample
from same population - Calculate new mean std dev
- Repeat many times
- Look at the distribution of sample means std
devs - Find bias use to correct sample statistic for
original sample - Figure out formula to get 95 CI that has correct
coverage
12Bootstrap resampling
- PARAMETRIC BOOTSTRAP
- Assume data are random variables from a
particular distribution - E.g., log-normal
- Use data to estimate parameters of the
distribution - E.g., mean, variance
- Use random number generator to create sample
- Same size as original
- Calculate sample statistics
- Allows us to ask What if data were a random
sample from specified distribution with specified
parameters?
- NONPARAMETRIC BOOTSTRAP
- Assume underlying distribution from which data
come is unknown - Best estimate of this distribution is the data
themselves the empirical distribution function - Create a new dataset by sampling with replacement
from the data - Same size as original
- Calculate sample statistics
- WHICH IS BETTER?
- If underlying distribution is correctly chosen,
parametric has more precision - If underlying distribution incorrectly chosen,
parametric has more bias
13Nonparametric bootstrap
- We want to estimate a parameter ? of the
population - Sample statistic from data is t
- Create bootstrap sample of size n from the data
- Calculate t, the value of t for the bootstrap
data - Repeat B times
- Look at the distribution of t
14The Bootstrap principle
- The distribution of t around the sample
statistic t will be similar to the distribution
of t around the population parameter ? - If mean of t differs from t, suggests t is a
biased estimator of ? - Can construct more accurate confidence intervals
- Can construct CIs for quantities lacking theory
15(No Transcript)
16(No Transcript)
17Bootstrap confidence intervals
- If no bias and t follows normal distribution
- Calculate standard deviation of t
- Use like standard error in standard formula for
CI of mean - If no bias and t is symmetric
- Use upper and lower percentiles of t
distribution - E.g., for 95 CI, use 2.5 and 97.5 of dist.
- Otherwise, use bias-corrected accelerated (BCa)
intervals - May be assymetric
18Back to the mine
- CI from normal approximation
- 95 confident true mean is between 0.455 and
1.262 - CI from bootstrap BCa
- 95 confident true mean is between 0.606 and
1.548
19Bootstrapping a regression
Call lm(formula Chlorophyll.a Phosphorus,
data chlor) Residuals Min 1Q Median
3Q Max -36.148 -13.901 -5.022 5.254
61.037 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 11.34093
6.72380 1.687 0.105 Phosphorus
0.30241 0.03512 8.610 1.19e-08
--- Signif. codes 0 '' 0.001 '' 0.01
'' 0.05 '.' 0.1 ' ' 1 Residual standard error
24.86 on 23 degrees of freedom Multiple
R-Squared 0.7632, Adjusted R-squared 0.7529
F-statistic 74.13 on 1 and 23 DF, p-value
1.189e-08
20(No Transcript)
21(No Transcript)
2295 CI of slope parameter
- From OLS theory
- 0.302 /- 0.035t0.05,23
- 0.230, 0.374
- BCa CI
- 0.247, 0.425
23(No Transcript)