Bootstrap resampling - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Bootstrap resampling

Description:

Assume data are random variables from a particular ... Use random number generator to create sample. Same size as original. Calculate sample statistics ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 24

Provided by: brucek64

Category:

more less

Transcript and Presenter's Notes

Title: Bootstrap resampling

1
Bootstrap resampling

May 13, 2008
ESM 206C

2
Heres some data

mg of mercury in one gram soil samples
0.853511661, 0.391905707, 0.143344303,
0.198267857, 0.266572367, 0.327306702,
0.834747834, 5.322618220, 0.817037696,
0.157247167, 0.328456677, 3.793153524,
0.513433215, 0.502938253, 0.733454663,
0.279345254, 0.952473470, 0.742740502,
0.178309271, 0.469049646, 0.764546106,
1.819858816, 0.830187557, 0.369993886,
0.644729374, 0.841576129, 0.734056277,
0.773035692, 0.810722543, 0.357449318

3
Soil mercury at a mine site
4
How well does the sample statistic estimate the
population parameter?

Accuracy bias and precision
Thought experiment imagine repeating the
sampling procedure many times, and each time
calculating a sample mean
Bias is the difference between average sample
mean and population mean
Precision is the standard deviation of the sample
means

5
Take some more samples
6
Some statistical theory

If the population is normally distributed
The sample mean is a normally distributed random
variable with mean m and variance s2/n
The sample mean and variance are unbiased
estimates of the population mean and variance
The standard error of the mean is an unbiased
estimate of the precision of the sample mean, and
can be used to construct confidence intervals
The above are also true even if the population is
not normally distributed, as long as the sample
size is large enough

7
Confidence Intervals

If we know how data are sampled
We can construct a Confidence Interval for an
unknown parameter, q.
A 95 C.I. gives a range such that true q is in
interval 95 of the time.
A 100(1-a) C.I. captures true q, (1-a) of the
time.
Smaller a, more sure true q falls in interval,
but wider interval.

C.I. FOR MEAN OF NORMALLY DISTRIBUTED DATA
95 C.I. for m
SE is standard error of mean.
t97.5 is critical value of t distribution
Critical t value depends on sample size (n)
If n gt 20, then t97.5 1.96 2

8
Confidence Intervals
9
Distribution of 1000 sample means
10
95 CIs based on t statistic

Out of 1000 replicate samples
CI contained µ 908 times
Entire CI was below µ 87 times
Entire CI was above µ 5 times
Average sample mean very close to µ, (estimate is
unbiased) but
Sample mean less than µ 558 times
Average sample SD less than s
Biased estimate!

11
Resampling for a confidence interval of the mean

IN AN IDEAL WORLD
Take sample
Calculate sample mean std dev
Take new sample from same population
Calculate new mean std dev
Repeat many times
Look at the distribution of sample means std
devs
Find bias use to correct sample statistic for
original sample
Figure out formula to get 95 CI that has correct
coverage

IN THE REAL WORLD
Take sample
Calculate sample mean std dev
Find some way to simulate taking a new sample
from same population
Calculate new mean std dev
Repeat many times
Look at the distribution of sample means std
devs
Find bias use to correct sample statistic for
original sample
Figure out formula to get 95 CI that has correct
coverage

12
Bootstrap resampling

PARAMETRIC BOOTSTRAP
Assume data are random variables from a
particular distribution
E.g., log-normal
Use data to estimate parameters of the
distribution
E.g., mean, variance
Use random number generator to create sample
Same size as original
Calculate sample statistics
Allows us to ask What if data were a random
sample from specified distribution with specified
parameters?

NONPARAMETRIC BOOTSTRAP
Assume underlying distribution from which data
come is unknown
Best estimate of this distribution is the data
themselves the empirical distribution function
Create a new dataset by sampling with replacement
from the data
Same size as original
Calculate sample statistics
WHICH IS BETTER?
If underlying distribution is correctly chosen,
parametric has more precision
If underlying distribution incorrectly chosen,
parametric has more bias

13
Nonparametric bootstrap

We want to estimate a parameter ? of the
population
Sample statistic from data is t
Create bootstrap sample of size n from the data
Calculate t, the value of t for the bootstrap
data
Repeat B times
Look at the distribution of t

14
The Bootstrap principle

The distribution of t around the sample
statistic t will be similar to the distribution
of t around the population parameter ?
If mean of t differs from t, suggests t is a
biased estimator of ?
Can construct more accurate confidence intervals
Can construct CIs for quantities lacking theory

15
(No Transcript)
16
(No Transcript)
17
Bootstrap confidence intervals

If no bias and t follows normal distribution
Calculate standard deviation of t
Use like standard error in standard formula for
CI of mean
If no bias and t is symmetric
Use upper and lower percentiles of t
distribution
E.g., for 95 CI, use 2.5 and 97.5 of dist.
Otherwise, use bias-corrected accelerated (BCa)
intervals
May be assymetric

18
Back to the mine

CI from normal approximation
95 confident true mean is between 0.455 and
1.262
CI from bootstrap BCa
95 confident true mean is between 0.606 and
1.548

19
Bootstrapping a regression
Call lm(formula Chlorophyll.a Phosphorus,
data chlor) Residuals Min 1Q Median
3Q Max -36.148 -13.901 -5.022 5.254
61.037 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 11.34093
6.72380 1.687 0.105 Phosphorus
0.30241 0.03512 8.610 1.19e-08
--- Signif. codes 0 '' 0.001 '' 0.01
'' 0.05 '.' 0.1 ' ' 1 Residual standard error
24.86 on 23 degrees of freedom Multiple
R-Squared 0.7632, Adjusted R-squared 0.7529
F-statistic 74.13 on 1 and 23 DF, p-value
1.189e-08
20
(No Transcript)
21
(No Transcript)
22
95 CI of slope parameter