Bootstrap resampling - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Bootstrap resampling

Description:

Assume data are random variables from a particular ... Use random number generator to create sample. Same size as original. Calculate sample statistics ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 24
Provided by: brucek64
Category:

less

Transcript and Presenter's Notes

Title: Bootstrap resampling


1
Bootstrap resampling
  • May 13, 2008
  • ESM 206C

2
Heres some data
  • mg of mercury in one gram soil samples
  • 0.853511661, 0.391905707, 0.143344303,
    0.198267857, 0.266572367, 0.327306702,
    0.834747834, 5.322618220, 0.817037696,
    0.157247167, 0.328456677, 3.793153524,
    0.513433215, 0.502938253, 0.733454663,
    0.279345254, 0.952473470, 0.742740502,
    0.178309271, 0.469049646, 0.764546106,
    1.819858816, 0.830187557, 0.369993886,
    0.644729374, 0.841576129, 0.734056277,
    0.773035692, 0.810722543, 0.357449318

3
Soil mercury at a mine site
4
How well does the sample statistic estimate the
population parameter?
  • Accuracy bias and precision
  • Thought experiment imagine repeating the
    sampling procedure many times, and each time
    calculating a sample mean
  • Bias is the difference between average sample
    mean and population mean
  • Precision is the standard deviation of the sample
    means

5
Take some more samples
6
Some statistical theory
  • If the population is normally distributed
  • The sample mean is a normally distributed random
    variable with mean m and variance s2/n
  • The sample mean and variance are unbiased
    estimates of the population mean and variance
  • The standard error of the mean is an unbiased
    estimate of the precision of the sample mean, and
    can be used to construct confidence intervals
  • The above are also true even if the population is
    not normally distributed, as long as the sample
    size is large enough

7
Confidence Intervals
  • If we know how data are sampled
  • We can construct a Confidence Interval for an
    unknown parameter, q.
  • A 95 C.I. gives a range such that true q is in
    interval 95 of the time.
  • A 100(1-a) C.I. captures true q, (1-a) of the
    time.
  • Smaller a, more sure true q falls in interval,
    but wider interval.
  • C.I. FOR MEAN OF NORMALLY DISTRIBUTED DATA
  • 95 C.I. for m
  • SE is standard error of mean.
  • t97.5 is critical value of t distribution
  • Critical t value depends on sample size (n)
  • If n gt 20, then t97.5 1.96 2

8
Confidence Intervals
9
Distribution of 1000 sample means
10
95 CIs based on t statistic
  • Out of 1000 replicate samples
  • CI contained µ 908 times
  • Entire CI was below µ 87 times
  • Entire CI was above µ 5 times
  • Average sample mean very close to µ, (estimate is
    unbiased) but
  • Sample mean less than µ 558 times
  • Average sample SD less than s
  • Biased estimate!

11
Resampling for a confidence interval of the mean
  • IN AN IDEAL WORLD
  • Take sample
  • Calculate sample mean std dev
  • Take new sample from same population
  • Calculate new mean std dev
  • Repeat many times
  • Look at the distribution of sample means std
    devs
  • Find bias use to correct sample statistic for
    original sample
  • Figure out formula to get 95 CI that has correct
    coverage
  • IN THE REAL WORLD
  • Take sample
  • Calculate sample mean std dev
  • Find some way to simulate taking a new sample
    from same population
  • Calculate new mean std dev
  • Repeat many times
  • Look at the distribution of sample means std
    devs
  • Find bias use to correct sample statistic for
    original sample
  • Figure out formula to get 95 CI that has correct
    coverage

12
Bootstrap resampling
  • PARAMETRIC BOOTSTRAP
  • Assume data are random variables from a
    particular distribution
  • E.g., log-normal
  • Use data to estimate parameters of the
    distribution
  • E.g., mean, variance
  • Use random number generator to create sample
  • Same size as original
  • Calculate sample statistics
  • Allows us to ask What if data were a random
    sample from specified distribution with specified
    parameters?
  • NONPARAMETRIC BOOTSTRAP
  • Assume underlying distribution from which data
    come is unknown
  • Best estimate of this distribution is the data
    themselves the empirical distribution function
  • Create a new dataset by sampling with replacement
    from the data
  • Same size as original
  • Calculate sample statistics
  • WHICH IS BETTER?
  • If underlying distribution is correctly chosen,
    parametric has more precision
  • If underlying distribution incorrectly chosen,
    parametric has more bias

13
Nonparametric bootstrap
  • We want to estimate a parameter ? of the
    population
  • Sample statistic from data is t
  • Create bootstrap sample of size n from the data
  • Calculate t, the value of t for the bootstrap
    data
  • Repeat B times
  • Look at the distribution of t

14
The Bootstrap principle
  • The distribution of t around the sample
    statistic t will be similar to the distribution
    of t around the population parameter ?
  • If mean of t differs from t, suggests t is a
    biased estimator of ?
  • Can construct more accurate confidence intervals
  • Can construct CIs for quantities lacking theory

15
(No Transcript)
16
(No Transcript)
17
Bootstrap confidence intervals
  • If no bias and t follows normal distribution
  • Calculate standard deviation of t
  • Use like standard error in standard formula for
    CI of mean
  • If no bias and t is symmetric
  • Use upper and lower percentiles of t
    distribution
  • E.g., for 95 CI, use 2.5 and 97.5 of dist.
  • Otherwise, use bias-corrected accelerated (BCa)
    intervals
  • May be assymetric

18
Back to the mine
  • CI from normal approximation
  • 95 confident true mean is between 0.455 and
    1.262
  • CI from bootstrap BCa
  • 95 confident true mean is between 0.606 and
    1.548

19
Bootstrapping a regression
Call lm(formula Chlorophyll.a Phosphorus,
data chlor) Residuals Min 1Q Median
3Q Max -36.148 -13.901 -5.022 5.254
61.037 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 11.34093
6.72380 1.687 0.105 Phosphorus
0.30241 0.03512 8.610 1.19e-08
--- Signif. codes 0 '' 0.001 '' 0.01
'' 0.05 '.' 0.1 ' ' 1 Residual standard error
24.86 on 23 degrees of freedom Multiple
R-Squared 0.7632, Adjusted R-squared 0.7529
F-statistic 74.13 on 1 and 23 DF, p-value
1.189e-08
20
(No Transcript)
21
(No Transcript)
22
95 CI of slope parameter
  • From OLS theory
  • 0.302 /- 0.035t0.05,23
  • 0.230, 0.374
  • BCa CI
  • 0.247, 0.425

23
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com