Alternative Forecasting Methods: Bootstrapping

About This Presentation

Title:

Alternative Forecasting Methods: Bootstrapping

Description:

Alternative Forecasting Methods: Bootstrapping Bryce Bucknell Jim Burke Ken Flores Tim Metts Scenario Monte Carlo methods randomly select values to create scenarios ... – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 37

Provided by: ndEdu7Eb

Learn more at: https://www3.nd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Alternative Forecasting Methods: Bootstrapping

1
Alternative Forecasting Methods Bootstrapping

Bryce Bucknell
Jim Burke
Ken Flores
Tim Metts

2
Agenda
Scenario
Obstacles
Regression Model
Bootstrapping
Applications and Uses
Results
3
Scenario
You have been recently hired as the statistician
for the University of Notre Dame football team.
You are tasked with performing a statistical
analysis for the first year of the Charlie Weis
era. Specifically, you have been asked to
develop a regression model that explains the
relationship between key statistical categories
and the number of points scored by the offense.
You have a limited number of data points, so you
must also find a way to ensure that the
regression results generated by the model are
reliable and significant.
Problems/Obstacles

Central Limit Theorem
Replication of data
Sampling
Variance of error terms

4
Constrained by the Central Limit Theorem
In selecting simple random samples of size n from
a population, the sampling distribution of the
sample mean x can be approximated by a normal
probability distribution as the sample size
becomes large. It is generally accepted that the
sample size must be 30 or greater to satisfy the
large-sample condition of the theorem.
_
Sample N 1
Sample N 2
Sample N 3
Sample N 4
1. http//www.statisticalengineering.com/central_l
imit_theorem_(summary).htm
5
Central Limit Theorem
Central Limit theorem is the foundation for many
statistical procedures, because the distribution
of the phenomenon under study does NOT have to be
Normal because its average WILL tend to be
normal.
Why is the assumption of a normal distribution
important?

A normal distribution allows for the application
of the empirical rule 68, 95 and 99.7
Chebyshevs Theorem no more than 1/4 of the
values are more than 2 standard deviations away
from the mean, no more than 1/9 are more than 3
standard deviations away, no more than 1/25 are
more than 5 standard deviations away, and so on.
The assumption of a normally distributed data
allows descriptive statistics to be used to
explain the nature of the population

6
Not enough data available?
Monte Carlo simulation, a type of spreadsheet
simulation, is used to randomly generate values
for uncertain variables over and over to simulate
a model.

Monte Carlo methods randomly select values to
create scenarios
The random selection process is repeated many
times to create multiple scenarios
Through the random selection process, the
scenarios give a range of possible solutions,
some of which are more probable and some less
probable
As the process is repeated multiple times, 10,000
or more, the average solution will give an
approximate answer to the problem
The accuracy can be improved by increasing the
number of scenarios selected

7
Sampling without Replacement
Simple Random Sampling

A simple random sample from a population is a
sample chosen randomly, so that each possible
sample has the same probability of being chosen.
In small populations such sampling is typically
done "without replacement
Sampling without replacement results in
deliberate avoidance of choosing any member of
the population more than once
This process should be used when outcomes are
mutually exclusive, i.e. poker hands

8
Sampling with Replacement

Initial data set is not sufficiently large enough
to use simple random sampling without replacement
Through Monte Carlo simulation we have been able
to replicate the original population
Units are sampled from the population one at a
time, with each unit being replaced before the
next is sampled.
One outcome does not affect the other outcomes
Allows a greater number of potential outcomes
than sampling without replacement
If observations were not replaced there would not
be enough independent observations to create a
sample size of n 30

9
Hetroscedasticity vs. Homoscedasticity
Homoscedasticity constant variance
Hetroscedasticity nonconstant variance

All random variables have the same finite
variance
Simplifies mathematical and computational
treatment
Leads to good estimation results in data mining
and regression

Random variables may have different variances
Standard errors of regression coefficients may be
understated
T-ratios may be larger than actual
More common with cross sectional data

10
Regression Model For ND Points Scored
ND Points 38.54 0.079b1 - 0.170b2 -
0.662b3 - 3.16b4
b1 Total Yards Gained
b3 Total Plays
b2 Penalty Yards
b4 Turnovers
11
4 Checks of a Regression Model
1. Do the coefficients have the correct sign?
2. Are the slope terms statistically significant?
3. How well does the model fit the data?
4. Is there any serial correlation?
12
4 Checks of a Regression Model
1. Do the coefficients have the correct sign?
Could this represent a big play factor?
13
4 Checks of a Regression Model
2. Are the slope terms statistically significant?
3. How well does the model fit the data?
4. Is there any serial correlation?
14
4 Checks of a Regression Model
3. How well does the model fit the data?
Adjusted R2 74.22
15
4 Checks of a Regression Model
4. Is there any serial correlation?
Data is cross sectional
With limited data points, how useful is this
regression in describing how well the model fits
the actual data? Is there a way to tests its
reliability?
16
How to test the significance of the analysis
What happens when the sample size is not large
enough (n 30)?
Bootstrapping is a method for estimating the
sampling distribution of an estimator by
resampling with replacement from the original
sample.

Commonly used statistical significance tests are
used to determine the likelihood of a result
given a random sample and a sample size of n.
If the population is not random and does not
allow a large enough sample to be drawn, the
central limit theorem would not hold true
Thus, the statistical significance of the data
would not hold
Bootstrapping uses replication of the original
data to simulate a larger population, thus
allowing many samples to be drawn and statistical
tests to be calculated

17
How It Works
Bootstrapping is a method for estimating the
sampling distribution of an estimator by
resampling with replacement from the original
sample.

The bootstrap procedure is a means of estimating
the statistical accuracy . . . from the data in a
single sample.
Bootstrapping is used to mimic the process of
selecting many samples when the population is too
small to do otherwise
The samples are generated from the data in the
original sample by copying it many number of
times (Monte Carlo Simulation)
Samples can then selected at random and
descriptive statistics calculated or regressions
run for each sample
The results generated from the bootstrap samples
can be treated as if it they were the result of
actual sampling from the original population

18
Characteristics of Bootstrapping
Full Sample
Sampling with Replacement
19
Bootstrapping Example
Random sampling with replacement can be employed
to create multiple independent samples for
analysis
Limited number of observations
1st Random Sample
Original Data Set
109 Copies of each observation
Creating a much larger sample with which to work
20
When it should be used
Bootstrapping is especially useful in situations
when no analytic formula for the sampling
distribution is available.

Traditional forecasting methods, like exponential
smoothing, work well when demand is constant
patterns easily recognized by software
In contrast, when demand is irregular, patterns
may be difficult to recognize.
Therefore, when faced with irregular demand,
bootstrapping may be used to provide more
accurate forecasts, making some important
assumptions

21
Assumptions and Methodology

Bootstrapping makes no assumption regarding the
population
No normality of error terms
No equal variance
Allows for accurate forecasts of intermittent
demand
If the sample is a good approximation of the
population, the sampling distribution may be
estimated by generating a large number of new
samples
For small data sets, taking a small
representative sample of the data and replicating
it will yield superior results

22
Applications and Uses
Criminology

Statistical significance testing is important in
criminology and criminal justice
Six of the most popular journals in criminology
and criminal justice are dominated by
quantitative methods that rely on statistical
significance testing
However, it poses two potential problems
tautology and violations of assumptions

23
Applications and Uses
Criminology

Tautology the null hypothesis is always false
because virtually all null hypothesis may be
rejected at some sample size
Violation of assumptions of regression errors
are homogeneous and errors of independent
variables are normally distributed
Bootstrapping provides a user-friendly
alternative to cross-validation and jackknife to
augment statistical significance testing

24
Applications and Uses
Actuarial Practice

Process of developing an actuarial model begins
with the creation of probability distributions of
input variables
Input variables are generally asset-side
generated cash flows (financial) or cash flows
generated from the liabilities side
(underwriting)
Traditional actuarial methodologies are rooted in
parametric approaches, which fit prescribed
distribution of losses to the data

25
Applications and Uses
Actuarial Practice

However, experience from the last two decades has
shown greater interdependence of loss variables
with asset variables
Increased complexity has been accompanied by
increased competitive pressures and more frequent
insolvencies
There is a need to use nonparametric methods in
modeling loss distributions
Bootstrap standard errors and confidence
intervals are used to derive the distribution

26
Applications and Uses
Classifications Used by Ecologists

Ecologists often use cluster analysis as a tool
in the classification and mapping of entities
such as communities or landscapes
However, the researcher has to choose an adequate
group partition level and in addition, cluster
analysis techniques will always reveal groups
Use bootstrap to test statistically for fuzziness
of the partitions in cluster analysis
Partitions found in bootstrap samples are
compared to the observed partition by the
similarity of the sampling units that form the
groups.

27
Applications and Uses
Human Nutrition

Inverse regression used to estimate vitamin B-6
requirement of young women
Standard statistical methods were used to
estimate the mean vitamin B-6 requirement
Used bootstrap procedure as a further check for
the mean vitamin B-6 requirement by looking at
the standard error estimates and confidence
intervals

28
Application and Uses
Outsourcing

Agilent Technologies determined it was time to
transfer manufacturing of its 3070 in-circuit
test systems from Colorado to Singapore
Major concern was the change in environmental
test conditions (dry vs humid)
Because Agilent tests to tighter factory limits
(guard banding), they needed to adjust the
guard band for Singapore
Bootstrap was used to determine the appropriate
guard band for Singapore facility

29
An Alternative to the bootstrap
Jackknife

A statistical method for estimating and removing
bias and for deriving robust estimates of
standard errors and confidence intervals
Created by systematically dropping out subsets of
data one at a time and assessing the resulting
variation

Bias A statistical sampling or testing error
caused by systematically favoring some outcomes
over others
30
A comparison of the Bootstrap Jackknife

Bootstrap
Yields slightly different results when repeated
on the same data (when estimating the standard
error)
Not bound to theoretical distributions

Jackknife
Less general technique
Explores sample variation differently
Yields the same result each time
Similar data requirements

31
Cross-Validation
Another alternative method

The practice of partitioning data into a sample
of data into sub-samples such that the initial
analysis is conducted on a single sub-sample
(training data), while further sub-samples (test
or validation data) are retained blind in order
for subsequent use in confirming and validating
the initial analysis

32
Bootstrap vs. Cross-Validation

Bootstrap
Requires a small of data
More complex technique time consuming

Cross-Validation
Not a resampling technique
Requires large amounts of data
Extremely useful in data mining and artificial
intelligence

33
Methodology for ND Points Model

Use bootstrapping on ND points scored regression
model
Goal determine the reliability of the model
Replication, random sampling, and numerous
independent regression
Calculation of a confidence interval for adjusted
R2

34
R2 Data
Bootstrapping Results
Sample Adjusted R2
1 0.7351
2 0.7545
3 0.7438
4 0.7968
5 0.5164
6 0.6449
7 0.9951
8 0.9253
9 0.8144
10 0.7631
11 0.8257
12 0.9099
Sample Adjusted R2
13 0.7482
14 0.8719
15 0.7391
16 0.9025
17 0.8634
18 0.7927
19 0.6797
20 0.6765
21 0.8226
22 0.9902
23 0.8812
24 0.9169
The Mean, Standard Dev., 95 and 99 confidence
intervals are then calculated in excel from the
24 observations
35
Bootstrapping Results
R2 Data

Mean 0.8046
STDEV 0.1131
Conf 95 0.0453 or 75.93 - 84.98
Conf 99 0.0595 or 74.51 - 86.41

So what does this mean for the results of the
regression?
Can we rely on this model to help predict the
number of points per game that will be scored by
the 2006 team?
36
Questions?

Write a Comment

User Comments (0)