Title: STATISTICAL MODELING PROCEDURES Chapter 2
1STATISTICAL MODELING PROCEDURESChapter 2
- RSF CONFERENCE
- JANUARY 10, 2003
2Outline
- Introduction to Model Building
- Simple Comparisons/Graphical Methods
- Statistical Models in RSF(e.g., linear
regression) - Hypothesis Tests
- Model Selection
- Multiple Testing
- Bootstrapping
3Dr. George Box Quotes
- Modelling is an art, not a science,
- All models are wrong, some are useful, and we
should seek out those.
4Dr. Fisher discussing model specification
- as for problems of specification, these are
entirely a matter for the practical
statistician.
5General Principles of Modeling(McCullagh and
Nelder 1989)
- Search for useful models, and know that eternal
truth is not within our grasp. - Do not fall in love with a single model, to the
exclusion of alternatives. - Thoroughly check the fit of the model.
6John Stuart Mill (1879) writing in his System of
Logic
- The guesses which served to give mental unity
and wholeness to a chaos of scattered
particulars, are accidents which rarely occur to
any minds but those abounding in knowledge and
disciplined in intellectual combinations - INTERPRETATION MODELING IS NOT FOR THE MENTALLY
CHALLENGED
7Modeling Approaches
- simple sample comparisons/graphical displays
- linear regression
- logistic regression
- log-linear models
- proportional hazard models
- generalized linear models
8T-tests
9Graphing Example Chipping sparrow RSF
40
Unused
Used
30
mean
20
10
0
CANOPY
DEBRIS
LIVE TREE
SAPLINGS
SHRUBS
10Simple Sample ComparisonsGraphical Example
11Linear RegressionAnalysis of Continuous Measures
of the Amount of Use
- Assume the amount of use of a resource unit is a
continuous variable Y. - Standard statistical methods should be
sufficient. - The linear regression model
- Y ßo ß1X1 ß2X2 ... ßpXp ?, (2.1)
- where ß0 to ßp are constants to be estimated from
data, - ? N(0, ?2)
12Linear Regression/RSF Example
- Y Biomass of eelgrass in 1 m x 1 m quadrats
- X1 depth
13(No Transcript)
14Logistic Regression
- An assumption for this type of model is that the
probability of a success is given by the equation -
- exp(ß0 ß1X1 ß2X2 ... ßpXp)
- ? ----------------------------, (2.2)
- 1 exp(ß0 ß1X1 ß2X2 ... ßpXp)
-
- where ß0 to ßp are constants to be estimated from
the available data, - X1 to Xp are the variables that the probability
of a success is to be related to. - number of successes observed in n trials follows
a binomial distribution with mean n? and variance
n?(1 - ?)
15Logistic Regression Example
- Chipping sparrow resource selection
- Used and unused determined based on
presence/absence on point count stations - Design I, sampling protocol D
16Resulting Model
- exp(3.2150.088canopy0.053see
dling0.019nsapling0.676grndb) - w(x) __________________________________
____________________ -
1exp(3.2150.088canopy0.053s
eedling0.019nsapling0.676grndb)
17Log-linear Model
- Y are counts of the number of occurrences of a
certain event under different conditions - Natural assumption are that the counts follow a
Poission Distribution - E(Y) µ exp(ß0 ß1X1 ß2X2 ... ßpXp).
(2.3) - Examples, number of animals observed within
blocks of land, with covariates measured on those
blocks
18Example
19Generalized Linear Models (McCullagh and Nelder,
1989).
E(Y) f(ß0 ß1X1 ß2X2 ...
ßpXp), (2.7) with the distribution of Y being
suitably defined With f(z) z and YNormal
gives ordinary linear regression f(z)
exp(z)/1 exp(z) Ybinomial gives logistic
regression f(z) exp(z) and YPoission gives
the log-linear model f(z) 1 - exp-exp(z)g(t)
and Ybinomial gives the proportional hazards
model.
20Statistical Software
- Fitting log-linear models, and other generalized
linear model requires a suitable computer
program. - Many Poisson regression programs are now
available, including - SASs Proc Genmod
- GLIM
- S-Pluss glm()
- SPSS
- SYSTAT
- Quattro or Excel can also be used.
21Tests Used in Modeling
- Tests of ßi 0 can be tested by comparing
with critical values from a standard
normal. Approximate confidence intervals for ßi
of the form
22Deviance
- Deviance measures closeness of model to data
- Analogous to Residual Sum-of-Squares
- D -2loge(LM) - loge(LF), (2.8)
- LM likelihood of the fitted model
- LF likelihood of the full model
- Can be used as a general measure of fit by
comparing the observed value to chi-square
distribution with df( observations -
parameters) ALTHOUGH, NOT VERY ROBUST - General rule counts in observed cells gt5
23Difference in Deviance
- Difference in deviance for nested models
approximated by a chi-square distribution with p2
p1 degrees of freedom - ?D12 -2loge(L1) - loge(L2), (2.8)
- model 1 subset of model 2
- Model selection tool for GLIM
- Overall test of selection (no selection or null
model versus full model)
24Model Selection
- Art and not a science
- Most RSF analyses are based on observational data
and are exploratory in nature - Limit number of variables based on professional
judgement/knowledge of issues. - Do not limit yourself to one model unless
obvious. Make sure statistical inference is
understood. Replicate study when possible.
25Model Selection Criteria
- Nested models analysis of deviance
- Stepwise
- AIC
- AICC
- BIC
26Akaikes Information Criteria
- Burnham and Anderson (1999)
- AIC -2loge(LM) 2p, (2.9)
- where p is the number of unknown parameters in
the model that must be estimated - Small values of AIC suggest better model
27Other Measures
- Corrected AIC
- AICc -2loge(LM) 2p n/(n - p - 1), (2.10)
- Useful when sample sizes are relatively small
- Bayesian information criterion (BIC)
- BIC -2logc(LM) p loge(n). (2.11)
28Model Selection Simulation
- case of 4 variables, 2 proportions, and 2
continuous - -simulated no selection case, and selection for
one categorical variable (R.75/.25). - -assumed 50 used units, and either 10000 or 1000
available units. - -looked at which models are selected as best by
AIC. - -MODEL 0 - no selection
- -MODEL 1 - P1
- -MODEL 2 - D1
- -MODEL 3- D2
- -MODEL 4 - P1 D1
- -MODEL 5 - P1 D2
- -MODEL 6 - D1 D2
- -MODEL 7 P1 D1 D2
29Simulation Using AIC
30Model Averaging
- Do not fall in love with a single model, to the
exclusion of alternatives. - Using information from multiple models to improve
inference and interpretation - Has been shown to improve prediction
- Allows for assessing importance of individual
variables
31Process
AIC WEIGHTS
32Example
33 Importance Values
Variable
Alder Flycatcher
Blackpoll Warbler
Savannah Sparrow
river
0.7 (
-
)
0.09 (
-
)
0
lake
0
0
1 (
-
)
band 1
0
0.93 (
-
)
0.95 ( )
band 4
1 ( )
1 ( )
0.13 ( )
band 5
0
0
0.06 ( )
S
td band 1
0.28 (
-
)
0
0.94 ( )
S
td band 3
0.89 ( )
0.72 ( )
0.05 (
-
)
S
td band 4
0.61 ( )
0
0.06 ( )
S
td band 7
0.39 ( )
0.
20 ( )
0.06 ( )
elevation
1 (
-
)
0.93 (
-
)
1 (
-
)
slope
0
0
0
aspect
0
0
0
34Multiple Testing
- Inflated experiment-wise Type I error can occur
when several significance tests or several
confidence intervals conducted at once - Example 10 independent tests carried out at the
5 level with null true, probability of one or
more results significant - 1 0.95100.40
35Approaches to Address Multiple Testing
- Bonferroni procedure conservative approach
- Test each comparison at the 100(?/k), with k
being the number of comparisons - For example, 3 discrete habitat types, n3
comparisons, test each at 100(0.05/3)1.67 level - 10 comparisons, adjusted level is 0.005
36Holms Method
- Decide on overall ? level
- Calculate p-values
- Sort the p-values in ascending order
- See if p1lt ?/k
- If no stop, if yes, determine if p2lt ?/(k-1)
- If no stop, if yes, determine if p3lt ?/(k-2)
- If no stop, if yes,
37Example ? ?
38Bootstrap Methods
- IDEA when only information available about a
statistical population consists of a random
sample from that population, best guide as to
what might happen by resampling population is by
resampling the sample - The sample is assumed to represent the population
well
39Applications
- Variance of a complicated sample statistic
- ? probability of use of a unit with certain
characteristics - Model weights in model averaging
- Importance values for variables
40Applications
- Incorporation of between animal variability or
between true experimental unit variability - Radiod animals and logistic regression
- Transects used to gather use information (walking
transects)