Title: Assumptions of linear statistical models'
1Outline
Transformations in Statistical Analysis
- Assumptions of linear statistical models.
- Types of Transformations
- Alternatives to Transformations
- Effect addivitity
- Normality
- Homoscedasticity
- Independence
Model Assumptions
2Order of Importance
- Homoscedasticity
- Normality
- Additivity
- Independence
Experimental Analysis Models (ANOVA)
Observational Analysis Models (Regression)
- Additivity
- Homoscedasticity
- Normality
- Independence
All four are so interrelated that which is most
important may be immaterial!
3Independence
- Measurements over time on the same individual.
- Time series data (rainfall, temperature, etc).
- Repeated measures - split plots in time.
- Growth curves.
When is this important?
- Measurements near each other in space.
- Split plot designs.
- Spatial data.
How do I know its a problem?
By design - how the data were collected. Temporal/
spatial autocorrelation analysis.
Rectifying a dependence problem.
Modify the type of model to be fitted to the data.
4Homoscedasticity
How do I know I have a problem?
Plot predicted (fitted) values versus residuals.
What is the pattern of the spread in the
residuals as the predicted values increase?
Acceptable
x
x
x
- Spread constant.
- Spread increases.
- Spread decreases then increases.
x
x
x
Problems
Problems
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
5Lack of Homogeneity in Regression
- Attempt a transformation.
- Weighted regression.
- Incorporate additional covariates.
- Non-linear regression.
What to do?
What to do if the spread of the residuals plotted
versus X looks like this?
Need another x variable.
or this?
6Transforming the Response to achieve Linearity
If a scatterplot of y versus x curves upward,
proceed down on the scale to choose a
transformation.
7(No Transcript)
8Handling Heterogeneity
no
Regression?
ANOVA
yes
Group means
Fit linear model
accept
Test for Homoscedasticity
OK
Plot residuals
reject
Type of Transformation
OK
Box/Cox Family Power Family
Traditional
Transform Observations
9Transformations to Achieve Normality
no
Regression?
ANOVA
yes
Fit linear model
Estimate group means
Q-Q plot Formal Tests
yes
OK
Residuals Normal?
no
Different Model
Transform
10Transformations to Achieve Normality
How can we determine if observations are normally
distributed?
- Graphical examination
- Normal quantile-quantile plot (QQ-plot).
- Histogram or boxplot.
- Goodness of fit tests
- Kolmogorov-Smirnov test.
- Shapiro-Wilks test.
- DAgostinos test.
11Non-normal! So what?
Only very skewed distributions will have a marked
effect on the significance level of the F-test
for overall model or model effects.
Often the same transformations which are used to
achieve homoscedasticity will produce more
normal-looking observations (residuals).
Transformations to Achieve Model Simplicity
GOAL To provide as simple as possible a
mathematical form for the relationship among
response and explanatory variables.
May require transforming both response and
explanatory variables.
12Alternative Models
low
complexity
Regular Least Squares
Weighted Least Squares
Non-Parametric Methods
Generalized Linear Models
Non-Linear Regression
high
13Example Predicting brain weight from body weight
in mammals via SLR
Data are average brain (Y, g) and body (X, kg)
weights for 62 species of mammals (2 omitted).
Source Allison Chicchetti (1976),
Science. Species (common name) body weight
brain weight Arctic fox 3.385 44.500 Owl
monkey 0.480 15.499 Horse 521.000 655.000 Ka
ngaroo 35.000 56.000 Human 62.000 1320.000
African elephant 6654.000 5712.000 Asian
elephant 2547.000 4603.000 Chimpanzee 52.160
440.000 Tree shrew 0.104 2.500 Red
fox 4.235 50.400
Omit
14Scatterplot of data is non-informative. Most
species have small weights compared to the
elephants.
Viewing only those mammals with body weight below
300kgs suggests transforming to a log scale to
linearize the relationship .
15Scatterplot looks linear. Fitted regression
equation is
Body weight is a very significant predictor of
brain weight (p-valuelt0.0001). Also, R20.922.
16human
opossum
Residual plot shows no obvious violations of the
zero mean and constant variance
assumption. QQ-Plot demonstrates that the
normality assumption for the residuals is
plausible.
17Checking for influential observations (R) gt
fm_lm(log(y)log(x)) gt influence.measures(fm) Infl
uence measures of lm(formula log(y)
log(x)) dfb.1. dfb.lg.. dffit cov.r
cook.d hat inf 1 0.13501 -8.18e-03 0.14452
1.009 1.04e-02 0.0167 2 0.27274 -1.56e-01
0.27714 0.956 3.71e-02 0.0245 (Owl Monk.) 3
-0.04860 1.62e-02 -0.04876 1.051 1.21e-03 0.0187
14 -0.02853 3.42e-02 -0.03775 1.142
7.25e-04 0.0937 (Shrew) 19 0.00538
1.69e-01 0.18810 1.121 1.79e-02 0.0881
(Asian El.) 32 0.22151 3.51e-01 0.53207
0.788 1.24e-01 0.0295 (Human) 33 0.00130
-5.11e-02 -0.05538 1.164 1.56e-03 0.1110
(African El.) 34 -0.31147 1.54e-02 -0.33480
0.846 5.11e-02 0.0167 (Opossum) 35 0.27033
5.36e-02 0.32472 0.861 4.85e-02 0.0171
(Rhesus Monk.) 40 -0.00740 8.39e-03 -0.00945
1.124 4.55e-05 0.0786 (Brown Bat) 60
-0.00799 2.27e-03 -0.00806 1.054 3.31e-05 0.0181
In MTB Stat gt gt Regression gt Regression gt
Regression Storage
18Decision Leave out man (he doesnt really fit in
with the rest of the mammals) and re-run the
analysis. Feature Full Model Omit
Human 2.111 2.090 0.755 0.745 0.029 0.027
R2 0.922 0.929 Slope p-value lt 0.0001 lt 0.0001
Even though results dont change much, we will go
with this last model
19Predicting the brain weights of the omitted
mammals (R) gt xh lt- x-32 yh lt- y-32 gt fmh
lt- lm(log(yh)log(xh)) gt new lt-
data.frame(xhc(.104,4.235)) gt predict(fmh,
newdatanew, interval"prediction") fit
lwr upr 1 0.4038624 -0.9269029
1.734628 2 3.1660753 1.8499283
4.482222 gt exp(predict(fmh, newdatanew,
interval"prediction")) fit lwr
upr 1 1.497598 0.3957776
5.666817 2 23.714231 6.3593633 88.430985
Exponentiate final results!
Mammal Predicted Brain Wt Prediction
Interval Actual Brain Wt Tree Shrew 1.498
(0.396, 5.667) 2.500 Red Fox 23.714
(6.359, 88.431) 50.400
This illustrates the idea of cross-validation in
regression. It is often recommended that the data
be split into two (equal?) portions use one for
model fitting the other for model
checking/verification.
20Predicting the brain weights of the omitted
mammals (MTB)
Influence measures can be selected here.
21The regression equation is lbrain 2.11 0.755
lbody Predictor Coef SE Coef T
P Constant 2.11091 0.09860 21.41 0.000 lbody
0.75467 0.02889 26.12 0.000 S 0.696924
R-Sq 92.2 R-Sq(adj) 92.0 Analysis of
Variance Source DF SS MS
F P Regression 1 331.35 331.35
682.21 0.000 Residual Error 58 28.17
0.49 Total 59 359.52 Unusual
Observations Obs lbody lbrain Fit SE Fit
Residual St Resid 32 4.13 7.1854 5.2255
0.1197 1.9599 2.85R 33 8.80 8.6503
8.7542 0.2322 -0.1039 -0.16 X 34 1.25
1.3610 3.0563 0.0901 -1.6954 -2.45R 35
1.92 5.1874 3.5575 0.0912 1.6298
2.36R R denotes an observation with a large
standardized residual. X denotes an observation
whose X value gives it large influence. Predicted
Values for New Observations New Obs Fit SE
Fit 95 CI 95 PI 1 0.4028
0.1388 (0.1249, 0.6807) (-1.0196, 1.8253) 2
3.2002 0.0900 (3.0201, 3.3803) ( 1.7936,
4.6068)
MTB output (with man)
Only available influence measures are
standard/student residuals hat matrix Cooks
dist and dffits.