Title: Lab 14
1Lab 14
- Curvilinear analysis and detailed example of
categorical and continuous variables analysis
2Curvilinear Regression
- Linear regression assumes that a straight line
properly represents the relations between each IV
and the DV. - This is not always the case. For example, it has
been found that the relationship between job
satisfaction and job tenure (length of time in a
job) is a curvilinear relationship. Employees
with low and high tenure have high satisfaction
and employees with moderate tenure have the
lowest satisfaction.
3Example of Curvilinear Job Satisfaction and tenure
4How to test this with SAS
- What we do in polynomial regression is to conduct
a sequence of tests. We start with regressing DV
on IV. - Then add IVIV to model to see if that accounts
for a significant amount of additional variance. - If it does, we add IVIVIV to see if it adds
variance. We stop when adding a successive power
term fails to add variance accounted for.
5Example
- A sports physiologist is interested in the
effects of diet on strength of athletes. He
measures strength and the amount of protein
consumed and he wants to know what the
relationship is between these two variables. - Form quadratic and cubic terms.
- Run the regressions to test for trends and
identify the best model. - Graph the relations between X and Y for evidence
of nonlinearity.
6Example program
- data d1
- input protein strength
- create power terms
- protein2proteinprotein
- protein3protein2protein
- cards
- regressions with linear, quadratic, and cubic
models - linear
- proc reg
- model strength protein
- plot strengthprotein r.p.
- quadratic
- proc reg
- model strength protein protein2
- plot r.p.
- cubic
- proc reg
- model strength protein protein2 protein3
- plot r.p.
7Output Model 1
- Model MODEL1
- Dependent
Variable strength - Analysis
of Variance -
Sum of Mean - Source DF
Squares Square F Value Pr gt F - Model 1
16191 16191 646.01 lt.0001 - Error 248
6215.86885 25.06399 - Corrected Total 249 22407
- Root MSE
5.00639 R-Square 0.7226 - Dependent Mean
202.56800 Adj R-Sq 0.7215 - Coeff Var
2.47146 - Parameter
Estimates -
Parameter Standard - Variable DF Estimate
Error t Value Pr gt t - Intercept 1 145.33012
2.27414 63.91 lt.0001 - protein 1 0.81480
0.03206 25.42 lt.0001
8(No Transcript)
9(No Transcript)
10Model 2 Output
- Model MODEL1
- Dependent
Variable strength - Analysis
of Variance -
Sum of Mean - Source DF
Squares Square F Value Pr gt F - Model 2
19145 9572.45217 724.73 lt.0001 - Error 247
3262.43966 13.20826 - Corrected Total 249 22407
- Root MSE
3.63432 R-Square 0.8544 - Dependent Mean
202.56800 Adj R-Sq 0.8532 - Coeff Var
1.79412 - Parameter
Estimates -
Parameter Standard - Variable DF Estimate
Error t Value Pr gt t - Intercept 1 22.06447
8.40699 2.62 0.0092 - protein 1 4.42387
0.24247 18.24 lt.0001 - protein2 1 -0.02589
0.00173 -14.95 lt.0001
11(No Transcript)
12Model 3 Output
-
Sum of Mean - Source DF
Squares Square F Value Pr gt F - Model 3
19145 6381.64432 481.20 lt.0001 - Error 246
3262.41104 13.26183 - Corrected Total 249 22407
- Root MSE
3.64168 R-Square 0.8544 - Dependent Mean
202.56800 Adj R-Sq 0.8526 - Coeff Var
1.79776 - Parameter
Estimates -
Parameter Standard - Variable DF Estimate
Error t Value Pr gt t - Intercept 1 20.15763
41.90111 0.48 0.6309 - protein 1 4.51006
1.87112 2.41 0.0167 - protein2 1 -0.02716
0.02742 -0.99 0.3230 - protein3 1 0.00000613
0.00013194 0.05 0.9630
13(No Transcript)
14Conclusions
- The b-weight is significant for the quadratic
model and not for the cubic model, therefore it
appears that the quadratic equation is the best
fit for this data (Y22.064.42X1-.026X12) and
it accounts for 85 of the variance. - Looking back at the graph (strengthprotein), it
appears that the benefit of protein is large at
first and then levels off, where athletes receive
little to no benefit at around the 70 mark.
15Detailed Example
- Events variable is a person's score on a life
event scale, indicating the number and severity
of recent life events. - Status variable is a measure of whether a person
co-habits with a partner (a 0 indicates that they
do not, and a 1 indicates that they do). - Stress variable is the score on self-report
measure of experienced stress
16Hypotheses
- 1 The more life events, the greater the stress.
- 2 Those who live with their partner will have
lower stress than participants who dont live
with a partner. - 3 The relationship between events and stress is
predicted to be moderated by status.
Participants who cohabitate with a partner are
predicted to be less stressed by life events than
those who do not live with a partner.
17Evaluate Normality
- Check normality in variables.
- Proc univariate normal plot
- Check normality by Status.
- Proc univariate normal plot
- By status
18Results of normality
- Box plots Stress variable looks normal but
Events is positively skewed with few people
having high scores. No evident outliers. - Shapiro-Wilk supports visual conclusions, Stress
was not significant (W 0.981, ns) and Events
was significant (W 0.935, p lt .05) , indicating
non normality. With a small percentage of
participants reporting large number of life
events. - Good distribution of status, 30 in a relationship
and 30 not in a relationship.
19Normality with by Status
- Participants not in a relationship had higher
means on events in life than those in a
relationship. Similar variability in the both
status groups across the event variable. - Participants not in a relationship had higher
means on stress variable than those in a
relationship, providing visual support for
hypothesis 1. There were two outliers in the
relationship group and the variance appears
smaller in the relationship group.
20Descriptive stats
- Means, SD, and correlations.
- Proc means
- Proc corr
21Proc mean and corr results
- Both independent variables, Status and Events,
had significant relationships with stress. - Status had a significant negative relationship
with stress (r(58) -.49, p lt.05 0doesnt
cohabit and 1does cohabit). - Events had a significant positive relationship
with stress (r(58) .41, p lt .05). - Independent variables not significantly
correlated with one another (status and events
r(58) -.12, ns), which indicates that
collinearity is not a problem with these data.
22Linearity, Outliers, and Homoscedasticity
- Look at plots for heteroscedasiticity and
nonlinearity. - proc gplot
- plot stressevent
- proc gplot
- plot stressevent
- by status
23Graphs
- No evidence of heteroscedasticity or non linear
trends. - There does appear to be a stronger relationship
between stress and events for those participants
who do not live with a partner.
24Statistical test for Curvilinear data
- Create power terms
- Event2eventevent
- Standardize variables
- Proc standard m0
- Run regression on linear and quadratic models
- proc reg
- model stress event
- proc reg
- model stress event event2
25Results of curvilinear analysis
- Linear model is significant and accounts for 17
of the variance in stress (F(1, 58) 11.85, p lt
.05). - Quadratic model is also significant (F(2,57)
6.80, p lt .05) and accounts for 19 of the
variance, but the beta-weight for the quadratic
term is not significant (b(57) 1.27, ns). - Therefore, the linear model appears to be the
best fit for this data.
26Data fit, outliers and homoscedasticity
- Run regression and check for outliers.
- Proc reg
- Model stress event status/ stb R influence
- Plot p.r. stressp.
27Results outliers
- Predicted by residuals plot showed no apparent
heteroscedasticity. The values appeared to be
randomly scattered around the zero residual line. - Predicted by actual demonstrates a positive
relationship. No apparent outliers
28Results outliers (cont.)
- 3 outliers were identified with a studentized
residual greater than 2, 10, 29, and 54. - Leverage gt 2(k1)/N .10.
- Cooks D gt.2
- DF Betas gt .26
29Outlier conclusions
- There doesnt appear to be any large problems
with outliers. 29 did have some influence so we
will try running the regression analysis without
it at the end and see if there are differences in
the significance.
30Collinearity
- Analyze regression with collinearity diagnostics
included. - Proc reg
- Model stress event status/ vif tol collin
31Collinearity results
32Analyze Regression Results
- Create interaction term
- inter statusevent
- Run regression analysis with and without
interaction. - Proc Reg
- Model stress status event inter/stb
- Go to flow chart on the next slide.
33- Y a b1X1(groupvar) b2X2(continvar)
b3X1X2(inter)
34(No Transcript)
35Regression results
- Overall model without the interaction was
significant (F(2,57) 16.51, p lt .05) and
accounted for 37 of the variance. - Both life events (ß .36, t(58) 3.35, plt.05)
and status (ß -.45, t(58) -4.21, plt.05) were
significant predictors of stress. - The overall model with the interaction was also
significant (F(3,56) 12.98, p lt .05) and
accounted for 41 of the variance. - The interaction was significant (ß -.40, t(58)
-2.00, plt.05), but status was no longer
significant (ß -.13, t(58) -.67, ns). - Therefore, The slopes of the two groups differ
Compute separate regressions for each group
36Produce regression on the same graph, correlation
by status, proc means
- Proc Means
- By status
- Run correlation by group
- Proc corr
- Var stress event
- By status
- Overlay regressions for two groups
- symbol1 colorblue interpolr1 valuenone
- symbol2 colorblack interpolr2 valuenone
- Proc Sort by status
- Proc gplot
- plot stressevent status
37Conclusions
- For participants who did not live with a partner,
the correlation between stress and life events
was not significant (r(28) .10, ns). - For participants who did live with a partner, the
correlation between stress and life events was
significant (r(28) .62, p lt .05). - The graph of the two regression lines illustrate
the interaction effect, with almost no slope for
those not living with a partner and a moderate
slope for those living with a partner. - Those participants living with a partner did show
lower levels of stress (M 18.3, SD 5.47) than
participants who do not live with a partner (M
24.3, SD 6.14), but this difference was not
significant when the interaction was added to the
model.
38Oops, one last thing, we forgot to run the model
again deleting participant 29
- Delete participant 29 and rerun the analysis.
- If _n_ 29 then delete
39Conclusions after deleting
- After deleting that one case, the interaction
term is no longer significant (ß -.32, t(57)
-1.62, ns). You would want to look at that one
value and see if it was an error. - If you feel that the data point is a true score
you should probably report results before and
after. - A big limitation of this example is the low
sample size. - If the sample size was larger, the interaction
would probably be significant. There seemed to
be a large effect. Even after the outlier was
deleted, the correlations for the two groups were
.62 and .19. - Might try testing for difference in significance
between the two correlations, even though this
test generally has less power.