Title: Regression with categorical independent variables
1Regression with categorical independent variables
2Types of variables
- Categorical
- Dichotomous
- Male/Female
- Pre-regulation/Post-regulation
- Island/Mainland
- Nominal (nom named)
- Continent
- Political party
- Soil type
- Ordinal (ord ordered)
- Survey response strongly disagree, disagree,
neutral, agree, strongly agree - Size classifications small, medium, large
- Income ranges
- Numeric
- Continuous
- Observations can take on, in principle, any real
number - Infinite of possible values between 1 and 10
- Discrete
- Observations can take on, in principle, any
integer - 10 possible values between 1 and 10
3Dummy variables
- How can we handle categorical explanatory
(independent) variables in a regression? - Answer make a dummy!
- Dichotomous zero or one
- Categorical with q categories q-1variables
each scored zero or one (examples to follow)
4Alien Species
- Exotic species cause economic and ecological
damage - Not all countries equally invaded
- Want to understand characteristics of country
that make it more likely to be invaded.
- Well measure invasiveness as fraction of
species that are Alien - Two hypotheses
- Human population density plays a role in a
countrys invasiveness. - Island nations are more invaded than mainland
nations.
5(No Transcript)
6A Simple Model
- ISL is a Dummy variable, coded 0 if mainland, 1
if island - Dummy changes intercept (explain).
7Call lm(formula Prop_exotic Pop_dens
Island, data ExoticSpecies) Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 7.944e-02 2.708e-02 2.934
0.00746 Pop_dens 3.002e-04 9.721e-05
3.088 0.00519 IslandT.Island 9.215e-02
4.945e-02 1.864 0.07517 . Residual standard
error 0.11 on 23 degrees of freedom Multiple
R-Squared 0.4761, Adjusted R-squared 0.4306
F-statistic 10.45 on 2 and 23 DF, p-value
0.0005902
8Some RCmdr details
- In the original dataset, the Island variable
was coded 0 for mainland countries and 1 for
island countries - RCmdr sees this as a numeric variable
- To make the dotplot I had to convert it to a
Factor variable (what R calls categorical
variables) - Data-gtManage variables-gtConvert numeric variables
to factors
9Rcmdr continued
- To include a factor variable in a regression, use
Linear model instead of Linear regression
- R converts the factor back to a 0/1 variable on
the fly - In the output, T.Island tells you which level
of the factor is being set to 1 (True) - If it guesses wrong, use Data -gt Manage variables
-gt Reorder factor levels to switch them - In the original dataset, with the 0s and 1s,
simply use Linear regression
10Call lm(formula Prop_exotic Island
Pop_dens, data ExoticSpecies2) Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 7.944e-02 2.708e-02 2.934
0.00746 Island 9.215e-02 4.945e-02
1.864 0.07517 . Pop_dens 3.002e-04
9.721e-05 3.088 0.00519 Residual standard
error 0.11 on 23 degrees of freedom Multiple
R-Squared 0.4761, Adjusted R-squared 0.4306
F-statistic 10.45 on 2 and 23 DF, p-value
0.0005902
11What if we want the slopes to differ between
islands and mainland countries?
- A model with interactions
12- is short hand for main effect and
interactions - Could also use Prop_exotic Pop_dens Island
Pop_dens Island
13Call lm(formula Prop_exotic Pop_dens
Island, data ExoticSpecies) Coefficients
Estimate Std. Error t value
Pr(gtt) (Intercept) 0.1038003
0.0304446 3.409 0.00251 Pop_dens
-0.0002163 0.0003406 -0.635 0.53184
IslandT.Island 0.0571083 0.0528125
1.081 0.29126 Pop_densIslandT.Island
0.0005593 0.0003544 1.578 0.12880 Residual
standard error 0.1066 on 22 degrees of
freedom Multiple R-Squared 0.5294, Adjusted
R-squared 0.4652 F-statistic 8.25 on 3 and 22
DF, p-value 0.0007308
14What about a polytomous variable (e.g.,
Continent)?
15Interpreting the coefficients
- b1 is the effect of being in Africa, relative to
being in South America - b2 is the effect of being in Europe, relative to
being in South America - Intercept for South America is
- Intercept for Africa is
- Intercept for Europe is
16Another coding (used by JMP)
17Interpreting the coefficients (JMP coding)
- b1 is the effect of being in Africa, relative to
the average continent - b2 is the effect of being in Europe, relative to
average continent - Intercept for South America is
- Intercept for Africa is
- Intercept for Europe is
18- Continent is already a categorical variable in
the dataset
19Call lm(formula Prop_exotic Pop_dens
Continent, data ExoticSpecies) Coefficients
Estimate Std. Error t
value Pr(gtt) (Intercept)
1.025e-02 3.729e-02 0.275 0.786265
Pop_dens 3.637e-04 7.838e-05
4.640 0.000158 ContinentT.Europe
1.424e-01 5.821e-02 2.447 0.023752
ContinentT.North America 1.033e-01 4.910e-02
2.104 0.048270 ContinentT.Oceania
2.600e-01 6.377e-02 4.078 0.000587
ContinentT.South America 3.880e-02
5.839e-02 0.665 0.513920 Residual standard
error 0.09015 on 20 degrees of freedom Multiple
R-Squared 0.6942, Adjusted R-squared 0.6177
F-statistic 9.08 on 5 and 20 DF, p-value
0.0001245
20Testing for significance of a categorical variable
- Even though there are q-1 dummy variables, there
is really only one real variable - Need to compare model with continent to model
without continent - Most advanced stats packages
- Recognize nominal variables (you dont have to do
the coding yourself) - Automatically provide P value for entire variable
- Extra step in Rcmdr Models -gt Hypothesis Tests
-gt Anova Table
21Anova Table (Type II tests) Response
Prop_exotic Sum Sq Df F value
Pr(gtF) Pop_dens 0.17499 1 21.5302 0.0001579
Continent 0.15796 4 4.8588 0.0066743
Residuals 0.16255 20
22In Excel, have to do Incremental F test yourself
- Run regression on models with (model 1) and
without (model 0) nominal variable - Calculate
- If null hypothesis (that categorical variable has
no effect) is true, then F0 should follow an
F-distribution with q-1 and n-k-1 d.f. - Look up critical value in a table, or calculate P
in excel using FDIST
23Coding an ordinal variable
24Interpreting the coefficients
- b1 is the effect of going from Strongly Disagree
to Disagree - b2 is the effect of going from Disagree to
Neutral - Intercept for Strongly Disagree is
- Intercept for Disagree is
- Intercept for Neutral is