Regression with categorical independent variables - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Regression with categorical independent variables

Description:

... intercept (explain). 7 ... Extra step in Rcmdr: Models - Hypothesis Tests - Anova Table. 21. Anova ... If null hypothesis (that categorical variable has no ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 25
Provided by: brucek64
Category:

less

Transcript and Presenter's Notes

Title: Regression with categorical independent variables


1
Regression with categorical independent variables
  • ESM 206B
  • 15 Jan. 2008

2
Types of variables
  • Categorical
  • Dichotomous
  • Male/Female
  • Pre-regulation/Post-regulation
  • Island/Mainland
  • Nominal (nom named)
  • Continent
  • Political party
  • Soil type
  • Ordinal (ord ordered)
  • Survey response strongly disagree, disagree,
    neutral, agree, strongly agree
  • Size classifications small, medium, large
  • Income ranges
  • Numeric
  • Continuous
  • Observations can take on, in principle, any real
    number
  • Infinite of possible values between 1 and 10
  • Discrete
  • Observations can take on, in principle, any
    integer
  • 10 possible values between 1 and 10

3
Dummy variables
  • How can we handle categorical explanatory
    (independent) variables in a regression?
  • Answer make a dummy!
  • Dichotomous zero or one
  • Categorical with q categories q-1variables
    each scored zero or one (examples to follow)

4
Alien Species
  • Exotic species cause economic and ecological
    damage
  • Not all countries equally invaded
  • Want to understand characteristics of country
    that make it more likely to be invaded.
  • Well measure invasiveness as fraction of
    species that are Alien
  • Two hypotheses
  • Human population density plays a role in a
    countrys invasiveness.
  • Island nations are more invaded than mainland
    nations.

5
(No Transcript)
6
A Simple Model
  • ISL is a Dummy variable, coded 0 if mainland, 1
    if island
  • Dummy changes intercept (explain).

7
Call lm(formula Prop_exotic Pop_dens
Island, data ExoticSpecies) Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 7.944e-02 2.708e-02 2.934
0.00746 Pop_dens 3.002e-04 9.721e-05
3.088 0.00519 IslandT.Island 9.215e-02
4.945e-02 1.864 0.07517 . Residual standard
error 0.11 on 23 degrees of freedom Multiple
R-Squared 0.4761, Adjusted R-squared 0.4306
F-statistic 10.45 on 2 and 23 DF, p-value
0.0005902
8
Some RCmdr details
  • In the original dataset, the Island variable
    was coded 0 for mainland countries and 1 for
    island countries
  • RCmdr sees this as a numeric variable
  • To make the dotplot I had to convert it to a
    Factor variable (what R calls categorical
    variables)
  • Data-gtManage variables-gtConvert numeric variables
    to factors

9
Rcmdr continued
  • To include a factor variable in a regression, use
    Linear model instead of Linear regression
  • R converts the factor back to a 0/1 variable on
    the fly
  • In the output, T.Island tells you which level
    of the factor is being set to 1 (True)
  • If it guesses wrong, use Data -gt Manage variables
    -gt Reorder factor levels to switch them
  • In the original dataset, with the 0s and 1s,
    simply use Linear regression

10
Call lm(formula Prop_exotic Island
Pop_dens, data ExoticSpecies2) Coefficients
Estimate Std. Error t value Pr(gtt)
(Intercept) 7.944e-02 2.708e-02 2.934
0.00746 Island 9.215e-02 4.945e-02
1.864 0.07517 . Pop_dens 3.002e-04
9.721e-05 3.088 0.00519 Residual standard
error 0.11 on 23 degrees of freedom Multiple
R-Squared 0.4761, Adjusted R-squared 0.4306
F-statistic 10.45 on 2 and 23 DF, p-value
0.0005902
11
What if we want the slopes to differ between
islands and mainland countries?
  • A model with interactions

12
  • is short hand for main effect and
    interactions
  • Could also use Prop_exotic Pop_dens Island
    Pop_dens Island

13
Call lm(formula Prop_exotic Pop_dens
Island, data ExoticSpecies) Coefficients
Estimate Std. Error t value
Pr(gtt) (Intercept) 0.1038003
0.0304446 3.409 0.00251 Pop_dens
-0.0002163 0.0003406 -0.635 0.53184
IslandT.Island 0.0571083 0.0528125
1.081 0.29126 Pop_densIslandT.Island
0.0005593 0.0003544 1.578 0.12880 Residual
standard error 0.1066 on 22 degrees of
freedom Multiple R-Squared 0.5294, Adjusted
R-squared 0.4652 F-statistic 8.25 on 3 and 22
DF, p-value 0.0007308
14
What about a polytomous variable (e.g.,
Continent)?
15
Interpreting the coefficients
  • b1 is the effect of being in Africa, relative to
    being in South America
  • b2 is the effect of being in Europe, relative to
    being in South America
  • Intercept for South America is
  • Intercept for Africa is
  • Intercept for Europe is

16
Another coding (used by JMP)
17
Interpreting the coefficients (JMP coding)
  • b1 is the effect of being in Africa, relative to
    the average continent
  • b2 is the effect of being in Europe, relative to
    average continent
  • Intercept for South America is
  • Intercept for Africa is
  • Intercept for Europe is

18
  • Continent is already a categorical variable in
    the dataset

19
Call lm(formula Prop_exotic Pop_dens
Continent, data ExoticSpecies) Coefficients
Estimate Std. Error t
value Pr(gtt) (Intercept)
1.025e-02 3.729e-02 0.275 0.786265
Pop_dens 3.637e-04 7.838e-05
4.640 0.000158 ContinentT.Europe
1.424e-01 5.821e-02 2.447 0.023752
ContinentT.North America 1.033e-01 4.910e-02
2.104 0.048270 ContinentT.Oceania
2.600e-01 6.377e-02 4.078 0.000587
ContinentT.South America 3.880e-02
5.839e-02 0.665 0.513920 Residual standard
error 0.09015 on 20 degrees of freedom Multiple
R-Squared 0.6942, Adjusted R-squared 0.6177
F-statistic 9.08 on 5 and 20 DF, p-value
0.0001245
20
Testing for significance of a categorical variable
  • Even though there are q-1 dummy variables, there
    is really only one real variable
  • Need to compare model with continent to model
    without continent
  • Most advanced stats packages
  • Recognize nominal variables (you dont have to do
    the coding yourself)
  • Automatically provide P value for entire variable
  • Extra step in Rcmdr Models -gt Hypothesis Tests
    -gt Anova Table

21
Anova Table (Type II tests) Response
Prop_exotic Sum Sq Df F value
Pr(gtF) Pop_dens 0.17499 1 21.5302 0.0001579
Continent 0.15796 4 4.8588 0.0066743
Residuals 0.16255 20
22
In Excel, have to do Incremental F test yourself
  • Run regression on models with (model 1) and
    without (model 0) nominal variable
  • Calculate
  • If null hypothesis (that categorical variable has
    no effect) is true, then F0 should follow an
    F-distribution with q-1 and n-k-1 d.f.
  • Look up critical value in a table, or calculate P
    in excel using FDIST

23
Coding an ordinal variable
24
Interpreting the coefficients
  • b1 is the effect of going from Strongly Disagree
    to Disagree
  • b2 is the effect of going from Disagree to
    Neutral
  • Intercept for Strongly Disagree is
  • Intercept for Disagree is
  • Intercept for Neutral is
Write a Comment
User Comments (0)
About PowerShow.com