Dummy Variables - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Dummy Variables

Description:

Now picture race coded white(0), black(1), Hispanic(2), Asian(3) and Native American(4) ... AmInd; coded 1 for native Americans and 0 for non-native Americans ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 35
Provided by: robert158
Category:

less

Transcript and Presenter's Notes

Title: Dummy Variables


1
Dummy Variables
  • Dummy variables refers to the technique of using
    a dichotomous variable (coded 0 or 1) to
    represent the separate categories of a nominal
    level measure.
  • The term dummy appears to refer to the fact
    that the presence of the trait indicated by the
    code of 1 represents a factor or collection of
    factors that are not measurable by any better
    means within the context of the analysis.

2
Coding of dummy Variables
  • Take for instance the race of the respondent in a
    study of voter preferences
  • Race coded white(0) or black(1)
  • There are a whole set of factors that are
    possibly different, or even likely to be
    different, between voters of different races
  • Income, socialization, experience of racial
    discrimination, attitudes toward a variety of
    social issues, feelings of political efficacy,
    etc
  • Since we cannot measure all of those differences
    within the confines of the study we are doing, we
    use a dummy variable to capture these effects.

3
Multiple categories
  • Now picture race coded white(0), black(1),
    Hispanic(2), Asian(3) and Native American(4)
  • If we put the variable race into a regression
    equation, the results will be nonsense since the
    coding implicitly required in regression assumes
    at least ordinal level data with approximately
    equal differences between ordinal categories.
  • Regression using a 3 (or more) category nominal
    variable yields un-interpretable and meaningless
    results.

4
Creating Dummy variables
  • The simple case of race is already coded
    correctly
  • Race coded 0 for white and 1 for black
  • Note the coding can be reversed and leads only to
    changes in sign and direction of interpretation.
  • The complex nominal version turns into 5
    variables
  • White coded 1 for whites and 0 for non-whites
  • Black coded 1 for blacks and 0 for non-blacks
  • Hispanic coded 1 for Hispanics and 0 for non-
    Hispanics
  • Asian coded 1 for Asians and 0 for non- Asians
  • AmInd coded 1 for native Americans and 0 for
    non-native Americans

5
(No Transcript)
6
Regression with Dummy Variables
  • The dummy variable is then added the regression
    model
  • Interpretation of the dummy variable is usually
    quite straightforward.
  • The intercept term represents the intercept for
    the omitted category
  • The slope coefficient for the dummy variable
    represents the change in the intercept for the
    category coded 1 (blacks)

7
Regression with only a dummy
  • When we regress a variable on only the dummy
    variable, we obtain the estimates for the means
    of the depended variable.
  • a is the mean of Y for Whites and aB1 is the
    mean of Y for Blacks

8
Omitting a category
  • When we have a single dummy variable, we have
    information for both categories in the model
  • Also note that
  • White 1 Black
  • Thus having both a dummy for White and one for
    Blacks is redundant.
  • As a result of this, we always omit one category,
    whose intercept is the models intercept.
  • This omitted category is called the reference
    category
  • In the dichotomous case, the reference category
    is simply the category coded 0
  • When we have a series of dummies, you can see
    that the reference category is also the omitted
    variable.

9
Suggestions for selecting the reference category
  • Make it a well defined group other or an
    obscure (low n) is usually a poor choice.
  • If there is some underlying ordinality in the
    categories, select the highest or lowest category
    as the reference. (e.g. blue-collar,
    white-collar, professional)
  • It should have ample number of cases. The modal
    category is often a good choice.

10
Multiple dummy Variables
  • The model for the full dummy variable scheme for
    race is
  • Note that the dummy for White has been omitted,
    and the intercept a is the intercept for Whites.

11
Tests of Significance
  • With dummy variables, the t tests test whether
    the coefficient is different from the reference
    category, not whether it is different from 0.
  • Thus if a 50, and B1 -45, the coefficient for
    Blacks might not be significantly different from
    0, while Whites are significantly different from 0

12
Interaction terms
  • When the research hypotheses state that different
    categories may have differing responses to other
    independent variables, we need to use interaction
    terms
  • For example, race and income interact with each
    other so that the relationship between income and
    ideology is different (stronger or weaker) for
    Whites than Blacks

13
Creating Interaction terms
  • To create an interaction term is easy
  • Multiply the category the independent variable
  • The full model is thus
  • a is the intercept for Whites
  • (a B1) is the intercept for Blacks
  • B2 is the slope for Whites and
  • (B2 B3) is the slope for Blacks
  • t-tests for B1 and B3 are whether they are
    different than a and B2

14
Separating Effects
  • The literature is unclear on how to fully
    interpret interaction effects
  • There is multicolinearity between a dummy and its
    interaction terms, and also the regular
    independent variable
  • It is suggested that you not use a model with
    Interactions, and no intercept

15
Non-Linear Models
  • Tractable non-linearity
  • Equation may be transformed to a linear model.
  • Intractable non-linearity
  • No linear transform exists

16
Tractable Non-Linear Models
  • Several general Types
  • Polynomial
  • Power Functions
  • Exponential Functions
  • Logarithmic Functions
  • Trigonometric Functions

17
Polynomial Models
  • Linear
  • Parabolic
  • Cubic higher order polynomials
  • All may be estimated with OLS simply square,
    cube, etc. the independent variable.

18
Power Functions
  • Simple exponents of the Independent Variable
  • Estimated with

19
Exponential and Logarithmic Functions
  • Common Growth Curve Formula
  • Estimated with
  • Note that the error terms are now no longer
    normally distributed!

20
Logarithmic Functions
21
Trigonometric Functions
  • Sine/Cosine functions
  • Fourier series Harmonic Analysis
  • See Wolframs Mathworld for pictures

22
Intractable Non-linearity
  • Occasionally we have models that we cannot
    transform to linear ones.
  • For instance a logit model
  • Or an equilibrium system model

23
Intractable Non-linearity
  • Models such as these must be estimated by other
    means.
  • We do, however, keep the criteria of minimizing
    the squared error as our means of determining the
    best model

24
Estimating Non-linear models
  • All methods of non-linear estimation require an
    iterative search for the best fitting parameter
    values.
  • They differ in how they modify and search for
    those values that minimize the SSE.

25
Methods of Non-linear Estimation
  • There are several methods of selecting parameters
  • Grid search
  • Steepest descent
  • Marquardts algorithm

26
Grid search estimation
  • In a grid search estimation, we simply try out a
    set of parameters across a set of ranges and
    calculate the SSE.
  • We then ascertain where in the range (or at which
    end) the SSE was at a minimum.
  • We then repeat with either extending the range,
    or reducing the range and searching with smaller
    grid around the estimated SSE
  • Try the spreadsheet
  • Try this for homework!

27
Regression Diagnostics
  • Some cases are very influential in regression
    models
  • There are two ways to describe the influence that
    case may exert
  • Residual
  • Leverage
  • Examination of these particular cases may lead us
    to theoretical insight.

28
Regression Diagnostics
  • Residual
  • Outliers extreme measures weaken the goodness
    of fit indexes and hypothesis tests
  • Studentized residuals
  • where hii is the hat diagonal

29
Residual Diagnostics (cont.)
  • RStudent

30
Residual Diagnostics (cont.)
  • Leverage influential observation
  • Hat diagonal a measure of the observations
    remoteness in X-space.
  • Hat diagonals greater than 2 times the number of
    coefficients in the model divided by the number
    of observations are considered significant.

31
Residual Diagnostics (cont.)
  • Leverage (cont.)
  • Cooks D
  • If D is greater than 1, then the observation is
    influential.

32
Residual Diagnostics (cont.)
  • Leverage (cont.)
  • dffits
  • If dffits is greater than 1, then the observation
    is influential.

33
Residual Diagnostics (cont.)
  • Leverage (cont.)
  • Dfbetas an indicator of how much a given
    observation influences each regression
    coefficient

34
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com