Title: Dummy Variables
1Dummy Variables
- Dummy variables refers to the technique of using
a dichotomous variable (coded 0 or 1) to
represent the separate categories of a nominal
level measure. - The term dummy appears to refer to the fact
that the presence of the trait indicated by the
code of 1 represents a factor or collection of
factors that are not measurable by any better
means within the context of the analysis.
2Coding of dummy Variables
- Take for instance the race of the respondent in a
study of voter preferences - Race coded white(0) or black(1)
- There are a whole set of factors that are
possibly different, or even likely to be
different, between voters of different races - Income, socialization, experience of racial
discrimination, attitudes toward a variety of
social issues, feelings of political efficacy,
etc - Since we cannot measure all of those differences
within the confines of the study we are doing, we
use a dummy variable to capture these effects.
3Multiple categories
- Now picture race coded white(0), black(1),
Hispanic(2), Asian(3) and Native American(4) - If we put the variable race into a regression
equation, the results will be nonsense since the
coding implicitly required in regression assumes
at least ordinal level data with approximately
equal differences between ordinal categories. - Regression using a 3 (or more) category nominal
variable yields un-interpretable and meaningless
results.
4Creating Dummy variables
- The simple case of race is already coded
correctly - Race coded 0 for white and 1 for black
- Note the coding can be reversed and leads only to
changes in sign and direction of interpretation. - The complex nominal version turns into 5
variables - White coded 1 for whites and 0 for non-whites
- Black coded 1 for blacks and 0 for non-blacks
- Hispanic coded 1 for Hispanics and 0 for non-
Hispanics - Asian coded 1 for Asians and 0 for non- Asians
- AmInd coded 1 for native Americans and 0 for
non-native Americans
5(No Transcript)
6Regression with Dummy Variables
- The dummy variable is then added the regression
model - Interpretation of the dummy variable is usually
quite straightforward. - The intercept term represents the intercept for
the omitted category - The slope coefficient for the dummy variable
represents the change in the intercept for the
category coded 1 (blacks)
7Regression with only a dummy
- When we regress a variable on only the dummy
variable, we obtain the estimates for the means
of the depended variable. - a is the mean of Y for Whites and aB1 is the
mean of Y for Blacks
8Omitting a category
- When we have a single dummy variable, we have
information for both categories in the model - Also note that
- White 1 Black
- Thus having both a dummy for White and one for
Blacks is redundant. - As a result of this, we always omit one category,
whose intercept is the models intercept. - This omitted category is called the reference
category - In the dichotomous case, the reference category
is simply the category coded 0 - When we have a series of dummies, you can see
that the reference category is also the omitted
variable.
9Suggestions for selecting the reference category
- Make it a well defined group other or an
obscure (low n) is usually a poor choice. - If there is some underlying ordinality in the
categories, select the highest or lowest category
as the reference. (e.g. blue-collar,
white-collar, professional) - It should have ample number of cases. The modal
category is often a good choice.
10Multiple dummy Variables
- The model for the full dummy variable scheme for
race is - Note that the dummy for White has been omitted,
and the intercept a is the intercept for Whites.
11Tests of Significance
- With dummy variables, the t tests test whether
the coefficient is different from the reference
category, not whether it is different from 0. - Thus if a 50, and B1 -45, the coefficient for
Blacks might not be significantly different from
0, while Whites are significantly different from 0
12Interaction terms
- When the research hypotheses state that different
categories may have differing responses to other
independent variables, we need to use interaction
terms - For example, race and income interact with each
other so that the relationship between income and
ideology is different (stronger or weaker) for
Whites than Blacks
13Creating Interaction terms
- To create an interaction term is easy
- Multiply the category the independent variable
- The full model is thus
- a is the intercept for Whites
- (a B1) is the intercept for Blacks
- B2 is the slope for Whites and
- (B2 B3) is the slope for Blacks
- t-tests for B1 and B3 are whether they are
different than a and B2
14Separating Effects
- The literature is unclear on how to fully
interpret interaction effects - There is multicolinearity between a dummy and its
interaction terms, and also the regular
independent variable - It is suggested that you not use a model with
Interactions, and no intercept
15Non-Linear Models
- Tractable non-linearity
- Equation may be transformed to a linear model.
- Intractable non-linearity
- No linear transform exists
16Tractable Non-Linear Models
- Several general Types
- Polynomial
- Power Functions
- Exponential Functions
- Logarithmic Functions
- Trigonometric Functions
17Polynomial Models
- Linear
- Parabolic
- Cubic higher order polynomials
- All may be estimated with OLS simply square,
cube, etc. the independent variable.
18Power Functions
- Simple exponents of the Independent Variable
- Estimated with
19Exponential and Logarithmic Functions
- Common Growth Curve Formula
- Estimated with
- Note that the error terms are now no longer
normally distributed!
20Logarithmic Functions
21Trigonometric Functions
- Sine/Cosine functions
- Fourier series Harmonic Analysis
- See Wolframs Mathworld for pictures
22Intractable Non-linearity
- Occasionally we have models that we cannot
transform to linear ones. - For instance a logit model
- Or an equilibrium system model
23Intractable Non-linearity
- Models such as these must be estimated by other
means. - We do, however, keep the criteria of minimizing
the squared error as our means of determining the
best model
24Estimating Non-linear models
- All methods of non-linear estimation require an
iterative search for the best fitting parameter
values. - They differ in how they modify and search for
those values that minimize the SSE.
25Methods of Non-linear Estimation
- There are several methods of selecting parameters
- Grid search
- Steepest descent
- Marquardts algorithm
26Grid search estimation
- In a grid search estimation, we simply try out a
set of parameters across a set of ranges and
calculate the SSE. - We then ascertain where in the range (or at which
end) the SSE was at a minimum. - We then repeat with either extending the range,
or reducing the range and searching with smaller
grid around the estimated SSE - Try the spreadsheet
- Try this for homework!
27Regression Diagnostics
- Some cases are very influential in regression
models - There are two ways to describe the influence that
case may exert - Residual
- Leverage
- Examination of these particular cases may lead us
to theoretical insight.
28Regression Diagnostics
- Residual
- Outliers extreme measures weaken the goodness
of fit indexes and hypothesis tests - Studentized residuals
- where hii is the hat diagonal
29Residual Diagnostics (cont.)
30Residual Diagnostics (cont.)
- Leverage influential observation
- Hat diagonal a measure of the observations
remoteness in X-space. - Hat diagonals greater than 2 times the number of
coefficients in the model divided by the number
of observations are considered significant.
31Residual Diagnostics (cont.)
- Leverage (cont.)
- Cooks D
- If D is greater than 1, then the observation is
influential.
32Residual Diagnostics (cont.)
- Leverage (cont.)
- dffits
- If dffits is greater than 1, then the observation
is influential.
33Residual Diagnostics (cont.)
- Leverage (cont.)
- Dfbetas an indicator of how much a given
observation influences each regression
coefficient
34(No Transcript)