Dummy Variables - PowerPoint PPT Presentation

About This Presentation
Title:

Dummy Variables

Description:

Then we can run a regression analysis with Salary as the response variable, ... We won't be able to provide a thorough analysis of this issue but we can add one ... – PowerPoint PPT presentation

Number of Views:730
Avg rating:3.0/5.0
Slides: 73
Provided by: mois1
Category:

less

Transcript and Presenter's Notes

Title: Dummy Variables


1
Dummy Variables
  • Some potential explanatory variables are
    categorical and cannot be measured on a
    quantitative scale.
  • However, we often need to use these variables
    because they are related to the response
    variable.
  • The trick is to create dummy variables, also
    called indicator or 0-1 variables.
  • These are variables that indicate the category a
    given observation is in.

2
Dummy Variables -- continued
  • To create dummy variables we can use an IF
    statement or we can use StatPros Dummy variable
    procedure.
  • The Dummy variable procedure is usually easier
    particularly when there are multiple categories.
  • Once the dummy variables are created, we can
    combine the variables if we like by simply adding
    the columns to get the dummy for the new
    category.

3
Regression Analysis
  • In this example we create dummy variables for
    Gender, and EducLev.
  • Then we can run a regression analysis with Salary
    as the response variable, using any combination
    of numerical and dummy explanatory variables.
  • We must follow two rules
  • We shouldnt use any of the original categorical
    variables that the dummies are based on.
  • We should use one less dummy than the number of
    categories for any categorical variable.

4
Regression Analysis -- continued
  • This second rule is a technical one. If we
    violate it the software will give us an error
    message.
  • For example, Ed_1-Ed_6, any five of these
    variables can be used. The omitted dummy then
    corresponds to the reference category.
  • As we will see the interpretation of the dummy
    variable coefficients are all relevant to this
    reference category.
  • To get used to dummy variables in regression
    analysis we will proceed in several stages.

5
Regression Analysis -- continued
  • We first estimate a regression equation with only
    one variable. The output is shown in this table.
    The resulting equation is Predicated Salary
    45.505 - 8.26Female

6
Regression Analysis -- continued
  • To interpret this equation recall that Female has
    only two possible values, 0 and 1. If we
    substitute 1 then the predicted salary equals
    37.209 and if we substitute 0 the predicated
    salary is 45.505.
  • These are the average salaries of females and
    males. Therefore the interpretation of the -8.926
    coefficient of the Female dummy variable is
    straightforward.

7
Regression Analysis -- continued
  • The above equation only tells part of the story,
    it ignores all information except for gender.
  • We expand this equation by adding the experience
    variables. The output is shown in this table.

8
Regression Analysis -- continued
  • The corresponding equation is Predicted Salary
    35.492 0.998YrsExper 0.131YrsPrior -
    8.080Female
  • It is useful to write two separate equations, one
    for females and one for males Predicted Salary
    27.412 0.988YrsExper 0.131YrsPrior
    Predicted Salary 35.492 0.988YrsExper
    0.131YrsPrior
  • We interpret the coefficient -8.080 of the Female
    dummy variable as the average salary disadvantage
    for females relative to males after controlling
    for job experience. But there is still more story
    to tell.

9
Regression Analysis -- continued
  • We next add job grade to the equation by
    including five of the six job grade dummies.
    Although any five can be use we use Job_2-Job_6.
    The resulting output is shown in this table.

10
Regression Analysis -- continued
  • The estimated regression equations is
    nowPredicated Salary30.230 0.408YrsExper
    0.149YrsPrior - 1.962Female 2.57Job_2
    6.295Job_3 10.475Job_4 16.011Job_5
    27.647Job_6
  • There are no two categorical variables involved,
    gender and job grade.
  • However, we can still write a separate equation
    for any combination of categories by setting the
    dummies to the appropriate values.

11
Regression Analysis -- continued
  • For example, the equation for females at the
    fifth job grade is found by setting Female1 and
    Job_51 and setting the other job dummies equal
    to 0. The equation formed isPredictedSalary
    44.279 0.408YrsExper 0.150YrsPrior
  • We interpret this equation as follows
  • For either gender and any job grade, the expected
    increase is salary for one extra year of
    experience with Fifth National is 408 the
    expected salary increase for one year experience
    with another bank is 149.

12
Regression Analysis -- continued
  • The coefficients of the job dummies indicate the
    average increase in salary an employee can expect
    relative to the reference (lowest) job grade.
  • The key coefficient, the negative 1962 for
    females indicates the average salary disadvantage
    for females relative to males, given that they
    have the same experience levels and are in the
    same job grade
  • Although the penalty is still substantial, it
    is less than a fourth of the penalty we saw
    before.
  • It appears that females might be getting paid
    less on average partly because they are in the
    lower job categories.

13
Regression Analysis -- continued
  • We can check whether females are
    disproportionately in the lower job categories by
    using a pivot table with JobGrade in the row
    area, Gender in the column area and the count
    (expressed as a percentage) of any variable in
    the data area.

14
Regression Analysis -- continued
  • Clearly, females tend to be concentrated at the
    lower job grades.
  • This certainly helps to explain why females get
    lower salaries on average, but it doesnt explain
    why females are at the lower job grades in the
    first place.
  • We wont be able to provide a thorough analysis
    of this issue but we can add one more piece to
    the puzzle now by adding education level, age,
    and PCJob to the equation.

15
Regression Analysis -- continued
  • We dont provide the whole equation but the
    resulting output is shown here.

16
Regression Analysis -- continued
  • The coefficients can be seen in the output.
  • It doesnt appear to add much to the previous
    equation. The penalty does, however, go up to
    2555, which is slightly greater than the 1962.
  • At face value we can interpret the coefficients
    of the education dummies as a benefit (or loss if
    negative) of extra education relative to a high
    school diploma, the reference category.

17
Regression Analysis -- continued
  • The coefficient of PCJob implies that an employee
    with a computer-related job can expect an extra
    4923 in salary relative to an employee without a
    computer-related job, provided the other
    variables are the same for each employee.
  • The age coefficient is quite small and has little
    effect on salary.

18
Conclusion
  • The main conclusion we can draw from the output
    is that there is still a plausible case to be
    made for discrimination against females, even
    after including information on all the variables
    in the database in the regression equation.

19
  • Modeling Possibilities

20
BANK.XLS
  • The Fifth National Bank of Springfield is facing
    a gender-discrimination suit. The charge is that
    its female employees receive substantially
    smaller salaries than its male employees.
  • The banks employee database is listed in this
    file. Here is a partial list of the data.

21
Question
  • Earlier we estimated an equation for Salary suing
    the numerical explanatory variables YrsExper and
    YrsPrior and the dummy variable Female.
  • If we drop the YrsPrior variable from the
    equation (for simplicity) and rerun the
    regression, we obtain the equationPredicted
    Salary 35.824 0.981YrsExper - 8.012Female
  • The R2 value for this equation is 49.1. If we
    decide to include an interaction variable between
    YrsExper and Female in this equation, what is the
    effect?

22
Interaction Terms
  • An interaction variable algebraically is the
    product of two variables. Its effect is to allow
    the effect of one of the variables on Y to depend
    on the value of the other variable.
  • The interaction term allows the slope of the
    regression line to differ between the two
    categories.

23
Solution
  • We first need to form an interaction variable
    that is the product of YrsExper and Female.
  • This can be done two ways in Excel.
  • we can do it manually by introducing a new
    variable that contains the product of the two
    variables involved, or
  • we can use the StatPro/Data Utilities/Create
    Interaction Variable menu item.
  • Using the latter way we must select Female and
    YrsExper as the variables, and we do not check
    either of the boxes in the dialog box -- neither
    should be a categorical variable.

24
Solution -- continued
  • Once the interaction variable has been created,
    we include it in the regression equation in
    addition to the other variables. The multiple
    regression output is shown here.

25
Solution -- continued
  • The estimated regression equation isPredicated
    Salary 30.430 1.528YrsExper 4.908Female
    - 1.248YrsExper_Female
  • As we discussed before it is useful to write this
    equation as two separate equations, one for
    females and one for males. The female equation
    is Predicated Salary 34.528
    0.280YrsExperand the male equation
    is Predicated Salary 30.430 1.528YrsExper
  • Next we can show these equations graphically.

26
Nonparallel Female and Male Salary Lines
27
Solution -- continued
  • The Y-intercept for the female line is slightly
    higher - females with no experience at Fifth
    National Bank tend to start out slightly higher
    than males - but the slope of the female line is
    much lower. That is, males tend to move up the
    salary ladder much more quickly than females.
  • Again, this provides another argument, although a
    somewhat different one, for gender discrimination
    against females.
  • The R2 value increased from 49.1 to 63.9. The
    interaction variable has definitely added to the
    explanatory power of the equation.

28
  • Modeling Possibilities

29
BANK.XLS
  • The Fifth National Bank of Springfield is facing
    a gender-discrimination suit. The charge is that
    its female employees receive substantially
    smaller salaries than its male employees.
  • The banks employee database is listed in this
    file. Here is a partial list of the data.

30
Question
  • A glance at the distribution of salaries of the
    208 employees shows some skewness to the right -
    a few employees make substantially more than the
    majority of employees.
  • Therefore, it might make sense to use the natural
    logarithm of Salary instead of Salary as the
    response variable.
  • If we do this, how do we interpret the results?

31
Solution
  • All of the analyses we did earlier with this data
    set could be repeated except with Log_Salary as
    the response variable.
  • For the sake of discussion we will look only at
    the regression equation with Female and YrsExper
    as explanatory variables.
  • After we create the Log_Salary variable and run
    the regression, we obtain the output shown here.

32
Regression Output with Log_Salary as Response
Variable
33
Solution
  • The estimated regression equation is Predicted
    Log_Salary 3.5829 0.0188YrsExper - 0.1616
    Female
  • The R2 and se values are 42.4 and 0.1794. For
    comparison with Salary these were 49.1 and
    8.070.
  • We first interpret that neither of these values
    are directly comparable to the Salary values.
  • The two R2 values are percentages explained of
    different response variables, Log_Salary and
    Salary. The fact that one is smaller does not
    mean a worse fit. They simply arent comparable.

34
Solution -- continued
  • The situation for se is even worse. Each se is a
    measure of a typical residual, but the residuals
    in the Log_Salary equation are in log dollars,
    whereas the residuals in the Salary equation are
    in dollars.
  • Therefore it is no surprise that the Log_Salary
    is much smaller than the se for the Salary
    equation.
  • If we want comparable standard error measures for
    the two equations, we should take antilogs of the
    fitted values from the Log_Salary equation to
    convert them back to dollars, subtract these from
    the original Salary values, and take the standard
    deviation of these residuals.

35
Solution -- continued
  • The resulting standard deviation is 7.74. This
    is somewhat smaller than the se from the Salary
    equation, an indication of a slightly better fit.
  • Finally we interpret the equation itself.
  • When the response variable is Log_Y and a term on
    the right hand side of the equation is of the
    form bX, then whenever X increases by one unit
    Y-hat changes by a constant percentage, and this
    percentage is approximately equal to b (written
    as a percentage).

36
Solution -- continued
  • This means that for each year of experience with
    Fifth National, an employees salary can be
    expected to increase 1.88.
  • The Female expected percentage decrease in salary
    is 16.16.
  • In other words this equation implies that females
    can expect to make about 16 less than men for
    comparable years of experience.

37
  • Modeling Possibilities

38
POWER.XLS
  • The Public Service Electric Company produces
    different quantities of electricity each month,
    depending on the demand.
  • This file lists the number of units of
    electricity produced (Units) and the total cost
    of producing these (Cost) for a 36-month period.
  • The data set appears on the next slide.
  • How can regression be used to analyze the
    relationship between Cost and Units?

39
Data for Electric Power
40
Solution
  • A good place to start is with a scatterplot of
    Cost versus Units.

41
Solution -- continued
  • The scatterplot indicates a definite positive
    relationship and one that is nearly linear.
  • However, there is also some evidence of curvature
    in the plot. The points increase slightly less
    rapidly as Units increase from left to right.
  • In economic terms, there may be economics of
    scale, where marginal cost of the electricity
    decreases as more units of electricity are
    produced.
  • Nevertheless, we use regression to estimate a
    linear relationship between Cost and Units.

42
Solution -- continued
  • The resulting regression equation is
    Predicted Cost 23,651 30.53 Units
  • The corresponding R2 and se are 73.6 and 2734.
    We also requested a scatterplot of the residuals
    versus the fitted values. The scatterplot is on
    the next slide. Obtaining this scatterplot is
    always a good idea if nonlinearity is suspected.
  • The sign of nonlinearity in this plot is that the
    residuals to the far left and the far right are
    all negative, whereas the majority of the
    residuals in the middle are positive.

43
Residuals from a Straight-Line Fit
44
Solution -- continued
  • Admittedly the pattern is far from perfect -
    there are a few negatives in the middle - but the
    plot does hint at nonlinear behavior.
  • The negative-positive-negative behavior of the
    residuals suggests a parabola that is, a
    quadratic equation with the square of Units
    included in the equation.
  • We first create a new variable Sqr_Units in the
    data set. This can be done manually or using
    StatPros Transform Variables menu item.

45
Solution -- continued
  • Then we use multiple regression to estimate the
    equation for Cost with both explanatory
    variables, Units and Sqr_Units, included.
  • The resulting equation from the output on the
    next slide is Predicated Cost 5793
    98.3Units - 0.0600Sqr_Units
  • Note that R2 has increase to 82.2 and se has
    decreased to 2281.

46
Regression Output with Squared Term Included
47
Solution -- continued
  • One way to see how this regression equation fits
    the scatterplot of Costs versus Units is to use
    Excels trendline option.
  • To do so activate the scatterplot, click on any
    point and use the Chart/Add Trendline menu item,
    click the Type tab and select the Polynormal type
    or order 2, that is a quadratic.
  • A graph of the equation is superimposed on the
    scatterplot on the following slide. It shows
    reasonably good fit, plus an obvious curvature.

48
Quadratic Fit Scatterplot
49
Solution -- continued
  • The main downside to a quadratic regression
    equation is that there is no easy interpretation
    of the coefficients of Units and Sqr_Units.
  • All we can say is that the terms in the equation
    combine to explain the nonlinear relationship
    between units produced and total cost.
  • A final note about the equation concerns the
    coefficient of Sqr_Units.
  • First, the fact that it is a negative make the
    parabola bend downward. This produces the
    decreasing marginal cost behavior, where every
    extra unit of electricity incurs a smaller cost.

50
Solution -- continued
  • Second, we shouldnt be fooled by the small
    magnitude of this coefficient. Remember that it
    is the coefficient of Units squared, which is a
    large quantity. Therefore, the effect of the
    product -0.0600Sqr_Units is sizable.
  • One other possibility we might examine is a
    logarithmic fit.
  • In this case we create a new variable Log_Units,
    the natural logarithm of Units, and then regress
    Cost against the single variable Log_Units.

51
Solution -- continued
  • To create the new variable we can again use
    StatPros Transform Variable menu item and then
    we can superimpose a logarithmic curve on the
    scatterplot of Cost versus Units by using the
    trendline feature.
  • This curve appears in the scatterplot on the next
    slide.
  • To the naked eye, it appears to be similar, and
    about as good a fit as the quadratic curve.

52
Logarithmic Fit Scatterplot
53
Solution -- continued
  • The resulting regression equation is Predicted
    Cost -63,993 16,654Log_Units
  • The values of R2 and se are 79.8 and 2393.
  • These latter values indicate that the logarithmic
    fit is not quite as good as the quadratic fit.
  • However, the advantage of the logarithmic
    equation is that it is easier to interpret.

54
Solution -- continued
  • In this case, where the log of the explanatory
    variable is used, we can interpret its
    coefficient as follows.
  • Suppose Units increases by 1, for example from
    600 to 606. Then the equation implies that the
    expected Cost will increase approximately
    166.54.
  • In words, every 1 increase in Units is
    accompanied by an expected 166.54 increase in
    Cost.
  • Note that for larger values of Units, a 1
    increase represents a larger absolute increase.
    But each such 1 increase entails the same
    increase in Cost. This is another way of
    describing the decreasing marginal cost property.

55
  • Modeling Possibilities

56
CARDEMAND.XLS
  • This file contains annual data (1970-1987) on
    domestic auto sales in the United States. The
    data set is shown here on the next slide.
  • The variables are defined as
  • Quantity annual domestic auto sales (in number
    of units)
  • Price real price index of new cars
  • Income real disposable income
  • Interest prime rate of interest
  • Estimate and interpret a multiplicative (constant
    elasticity) relationship between Quantity and
    Price, Income and Interest.

57
Car Demand Data
58
Constant Elasticity Relationships
  • A particular type of nonlinear relationship that
    has firm grounding in economic theory is called a
    constant elasticity relationship. It is also
    called a multiplicative relationship.
  • One property of this type of relationship is that
    the effect of a change on any explanatory
    variable Xi on Y depends on the levels of the
    other Xs in the equation.

59
Solution
  • We first take the natural logs of all four
    variables.
  • This can be done in one step using the Transform
    Variables menu item or we can use Excels LN
    function.
  • We then use multiple regression, with
    Log_Quantity as the response variable and
    Log_Price, Log_Income, and Log_Interest as the
    explanatory variables.
  • The resulting output is shown on the next slide
    and the corresponding equation Predicted
    Log_Quantity 4.675 - 1.185Log_Price
    2.183Log_Income - 0.19Log_Interest

60
Regression Output for Multiplicative Relationship
61
Solution -- continued
  • If we like we can convert this back to the
    original variables, that is back to
    multiplicative form, by taking antilogs. The
    result isPredicted Quantity 107.198Price-1.185I
    ncome2.183Interest-0.191
  • In either form the equation implies that the
    elasticities are approximately equal to -1.185,
    2.183 and -0.191.
  • When Price increases by 1, Quantity tends to
    decrease by about 1.185 when Income increases
    by 1, Quantity tends to increase by about
    2.183 and when Interest increases by 1,
    Quantity tends to decrease by about 0.191.

62
Conclusions
  • Does this multiplicative equation provide a
    better fit to the automobile data than does an
    additive relationship?
  • Without doing considerable more work it is
    difficult to answer this questions with
    certainty.
  • As we discussed previously, it is not sufficient
    to compare R2 and se values for the two fits.
  • We will simply state that the multiplicative
    relationship provides a reasonably good fit, and
    it makes sense economically.

63
  • Modeling Possibilities

64
LEARNING.XLS
  • The Presario Company produces a variety of small
    industrial products.
  • It has just finished producing 22 batches of a
    new product (new to Presario) for a customer.
  • This file contains the times (in hours) to
    produce each batch. These data are in the table
    on the next slide.
  • Clearly, the times have tended to decrease as
    Presario has gained more experience in making the
    product.

65
Data for Learning Curve
  • Does the multiplicative learning model apply to
    these data, and what does it imply about the
    learning rate?

66
Learning Curve Model
  • A final example of a multiplicative relationship
    is the learning curve model.
  • A learning curve relates the unit production time
    (or cost) to the cumulative volume of output
    since that production process first began.
  • Empirical studies indicate that production times
    tend to decrease by a relatively constant
    percentage every time cumulative output doubles.
  • The constant percentage is called the learning
    rate.

67
Solution
  • One way to check whether the multiplicative
    learning model is reasonable is to create the log
    variables Log_time and Log_batch in the usual way
    and then see whether a scatterplot of Log_Time
    versus Log_Batch is approximately linear.
  • The multiplicative model implies that it should
    be.
  • Such a scatterplot is shown on the next slide,
    along with a superimposed linear trend line. The
    fit appears to be quite good.

68
Scatterplot of Log Variables with Linear Trend
Superimposed
69
Solution -- continued
  • To estimate the relationship, we regress Log_Time
    on Log_Batch. The resulting equation
    is Predicated Log_Time 4.834 - 0.155Log_Batch
  • There are a couple of ways of interpreting this
    equation.
  • First, because it is based on a multiplicative
    relationship, we can interpret the coefficient
    -0.155 as an elasticity. That is when Batch
    increases by 1, Time tends to decrease by
    approximately 0.155. Although this is correct it
    is not as useful as the doubling
    interpretation.

70
Solution -- continued
  • We know that the estimated learning rate
    satisfies -0.155 ln(learning
    rate/ln(2)Solving for the learning rate
    (multiply through by ln(2)) and then take
    antilogs, we find that it is 0.898, or
    approximately 90. In other words, whenever
    cumulative production doubles, the time to
    produce a batch decreases by about 10.

71
Predicting Future Production Times
  • Presario could use this regression equation to
    predict future production times.
  • For example, suppose the customer places an order
    for 15 more batches of the same product. We can
    use the equation to predict the log of production
    time for each batch, then take their antilogs and
    sum them to obtain the total production time.
  • The calculations are shown in rows 26-42 of the
    following table. The total predicted time to
    finish is about 1115 hours.

72
Using the Learning Curve Model for Predications
Write a Comment
User Comments (0)
About PowerShow.com