Simple Regression
  • In confidence intervals and hypothesis testing,
    we examined a single parameter of one variable
  • In chi-square analysis, we also can test for the
    existence of a relationship between two variables
  • In regression analysis, we also can determine the
    strength of the relationship between two or more
    variables, build models, and make predictions
  • We will begin with a fairly complete description
    of two-variable models before moving to multiple
    regression and logistic regression

The Linear Model
  • The simplest relationship between two variables
    is a linear one
  • y ? ?x
  • x independent or explanatory variable (cause)
  • y dependent or response variable (effect)
  • intercept (value of y when x 0)
  • ? slope (change in y when x increases one unit)
  • Before beginning a regression, think about
    whether a linear relationship is reasonable

The Error Term
  • Linear relationships are rarely precise
    measure-ment error and inherent variability
    result in a distribution of values of y for a
    given value of x.
  • We represent this variability with an error term,
  • y ? ?x ?
  • where ? is normally distributed random variable,
    with a mean of 0 and a standard deviation of ?
    (constant, independent of xhomoscedasticity).
  • A plot of P(yx) v. y is a normal distribution
    with a mean of ? ?x and a standard deviation of

Probability of y for a given value of x
Estimating ?, ?, ? Least Squares
  • Based on sample data (xi,yi), choose a and b
    (estimates of ? and ?) such that the sum of the
    squared residuals is minimized (least squares)

Check that values of a and b make sense!
Residuals and Standard Error
  • The predicted or best fit value of y for a
    given xi is y-hat the difference between y-hat
    and the observed value of y is the residual

The key assumption is that the residuals are
independent and normally distributed with a
constant standard deviation
Excel Functions
  • Scatterplot
  • Chart/Add Trendline/Linear/Display equation, R2
  • Menu-driven tools
  • Tools/Data Analysis/Regression
  • Tools/Data Analysis Plus/Stepwise Regression
  • Tools/OLS Regression
  • Excel functions
  • a INTERCEPT(y-range,x-range)
  • b SLOPE(y-range,x-range)
  • se STEYX(y-range,x-range)

Goodness of Fit
  • The total variability in y (SST) has two
  • variability explained by the regression (SSR)
  • remaining or unexplained variability (SSE)

The proportion of the variability that is
explained by the regression is the coefficient
of determination
Degree of Linear Correlation
  • R2 1 perfect linear correlation R2 0 no
  • High R2 good fit only if linear model is
    appropriate always check with a scatterplot
  • Correlation does not prove causation x and y may
    both be correlated to a third (possibly
    unidentified) variable
  • A more popular (but less meaningful) measure is
    the correlation coefficient

R2 RSQ(y-range,x-range r
R2 0.67
R2 0.67
R2 0.67
R2 0.67
Inferences about the Slope
  • The standard error of the slope

Note that sb depends on both se and sx. All else
equal, larger sx produce smaller sb. Confidence
interval ? b t?/2,dfsb df n 2 for one
independent variable. To test the null hypothesis
that ? 0 (i.e., no association between x and
y), find the p value for
Inferences about Correlation
  • You can do this same hypothesis test (no
    correlation) if you know only r (or R2) and n.
    The standard error of r is equal to

It can be shown that
So R2 0.1 is significant (? 0.05) if n gt 40.
Significant value of R2 for given n
What Does Regression Mean?
What Does Regression Mean?
  • Draw best-fit line free hand
  • Find mothers height 60, find average
    daughters height
  • Repeat for mothers height 62, 64 70 draw
    best-fit line for these points
  • Draw line daughters height mothers height
  • For a given mothers height, daughters height
    tends to be between mothers height and mean
    height regression toward the mean

What Does Regression Mean?
  • The value of y predicted by the best-fit line for
    a given x is
  • This prediction is uncertain for two reasons
  • The estimated regression line isnt the true
    regression line (a ? ?, b ? ?) this uncertainty
    is reduced as the sample size, n, is increased
  • There is natural variability in y for a given
    value of x. We model this with a normal
    distribution with constant standard deviation ?

Uncertainty in Mean Value of y
  • If we knew the exact values of ? and ?, there
    would be no uncertainty in the mean value of y
    for a given value of x (i.e., the best-fit line)
  • The uncertainty in the mean value (y-hat) that
    arises from the uncertainty in a and b is
  • This is the error in the location of the best-fit
  • When x 0, (standard error of

Uncertainty in Individual Predicted Value
  • If we knew the exact values of ? and ?, the
    uncertainty in any individual value of y would be
    given by ? se, regardless of the value of x.
  • The overall uncertainty, including that arising
    from the uncertainty in a and b, is
  • Error grows as x moves away from the middle of
    the data. Extrapolation (predicting y for x
    outside of range of original data) is frowned

Prediction Using Data Analysis Plus
  • Enter in the spreadsheet the value of x for which
    you would like to calculate y-hat and its
    confidence interval
  • Tools/Data Analysis Plus/Prediction Interval
  • Input y range and x range click labels if
  • Input given x range
  • Input confidence level, click OK

Analysis of Residuals
  • Plot the residuals and look for
  • Outliers. Check residuals outside 3se. Because
    regression minimizes the sum of the squared
    residuals, the results are sensitive to outliers,
    particularly for extreme values of x.

Testing for Outliers
  • Compute and inspect standardized residuals
  • To see whether a potential outlier observation is
    important, delete the case and rerun the
  • If regression coefficients are basically
    unchanged (new values are within the confidence
    intervals of the original regression), the
    observation is not an important outlier
  • Otherwise, consider whether there is a reasonable
    basis for removing the observation

Testing for Non-Normality
  • Assumption residuals are normally distributed
  • Estimates of regression coefficients are fairly
    robust to violations of this assumption
    significant violations are usually evidenced by
    outliers or other problems with residuals
  • To inspect visually, make a histogram of the
    residuals and check that it is approximately
    bell-shaped and symmetrical
  • Data Analysis Plus/Chi-Squared Test of Normality

Analysis of Residuals
  • Non-linearity. The mean value of the residuals
    should be zero, independent of x. If residuals
    exhibit a curved pattern, a non-linear model may
    be more appropriate.

Testing for Non-Linearity
  • Visual inspection is usually sufficient
  • Right click on data in residual plot, select add
    trendline, select second-order polynomial
    (quadratic), include R2 on chart
  • If R2 is large (e.g., gt 4/(n2)), then curvature
    is significant consider a non-linear

Analysis of Residuals
  • Heteroscedasticity. The standard deviation, se,
    should be constant, independent of x. If the
    spread of residuals increases with x, a
    logarithmic model may be appropriate.

Testing for Heteroscedasticity
  • Visual inspection is usually sufficient
  • In professional statistics programs (SPSS,
    STATA), use the Cook-Weisburg test
  • Otherwise, split data into two parts
    do hypothesis test to compare the
    average residuals for each part

Analysis of Residuals
  • Autocorrelation. The residuals should be random
    and uncorrelated. If there a regular pattern in
    the residuals (e.g., up-down-up-down), common in
    time-series data, dummy, lagged, or difference
    variables may be needed.

Testing for Autocorrelation
  • First-order autocorrelation test for correlation
    between et and et1
  • Second-order (et and et2), third-order, etc.
  • Durbin-Watson Test
  • D ? 2 2r
  • If no autocorrelation, D ? 2
  • If strong positive autocorrelation, D ? 0
  • If strong negative autocorrelation, D ? 4
  • Critical value of D depends on n, ?, k

  • Transformations are used for three main reasons
  • to linearize non-linear relationships between
    independent and dependent variables
  • to produce residuals that are normally
    distributed with constant standard deviation
  • to remove autocorrelation from a time series
  • We will focus on the first two. No information is
    lost in transformations, but care must be taken
    in interpreting the coefficients, and the
    transformed model must be validated.

Exponential Function
  • y aebx linearize by regressing log(y) on x
  • Use if a unit change in x produces a fixed
    percentage change in y
  • Most common in time series, when y grows at a
    constant rate (b percent per year)

Power Function
  • y axb linearize by regressing log(y) on log(x)
  • Use if a one percent change in x produces a fixed
    percentage change in y slope is elasticity
  • Also used to stabilize variance
  • Convenient if y is product of several factors

Logarithmic Function
  • y a blog(x) regress y on log(x)
  • Use if a one percent change in x produces a fixed
    change in y
  • Also used to stabilize variance

Other Transformations
  • Polynomial mostly used to improve fit
  • y a b1x b2x2 define x2 x2 and regress
  • y a b1x b2x2
  • Poisson used if y counts, to stabilize variance

Binomial used y proportion, to stabilize
Logistic used to model populations, resources
Weighted Least Squares
  • Least squares regression model assumes errors are
    normally distributed with constant variance
  • Sometimes we can correct for heteroscedasticity
    with transformationse.g, log(y) instead of y
  • Sometimes each y has a different measurement
    error, or each y represents a different sized
    population. In this case we can use weighted
    least squares, in which each case is given a
    different weight in determining the best-fit
    line. Unfortunately, Excel does not include
    weighted least squares.

