Title: Regression Analysis
1Regression Analysis
2Introduction to Regression Analysis (RA)
- Regression Analysis is used to estimate a
function f( ) that describes the relationship
between a continuous dependent variable and one
or more independent variables. - Y f(X1, X2, X3,, Xn) e
- Note
- f( ) describes systematic variation in the
relationship. - e represents the unsystematic variation (or
random error) in the relationship.
3- In other words, the observations that we have
interest can be separated into two parts - Y f(X1, X2, X3,, Xn) e
- Observations Model Error
- Observations Signal Noise
- Ideally, the noise shall be very small,
comparing to the model.
4Signal to Noise
What we observe can be divided into
signal
noise
5Model specification
If the true function is
yi B0 B1Xi B2Zi
And we fit
yi b0 b1Xi b2Zi ei
Our model is exactly specified and we obtain an
unbiased and efficient estimate.
6Model specification
And finally, if the true function is
yi B0 B1Xi B2Zi B3XiZi B4Zi
2
And we fit
yi b0 b1Xi b2Zi ei
Our model is underspecified, we excluded some
necessary terms, and we obtain a biased estimate.
7Model specification
On the other hand, if the true function is
yi B0 B1Xi B2Zi
And we fit
yi b0 b1Xi b2Zi b3XiZi ei
Our model is overspecified, we included some
unnecessary terms, and we obtain an inefficient
estimate.
8Model specification
- if specify the model exactly, there is no bias
- if you overspecify the model (add more terms than
needed), result is unbiased, but inefficient - if you underspecify the model (omit one or more
necessary terms (the result is biased) - Overall Strategy
- best option is to exactly specify the true
function - we would prefer to err by overspecifying our
model because that only leads to inefficiency - Therefore, start with a likely overspecified
model and reduce it
9An Example
- Consider the relationship between advertising
(X1) and sales (Y) for a company. - There probably is a relationship...
- ...as advertising increases, sales should
increase. - But how would we measure and quantify this
relationship?
10A Scatter Plot of the Data
11The Nature of a Statistical Relationship
12A Simple Linear Regression Model
- The scatter plot shows a linear relation between
advertising and sales.
13Determining the Best Fit
- Numerical values must be assigned to b0 and b1
- If ESS0 our estimated function fits the data
perfectly. - We could solve this problem using Solver...
14Estimation Linear Regressin
Formula for a straight line
outcome
program
D
y
b1 slope
y b0 b1x e
x
D
y
D
y
x
D
want to solve for
b0 intercept
x
15The Estimated Regression Function
- The estimated regression function is
16Evaluating the Fit
600.0
500.0
400.0
300.0
2
R
0.9691
Sales (in 000s)
200.0
100.0
0.0
20
30
40
50
60
70
80
90
100
Advertising (in 000s)
17The R2 Statistic
- The R2 statistic indicates how well an estimated
regression function fits the data. - 0lt R2 lt1
- It measures the proportion of the total variation
in Y around its mean that is accounted for by the
estimated regression equation. - To understand this better, consider the following
graph...
18Error Decomposition
Yi (actual value)
Y
Yi -
Y b0 b1X
X
19Partition of the Total Sum of Squares
or, TSS ESS RSS
20(No Transcript)
21Degree of Linear Correlation
- R2 1 perfect linear correlation R2 0 no
correlation - High R2 good fit only if linear model is
appropriate always check with a scatterplot - Correlation does not prove causation x and y may
both be correlated to a third (possibly
unidentified) variable - A more popular (but less meaningful) measure is
the correlation coefficient
R2 RSQ(y-range,x-range r
CORREL(y-range,x-range)
22R2 0.67
R2 0.67
R2 0.67
R2 0.67
23Testing for Significance F Test
- Hypotheses
- H0 ?1 0
- Ha ?1 0
- Test Statistic
- Rejection Rule
- Reject H0 if F gt F?
- where F? is based on an F distribution with 1
d.f. in - the numerator and n - 2 d.f. in the denominator.
24Some Cautions about theInterpretation of
Significance Tests
- Rejecting H0 b1 0 and concluding that the
relationship between x and y is significant does
not enable us to conclude that a cause-and-effect
relationship is present between x and y. - Just because we are able to reject H0 b1 0 and
demonstrate statistical significance does not
enable us to conclude that there is a linear
relationship between x and y.
25An Example of Inappropriate Interpretation
- A study shows that, in elementary schools, the
ability of spelling is stronger for the students
with larger feet. - ? Could we conclude that the size of foot can
influence the ability of spelling? - ? Or there exists another factor that can
influence the foot size and the spelling ability?
26Making Predictions
- Estimated Sales 36.342 5.550 65
- 397.092
- So when 65,000 is spent on advertising, we
expect the average sales level to be 397,092.
27The Standard Error
- For our example, Se 20.421
- This is helpful in making predictions...
28An Approximate Prediction Interval
- An approximate 95 prediction interval for a new
value of Y when X1X1h is given by
where
- Example If 65,000 is spent on advertising
- 95 lower prediction interval 397.092 -
220.421 356.250 - 95 upper prediction interval 397.092
220.421 437.934 - If we spend 65,000 on advertising we are
approximately 95 confident actual sales will be
between 356,250 and 437,934.
29An Exact Prediction Interval
- A (1-a) prediction interval for a new value of Y
when X1X1h is given by
where
30Example
- If 65,000 is spent on advertising
- 95 lower prediction interval 397.092 -
2.30621.489 347.556 - 95 upper prediction interval 397.092
2.30621.489 446.666 - If we spend 65,000 on advertising we are 95
confident actual sales will be between 347,556
and 446,666. - This interval is only about 20,000 wider than
the approximate one calculated earlier but was
much more difficult to create. - The greater accuracy is not always worth the
trouble.
31Comparison of Prediction Interval Techniques
Sales
575
Prediction intervals created using standard error
Se
525
475
425
375
325
Regression Line
275
Prediction intervals created using standard
prediction error Sp
225
175
125
25
35
45
55
65
75
85
95
Advertising Expenditures
32Confidence Intervals for the Mean
- A (1-a) confidence interval for the true mean
value of Y when X1X1h is given by
where
33A Note About Extrapolation
- Predictions made using an estimated regression
function may have little or no validity for
values of the independent variables that are
substantially different from those represented in
the sample.
34What Does Regression Mean?
35What Does Regression Mean?
- Draw best-fit line free hand
- Find mothers height 60, find average
daughters height - Repeat for mothers height 62, 64 70 draw
best-fit line for these points - Draw line daughters height mothers height
- For a given mothers height, daughters height
tends to be between mothers height and mean
height regression toward the mean
36What Does Regression Mean?
37Residual Analysis
- Residual for Observation i
- yi yi
- Standardized Residual for Observation i
-
- where
38?
?
?
39Residual Analysis
- Detecting Outliers
- An outlier is an observation that is unusual in
comparison with the other data. - Minitab classifies an observation as an outlier
if its standardized residual value is lt -2 or gt
2. - This standardized residual rule sometimes fails
to identify an unusually large observation as
being an outlier. - This rules shortcoming can be circumvented by
using studentized deleted residuals. - The i th studentized deleted residual will be
larger than the i th standardized residual.
40Multiple Regression Analysis
- Most regression problems involve more than one
independent variable.
- The optimal values for the bi can again be found
by minimizing the ESS. - The resulting function fits a hyperplane to our
sample data.
41Example Regression Surface for Two Independent
Variables
Y
X2
X1
42Multiple Regression ExampleReal Estate Appraisal
- A real estate appraiser wants to develop a model
to help predict the fair market values of
residential properties. - Three independent variables will be used to
estimate the selling price of a house - total square footage
- number of bedrooms
- size of the garage
43Selecting the Model
- We want to identify the simplest model that
adequately accounts for the systematic variation
in the Y variable. - Arbitrarily using all the independent variables
may result in overfitting. - A sample reflects characteristics
- representative of the population
- specific to the sample
- We want to avoid fitting sample specific
characteristics -- or overfitting the data.
44Models with One Independent Variable
- With simplicity in mind, suppose we fit three
simple linear regression functions
- The model using X1 accounts for 87 of the
variation in Y, leaving 13 unaccounted for.
45Important Software Note
- When using more than one independent variable,
all variables for the X-range must be in one
contiguous block of cells (that is, in adjacent
columns).
46Models with Two Independent Variables
- Now suppose we fit the following models with two
independent variables
- The model using X1 and X2 accounts for 93.9 of
the variation in Y, leaving 6.1 unaccounted for.
47The Adjusted R2 Statistic
- As additional independent variables are added to
a model - The R2 statistic can only increase.
- The Adjusted-R2 statistic can increase or
decrease.
- The R2 statistic can be artificially inflated by
adding any independent variable to the model. - We can compare adjusted-R2 values as a heuristic
to tell if adding an additional independent
variable really helps.
48A Comment On Multicollinearity
- It should not be surprising that adding X3 ( of
bedrooms) to the model with X1 (total square
footage) did not significantly improve the model. - Both variables represent the same (or very
similar) things -- the size of the house. - These variables are highly correlated (or
collinear). - Multicollinearity should be avoided.
49Testing for Significance Multicollinearity
- The term multicollinearity refers to the
correlation among the independent variables. - When the independent variables are highly
correlated (say, r gt .7), it is not possible
to determine the separate effect of any
particular independent variable on the dependent
variable. - If the estimated regression equation is to be
used only for predictive purposes,
multicollinearity is usually not a serious
problem. - Every attempt should be made to avoid including
independent variables that are highly correlated.
50Model with Three Independent Variables
- Now suppose we fit the following model with three
independent variables
- The model using X1 and X2 appears to be best
- Highest adjusted-R2
- Lowest Se (most precise prediction intervals)
51Making Predictions
- Lets estimate the avg selling price of a house
with 2,100 square feet and a 2-car garage
- The estimated average selling price is 134,444
52Binary Independent Variables
- Other types of non-quantitative factors could
independent variables could be included in the
analysis using binary variables.
53Polynomial Regression
- Sometimes the relationship between a dependent
and independent variable is not linear.
- This graph suggests a quadratic relationship
between square footage (X) and selling price (Y).
54The Regression Model
- An appropriate regression function in this case
might be,
or equivalently,
where,
55Implementing the Model
56Graph of Estimated Quadratic Regression Function
57Fitting a Third Order Polynomial Model
- We could also fit a third order polynomial model,
or equivalently,
where,
58Graph of Estimated Third Order Polynomial
Regression Function
59Overfitting
- When fitting polynomial models, care must be
taken to avoid overfitting. - The adjusted-R2 statistic can be used for this
purpose here also.
60Example Programmer Salary Survey
- A software firm collected data for a sample of
20 computer programmers. A suggestion was made
that regression analysis could be used to
determine if salary was related to the years of
experience and the score on the firms programmer
aptitude test. The years of experience, score on
the aptitude test, and corresponding annual
salary (1000s) for a sample of 20 programmers is
shown on the next slide.
61Example Programmer Salary Survey
- Exper. Score Salary Exper.
Score Salary - 4 78 24 9 88 38
- 7 100 43 2 73 26.6
- 1 86 23.7 10 75 36.2
- 5 82 34.3 5 81 31.6
- 8 86 35.8 6 74 29
- 10 84 38 8 87 34
- 0 75 22.2 4 79 30.1
- 1 80 23.1 6 94 33.9
- 6 83 30 3 70 28.2
- 6 91 33 3 89 30
62Example Programmer Salary Survey
- Multiple Regression Model
- Suppose we believe that salary (y) is related to
the years of experience (x1) and the score on the
programmer aptitude test (x2) by the following
regression model - y ?0 ?1 x1 ?2 x2 ??
-
- where
- y annual salary (000)
- x1 years of experience
- x2 score on programmer aptitude test
63Example Programmer Salary Survey
- Multiple Regression Equation
- Using the assumption E (?) 0, we obtain
- E(y ) ?0 ?1 x1 ?2 x2
- Estimated Regression Equation
- b0, b1, b2 are the least squares estimates of
?0, ?1, ?2. ? - Thus
- y b0 b1x1 b2x2.
64Example Programmer Salary Survey
- Solving for the Estimates of ?0, ?1, ?2
Least Squares Output
Input Data
Computer Package for Solving Multiple Regression P
roblems
b0 b1 b2 R2 etc.
x1 x2 y 4 78 24 7 100 43 .
. . . . . 3 89 30
65Example Programmer Salary Survey
- Data Analysis Output
- The regression is
- Salary 3.17 1.40 Exper 0.251 Score
- Predictor Coef Stdev
t-ratio p - Constant 3.174 6.156 .52 .613
- Exper 1.4039 .1986 7.07 .000
- Score .25089 .07735 3.24 .005
- s 2.419 R-sq 83.4
R-sq(adj) 81.5
66Example Programmer Salary Survey
- Computer Output (continued)
- Analysis of Variance
- SOURCE DF SS MS
F P - Regression 2 500.33 250.16 42.76 0.000
- Error 17 99.46 5.85
- Total 19 599.79