Title: Simple Regression Models of Estimation
1Simple Regression Models of Estimation
- Bernardo Aguilar-Gonzalez
2Purpose of Regression and Correlation Analysis
- Regression Analysis is Used Primarily for
Prediction - A statistical model used to predict the values
of a dependent or response variable based on
values of at least one independent or
explanatory variable - Very related to Correlation Analysis is Used
to Measure Strength of the Association Between
Numerical Variables
3The Scatter Diagram
Plot of all (Xi , Yi) pairs
4Types of Regression Models
Positive Linear Relationship
Relationship NOT Linear
Negative Linear Relationship
No Relationship
5Simple Linear Regression Model
- Relationship Between Variables Is a Linear
Function
- The Straight Line that Best Fit the Data
Y intercept
Random Error
Dependent (Response) Variable
Independent (Explanatory) Variable
Slope
6Population Linear Regression Model
Y
Y
X
?
?
?
?
?
?
Observed Value
i
i
i
0
1
?
Random Error
i
?
?
?
X
?
?
i
0
1
YX
X
Observed Value
7Sample Linear Regression Model
?
?
Yi
Predicted Value of Y for observation i
Xi
Value of X for observation i
b0
Sample Y - intercept used as estimate of the
population ?0
b1
Sample Slope used as estimate of the
population ?1
8Simple Linear Regression Equation Example
Annual Store Square Sales
Feet (000) 1 1,726 3,681 2
1,542 3,395 3 2,816 6,653 4
5,555 9,543 5 1,292 3,318 6
2,208 5,563 7 1,313 3,760
You wish to examine the relationship between the
square footage of produce stores and its annual
sales. Sample data for 7 stores were obtained.
Find the equation of the straight line that fits
the data best
9Scatter Diagram Example
10Equation for the Best Straight Line
?
11Graph of the Best Straight Line
Yi 1636.415 1.487Xi
?
12Interpreting the Results
?
Yi 1636.415 1.487Xi
The slope of 1.487 means for each increase of one
unit in X, the Y is estimated to increase
1.487units.
For each increase of 1 square foot in the size of
the store, the model predicts that the expected
annual sales are estimated to increase by 1487.
13Measures of VariationThe Sum of Squares
- SST Total Sum of Squares
- measures the variation of the Yi values around
their mean Y
_
- SSR Regression Sum of Squares
- explained variation attributable to the
relationship between X and Y
- SSE Error Sum of Squares
- variation attributable to factors other than the
relationship between X and Y
14Measures of Variation The Sum of Squares
Y
?
SSE ?(Yi - Yi )2
_
Yi b0 b1Xi
?
SST ?(Yi - Y)2
_
?
SSR ?(Yi - Y)2
_
Y
X
Xi
15Measures of VariationThe Sum of Squares Example
SSR
SSE
SST
16The Coefficient of Determination
SSR regression sum of squares
r2
SST total sum of squares
Measures the proportion of variation that is
explained by the independent variable X in the
regression model
17Coefficients of Determination (r2) and
Correlation (r)
r2 1,
Y
r 1
Y
r2 1,
r -1
Y
b
b
X
i
0
1
i
Y
b
b
X
i
0
1
i
X
X
r2 .8,
r2 0,
r 0.9
r 0
Y
Y
Y
b
b
X
Y
b
b
X
i
0
1
i
i
0
1
i
X
X
18Standard Error of Estimate
?
The standard deviation of the variation of
observations around the regression line
19Measures of Variation Example
Syx
r2 .94
94 of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage
20Linear Regression Assumptions
For Linear Models
- 1. Normality
- Y Values Are Normally Distributed For Each X
- Error is Normally Distributed
- 2. Homoscedasticity (Constant Variance)
- 3. Independence of Errors
21Inferences about the Slope t Test
- t Test for a Population Slope Is a Linear
Relationship Between X Y ?
- Null and Alternative Hypotheses
- H0 ?1 0 (No Linear Relationship) H1
?1 ? 0 (Linear Relationship)
Where
and df n - 2
22Example Produce Stores
Data for 7 Stores
Regression Model Obtained
Annual Store Square Sales
Feet (000) 1 1,726 3,681 2
1,542 3,395 3 2,816 6,653 4
5,555 9,543 5 1,292 3,318 6
2,208 5,563 7 1,313 3,760
?
Yi 1636.415 1.487Xi
The slope of this model is 1.487. Is there a
linear relationship between the square footage of
a store and its annual sales?
23Inferences about the Slope t Test Example
Test Statistic Decision Conclusion
- H0 ?1 0
- H1 ?1 ? 0
- ? ? .05
- df ? 7 - 2 7
- Critical Value(s)
Reject H0
Reject
Reject
.0.025
.0.025
There is evidence of a relationship.
t
0
2.5706
-2.5706
24Estimation of Predicted Values
25Example Produce Stores
Data for 7 Stores
Annual Store Square Sales
Feet (000) 1 1,726 3,681 2
1,542 3,395 3 2,816 6,653 4
5,555 9,543 5 1,292 3,318 6
2,208 5,563 7 1,313 3,760
Predict the annual sales for a store with 4200
square feet.
Regression Model Obtained
?
Yi 1636.415 1.487Xi
26Answer
?
- Predicted Sales Yi 1636.415 1.487Xi
7,881.82 - Yet,
- 1- How good is this estimate?
- 2- Could I have values that give me a confidence
interval? - You will be able to find these answers out in
your next statistics course!
27Ok, so, do the problem in Chapter 15
- The Data set is in the NAAGE site
28Estimate the descriptives and correlation
29So
- GPAi 2.140 0.168 Reading Timei
- This means that for each increase of 1 average
hour in reading with parents, the model predicts
that the expected GPA is estimated to increase by
0.168. - There is a strong significant correlation between
reading time and GPA (0.86) - 73.9 of the variation in GPA can be explained by
the variability in reading time as measured by
square footage (Good r2). - ,
30- The slope of this model is 0.168.
- Is there a significant linear relationship
between the reading time and the GPA? - Examining the T-statistic (8.075) value and the P
value (0.00) we Reject Ho at both 0.05 and 0.01
levels and conclude that there is a significant
linear relationship. - So, it seems like this is a good model for
prediction of this linear relationship. If we
have a student that spends on average 10 hours
per week reading with his/her parents we can
safely estimate that his/her GPA will be of
approximately 3.82
31To confirm the good fit (prediction power and
strong significant relationship) we can look at
the scatter plot of the relationship between GPA
and reading time
32It could have a better fit trying some other form
(log linear maybe).
33Homework
- Use the General Social Survey for the Year 2000,
GSS00 data set and simple linear regression
estimation methods to determine if - There is a significant linear relationship
between the strength of religious affiliation of
respondents and the happiness in marriage - There is a significant linear relationship
between the race of the respondent and total
family income - There is a significant linear relationship
between total family income and hours per day
watching TV - There is a significant linear relationship
between occupational prestige and total family
income