Title: MATH 1107 Elementary Statistics
1MATH 1107Elementary Statistics
- Lecture 7
- Regression Analysis
2MATH 1107 Regression Analysis
- Without question, Regression Analysis is the most
heavily used tool in Statistical Modeling. - This is true because it enables you to predict or
explain a dependent variable based upon one or
more independent variables. - Regression Analysis is used in almost every
industry.
3MATH 1107 Regression Analysis
- For Example
- If you were a sports agent, how would you
propose a reasonable contract salary for your
client? - If you are interested in selling your house, how
can you determine an appropriate market price? - If you are the head of the admissions department
in a University, how do you decide who gets
accepted? - If you are an investment banker, how do you
decide which funds to hold in your portfolio?
4MATH 1107 Regression Analysis
All of the variables underlined would be the
dependent variables what would be the
associated independent variables that we might
use to predict or explain these dependent
variables?
5MATH 1107 Regression Analysis
The first step in predicting or explaining a
dependent variable using an independent
variable, is evaluating the correlation of the
two variables using a scatterplot. Lets return
to Median Household Income and Deathrate
although many independent variables can be used
in regression analysis, in these notes, we will
be using only one.
6MATH 1107 Regression Analysis
7MATH 1107 Regression Analysis
8MATH 1107 Regression Analysis
- Using the CORREL(array1, array2) function in
EXCEL, we can determine that the correlation
between Median Income and Death Rate is -.61. - This indicates three things
- The relationship is fairly strong the value of
-.61 is closer to 1 than it is to 0. - The direction is negative/inverse. Meaning that
as one variable goes up, the other goes down. - The R2 value of a predictive regression equation
using these two variables is .37.
9MATH 1107 Regression Analysis
- Since the correlation is pretty good, we can use
these two variables to create a linear model a
linear model - It will have an equation in the form ymxb
- It will be the best fit of the data
- it will minimize the distances between the
actual data points and the predicted points
(this distance is called a residual) - it will enable us to predict the death rates in
other states, that were NOT included in the
original dataset.
10MATH 1107 Regression Analysis
From this analysis, the best fit line is This
equation was provided by EXCEL (tick the Display
Equation on Chart option under the Add
Trendline function). A better way to represent
this equation is State Death Rate (-0.0002
Median State Income) 13.255
y -0.0002x 13.255
11MATH 1107 Regression Analysis
Lets interpret these values directly -.0002 is
the slope of the line. It can be translated
directly to mean For every one dollar of
additional median income, the death rate will
decrease by .0002. The slope tells you how
the dependent variable changes with one unit
change in the independent variable.
12MATH 1107 Regression Analysis
Lets interpret these values directly 13.255 is
the Y-intercept. Algebraically, this is the
point at which the line will cross the y-axis
when the x-value is 0. Since it is not
reasonable to have a state with 0 Median Income,
its not really interpreted directly.
13MATH 1107 Regression Analysis
Now, using the model we developed, predict the
death rates for the states below
STATE MEDIAN INCOME
Virginia 38,223
Washington 34,064
West Virginia 20,301
Wisconsin 33,415
Wyoming 30,379
14MATH 1107 Regression Analysis
Now, lets determine our residuals or how far
off we were for each prediction.