Title: QNT 531 Advanced Problems in Statistics and Research Methods
1QNT 531Advanced Problems in Statistics and
Research Methods
- WORKSHOP 3
- By
- Dr. Serhat Eren
- University OF PHOENIX
2SECTION 3 OBJECTIVES
- Find the linear regression equation for a
dependent variable Y as a function of a single
independent variable X - Determine whether a relationship between X and Y
exists - Analyze the results of a regression analysis to
determine whether the simple linear model is
appropriate
3SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 1. Deterministic and Statistical Relationships
- In some cases where two variables, x and y, are
related, the relationship is deterministic, or
functional. This means that when a value of x is
selected, the value of y is uniquely determined. - Figure 15.1 illustrates this type of relationship.
4(No Transcript)
5SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- If a person were to order x 100 items, then the
corresponding cost would be y 50 (1.20)
(100) 170. - Every person who orders 100 items will incur the
cost of 170. That is, the value of y is unique,
for a given value of x. - Suppose that we are looking at the relationship
between dollars spent in advertising and the
revenues from sales.
6SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- Clearly, we expect the two variables to be
related, but we do not expect that every time a
company spends x in advertising it Will always
have y in revenues. - We know that there are other factors, or
variables, such as the type of product, location,
and various economic factors that will affect the
value of Y for a given value of X
7SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- When we collect our data we are collecting pairs
of observations on the two variables, X and y
Thus, we will have a set of n data pairs - (X1, Y1), (X2,Y2), . . . , (Xn,Yn)
- This type of plot, Figure 15.2, a scatter plot,
is of primary importance in exploring
relationships between variables, and should be
done before any type of statistical analysis is
performed.
8(No Transcript)
9(No Transcript)
10(No Transcript)
11SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 2. The Simple Linear Regression Model
- The true relationship between the variables X and
Y, the simple regression model, can be described
by the equation - This equation says that for a given value of the
variable X x, the actual value of Y will be
determined by the expression ,plus
some random variation, ?, due to other
unmeasured, factors.
12SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- Thus, if we knew the values of , the true
population intercept, and , the true
population slope, we could predict the value of Y
to within some random error, ?. - Figure 15.3 shows the population model for a
linear regression. Figure 15.4 shows such a line
along with the data.
13(No Transcript)
14(No Transcript)
15SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- The equation of the line that we draw will be
- Where y is the predicted value of Y for a
particular value of X x. The quantities b0 and
b1 are the estimates of the population values
- and . This line is called the
regression line of y on X or the estimate of the
simple regression model.
16SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 3. The Least-Squares Line
- Figure 15.6 shows a set of data and a line drawn
to represent the relationship between the
variables. - The distance between the predicted value of
Y,,and the actual value of Y, y, is called the
deviation of error.
17SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- The technique that finds the equation of the line
that minimizes the total or sum of the squared
deviations between the actual data points and the
line is called the least-squares method.
18(No Transcript)
19SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- The least-squares method finds the equation of
the line - that minimizes
20SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- The total of squared deviations from the data
points to the line. The values for b0 (the
intercept of the line) and b1 (the slope of the
line) are found by using the following equations
21SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- The easiest way to look at what is involved in
the calculations is to make a table with a column
for each sum needed. The table will look like the
one in Table 15.1.
22(No Transcript)
23SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 4. Using the Computer to Do Regression
- Analysis
- Any statistical software that you might be using
is capable of performing regression analysis. - In addition, most spreadsheet packages also do
regression. We will start by identifying the
estimates for the parameters of the regression
equation in the output for several software
packages.
24SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 5. Using the Regression Equation to Make
- Predictions
- In Figure 15.3 we saw that the values of Y vary
around the true regression line. The value of we
find is really a prediction of the mean value of
Y for a given value of X. - There are two kinds of predictions that you can
do, interpolation and extrapolation.
25SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- Using the equation to predict values of Y within
the range of the X data is called interpolation.
- Predicting values of Y for values of X outside
the observed range is called extrapolation.
26(No Transcript)
27(No Transcript)
28SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 6. Calculating Residuals
- The difference between the observed value of Y
(yi), and the predicted value of Y from the
regression equation , for a value of X xi
is called the ith residual, .
29(No Transcript)
30SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
- 7. The Standard Error of the Estimate
- The standard error of the estimate, sylx is a
measure of how much the data vary around the
regression line.
31(No Transcript)
32SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- 1. Hypothesis Testing About the Slope, ?1
- If the variables X and Y are related, the slope
of the line will be some number. If there is no
relationship between X and Y then the slope of
the line is zero. - That is, we say that as X changes, Y does not
change in a related way, Figure 15.9. - We will use hypothesis testing to decide whether
the slope of the regression line is significantly
different from zero.
33SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- The first step in testing a hypothesis is to set
up the appropriate hypotheses. In this case we
want to test - Ho ?1 0
- HA ?1 ? 0
34(No Transcript)
35SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- If the test results in rejecting the null
hypothesis, then we will conclude that the slope
of the regression line is not equal to zero, and
that the relationship between the X and Y
variables is real. - Our estimate of ?1 is b1, and to proceed with the
steps of the hypothesis test we need to know
about the sampling distribution of b1.
36SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- It turns out that the sampling distribution
associated with the least-squares estimate of the
slope is the Student t distribution. - The test statistic for our hypothesis test is
therefore
37SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- which has a t distribution with n-2 degrees of
freedom. In the formula, is the standard error of
the slope b1 and is calculated by
38SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- Since the test is a two-sided test, once the
significance level of the test, ?, is chosen, the
critical values of the test are - ? t?/2,n-2.
- We now have the test set up. All that remains is
to perform the test and make a decision.
39(No Transcript)
40(No Transcript)
41SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- 2. Partitioning the Variance in Linear Regression
- SST- the total variation in the y values around
the mean - SSR- the variation in Y that is caused by Y's
relationship with X - SSE- the variation in Y that remains unexplained.
42SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- The quantities SST, SSR, and SSE are known as the
sums of squares SST is the N total sum of
squares, SSR is the regression sum of squares,
and SSE is the error sum of squares. - Figure 15.11 illustrates how these quantities
relate to the data and to the regression line.
43(No Transcript)
44SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- In addition you see that
- SST SSR SSE
45SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- When X and Y are related, SSR is a large part of
the total variation. This implies that a major
reason that Y varies so much is because it is
related to X. - When this is true, the SSE component of the
variation is small and is the variation in Y that
happens "naturally" or entirely due to chance.
46SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- When X and Y are not related, the regression line
is horizontal (?1 0) and the SSR component of
the variation disappears. - The SSE part of the variation becomes dominant
and we say that we cannot really explain the
variation in Y using the linear model with X.
47SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- The test is referred to as an analysis of
variance (ANOVA) test, because it is based on
looking at the variation in the Y variable. - The hypotheses that we test are
- Ho The linear regression model is not
significant. - HA The linear regression model is significant.
48SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- The test statistic for this test uses the mean
squares, which are obtained by dividing the sums
of squares, SSR and SSE, by their respective
degrees of freedom
49SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- The regression sum of squares (SSR) measures the
amount of the variation in the Y variable that
can be accounted for or explained by Y's
relationship with X. - If you look at SSR as a portion of SST, then you
can determine the amount of the variability in Y
that can be explained, or accounted for.
50SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- This value is called the coefficient of
determination, R²
51SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
- R² is usually part of the general information in
the output from statistical packages. - The coefficient of determination gives you a
measure of how much the variation in Y could be
reduced if X were controlled to a single value. - This is a way of measuring how useful a model is
for planning purposes.
52SECTION 3 CORRELATION ANALYSIS
- 1. The Correlation Coefficient
- In Figure 15.14, you see three types of
relationships perfect negative, none, and
perfect positive. - The correlation coefficient is used as a measure
of the strength of a linear relationship.
53SECTION 3 CORRELATION ANALYSIS
- A correlation of -1 corresponds to a perfect
negative relationship, a correlation of 0
corresponds to no relationship, and a correlation
of 1 corresponds to a perfect positive
relationship.
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58SECTION 3 CORRELATION ANALYSIS
- The correlation coefficient, r, is calculated
using the formula
59SECTION 3 CORRELATION ANALYSIS
- The correlation coefficient is also related to
one of the quantities that we looked at in
regression analysis, the coefficient of
determination, R². - The value of r is equal to the positive square
root of R².
60SECTION 3REGRESSION ASSUMPTIONS AND RESIDUAL
ANALYSIS
- 1. Assumptions and Problems in the Regression
Models - Remember that the simple linear model is given by
61SECTION 3REGRESSION ASSUMPTIONS AND RESIDUAL
ANALYSIS
- The basic assumptions about the error term ? are
- It has a mean value of zero
- For every value of X, the standard deviation, s,
of ? is the same. - The distribution of ? is normal.
- The error terms for the different observations
are nor correlated with each other.
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67SECTION 3MULTIPLE REGRESSION MODEL
- Multiple regression analysis is the study of how
a dependent variable y is related to two or more
independent variables. - In the general case, we will use p to denote the
number of independent variables.
68SECTION 3MULTIPLE REGRESSION MODEL
- 1. Regression Model and Regression Equation
- The equation that describes how the dependent
variable y is related to the independent
variables x1, x2,..., xp and an error term is
called the multiple regression models.
69SECTION 3MULTIPLE REGRESSION MODEL
- 2. Estimated Multiple Regression Equation
- Unfortunately the
parameter values will not be known and must be
estimated from sample data. - A simple random sample is used to compute sample
statistics that are used
as the point estimators of the parameters
- .
70SECTION 3MULTIPLE REGRESSION MODEL
- where
- estimated value of the dependent variable
- are the estimates of
.
71SECTION 3LEAST SQUARE METHOD
- The least squares method develops the estimated
regression equation that best approximated the
straight-line relationship between the dependent
and independent variables. - 1. Least Squares Criterion
72SECTION 3LEAST SQUARE METHOD
- where
- observed value of the dependent variable
for the i th observation. - estimated value of the dependent variable
for the i th observation.
73SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
- The total sum of squares can be partitioned into
two components the sum of squares due to
regression and the sum of squares due to error. - 1. Relationship Among SST, SSR, and SSE
74SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
75SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
- 2. Multiple Coefficient of Determination
- Many analysts prefer adjusting R² for the number
of independent variables to avoid overestimating
the impact of adding an independent variable on
the amount of variability explained by the
estimated regression equation.
76SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
- With n denoting the number of observations and p
denoting the number of independent variables, the
adjusted multiple coefficient of determination is
computed as follows.
77SECTION 3MULTIPLE REGRESSION MODEL ASSUMPTIONS
- 1. Multiple Regression Model
- 2. Assumptions About the Error Term in the
- Multiple Regression Model
- The error ? is a random variable with mean or
expected value of zero that is, E(?) 0.
78SECTION 3MULTIPLE REGRESSION MODEL ASSUMPTIONS
- The variance of ? is denoted by ?² and is the
same for all values of the independent variables
x1, x2,..., xp . - The values of ? are independent.
- The error ? is a normally distributed random
variable reflecting the deviation between the y
value and the expected value of y given by
79SECTION 3TESTING FOR SIGNIFICANCE
- The F test is used to determine whether a
significant relationship exists between the
dependent variable and the set of all the
independent variables. - If the F test shows an overall significance, the
t test is used to determine whether each of the
individual independent variables is significant.
A separate t test is conducted for each of the
independent variables in the model.
80SECTION 3TESTING FOR SIGNIFICANCE
- 1. F-Test
- For the multiple regression models as defined
below - The hypotheses for the F test involve the
parameters of the multiple regression models.
81SECTION 3TESTING FOR SIGNIFICANCE
- If H0 is rejected, we have sufficient statistical
evidence to conclude that one or more of the
parameters is not equal to zero and that the
overall relationship between y and the set of in
dependent variables x1, x2,..., xp is
significant. - However, if H0 cannot be rejected, we do not have
sufficient evidence to conclude that a
significant relationship is present.
82SECTION 3TESTING FOR SIGNIFICANCE
- F Test for Overall Significance
- Test Statistic
83SECTION 3TESTING FOR SIGNIFICANCE
- Rejection Rule
- Using test statistic
- Reject H0 if
- Using p -value
-
- Reject H0 if p-value lt ?
84SECTION 3TESTING FOR SIGNIFICANCE
- 2. t-Test
- If the F test shows that the multiple regression
relationship is significant, a t test can be
conducted to determine the significance of each
of the individual parameters. - t Test for Individual Significance
- For any parameter
85SECTION 3TESTING FOR SIGNIFICANCE
- Test Statistic
- Rejection Rule
- Using test statistic Reject H0 if
- Using p -value Reject H0 if p-value lt ?
86SECTION 3TESTING FOR SIGNIFICANCE
- 3. Multicollinearity
- Multicollinearity refers to the correlation among
the indeendent variables.
87SECTION 3QUALITATIVE INDEPENDENT VARIABLES
- We must work with qualitative independent
variables such as gender (male, female), method
of payment (cash, credit card, check), and so on.
88SECTION 3RESIDUAL ANALYSIS
- 1. Standardized Residual for Observation i
- where
- is the standard deviation of residual i.
- The general formula for the standard deviation of
residual i is defined as follows.
89SECTION 3RESIDUAL ANALYSIS
- 2. Standard Deviation of Residual i
- where
- s standard error of the estimate
- leverage of observation i
90SECTION 3RESIDUAL ANALYSIS
- 3. Detecting Outliers
- An outlier is an observation that is unusual in
comparison with the other data in other words,
an outlier does not fit the pattern of the other
data. - In general, the presence of one or more outliers
in a data set tends to increase s, the standard
error of the estimate, and hence increase
, the standard deviation of residual i .
91SECTION 3RESIDUAL ANALYSIS
- The size of the standardized residual will
decrease as s increases. - 4. Influential Observations
- The rule of thumb is used to identify influential
observations. - For the Butler Trucking example with p2
independent variables and n10 observations.
92SECTION 3RESIDUAL ANALYSIS
- The critical value for leverage is
- The leverage values for the Butler Trucking
example obtained by using Minitab are reported in
Table 3-24, in your textbook. Because hi does not
exceed 0.9, we do not detect influential
observations in the data set.
93SECTION 3RESIDUAL ANALYSIS
- 5. Using Cook s Distance Measure to Identify
- Influential Observations
- Cook s distance measure uses both the leverage
of observation i, hi, and the residual for
observation i, , to determine whether
the observation is influential. - Cooks Distance Measure
94SECTION 3RESIDUAL ANALYSIS
- where
- Cooks distance measure for observation i
- the residual for observation i
- the leverage for observation i
- p the number of independent variables
- s the standard error of the estimate
- The value of Cooks distance measure will be
large and indicate an influential observation if
the residual and/or the leverage is large.
95SECTION 3RESIDUAL ANALYSIS
- As a rule of thumb, values of indicate that the
i th observation is influential and should be
studied further.