QNT 531 Advanced Problems in Statistics and Research Methods - PowerPoint PPT Presentation

About This Presentation
Title:

QNT 531 Advanced Problems in Statistics and Research Methods

Description:

... calculations is to make a table with a column for each sum ... The value of r is equal to the positive square root of R . SECTION 3. CORRELATION ANALYSIS ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 96
Provided by: serha6
Category:

less

Transcript and Presenter's Notes

Title: QNT 531 Advanced Problems in Statistics and Research Methods


1
QNT 531Advanced Problems in Statistics and
Research Methods
  • WORKSHOP 3
  • By
  • Dr. Serhat Eren
  • University OF PHOENIX

2
SECTION 3 OBJECTIVES
  • Find the linear regression equation for a
    dependent variable Y as a function of a single
    independent variable X
  • Determine whether a relationship between X and Y
    exists
  • Analyze the results of a regression analysis to
    determine whether the simple linear model is
    appropriate

3
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 1. Deterministic and Statistical Relationships
  • In some cases where two variables, x and y, are
    related, the relationship is deterministic, or
    functional. This means that when a value of x is
    selected, the value of y is uniquely determined.
  • Figure 15.1 illustrates this type of relationship.

4
(No Transcript)
5
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • If a person were to order x 100 items, then the
    corresponding cost would be y 50 (1.20)
    (100) 170.
  • Every person who orders 100 items will incur the
    cost of 170. That is, the value of y is unique,
    for a given value of x.
  • Suppose that we are looking at the relationship
    between dollars spent in advertising and the
    revenues from sales.

6
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • Clearly, we expect the two variables to be
    related, but we do not expect that every time a
    company spends x in advertising it Will always
    have y in revenues.
  • We know that there are other factors, or
    variables, such as the type of product, location,
    and various economic factors that will affect the
    value of Y for a given value of X

7
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • When we collect our data we are collecting pairs
    of observations on the two variables, X and y
    Thus, we will have a set of n data pairs
  • (X1, Y1), (X2,Y2), . . . , (Xn,Yn)
  • This type of plot, Figure 15.2, a scatter plot,
    is of primary importance in exploring
    relationships between variables, and should be
    done before any type of statistical analysis is
    performed.

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 2. The Simple Linear Regression Model
  • The true relationship between the variables X and
    Y, the simple regression model, can be described
    by the equation
  • This equation says that for a given value of the
    variable X x, the actual value of Y will be
    determined by the expression ,plus
    some random variation, ?, due to other
    unmeasured, factors.

12
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • Thus, if we knew the values of , the true
    population intercept, and , the true
    population slope, we could predict the value of Y
    to within some random error, ?.
  • Figure 15.3 shows the population model for a
    linear regression. Figure 15.4 shows such a line
    along with the data.

13
(No Transcript)
14
(No Transcript)
15
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • The equation of the line that we draw will be
  • Where y is the predicted value of Y for a
    particular value of X x. The quantities b0 and
    b1 are the estimates of the population values
  • and . This line is called the
    regression line of y on X or the estimate of the
    simple regression model.

16
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 3. The Least-Squares Line
  • Figure 15.6 shows a set of data and a line drawn
    to represent the relationship between the
    variables.
  • The distance between the predicted value of
    Y,,and the actual value of Y, y, is called the
    deviation of error.

17
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • The technique that finds the equation of the line
    that minimizes the total or sum of the squared
    deviations between the actual data points and the
    line is called the least-squares method.

18
(No Transcript)
19
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • The least-squares method finds the equation of
    the line
  • that minimizes

20
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • The total of squared deviations from the data
    points to the line. The values for b0 (the
    intercept of the line) and b1 (the slope of the
    line) are found by using the following equations

21
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • The easiest way to look at what is involved in
    the calculations is to make a table with a column
    for each sum needed. The table will look like the
    one in Table 15.1.

22
(No Transcript)
23
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 4. Using the Computer to Do Regression
  • Analysis
  • Any statistical software that you might be using
    is capable of performing regression analysis.
  • In addition, most spreadsheet packages also do
    regression. We will start by identifying the
    estimates for the parameters of the regression
    equation in the output for several software
    packages.

24
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 5. Using the Regression Equation to Make
  • Predictions
  • In Figure 15.3 we saw that the values of Y vary
    around the true regression line. The value of we
    find is really a prediction of the mean value of
    Y for a given value of X.
  • There are two kinds of predictions that you can
    do, interpolation and extrapolation.

25
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • Using the equation to predict values of Y within
    the range of the X data is called interpolation.
  • Predicting values of Y for values of X outside
    the observed range is called extrapolation.

26
(No Transcript)
27
(No Transcript)
28
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 6. Calculating Residuals
  • The difference between the observed value of Y
    (yi), and the predicted value of Y from the
    regression equation , for a value of X xi
    is called the ith residual, .

29
(No Transcript)
30
SECTION 3 THE SIMPLE LINEAR REGRESSION MODEL
  • 7. The Standard Error of the Estimate
  • The standard error of the estimate, sylx is a
    measure of how much the data vary around the
    regression line.

31
(No Transcript)
32
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • 1. Hypothesis Testing About the Slope, ?1
  • If the variables X and Y are related, the slope
    of the line will be some number. If there is no
    relationship between X and Y then the slope of
    the line is zero.
  • That is, we say that as X changes, Y does not
    change in a related way, Figure 15.9.
  • We will use hypothesis testing to decide whether
    the slope of the regression line is significantly
    different from zero.

33
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • The first step in testing a hypothesis is to set
    up the appropriate hypotheses. In this case we
    want to test
  • Ho ?1 0
  • HA ?1 ? 0

34
(No Transcript)
35
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • If the test results in rejecting the null
    hypothesis, then we will conclude that the slope
    of the regression line is not equal to zero, and
    that the relationship between the X and Y
    variables is real.
  • Our estimate of ?1 is b1, and to proceed with the
    steps of the hypothesis test we need to know
    about the sampling distribution of b1.

36
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • It turns out that the sampling distribution
    associated with the least-squares estimate of the
    slope is the Student t distribution.
  • The test statistic for our hypothesis test is
    therefore

37
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • which has a t distribution with n-2 degrees of
    freedom. In the formula, is the standard error of
    the slope b1 and is calculated by

38
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • Since the test is a two-sided test, once the
    significance level of the test, ?, is chosen, the
    critical values of the test are
  • ? t?/2,n-2.
  • We now have the test set up. All that remains is
    to perform the test and make a decision.

39
(No Transcript)
40
(No Transcript)
41
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • 2. Partitioning the Variance in Linear Regression
  • SST- the total variation in the y values around
    the mean
  • SSR- the variation in Y that is caused by Y's
    relationship with X
  • SSE- the variation in Y that remains unexplained.

42
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • The quantities SST, SSR, and SSE are known as the
    sums of squares SST is the N total sum of
    squares, SSR is the regression sum of squares,
    and SSE is the error sum of squares.
  • Figure 15.11 illustrates how these quantities
    relate to the data and to the regression line.

43
(No Transcript)
44
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • In addition you see that
  • SST SSR SSE

45
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • When X and Y are related, SSR is a large part of
    the total variation. This implies that a major
    reason that Y varies so much is because it is
    related to X.
  • When this is true, the SSE component of the
    variation is small and is the variation in Y that
    happens "naturally" or entirely due to chance.

46
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • When X and Y are not related, the regression line
    is horizontal (?1 0) and the SSR component of
    the variation disappears.
  • The SSE part of the variation becomes dominant
    and we say that we cannot really explain the
    variation in Y using the linear model with X.

47
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • The test is referred to as an analysis of
    variance (ANOVA) test, because it is based on
    looking at the variation in the Y variable.
  • The hypotheses that we test are
  • Ho The linear regression model is not
    significant.
  • HA The linear regression model is significant.

48
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • The test statistic for this test uses the mean
    squares, which are obtained by dividing the sums
    of squares, SSR and SSE, by their respective
    degrees of freedom

49
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • The regression sum of squares (SSR) measures the
    amount of the variation in the Y variable that
    can be accounted for or explained by Y's
    relationship with X.
  • If you look at SSR as a portion of SST, then you
    can determine the amount of the variability in Y
    that can be explained, or accounted for.

50
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • This value is called the coefficient of
    determination, R²

51
SECTION 3INFERENCES ABOUT THE LINEAR REGRESSION
MODEL
  • R² is usually part of the general information in
    the output from statistical packages.
  • The coefficient of determination gives you a
    measure of how much the variation in Y could be
    reduced if X were controlled to a single value.
  • This is a way of measuring how useful a model is
    for planning purposes.

52
SECTION 3 CORRELATION ANALYSIS
  • 1. The Correlation Coefficient
  • In Figure 15.14, you see three types of
    relationships perfect negative, none, and
    perfect positive.
  • The correlation coefficient is used as a measure
    of the strength of a linear relationship.

53
SECTION 3 CORRELATION ANALYSIS
  • A correlation of -1 corresponds to a perfect
    negative relationship, a correlation of 0
    corresponds to no relationship, and a correlation
    of 1 corresponds to a perfect positive
    relationship.

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
SECTION 3 CORRELATION ANALYSIS
  • The correlation coefficient, r, is calculated
    using the formula

59
SECTION 3 CORRELATION ANALYSIS
  • The correlation coefficient is also related to
    one of the quantities that we looked at in
    regression analysis, the coefficient of
    determination, R².
  • The value of r is equal to the positive square
    root of R².

60
SECTION 3REGRESSION ASSUMPTIONS AND RESIDUAL
ANALYSIS
  • 1. Assumptions and Problems in the Regression
    Models
  • Remember that the simple linear model is given by

61
SECTION 3REGRESSION ASSUMPTIONS AND RESIDUAL
ANALYSIS
  • The basic assumptions about the error term ? are
  • It has a mean value of zero
  • For every value of X, the standard deviation, s,
    of ? is the same.
  • The distribution of ? is normal.
  • The error terms for the different observations
    are nor correlated with each other.

62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
SECTION 3MULTIPLE REGRESSION MODEL
  • Multiple regression analysis is the study of how
    a dependent variable y is related to two or more
    independent variables.
  • In the general case, we will use p to denote the
    number of independent variables.

68
SECTION 3MULTIPLE REGRESSION MODEL
  • 1. Regression Model and Regression Equation
  • The equation that describes how the dependent
    variable y is related to the independent
    variables x1, x2,..., xp and an error term is
    called the multiple regression models.

69
SECTION 3MULTIPLE REGRESSION MODEL
  • 2. Estimated Multiple Regression Equation
  • Unfortunately the
    parameter values will not be known and must be
    estimated from sample data.
  • A simple random sample is used to compute sample
    statistics that are used
    as the point estimators of the parameters
  • .

70
SECTION 3MULTIPLE REGRESSION MODEL
  • where
  • estimated value of the dependent variable
  • are the estimates of
    .

71
SECTION 3LEAST SQUARE METHOD
  • The least squares method develops the estimated
    regression equation that best approximated the
    straight-line relationship between the dependent
    and independent variables.
  • 1. Least Squares Criterion

72
SECTION 3LEAST SQUARE METHOD
  • where
  • observed value of the dependent variable
    for the i th observation.
  • estimated value of the dependent variable
    for the i th observation.

73
SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
  • The total sum of squares can be partitioned into
    two components the sum of squares due to
    regression and the sum of squares due to error.
  • 1. Relationship Among SST, SSR, and SSE

74
SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
  • where

75
SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
  • 2. Multiple Coefficient of Determination
  • Many analysts prefer adjusting R² for the number
    of independent variables to avoid overestimating
    the impact of adding an independent variable on
    the amount of variability explained by the
    estimated regression equation.

76
SECTION 3MULTIPLE COEFFICIENT OF DETERMINATION
  • With n denoting the number of observations and p
    denoting the number of independent variables, the
    adjusted multiple coefficient of determination is
    computed as follows.

77
SECTION 3MULTIPLE REGRESSION MODEL ASSUMPTIONS
  • 1. Multiple Regression Model
  • 2. Assumptions About the Error Term in the
  • Multiple Regression Model
  • The error ? is a random variable with mean or
    expected value of zero that is, E(?) 0.

78
SECTION 3MULTIPLE REGRESSION MODEL ASSUMPTIONS
  • The variance of ? is denoted by ?² and is the
    same for all values of the independent variables
    x1, x2,..., xp .
  • The values of ? are independent.
  • The error ? is a normally distributed random
    variable reflecting the deviation between the y
    value and the expected value of y given by

79
SECTION 3TESTING FOR SIGNIFICANCE
  • The F test is used to determine whether a
    significant relationship exists between the
    dependent variable and the set of all the
    independent variables.
  • If the F test shows an overall significance, the
    t test is used to determine whether each of the
    individual independent variables is significant.
    A separate t test is conducted for each of the
    independent variables in the model.

80
SECTION 3TESTING FOR SIGNIFICANCE
  • 1. F-Test
  • For the multiple regression models as defined
    below
  • The hypotheses for the F test involve the
    parameters of the multiple regression models.

81
SECTION 3TESTING FOR SIGNIFICANCE
  • If H0 is rejected, we have sufficient statistical
    evidence to conclude that one or more of the
    parameters is not equal to zero and that the
    overall relationship between y and the set of in
    dependent variables x1, x2,..., xp is
    significant.
  • However, if H0 cannot be rejected, we do not have
    sufficient evidence to conclude that a
    significant relationship is present.

82
SECTION 3TESTING FOR SIGNIFICANCE
  • F Test for Overall Significance
  • Test Statistic

83
SECTION 3TESTING FOR SIGNIFICANCE
  • Rejection Rule
  • Using test statistic
  • Reject H0 if
  • Using p -value
  • Reject H0 if p-value lt ?

84
SECTION 3TESTING FOR SIGNIFICANCE
  • 2. t-Test
  • If the F test shows that the multiple regression
    relationship is significant, a t test can be
    conducted to determine the significance of each
    of the individual parameters.
  • t Test for Individual Significance
  • For any parameter

85
SECTION 3TESTING FOR SIGNIFICANCE
  • Test Statistic
  • Rejection Rule
  • Using test statistic Reject H0 if
  • Using p -value Reject H0 if p-value lt ?

86
SECTION 3TESTING FOR SIGNIFICANCE
  • 3. Multicollinearity
  • Multicollinearity refers to the correlation among
    the indeendent variables.

87
SECTION 3QUALITATIVE INDEPENDENT VARIABLES
  • We must work with qualitative independent
    variables such as gender (male, female), method
    of payment (cash, credit card, check), and so on.

88
SECTION 3RESIDUAL ANALYSIS
  • 1. Standardized Residual for Observation i
  • where
  • is the standard deviation of residual i.
  • The general formula for the standard deviation of
    residual i is defined as follows.

89
SECTION 3RESIDUAL ANALYSIS
  • 2. Standard Deviation of Residual i
  • where
  • s standard error of the estimate
  • leverage of observation i

90
SECTION 3RESIDUAL ANALYSIS
  • 3. Detecting Outliers
  • An outlier is an observation that is unusual in
    comparison with the other data in other words,
    an outlier does not fit the pattern of the other
    data.
  • In general, the presence of one or more outliers
    in a data set tends to increase s, the standard
    error of the estimate, and hence increase
    , the standard deviation of residual i .

91
SECTION 3RESIDUAL ANALYSIS
  • The size of the standardized residual will
    decrease as s increases.
  • 4. Influential Observations
  • The rule of thumb is used to identify influential
    observations.
  • For the Butler Trucking example with p2
    independent variables and n10 observations.

92
SECTION 3RESIDUAL ANALYSIS
  • The critical value for leverage is
  • The leverage values for the Butler Trucking
    example obtained by using Minitab are reported in
    Table 3-24, in your textbook. Because hi does not
    exceed 0.9, we do not detect influential
    observations in the data set.

93
SECTION 3RESIDUAL ANALYSIS
  • 5. Using Cook s Distance Measure to Identify
  • Influential Observations
  • Cook s distance measure uses both the leverage
    of observation i, hi, and the residual for
    observation i, , to determine whether
    the observation is influential.
  • Cooks Distance Measure

94
SECTION 3RESIDUAL ANALYSIS
  • where
  • Cooks distance measure for observation i
  • the residual for observation i
  • the leverage for observation i
  • p the number of independent variables
  • s the standard error of the estimate
  • The value of Cooks distance measure will be
    large and indicate an influential observation if
    the residual and/or the leverage is large.

95
SECTION 3RESIDUAL ANALYSIS
  • As a rule of thumb, values of indicate that the
    i th observation is influential and should be
    studied further.
Write a Comment
User Comments (0)
About PowerShow.com