Cotton example - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Cotton example

Description:

Coefficient of determination. The proportion of the total variation in Y that ... That is, the coefficient for B tells us that B contributes all by itself with no ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 34
Provided by: ericra7
Category:

less

Transcript and Presenter's Notes

Title: Cotton example


1
Cotton example
  • Cotton is particularly sensitive to rainfall. Dry
    weather in June appears to slow growth.
  • The following data are records from an
  • agricultural experiment station

2
  • June rainfall (cm) Yield (lb/acre)
  • 3 1120
  • 6 1750
  • 7 1940
  • 9 2130
  • 11 2380
  • 15 2650
  • 17 2990
  • 19 3130

3
  • Yield is the response variable
  • Rainfall in June is the regressor
  • Denote Yield as Y and Rainfall as X.

4
Blackboard
5
(No Transcript)
6
Coefficient of determination
  • The proportion of the total variation in Y that
  • is explained by the fitted regression

7
Cotton example
Yield 999 116 June rainfall
8
Coefficient on determination
  • 96.7 of the total variation is explained by
  • the fitted model

9
(No Transcript)
10
Recall the assumptions we made for the regression
  • For each X there is a normal distribution of Y
    values
  • The variance of the normal distribution is the
    same for all X values
  • The mean of the Y values at a given X lies on a
    straight line along the regression line

11
Another example
  • Plot tree height against time as an indication
  • of growth rate
  • What is the problem here? The observations
  • are not independent.

12
  • Assumption
  • Y observations collected for Xi must be
  • independent the Y observations collected for
  • Xk.

13
  • Make the Y observations independent by just
    looking at incremental growth.

14
Linear relationship?
15
(No Transcript)
16
  • It is fairly easy to misuse regression. A
    rule of thumb is to use your common sense. If the
    results of an analysis dont make any sense, get
    help. Ultimately, statistics is a tool employed
    to help us understand life. Although
    understanding life can be tricky, it is not
    usually perverse. Before accepting conclusions
    which seems silly based on statistical analyses,
    consult with a veteran data analyst. Most truly
    revolutionary results from data analyses are
    based on data entry errors.
  • Cody, R. P. and Smith, J. K. (1991) Applied
    Statistics and the SAS Programming Language.

17
Multiple regression
  • Growth rate of a tree is for example dependent
  • on
  • -water supply
  • -hours of daylight
  • -soil composition
  • -genetics

18
  • To solely use one of these variables to explain
    growth rate is not going to lead to a very
    accurate prediction of growth rate
  • In other words, dividing the total variation into
    the regression variation from one of these
    variables and the residual variation is likely to
    give us a huge residual term

19
Multiple regression
The unknown parameters are called regression
coefficients or partial regression coefficients.
20
In addition to the assumptions of simple linear
regression, we also have to make the following
  • Assumption The variables X1, X2, X3, X4,..
  • are independent there is no
  • correlation between any pair of variables

21
Air pollution in different American cities
  • The response variable Y is sulfur dioxide (SO2)
  • X1 Temperature
  • X2 Number of factories
  • X3 Population size
  • X4 Wind
  • X5 Precipitation
  • X6 Number of days of precipitation

22
Air pollution
  • Our aim is to see how well these six
  • variables explain the amount of SO2.
  • In total 41 different cities were included in the
  • study

23
  • Just as in the simple linear regression we ask
  • if the regression coefficients are different
  • from zero
  • H0 for all j
  • H1 for at least one j

24
(No Transcript)
25
Air pollution
SO2 conc 111 - 1.26 Temp (F) 0.0650 No.
factories - 0.0394 Population - 3.17 Wind 0.509
Ppt - 0.050 No. days Ppt Y 111 - 1.26X1
0.065X2 - 0.0394X3 - 3.17X4 0.509X5 - 0.050X6
26
Conclusion
  • At least one of the regression coefficients is
  • different from zero.

27
  • New Question Which regression coefficients
  • are different from zero?
  • Predictor Coef SE Coef P
  • Temp (F) -1.2592 0.6203 0.049
  • No. fact 0.06500 0.01575 0.000
  • Populati -0.03937 0.01514 0.014
  • Wind -3.169 1.815 0.090
  • Ppt 0.5092 0.3629 0.170
  • No. days -0.0498 0.1617 0.760
  • Only 3 of the 6 variables have regression
  • coefficients different from zero. What happens if
    we
  • make a regression model omitting these three?

28
  • Follow-up null hypothesis
  • H0 ( )

29
Air pollution
SO2 conc 58 0.584 Temp (F) 0.0713 No.
factories - 0.0467 Population Y 58 0.584 X1
0.0713 X2 0.0467 X3
30
  • Predictor Coef SE Coef P
  • Temp (F) -0.5841 0.3710 0.124
  • No. fact 0.07131 0.01606 0.000
  • Population -0.04672 0.01537 0.004
  • What happened here? Suddenly the regression
    coefficient of
  • temperature is not significant!
  • Also r2 for the model including all six variables
    is 0.669
  • while r2 for the model including three variables
    is 0.612

31
  • Questions
  • 1.Which of the two models predicts the sulfur
    dioxide concentration the best?
  • 2. Should I further reduce the model by
    eliminating temperature?

32
Multicollinearity of variables
  • Many, if not most, regression analyses are
    conducted on data sets where the independent
    variables show some degree of correlation. These
    data sets, resulting from non experimental
    research, are common in all fields... The
    potential for a researcher to be misled by a non
    experimental data set is high for a novice
    researcher it is near certainty.
  • Cody, R. P. and Smith, J. K. (1991) Applied
    Statistics and the SAS Programming Language.

33
  • The problem is that the correlation among
    independent variables causes regression estimates
    to change depending on which independent
    variables are being used. That is impact of B on
    A depends on whether C is in the equation or not.
    With C omitted, B can look very influential. With
    C included, the impact of B can disappear
    completely!
  • The reason for this is as follows A regression
    coefficient tells us the unique contribution of
    independent variable to a dependent variable.
    That is, the coefficient for B tells us that B
    contributes all by itself with no overlap with
    any other variable. If B is the only variable in
    the equation, this is no problem. But if we add
    C, and if B and C are correlated, then the unique
    contribution of B on A will be changed.
Write a Comment
User Comments (0)
About PowerShow.com