Title: WFM 5201: Data Management and Statistical Analysis
1WFM 5201 Data Management and Statistical Analysis
Lecture-6 Correlation and Regression Analysis
Institute of Water and Flood Management
(IWFM) Bangladesh University of Engineering and
Technology (BUET)
June, 2008
2Correlation
- Correlation is concerned with describing the
direction (positive or negative) and strength of
a relationship between two variables. - Correlation makes no distinction between the two
variables (it is a measure of how they vary
jointly), whereas regression theory depends on a
dependent variable being affected by an
error-free independent variable.
3 Correlation coefficient
- The direction and strength of the relationship
can be expressed by means of a correlation
coefficient r, which is mathematically defined
as - The sum of cross products of deviations
4 Correlation coefficient
- The sum of squared deviations for X
- The sum of squared deviations for Y
5Pearsons r
6 Correlation coefficient
- A correlation coefficient varies from -1 to 1
- -1 indicating a perfect negative relationship
(one increase while other decrease), - 0 indicating no relationship
- 1 indicating a perfect positive relationship.
- The size of the correlation indicates the
strength of the relationship for example, the
correlation coefficient -0.89 indicates a
stronger relationship than a coefficient of 0.60.
7Linear Regression
- Regression is primarily concerned with using the
relationship for the purpose of predicting one
variable from knowledge of the other - Correlation, on the other hand, is primarily
concerned with discovering whether or not a
relationship exists in the first place, and then
specifying the strength and direction of this
relationship.
8Linear Regression
- The simple linear regression equation is given
as - X given data
- b0 intercept of regression line
- b1 slope of regression line
It is also known as least squares method
9Regression line
10Coefficient of Regression
11Coefficient of Determination
- The decomposition of the sample variation of
leads to a measure of the "goodness of fit",
which is known as the coefficient of
determination and denoted by R2.
Note
12 Coefficient of determination
- is a measure commonly used to describe how well
the sample regression line fits the observed
data. - Range
- 0 means poorest , 1 best fit of regression model
13Exercise-1 Fit regression equation between Boro
production and rainfall and find R2
Year Boro Production Rainfall
1975-76 424536 216
1976-77 152273 319
1977-78 437007 164
1978-79 278287 141
1979-80 417225 237
1980-81 500207 197
1981-82 395940 255
1982-83 418170 221
14Deviations or Errors
- The sum of squares of these deviations from the
fitted line is
Total Explained unexplained deviation
deviation deviation
15Total, explained, and unexplained deviation
16Regression diagnostics
- Patterns for residual plots (a) satisfactory (b)
funnel, (c) double bow (d) non-linear