CSC 323 Quarter: Spring - PowerPoint PPT Presentation

About This Presentation
Title:

CSC 323 Quarter: Spring

Description:

CSC 323 Quarter: Spring 02/03 Daniela Stan Raicu School of CTI, DePaul University – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 21
Provided by: Ioa71
Category:

less

Transcript and Presenter's Notes

Title: CSC 323 Quarter: Spring


1
CSC 323 Quarter Spring 02/03
  • Daniela Stan Raicu
  • School of CTI, DePaul University

2
Outline
Chapter 2 Looking at Data Relationships
between two or more variables
  • Remarks on Correlation (last slides from the
    previous lecture)
  • Linear regression
  • Least-squares regression line
  • Residual Analysis
  • Cautions about regression and correlation
  • SAS procedures for univariate data, scatterplots,
    correlation and regression

3
Correlation
  • The correlation r measures the direction and
    strength of the linear relationship between two
    quantitative variables.
  • Suppose we have the following data

X Y
x1 y1
x2 y2

xn yn
Where sx, sy are the standard deviations for the
two variables X and Y
4
More on Correlation
  • Correlation ignores distinction between
    explanatory and response variables
  • Correlation requires that both variables be
    quantitative
  • Correlation is not affected by changes in the
    unit of measurement of either variable
  • Correlation measures the strength of only linear
    relationships
  • Correlation is not resistant measure, so outliers
    can greatly change the value of r.

5
Not all Relationships are Linear Miles per
Gallon versus Speed
  • Curved relationship(r is misleading)
  • Speed varies from 20 mph to 60 mph
  • MPG varies from trial to trial, even at the same
    speed
  • Statistical relationship

Correlation measures the strength of only linear
relationships
6
Problems with Correlations
  • Outliers can inflate or deflate correlations
  • Groups combined inappropriately may mask
    relationships (a third variable)
  • groups may have different relationships when
    separated

Plot
Correlation is not resistant measure, so
outliers can greatly change the value of r.
7
Linear Regression
Objective To quantify the linear relationship
between an explanatory variable and response
variable by fitting a line to the data (that is,
drawing a line that comes as close as possible to
the points).
Example
8
Linear Regression
  • A regression line is a straight line that
    describes how a response variable y changes as an
    explanatory variable x changes.

Linear Regression equation

y a bx b slope rate of change a
intercept (x0)
Height a bage
9
Prediction
  • Use of Regression to predict the value of y for
    any value of x by substituting this x into the
    equation of the regression line.

Example Prediction via Regression Line Husband
and Wife Ages
  • The regression equation is y 3.6 0.97x,
    where y is the average age of all husbands who
    have wives of age x
  • For all women aged 30, we predict the average
    husband age to be 32.7 years
  • 3.6 (0.97)(30) 32.7 years
  • Suppose we know that an individual wifes age is
    30. What would we predict her husbands age to
    be?

10
Least-squares Regression
  • Used to determine the best line
  • We want the line to be as close as possible to
    the data points in the vertical (y) direction
    (since that is what we are trying to predict)
  • The least - squares regression line of y on x is
    the line that makes the sum of the squares of the
    vertical distances of the data points from the
    line as small as possible.

Y
Observed value y Error Predicted value
?
?
?
?
?
?
A residual is the difference between an observed
value of the response variable y and the value
predicted by the regression line.
x
11
Least - Squares Regression
The regression line makes the prediction errors
as small as possible.
12
Least - Squares Regression (cont.)
  • How is the least squares regression line
    calculated?

13
Coefficient of Determination (R2)
  • Measures usefulness of regression prediction
  • R2 (or r2, the square of the correlation)
    measures how much variation in the values of the
    response variable (y) is explained by the
    regression line
  • Example
  • r1 R21 regression line explains/captures all
    (100) of the variation in y
  • r.7 R2.49 regression line explains almost
    half (50) of the variation in y

14
A CautionBeware of Extrapolation
  • Extrapolation is the use of regression line for
    prediction outside the range values of the
    explanatory variable x that you used to obtain
    the line.
  • Such predictions are often not accurate.
  • Sarahs height was plotted against her age
  • Can you predict her height at age 42 months?
  • Can you predict her height at age 30 years (360
    months)?

15
A CautionBeware of Extrapolation
  • Regression liney 71.95 .383 x
  • height at age 42 months?
  • y 88
  • height at age 30 years?
  • y 209.8
  • She is predicted to be 6 10.5 at age 30.

16
Accuracy of the predictions
One possible measure of the accuracy of the
regression predictions is given by the root mean
square error (r.m.s. error). The r.m.s. error is
defined as the square root of the average of the
square residuals
In large data sets, the r.m.s. error is
approximately equal to
17
Confounding factor
A confounding factor is a variable that has an
important effect on the relationship among the
variables in a study but it is not included in
the study. Example The mathematics department
of a large university must plan the timetable for
the following year. Data are collected on the
enrollment year, the number x of first-year
students and the number y of students enrolled in
elementary math courses.
The fitted regression line has equation 2491.6
91.0663 x R20.694.
18
Influential Point
An observation is influential for the regression
line, if removing it would change considerably
the fitted line. An influential point pulls the
regression line towards itself.
Regression line if ? is omitted
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Influential point/outlier
?
?
?
?
?
?
19
Summary - Warnings
  1. Correlation measures linear association,
    regression line should be used only when the
    association is linear.
  2. Extrapolation do not use the regression line to
    predict values outside the observed range
    predictions are not reliable.
  3. Correlation and regression line are sensitive to
    influential / extreme points.

20
Data Mining
  • Exploring really large data bases in the hope of
    finding useful patterns is called data mining.

Domain Understanding
Data Selection
Cleaning Preprocessing

Evaluation Interpretation
Knowledge
Discovering patterns
The entire process is iterative and interactive.
Write a Comment
User Comments (0)
About PowerShow.com