Title: Predictive Modeling: Value Prediction
1Predictive Modeling Value Prediction
- The main traditional technique used for value
prediction is Linear Regression which attempts to
fit a straight line through a plot of the data,
such that the line is the best representation of
the average of all observations at that point in
the plot.
2Disadvantages of Linear Regression
- 1. This technique works fine if the data is
linear.
True Regression line
Predicted Regression line
3- 2. The outcome can be influenced by just a few
outliers.
Predicted Regression line
True Regression line
4Regression
- The relationship between the mean of a random
variable and the values of one or more
independent variables on which it depends. - Example we might want to predict the sales of a
new product in terms of the amount of money spent
advertising it on TV.
5- Example we might want to predict family
expenditures on entertainment in terms of family
income. - Example we might want to predict a college
students grade-point average based on the number
of hours he/she spent studying.
6- Example we can predict the average earnings of
college graduates ten years after graduation.
7Curve fitting
- Non-linear regression, solves these two problems
of linear regression but is still not flexible
enough to handle all possible shapes of the data
plot. - We shall consider only linear equation in two
unknowns y a bx, because it is.....
Slope of the line
y-intercept
8- Because it is useful and important, not only
because many relationships are actually of this
form, but also because they often provide close
approximations to the relationships which would
otherwise be difficult to describe in
mathematical forms. - The values of a and b are estimated from the
data.
9Least Square Method
- Most experimental data will not lie exactly on a
straight line (even when it should.) However,
there is a mathematical method for determining
the equation for the best-fit straight line. This
method is called the Least-Square method or
linear regression. - Using the Least-Square method y is a linear
function of x - i.e. y f (x)
- or more specifically y bx a
10- The easiest way to implement this method by hand
is to set up a table with columns of data and
columns of the product xiyi and xi2 (where
i1,2,.., N where N is the number of data
points!). The columns can then be totaled to the
summations used in the above formulas.
11- Example This sample of data obtained in a study
of the relationship between the number of years
that applicants for certain foreign service jobs
have studied English in high school or college
and the grades which they received in a
proficiency test in that language
12- No. of years (x) Grade in Test (y) x2 x.y
- 3 57 9 171
- 4 78 16 312
- 4 72 16 288
- 2 58 4 116
- 5 89 25 445
- 3 63 9 189
- 4 73 16 292
- 5 84 25 420
- 3 75 9 225
- 2 48 4 96
- 35 697 133 2,554
13- The goal is to find one line which fits the data.
- Normal Equations
- ?y na b?x n no. of pairs
- ?xy a?x b ?x2
- 697 10 a 35 b
- 2,554 35 a 133 b
14- Two methods to find a and b
- (1)
- 1st 7 ? 4,879 70 a 245 b
- 2nd 2 ? 5,108 70 a 266 b
- ? (2nd 1st) 229 0 21 b
- ? b 10.90 ? a 31.55
- ? y 31.55 10.90 x
15- (2)
- 1st 697 10 a 35 b
- ? a (697 35b)/10
- 2nd ? 2,554 35 (697 35b)/10 133 b
- ? b 10.90 ? a 32.55
16- (3) Where the slope of the line is given by b
(Dx over Dy or the rise over the run) and the
intercept of the y axis is given as a. For n
data pairs, the equations used to find the slope
b and intercept a are -
- 1.) a ( Sy Sx2 - Sx Sxy ) / ( n Sx2
(Sx)2 ) -
- 2.) b ( n Sxy SySx ) / (n Sx2 (Sx) 2 )
-
17Example eleven data points are recorded during a
test each pair consists of X and Y value. The
following table can be constructed and each
column total shown on the last row
18- Equation 1 gives
-
- a (12578854 76012940) / (1178854 760760)
0.0771 lbs -
- Equation 2 gives
-
- b (1112940 760125) / (1178854 760760)
0.1634 lbs/me -
- Therefore the best straight line fit is given by
the equation -
- y 0.1634 x 0.0771
19The linear regression plot would look like
20(No Transcript)
21(No Transcript)
22- The scatter diagram is obtained by plotting the
points (70,155), 63,150).(68,152). By using a
ruler you find several straight line which
apparently suites the relation in question.
Choosing any two point on the line just drawn you
can account the slope of a fitting line. - Y Y1 (X X1)(Y2 -Y1)/(X2 -X1)
- (170 - 156)/(68 66) 7
- Y - 156 7 (X - 66)
- Y 7X 306
23- If X63 then Y 763 - 306 135 provided that
the line expresses the relation between height
and weight among females in right way. We chose
the best fitting line in the diagram, we hope. - As we shall see below this method is certainly
not exactly. Instead of the point (66,156) we
could have chosen (65,139) and got the result Y
10.33X 316.
24Example
- Find the least square line to the following data
(1,1), (3,2), (4,4), (6,4), (8,5), (9,7), (11,8),
(14,9) - The equation of the line is Y a0 a1X. The
normal equations are - S Y a0N a1S X
- S XY a0S X a1S X2
- The work involved in computing the sums can be
arranged as in the following table. Although the
last column is not needed for this part of the
problem, it has been added to the table for use
with X as a dependent variable which gives quite
another result. The last called regression of X
on Y.
25(No Transcript)
26- Since there are 8 pairs of values of X and Y, N
8 and the normal equations become - 8a0 56a1 40
- 56a0 524a1 364
- Solved simultaneously,
- a0 6/11 or 0.545,
- a1 7/11 or 0.636
27- Another method
- a0 (S Y)(S X2) - (S X)(S XY)/NS X2 - (S
X)2 (40524) - (56364)/(8524) - (56)2
6/11 or 0.545 - a1 NS XY - (S X)(S Y)/NS X2 - (S X)2
(8364) - (5640)/(8524) (56)2 - 7/11 or 0.636
28Perhaps we should try to estimate the regression
line from the example with heigts and weights a
little more exactly
29- The required least square line has this equation
- Y - 154.2 3.22(X 66.8) or
- Y 3.22X - 60.9
- Sometimes when the raw figures are large it is an
advantage to subtract a large figure at least
from one of, perhaps from both of the variables
before accounting. Then you must remember to add
the same figures to the averages you account to
get the right result.
30Assignment (Deadline 14th Sep 06)
- The following area data on the IQs of 25
students, the number of hours they studies for a
certain achievement test, and their scores on the
test
31(No Transcript)
32- 1. Use the computer to find the least squares
line which will enable us to predict a student
score on the test in terms of his/her IQ. And
draw the line. - 2. Use the computer to find the least squares
line which will enable us to predict a students
score on the test in terms of the numbers of
hours he/she studied for the test. And draw the
line. - 3. Use the computer to predict how many hours a
student will study for the test given his/her IQ.
Draw the line.
33- http//www.math.csusb.edu/faculty/stanton/m262/reg
ress/regress.html