Title: MSIS 563: Chapter 7 Numerical Prediction
1MSIS 563 Chapter 7Numerical Prediction
- Deb Dey
- Professor and McCabe Fellow of Information
Systems - Faculty Director, MSIS Program, UW Business School
2Why Numerical Prediction
- Traditional classifiers assume a discrete goal
variable - Cannot use them on data sets with continuous goal
variables directly - Of course, the goal variable can be discretized
- However, it may not be appropriate in some
situations - Numerical prediction techniques can return a
numerical value for the goal variable - Prediction related to economic growth, weather,
market condition
3Linear Regression
- Fit a multi-dimensional line
- The goal (G) is the dependent variable
- Must be numeric/continuous
- Features (Ai) are independent variables, i1,2,K
- Can be continuous or discrete
- We will start with only numeric features
- The challenge is the find the wis in such a way
that the total error (squared) is minimized - Error is the difference between the observed
value and the predicted value
4One-dimensional Linear Regression
- Given a set of (X,Y) observations
- Fit a line YabX
- Of course, observations would have a random error
- ? Yiobs Yiest
- Pick a and b such that ??2 is minimized
5Example
As is the case with all prediction techniques,
you should be ready to accept some error. The aim
is to keep the error to a minimum.
6Multi-dimensional Regression
- The formulae for the weights are quite
difficult - Require complex matrix notation
- In principle, we can estimate the weights
- Estimate the square of error for each row, add
them up, and minimize the total - Can be done easily in MS Excel using Solver
- Most statistical and data mining packages would
support a regression model - Once a line is fitted
- We can use it for prediction of test cases
- Need a measure of accuracy for testing
7Example
WEKA Result (linear1.arff) Linear Regression
Model G 0.5611 A1 2.5118 A2
2.9021 A3 3.461 A4
3.0972 Evaluation on training set
Summary Correlation coefficient 0.9967 Mean
absolute error 0.306 Root mean squared
error 0.3866 Relative absolute error 7.8325
Root relative squared error 8.0595 Total
Number of Instances 20
Mean squared error 2.9896/20 0.14948 Root
mean squared error (0.14948)0.5 0.3866
8Multi-Dimensional Regression(Discrete Feature)
9Making a Discrete Feature Numeric
- This is exactly an opposite issue
- Useful, for example, in linear regression
- Two techniques
- Converting a value to a binary (0/1) feature
- Consider a discrete feature F with 3 (n) values
high, medium, and normal - Create 2 (n1) new features as F-High and
F-medium - F-Normal is not needed (it is dependent on the
other two) - Assign the values of 0 or 1 based on original
values of F - Replacing discrete value by a numeric one
- Useful if there is a natural order among the
values - Order the values and assign a numeric equivalent
10Multi-Dimensional Regression(Discrete Feature)
WEKA Result (linear2.arff) Linear Regression
Model G 0.408 A1medium,high
0.7046 A1high 2.5086 A2
2.9217 A3 3.4493 A4
3.6965 Evaluation on training set
Summary Correlation coefficient 0.9968 Mean
absolute error 0.3099 Root mean squared
error 0.381 Relative absolute error 7.9322
Root relative squared error 7.9424 Total
Number of Instances 20
11Testing and Validation
- Models making numerical predictions should also
be tested - Partitioning of data into training and testing
- Same as before (see notes for Ch. 4)
- Evaluation criteria
- Cannot use previous measures such as accuracy,
stratified accuracy, or RIS - Need to measure the distance between the actual
(observed) values (ai) and the predicted
(estimated) ones (pi) for test case i
12Performance Measures(Numeric Predictions for n
test cases)
- Mean-squared error
- Root mean-squared error
- Mean absolute error
- Relative squared error
- Root relative squared error
- Relative absolute error
- Correlation coefficient
13Comments on Linear Regression
- Easy to build and use
- Often works well in real-world situations
- Limitations
- The treatment of discrete features is not natural
- The conversion process can be tedious, especially
if there are many discrete features each with
several distinct values - If the relationship is not linear, the prediction
can be quite bad - Of course, one can do non-linear regression, but
the number of alternative functional forms is
often too large to perform a meaningful search
14Outliers and Robust Regression
- Outliers are often noisy elements and can
seriously change the result - Demo
- Possible ways of making the regression more
robust - Minimize absolute error instead of squared error
- Remove outliers (i.e. 10 of points farthest from
the regression plane) - Minimize median instead of mean of squares
- Note We have been minimizing the sum of squares
which is equivalent to minimizing the mean of
squares - Finds narrowest strip covering half the
observations
15Least Median of Square
16Example
WEKA Result (linear1.arff) Linear Regression
Model G 0.6126 A1 2.1711 A2
2.7891 A3 3.4815 A4
3.863 Evaluation on training set
Summary Correlation coefficient 0.9964 Mean
absolute error 0.3006 Root mean squared
error 0.4471 Relative absolute error 7.6938
Root relative squared error 9.3203 Total
Number of Instances 20
Solver does slightly worse than WEKA!
17Regression Trees
- Accounts for the problems associated with linear
regression - The overall goal may be a non-linear function of
the features, but it may be locally (piece-wise)
linear in the features - Design
- Build a tree
- Similar to decision trees
- With each leaf node, associate a linear
regression to get the value
18Building a Regression Tree
- Branching (picking a feature)
- Branching is done based on standard deviation
reduction (SDR) of the goal - Pick the one that has the highest SDR
- Let T be the set of goal values, and let Ti be
the set of goal values along the i-th partition
(split) according to the chosen feature - Stopping
- When there is no more attribute to split
- When the maximum SDR is below a threshold
- Typical threshold x of the original sd x5,10
19Example
20Missing Value Compensation
Here, m is the total number of instances without
missing values
Note The denominator is m, and not T as given in
the textbook (p. 204).