Title: Machine learning
1LECTURE 27
2(No Transcript)
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Machine learning Distinguish between supervised
and unsupervised learning. Examples for
supervised learning 1. Credit rating We have a
credit application, and we want to predict
whether or not the applicant will pay back the
loan. Given Collection of credit applications
for which we know whether or not the
loans were paid back. 2. Satellite image
analysis We have multi-spectral images of the
earth surface. Want to automatically segment
images, i.e. assign each pixel to one of a number
of classes (forest, grassland, water, road,
...). Given Image(s) for which ground truth
is known.
27 3. Handwritten character recognition We want to
automatically read ZIP codes on mail
envelopes. Subproblem Given 16 x 16 gray scale
image of handwritten digit, identify
digit. Given Sample of images for which digit
is known. 4. Chemometrics We have a gasoline
sample and we want to determine its octane level
from an IR absorption spectrum. Given
Collection of spectra of gasoline samples for
which we determined the octane level
with an "octane engine".
28Examples for unsupervised learning 5. Market
segmentation We have data on Amazon customers
(purchase history, demographics (?), etc.) We
want to partition the customers into homogeneous
subsets. 6. Satellite image analysis We have
multi-spectral images of the earth surface. We
want to partition the pixels into subsets
(hopefully) representing the same kind of surface.
29- The generic supervised learning problem
- Universe of objects (credit applicants, written
digits, gasoline samples, ...). - We observe properties of objects (answers on
credit application 16 x 16 gray level image,
IR spectrum ). - Based on observed properties, we want to predict
the value of a response variable (loan
performance, digit, octane level). - As basis for our prediction rule we have a
training sample of objects for which we know
properties and value of response variable. - The generic unsupervised learning problem
- Universe of objects (Amazon customers, pixels in
a MSI) - We observe properties (Purchase history,
spectrum). - Based on observed properties, we want to
partition objects into groups such that objects
within each group are similar, and dissimilar
from objects in other groups. - Terminology
- Supervised learning with continuos response
variable regression - Supervised learning with categorical response
variable classification or discriminant
analysis - Unsupervised learning cluster analysis
30- Supervised learning
- X1, , Xp predictor variables
- Y response variable
- Given (x1, y1), (xn, yn) training sample
- Goals
- Generate rule s(x) that predicts response Y for
new test observations for which only x is
known - Describe how "Y depends on x
- Will restrict ourselves to continuous or binary
response. - Methods for binary response can be used for
general categorical response. - Will discuss two methods
- Linear modeling --- oldest, simplest, most
widely used - Classification and regression trees (CART) ---
builds more general models, can handle messy
data.
31Linear modeling Model response Y as a linear
function of predictors X1, , Xp Y a0 a1 X1
. ap Xp Determine regression coefficients
a0, , ap by fitting model to training
sample. Most common fitting method least
squares --- find coefficients that minimize the
residual sum of squares RSS sum_i (yi - a0 -
a1 xi1 - - ap xip) 2 Optimization problem
can be solved by linear algebra. Prediction For
test observation with predictor vector x
predict yhat a0 a1 x1 ap xp
32- Illustration 1 (silly data set --- sorry!)
- Data on 60 different car models, from Consumer
Reports - Predictor variables
- Weight in pounds
- Engine displacement in liters
- Type --- Small, Sporty, Compact, Medium, Large,
Van (categorical predictor, watch out) - Response variable Mileage in miles / gallon
- Question How does fuel consumption depend on the
predictor variables. - Initially, lets just use a single predictor,
Weight. - First some simple data exploration.
33- Boxplot
- Center line of box median
- Upper and lower bounds of box quartiles
- Whiskers extend out to extremes, or to quartiles
/- 1.5 inter-quartile range, - if there are outliers
- Summary
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 1845 2571 2885 2901 3231 3855
34Curve in plots was obtained by moving
average s(x) ave (yi xi in N(x) ) where N(x)
is an interval centered at x. Maybe better to
model gallons / mile?
35Output obtained when looking at linear model with
Browser
Linear Model Call lm(formula Fuel
Weight) Residuals Min 1Q Median
3Q Max -0.7957 -0.2703 0.01414 0.2547
0.9583 Coefficients Value Std.
Error t value Pr(gtt) (Intercept) 0.3914
0.2995 1.3070 0.1964 Weight 0.0013
0.0001 12.9323 0.0000 Residual standard
error 0.3877 on 58 degrees of freedom Multiple
R-Squared 0.7425 F-statistic 167.2 on 1 and 58
degrees of freedom, the p-value is 0
Model yi a0 a1 xi ei, where ei
independent random errors with mean 0 and
variance sigma2. Std.error tells us how much
the estimated coefficients would vary, if we
generated new data from the model. Multiple
R-squared 1 - sum (ri2) / sum (yi -
ybar)2Fraction of response variability
explained by the model
3675 of variation in fuel consumption / mile can
be explained by vehicle weight Next, lets try a
model with two predictor variables, Weight and
Displacement. Question Does displacement help
explain fuel consumption, once we adjust for
weight? Linear Model
Call lm(formula Fuel Weight
Disp.) Residuals Min 1Q Median 3Q
Max -0.8109 -0.2559 0.01971 0.2673
0.9812 Coefficients Value Std.
Error t value Pr(gtt) (Intercept) 0.4790 0.3418
1.4014 0.1665 Weight 0.0012 0.0002
7.2195 0.0000 Disp. 0.0009 0.0016
0.5427 0.5895 Residual standard error 0.3901
on 57 degrees of freedom Multiple R-Squared
0.7438 F-statistic 82.75 on 2 and 57 degrees of
freedom, the p-value is 0 Adding Displacement
as a second predictor does not increase
explanatory power of model
37- Now, lets try displacement by itself
- Linear Model
- Call lm(formula Fuel Disp.)
- Residuals
- Min 1Q Median 3Q Max
- -0.7998 -0.4765 0.04638 0.2653 1.356
- Coefficients
- Value Std. Error t value Pr(gtt)
- (Intercept) 2.6919 0.2074 12.9796 0.0000
- Disp. 0.0100 0.0013 7.7630 0.0000
- Residual standard error 0.5351 on 58 degrees of
freedom - Multiple R-Squared 0.5096
- F-statistic 60.26 on 1 and 58 degrees of
freedom, the p-value is 1.53e-010
Does well, but not as well as Weight
38Finally, lets see how well Type predicts fuel
consumption. First, look at a picture
How to deal with categorical predictors in a
linear model Suppose X is categorical, with K
categories (in our case, K 6) Convert X into
K-1 indicator variables I2, , Ik Ij 1 if
value of X is j-th category Ij 0 otherwise
39Next, a linear model Linear Model
Call lm(formula Fuel Type) Residuals
Min 1Q Median 3Q Max -0.9273
-0.2537 -0.05596 0.1802 1.306 Coefficients
Value Std. Error t value Pr(gtt)
(Intercept) 4.1677 0.1090 38.2480
0.0000 TypeLarge 0.8001 0.2669 2.9978
0.0041 TypeMedium 0.4338 0.1599 2.7124
0.0089 TypeSmall -0.8943 0.1599 -5.5922
0.0000 TypeSporty -0.2100 0.1779 -1.1805
0.2430 TypeVan 1.1456 0.1932 5.9306
0.0000 Residual standard error 0.422 on 54
degrees of freedom Multiple R-Squared 0.7159
F-statistic 27.22 on 5 and 54 degrees of
freedom, the p-value is 1.22e-013
yhat (Compact) 4.2yhat (Large) 4.2
0.8yhat (Small) 4.2 - 0.9
Type predicts fuel consumption almost as well as
Weight does. Not surprising --- Type is highly
correlated with Weight
40Finally, lets see if Type adds any predictive
power to Weight Linear Model Call
lm(formula Fuel Weight Type) Residuals
Min 1Q Median 3Q Max -0.62 -0.2556
-0.03516 0.22 0.8534 Coefficients
Value Std. Error t value Pr(gtt) (Intercept)
1.6721 0.5683 2.9423 0.0048 Weight
0.0009 0.0002 4.4524 0.0000 TypeLarge
0.0432 0.2859 0.1510 0.8805 TypeMedium
0.1022 0.1565 0.6530 0.5166 TypeSmall
-0.3960 0.1775 -2.2313 0.0299 TypeSporty
-0.1905 0.1533 -1.2427 0.2194 TypeVan
0.5298 0.2163 2.4489 0.0177 Residual
standard error 0.3634 on 53 degrees of
freedom Multiple R-Squared 0.7933 F-statistic
33.9 on 6 and 53 degrees of freedom, the p-value
is 2.22e-016 Not much Percentage of variance
explained increases from 74 to 79. On the other
hand, we are fitting 5 additional parameters.
41Look at the residuals, to see if there are any
peculiarities First, plot residuals vs predictors
No outliers, no obvious non-linearity
42Next, plot residuals against fitted values
No outliers, no non-linearity, no obvious change
in residual spread Model looks fine !
43- Illustration 2 --- Classification of handwritten
digits. - Each handwritten digit given as 16 x 16 gray
level image - Will try to construct rule distinguishing 0s
from 8s - Given
- Training samples of 389 0s and 158 1s
- Test samples of 359 0s and 166 1s
- Idea
- Construct prediction rule using training sample,
then estimate accuracy from test sample - Error rate obtained by classifying training
sample (resubstitution error) is usually
optimistic - Will use test sample to evaluate method(s) for
obtaining unbiased estimate of error rate from
training sample.
44Sample pictures of digits
45- Training sample has n 389 158
observationsThere are p 256 predictor
variables (string out pixels column by column)yi
0 or 8, depending on digit (values dont
matter) - LS fit gives intercept and a coefficient vector
a of length 256 - Interpretation of a
- Normalized coefficient vector a / a can be
regarded as a direction in 256 dim space. The
predicted values yi a0 a1 xi1 .. . ap
xip are essentially projections of the
predictor vectors xi onto this direction
(shifted and rescaled). - Coefficient vector a can itself be regarded as
an image (not particularly enlightening in this
case.
46Lets now look at the predicted values or
discriminant scores.
There obviously are some pretty wild outliers
among the regression coefficients and the
images. Still, histogram of started logs -- ( log
(yi c) with c abs (min (scores)) 1
)clearly shows two groups.
47Make sure the two groups really are 0s and 8s
48Now make up classification rule. Will do this by
hand for now.Sort scores for training sample
and look at corresponding class identities
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0112
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0149 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0186 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0223 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0260 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0297 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0334 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8371 8 8 8 8 0 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8408 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8445 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8482 8 8 8 8 8 8 8 8 8 8 8 8 8
0 8 8 8 8 8 8 8 8 0 8 8 8 0 0 0 0 0 0 0 8 0 0
0519 0 0 0 0 8 8 0 0 0 0 0 0 0 0 8 8 0 0 0 8
0 0 8 8 0 0 0 0 8 Clearly, lower cut point lcut
should be between ss 355 and ss 356 (ss
sorted score). Classify image as 0 if score lt
lcut, as 8 otherwise.Note Rule partitions 256
dim space into two half-spaces. We might consider
a more complex rule that also has an upper cut
point. But first lets see how well the rule
performs.
49Evaluating the classification rule (naïve) 1.
Compute resubstitution error rate --- error rate
when classifying observations in training
sample.misclassified.0s lt- sum((y0)
(scoresgtlcut)) misclassified.0s 1 34
misclassified.8s lt- sum((y8) (scoresltlcut))
misclassified.8s 1 0 So the overall
resubstitution error rate is 34 / 547 6.2 2.
To get more realistic assessment of performance,
compute error rate for independent test
sample (or validation sample ) misclassified.0s
lt- sum((y.test0) (scores.testgtlcut))
misclassified.0s 1 53 misclassified.8s lt-
sum((y.test8) (scores.testltlcut))
misclassified.8s 1 11 So the validation
error rate is 64 / 525 12 Not so surprising
we estimated 257 parameters from 547 observations
!
50- Some issues that we have not yet addressed
- 1. Better ways of measuring performance of a
classification rule - Class proportions in population (prior
probabilities for classes) might be different
from proportions in training sample ---
misrepresentation. In practice, one would not
want proportions in training sample to be too
uneven. - The loss when misclassifying a class 0
observation as class 1 might be different from
the loss when committing the reverse error. - 2. Automatically finding the cut point for a
hyperplane rule - Optimal cut point isnt always as obvious as in
this example. - 3. Logistic regression
- We are in effect modeling p(x) prob (Y 1
x), but the linear model gives predicted values
outside 0,1. - Common remedy Model log ( p / (1 - p) ) ---
logistic regression
514. Constructing more general rules Often, not all
classificatory information is contained in a
single direction in predictor space. Need methods
that can deal with more general situations.
52(No Transcript)