Machine learning - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Machine learning

Description:

... regression trees (CART) --- builds more general models, can ... Data on 60 different car models, from Consumer Reports. Predictor variables: Weight in pounds ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 53
Provided by: werners6
Category:
Tags: learning | machine

less

Transcript and Presenter's Notes

Title: Machine learning


1
LECTURE 27
2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Machine learning Distinguish between supervised
and unsupervised learning. Examples for
supervised learning 1. Credit rating We have a
credit application, and we want to predict
whether or not the applicant will pay back the
loan. Given Collection of credit applications
for which we know whether or not the
loans were paid back. 2. Satellite image
analysis We have multi-spectral images of the
earth surface. Want to automatically segment
images, i.e. assign each pixel to one of a number
of classes (forest, grassland, water, road,
...). Given Image(s) for which ground truth
is known.
27
3. Handwritten character recognition We want to
automatically read ZIP codes on mail
envelopes. Subproblem Given 16 x 16 gray scale
image of handwritten digit, identify
digit. Given Sample of images for which digit
is known. 4. Chemometrics We have a gasoline
sample and we want to determine its octane level
from an IR absorption spectrum. Given
Collection of spectra of gasoline samples for
which we determined the octane level
with an "octane engine".
28
Examples for unsupervised learning 5. Market
segmentation We have data on Amazon customers
(purchase history, demographics (?), etc.) We
want to partition the customers into homogeneous
subsets. 6. Satellite image analysis We have
multi-spectral images of the earth surface. We
want to partition the pixels into subsets
(hopefully) representing the same kind of surface.
29
  • The generic supervised learning problem
  • Universe of objects (credit applicants, written
    digits, gasoline samples, ...).
  • We observe properties of objects (answers on
    credit application 16 x 16 gray level image,
    IR spectrum ).
  • Based on observed properties, we want to predict
    the value of a response variable (loan
    performance, digit, octane level).
  • As basis for our prediction rule we have a
    training sample of objects for which we know
    properties and value of response variable.
  • The generic unsupervised learning problem
  • Universe of objects (Amazon customers, pixels in
    a MSI)
  • We observe properties (Purchase history,
    spectrum).
  • Based on observed properties, we want to
    partition objects into groups such that objects
    within each group are similar, and dissimilar
    from objects in other groups.
  • Terminology
  • Supervised learning with continuos response
    variable regression
  • Supervised learning with categorical response
    variable classification or discriminant
    analysis
  • Unsupervised learning cluster analysis

30
  • Supervised learning
  • X1, , Xp predictor variables
  • Y response variable
  • Given (x1, y1), (xn, yn) training sample
  • Goals
  • Generate rule s(x) that predicts response Y for
    new test observations for which only x is
    known
  • Describe how "Y depends on x
  • Will restrict ourselves to continuous or binary
    response.
  • Methods for binary response can be used for
    general categorical response.
  • Will discuss two methods
  • Linear modeling --- oldest, simplest, most
    widely used
  • Classification and regression trees (CART) ---
    builds more general models, can handle messy
    data.

31
Linear modeling Model response Y as a linear
function of predictors X1, , Xp Y a0 a1 X1
. ap Xp Determine regression coefficients
a0, , ap by fitting model to training
sample. Most common fitting method least
squares --- find coefficients that minimize the
residual sum of squares RSS sum_i (yi - a0 -
a1 xi1 - - ap xip) 2 Optimization problem
can be solved by linear algebra. Prediction For
test observation with predictor vector x
predict yhat a0 a1 x1 ap xp
32
  • Illustration 1 (silly data set --- sorry!)
  • Data on 60 different car models, from Consumer
    Reports
  • Predictor variables
  • Weight in pounds
  • Engine displacement in liters
  • Type --- Small, Sporty, Compact, Medium, Large,
    Van (categorical predictor, watch out)
  • Response variable Mileage in miles / gallon
  • Question How does fuel consumption depend on the
    predictor variables.
  • Initially, lets just use a single predictor,
    Weight.
  • First some simple data exploration.

33
  • Boxplot
  • Center line of box median
  • Upper and lower bounds of box quartiles
  • Whiskers extend out to extremes, or to quartiles
    /- 1.5 inter-quartile range,
  • if there are outliers
  • Summary
  • Min. 1st Qu. Median Mean 3rd Qu. Max.
  • 1845 2571 2885 2901 3231 3855

34
Curve in plots was obtained by moving
average s(x) ave (yi xi in N(x) ) where N(x)
is an interval centered at x. Maybe better to
model gallons / mile?
35
Output obtained when looking at linear model with
Browser
Linear Model Call lm(formula Fuel
Weight) Residuals Min 1Q Median
3Q Max -0.7957 -0.2703 0.01414 0.2547
0.9583 Coefficients Value Std.
Error t value Pr(gtt) (Intercept) 0.3914
0.2995 1.3070 0.1964 Weight 0.0013
0.0001 12.9323 0.0000 Residual standard
error 0.3877 on 58 degrees of freedom Multiple
R-Squared 0.7425 F-statistic 167.2 on 1 and 58
degrees of freedom, the p-value is 0
Model yi a0 a1 xi ei, where ei
independent random errors with mean 0 and
variance sigma2. Std.error tells us how much
the estimated coefficients would vary, if we
generated new data from the model. Multiple
R-squared 1 - sum (ri2) / sum (yi -
ybar)2Fraction of response variability
explained by the model
36
75 of variation in fuel consumption / mile can
be explained by vehicle weight Next, lets try a
model with two predictor variables, Weight and
Displacement. Question Does displacement help
explain fuel consumption, once we adjust for
weight? Linear Model
Call lm(formula Fuel Weight
Disp.) Residuals Min 1Q Median 3Q
Max -0.8109 -0.2559 0.01971 0.2673
0.9812 Coefficients Value Std.
Error t value Pr(gtt) (Intercept) 0.4790 0.3418
1.4014 0.1665 Weight 0.0012 0.0002
7.2195 0.0000 Disp. 0.0009 0.0016
0.5427 0.5895 Residual standard error 0.3901
on 57 degrees of freedom Multiple R-Squared
0.7438 F-statistic 82.75 on 2 and 57 degrees of
freedom, the p-value is 0 Adding Displacement
as a second predictor does not increase
explanatory power of model
37
  • Now, lets try displacement by itself
  • Linear Model
  • Call lm(formula Fuel Disp.)
  • Residuals
  • Min 1Q Median 3Q Max
  • -0.7998 -0.4765 0.04638 0.2653 1.356
  • Coefficients
  • Value Std. Error t value Pr(gtt)
  • (Intercept) 2.6919 0.2074 12.9796 0.0000
  • Disp. 0.0100 0.0013 7.7630 0.0000
  • Residual standard error 0.5351 on 58 degrees of
    freedom
  • Multiple R-Squared 0.5096
  • F-statistic 60.26 on 1 and 58 degrees of
    freedom, the p-value is 1.53e-010

Does well, but not as well as Weight
38
Finally, lets see how well Type predicts fuel
consumption. First, look at a picture
How to deal with categorical predictors in a
linear model Suppose X is categorical, with K
categories (in our case, K 6) Convert X into
K-1 indicator variables I2, , Ik Ij 1 if
value of X is j-th category Ij 0 otherwise
39
Next, a linear model Linear Model
Call lm(formula Fuel Type) Residuals
Min 1Q Median 3Q Max -0.9273
-0.2537 -0.05596 0.1802 1.306 Coefficients
Value Std. Error t value Pr(gtt)
(Intercept) 4.1677 0.1090 38.2480
0.0000 TypeLarge 0.8001 0.2669 2.9978
0.0041 TypeMedium 0.4338 0.1599 2.7124
0.0089 TypeSmall -0.8943 0.1599 -5.5922
0.0000 TypeSporty -0.2100 0.1779 -1.1805
0.2430 TypeVan 1.1456 0.1932 5.9306
0.0000 Residual standard error 0.422 on 54
degrees of freedom Multiple R-Squared 0.7159
F-statistic 27.22 on 5 and 54 degrees of
freedom, the p-value is 1.22e-013
yhat (Compact) 4.2yhat (Large) 4.2
0.8yhat (Small) 4.2 - 0.9
Type predicts fuel consumption almost as well as
Weight does. Not surprising --- Type is highly
correlated with Weight
40
Finally, lets see if Type adds any predictive
power to Weight Linear Model Call
lm(formula Fuel Weight Type) Residuals
Min 1Q Median 3Q Max -0.62 -0.2556
-0.03516 0.22 0.8534 Coefficients
Value Std. Error t value Pr(gtt) (Intercept)
1.6721 0.5683 2.9423 0.0048 Weight
0.0009 0.0002 4.4524 0.0000 TypeLarge
0.0432 0.2859 0.1510 0.8805 TypeMedium
0.1022 0.1565 0.6530 0.5166 TypeSmall
-0.3960 0.1775 -2.2313 0.0299 TypeSporty
-0.1905 0.1533 -1.2427 0.2194 TypeVan
0.5298 0.2163 2.4489 0.0177 Residual
standard error 0.3634 on 53 degrees of
freedom Multiple R-Squared 0.7933 F-statistic
33.9 on 6 and 53 degrees of freedom, the p-value
is 2.22e-016 Not much Percentage of variance
explained increases from 74 to 79. On the other
hand, we are fitting 5 additional parameters.
41
Look at the residuals, to see if there are any
peculiarities First, plot residuals vs predictors
No outliers, no obvious non-linearity
42
Next, plot residuals against fitted values
No outliers, no non-linearity, no obvious change
in residual spread Model looks fine !
43
  • Illustration 2 --- Classification of handwritten
    digits.
  • Each handwritten digit given as 16 x 16 gray
    level image
  • Will try to construct rule distinguishing 0s
    from 8s
  • Given
  • Training samples of 389 0s and 158 1s
  • Test samples of 359 0s and 166 1s
  • Idea
  • Construct prediction rule using training sample,
    then estimate accuracy from test sample
  • Error rate obtained by classifying training
    sample (resubstitution error) is usually
    optimistic
  • Will use test sample to evaluate method(s) for
    obtaining unbiased estimate of error rate from
    training sample.

44
Sample pictures of digits
45
  • Training sample has n 389 158
    observationsThere are p 256 predictor
    variables (string out pixels column by column)yi
    0 or 8, depending on digit (values dont
    matter)
  • LS fit gives intercept and a coefficient vector
    a of length 256
  • Interpretation of a
  • Normalized coefficient vector a / a can be
    regarded as a direction in 256 dim space. The
    predicted values yi a0 a1 xi1 .. . ap
    xip are essentially projections of the
    predictor vectors xi onto this direction
    (shifted and rescaled).
  • Coefficient vector a can itself be regarded as
    an image (not particularly enlightening in this
    case.

46
Lets now look at the predicted values or
discriminant scores.
There obviously are some pretty wild outliers
among the regression coefficients and the
images. Still, histogram of started logs -- ( log
(yi c) with c abs (min (scores)) 1
)clearly shows two groups.
47
Make sure the two groups really are 0s and 8s
48
Now make up classification rule. Will do this by
hand for now.Sort scores for training sample
and look at corresponding class identities
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0112
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0149 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0186 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0223 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0260 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0297 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0334 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8371 8 8 8 8 0 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8408 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8445 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8482 8 8 8 8 8 8 8 8 8 8 8 8 8
0 8 8 8 8 8 8 8 8 0 8 8 8 0 0 0 0 0 0 0 8 0 0
0519 0 0 0 0 8 8 0 0 0 0 0 0 0 0 8 8 0 0 0 8
0 0 8 8 0 0 0 0 8 Clearly, lower cut point lcut
should be between ss 355 and ss 356 (ss
sorted score). Classify image as 0 if score lt
lcut, as 8 otherwise.Note Rule partitions 256
dim space into two half-spaces. We might consider
a more complex rule that also has an upper cut
point. But first lets see how well the rule
performs.
49
Evaluating the classification rule (naïve) 1.
Compute resubstitution error rate --- error rate
when classifying observations in training
sample.misclassified.0s lt- sum((y0)
(scoresgtlcut)) misclassified.0s 1 34
misclassified.8s lt- sum((y8) (scoresltlcut))
misclassified.8s 1 0 So the overall
resubstitution error rate is 34 / 547 6.2 2.
To get more realistic assessment of performance,
compute error rate for independent test
sample (or validation sample ) misclassified.0s
lt- sum((y.test0) (scores.testgtlcut))
misclassified.0s 1 53 misclassified.8s lt-
sum((y.test8) (scores.testltlcut))
misclassified.8s 1 11 So the validation
error rate is 64 / 525 12 Not so surprising
we estimated 257 parameters from 547 observations
!
50
  • Some issues that we have not yet addressed
  • 1. Better ways of measuring performance of a
    classification rule
  • Class proportions in population (prior
    probabilities for classes) might be different
    from proportions in training sample ---
    misrepresentation. In practice, one would not
    want proportions in training sample to be too
    uneven.
  • The loss when misclassifying a class 0
    observation as class 1 might be different from
    the loss when committing the reverse error.
  • 2. Automatically finding the cut point for a
    hyperplane rule
  • Optimal cut point isnt always as obvious as in
    this example.
  • 3. Logistic regression
  • We are in effect modeling p(x) prob (Y 1
    x), but the linear model gives predicted values
    outside 0,1.
  • Common remedy Model log ( p / (1 - p) ) ---
    logistic regression

51
4. Constructing more general rules Often, not all
classificatory information is contained in a
single direction in predictor space. Need methods
that can deal with more general situations.
52
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com