Classification Algorithms Continued - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Classification Algorithms Continued

Description:

Astigmatism = no. 1/12. Spectacle prescription = Hypermetrope. 3/12. Spectacle ... If age = young and astigmatism = yes. and tear production rate = normal ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 38

Provided by: gregoryp8

Category:

more less

Transcript and Presenter's Notes

Title: Classification Algorithms Continued

1
Classification Algorithms Continued
2
Overview

Compare rule induction and decision tree learning
algorithms
Understand some classifiers that dont represent
data with rules
Compare a range of benchmark learning algorithms

3
Algorithms

Rule Induction
Linear Models (Discriminants)
Instance-based (Nearest-neighbour)

4
Generating Rules

Decision tree can be converted into a rule set
Straightforward conversion
each path to the leaf becomes a rule but makes
an overly complex rule set
More effective conversions are not trivial
(e.g. C4.8 tests each node in root-leaf path to
see if it can be eliminated without loss of
accuracy)
Instead, generate rules directly from data rule
induction

5
Covering Algorithms

Strategy for generating a rule set directly for
each class in turn find rule set that covers all
instances in it (excluding instances not in the
class)
This approach is called a covering approach
because at each stage a rule is identified that
covers some of the instances

6
Example Generating a Rule
7
Example Generating a Rule II
8
Example Generating a Rule III
9
Example Generating a Rule IV

Possible rule set for class b

10
Rules vs. Trees

Corresponding decision tree(produces exactly
the samepredictions)
But rule sets may be easier to understand
decision trees suffer from replicated subtrees
Also in multi-class situations, covering
algorithm concentrates on one class at a time
whereas decision tree learner takes all classes
into account. Covering algorithm clearer.

11
A Simple Covering Algorithm

Generates a rule by adding tests that maximize
rules accuracy
Similar to decision trees problem of selecting
an attribute to split on
But decision tree inducer maximizes overall
purity and considers all branches
Each new test reducesrules coverage

witteneibe
12
Selecting a Test

Goal maximize accuracy
t total number of instances covered by rule
p positive examples of the class covered by rule
t p number of errors made by rule
Select test that maximizes the ratio p/t
We are finished when p/t 1 or the set of
instances cant be split any further
Also want t be large some algorithms have
heuristics to take this into account.

witteneibe
13
Example Contact Lens Data

Rule we seek
Possible tests

witteneibe
14
Modified Rule and Resulting Data

Rule with best test added
Instances covered by modified rule

witteneibe
15
Further Refinement

Current state
Possible tests

witteneibe
16
Modified Rule and Resulting Data

Rule with best test added
Instances covered by modified rule
Now you test spectacle_prescription

witteneibe
17
Further refinement

Current state
Possible tests
Tie between the first and the fourth test
We choose the one with greater coverage

witteneibe
18
Resulting Rule

Final rule
Second rule for recommending hard
lenses(built from instances not covered by
first rule)
These two rules cover all hard lenses
Process is repeated with other two classes

witteneibe
19
Pseudo-code for PRISM
witteneibe
20
Rules vs. Decision Lists

PRISM with outer loop removed generates a
decision list for one class
Subsequent rules are designed for rules that are
not covered by previous rules (i.e. rule order
matters)
Order doesnt matter for testing because all
rules predict the same class but should affect
pruning.
Outer loop considers all classes separately
No order dependence between classes/rules implied
Problems overlapping rules uncovered examples
(default rule required)

21
Separate and Conquer

Methods like PRISM (for dealing with one class)
are separate-and-conquer algorithms
First, a rule is identified
Then, all instances covered by the rule are
separated out
Finally, the remaining instances are conquered
Difference from divide-and-conquer methods
Subset covered by rule doesnt need to be
explored any further

witteneibe
22
Rule Induction Algorithms

Common procedure separate-and-conquer
Differences
Search method (e.g. greedy, beam search, ...)
Test selection criteria (e.g. accuracy, ...)
Pruning method (e.g. MDL, hold-out set, ...)
Stopping criterion (e.g. minimum accuracy)
Post-processing step
Also Decision list over all classes vs. one
rule set for each class

witten eibe
23
Algorithms

Rule Induction
Linear Models (Discriminants)
Instance-based (Nearest-neighbour)

24
Linear Models

Work most naturally with numeric attributes
Standard technique for numeric prediction linear
regression
Outcome is linear combination of attributes
Weights are calculated from the training data
Predicted value for first training instance a(1)

1 (called the bias)
witteneibe
25
Minimizing the Squared Error

Choose k 1 coefficients to minimize the squared
error on the training data
Squared error
Compute coefficients using standard matrix
operations (pseudo-inverse) a fast process
Can be done if there are more instances than
attributes
Minimizing the absolute error is more difficult

sum over inputs
data
model output
sum over patterns
witteneibe
26
Regression for Classification

What is regression?
Any regression technique can be used for
classification
Training perform a regression for each class,
setting the output to 1 for training instances
that belong to class, and 0 for those that dont.
Called the 1-of-c coding
Prediction predict class corresponding to model
with largest output value (membership value)
For linear regression this is known as
multi-response linear regression

witteneibe
27
Theoretical justification
Observed target value (either 0 or 1)
Model
Instance
The scheme minimizes this
True class probability
We want to minimize this
Constant
witteneibe
28
Pairwise Regression

Another way of using regression for
classification
A regression function for every pair of classes,
using only instances from those two classes
Assign output of 1 to one member of the pair, 1
to the other (or 1 and 0)
Prediction is done by voting
Class that receives most votes is predicted
Alternative dont know if there is no
agreement
More likely to be accurate but more expensive
Basic idea of building a classifier for pairs of
classes can be used for any model

witteneibe
29
Logistic Regression

Problem some assumptions violated when linear
regression is applied to classification problems
(assumes Gaussian conditional noise). Really want
outputs that estimate class probabilities
Logistic regression alternative to linear
regression
Designed for classification problems
Estimates class probabilities directly using the
maximum likelihood method
Uses this generalised linear model

P Class probability
witteneibe
30
Logistic Regression II
p 1/(1exp(-y)), where y is the linear model
output.

Still has linear decision boundaries, but
probabilistic outputs fit into a more principled
framework
Opens up the path for application of Bayesian
techniques complexity control, selection of
inputs, missing data,

31
Discussion of Linear Models

Not appropriate if data exhibits non-linear
dependencies
But can serve as building blocks for more
complex schemes (i.e. model trees)
Example multi-response linear discriminants
defines a hyperplane for any two given
classes
Logistic regression and linear discriminants both
give linear decision boundaries excellent
benchmarks

witteneibe
32
Comments on Basic Methods

Minsky and Papert (1969) showed that linear
classifiers have limitations, e.g. cant learn
XOR
But combinations of them can (non-linearities ?
Neural Nets)
Can also include pre-computed non-linear terms
(e.g. quadratic).

witteneibe
33
Algorithms

Rule Induction
Linear Models (Discriminants)
Instance-based (Nearest-neighbour)

34
Instance-based Representation

Simplest form of learning rote learning
Training instances are searched for instance that
most closely resembles new instance
The instances themselves represent the knowledge
Also called instance-based learning
Similarity/distance function defines whats
learned
Instance-based learning is lazy learning
Methods
nearest-neighbour
k-nearest-neighbour

witteneibe
35
Distance Function

Key to success (or failure) defines whats
learned
Several numeric attributes normally, Euclidean
distance is used and attributes are normalized
Nominal attributes distance is set to 1 if
values are different, 0 if they are equal
Ordinal attributes distance depends on order of
values
Are all attributes equally important?
Weighting the attributes might be necessary
Scale so that each attribute contributes
(approximately) the same to the distance metric

witteneibe
36
Instance-based Learning

Most instance-based schemes use Euclidean
distance
a(1) and a(2) two instances with k attributes
Taking the square root is not required when
comparing distances
Other popular metric city-block (Manhattan)
metric
Adds absolute differences without squaring them
Why the name? (Think of a city-grid in two
dimensions).

witteneibe
37
Normalization and Other Issues

Different attributes are measured on different
scales ? need to be normalized
vi the actual value of attribute i
Nominal attributes distance either 0 or 1
Common policy for missing values assumed to be
maximally distant (given normalized attributes).
Completely ad hoc!

or
witteneibe
38
Discussion of 1-NN

Often very accurate
but slow in classification
simple version scans entire training data to
derive a prediction. Tree data structures provide
speed improvements
Assumes all attributes are equally important
Remedy attribute selection or weights
Possible remedies against noisy instances
Take a majority vote over the k nearest
neighbours
Removing noisy instances from dataset
(difficult!)
Statisticians have used k-NN since early 1950s
If n ? ? and k/n ? 0, error approaches minimum
possible (Bayes error)