Title: Classification Algorithms Continued
1Classification Algorithms Continued
2Overview
- Compare rule induction and decision tree learning
algorithms - Understand some classifiers that dont represent
data with rules - Compare a range of benchmark learning algorithms
3Algorithms
- Rule Induction
- Linear Models (Discriminants)
- Instance-based (Nearest-neighbour)
4Generating Rules
- Decision tree can be converted into a rule set
- Straightforward conversion
- each path to the leaf becomes a rule but makes
an overly complex rule set - More effective conversions are not trivial
- (e.g. C4.8 tests each node in root-leaf path to
see if it can be eliminated without loss of
accuracy) - Instead, generate rules directly from data rule
induction
5Covering Algorithms
- Strategy for generating a rule set directly for
each class in turn find rule set that covers all
instances in it (excluding instances not in the
class) - This approach is called a covering approach
because at each stage a rule is identified that
covers some of the instances
6Example Generating a Rule
7Example Generating a Rule II
8Example Generating a Rule III
9Example Generating a Rule IV
- Possible rule set for class b
10Rules vs. Trees
- Corresponding decision tree(produces exactly
the samepredictions) - But rule sets may be easier to understand
decision trees suffer from replicated subtrees - Also in multi-class situations, covering
algorithm concentrates on one class at a time
whereas decision tree learner takes all classes
into account. Covering algorithm clearer.
11A Simple Covering Algorithm
- Generates a rule by adding tests that maximize
rules accuracy - Similar to decision trees problem of selecting
an attribute to split on - But decision tree inducer maximizes overall
purity and considers all branches - Each new test reducesrules coverage
witteneibe
12Selecting a Test
- Goal maximize accuracy
- t total number of instances covered by rule
- p positive examples of the class covered by rule
- t p number of errors made by rule
- Select test that maximizes the ratio p/t
- We are finished when p/t 1 or the set of
instances cant be split any further - Also want t be large some algorithms have
heuristics to take this into account.
witteneibe
13Example Contact Lens Data
- Rule we seek
- Possible tests
witteneibe
14Modified Rule and Resulting Data
- Rule with best test added
- Instances covered by modified rule
witteneibe
15Further Refinement
- Current state
- Possible tests
witteneibe
16Modified Rule and Resulting Data
- Rule with best test added
- Instances covered by modified rule
- Now you test spectacle_prescription
witteneibe
17Further refinement
- Current state
- Possible tests
- Tie between the first and the fourth test
- We choose the one with greater coverage
witteneibe
18Resulting Rule
- Final rule
- Second rule for recommending hard
lenses(built from instances not covered by
first rule) - These two rules cover all hard lenses
- Process is repeated with other two classes
witteneibe
19Pseudo-code for PRISM
witteneibe
20Rules vs. Decision Lists
- PRISM with outer loop removed generates a
decision list for one class - Subsequent rules are designed for rules that are
not covered by previous rules (i.e. rule order
matters) - Order doesnt matter for testing because all
rules predict the same class but should affect
pruning. - Outer loop considers all classes separately
- No order dependence between classes/rules implied
- Problems overlapping rules uncovered examples
(default rule required)
21Separate and Conquer
- Methods like PRISM (for dealing with one class)
are separate-and-conquer algorithms - First, a rule is identified
- Then, all instances covered by the rule are
separated out - Finally, the remaining instances are conquered
- Difference from divide-and-conquer methods
- Subset covered by rule doesnt need to be
explored any further
witteneibe
22Rule Induction Algorithms
- Common procedure separate-and-conquer
- Differences
- Search method (e.g. greedy, beam search, ...)
- Test selection criteria (e.g. accuracy, ...)
- Pruning method (e.g. MDL, hold-out set, ...)
- Stopping criterion (e.g. minimum accuracy)
- Post-processing step
- Also Decision list over all classes vs. one
rule set for each class
witten eibe
23Algorithms
- Rule Induction
- Linear Models (Discriminants)
- Instance-based (Nearest-neighbour)
24Linear Models
- Work most naturally with numeric attributes
- Standard technique for numeric prediction linear
regression - Outcome is linear combination of attributes
- Weights are calculated from the training data
- Predicted value for first training instance a(1)
1 (called the bias)
witteneibe
25Minimizing the Squared Error
- Choose k 1 coefficients to minimize the squared
error on the training data - Squared error
- Compute coefficients using standard matrix
operations (pseudo-inverse) a fast process - Can be done if there are more instances than
attributes - Minimizing the absolute error is more difficult
sum over inputs
data
model output
sum over patterns
witteneibe
26Regression for Classification
- What is regression?
- Any regression technique can be used for
classification - Training perform a regression for each class,
setting the output to 1 for training instances
that belong to class, and 0 for those that dont.
Called the 1-of-c coding - Prediction predict class corresponding to model
with largest output value (membership value) - For linear regression this is known as
multi-response linear regression
witteneibe
27Theoretical justification
Observed target value (either 0 or 1)
Model
Instance
The scheme minimizes this
True class probability
We want to minimize this
Constant
witteneibe
28Pairwise Regression
- Another way of using regression for
classification - A regression function for every pair of classes,
using only instances from those two classes - Assign output of 1 to one member of the pair, 1
to the other (or 1 and 0) - Prediction is done by voting
- Class that receives most votes is predicted
- Alternative dont know if there is no
agreement - More likely to be accurate but more expensive
- Basic idea of building a classifier for pairs of
classes can be used for any model
witteneibe
29Logistic Regression
- Problem some assumptions violated when linear
regression is applied to classification problems
(assumes Gaussian conditional noise). Really want
outputs that estimate class probabilities - Logistic regression alternative to linear
regression - Designed for classification problems
- Estimates class probabilities directly using the
maximum likelihood method - Uses this generalised linear model
P Class probability
witteneibe
30Logistic Regression II
p 1/(1exp(-y)), where y is the linear model
output.
- Still has linear decision boundaries, but
probabilistic outputs fit into a more principled
framework - Opens up the path for application of Bayesian
techniques complexity control, selection of
inputs, missing data,
31Discussion of Linear Models
- Not appropriate if data exhibits non-linear
dependencies - But can serve as building blocks for more
complex schemes (i.e. model trees) - Example multi-response linear discriminants
defines a hyperplane for any two given
classes - Logistic regression and linear discriminants both
give linear decision boundaries excellent
benchmarks
witteneibe
32Comments on Basic Methods
- Minsky and Papert (1969) showed that linear
classifiers have limitations, e.g. cant learn
XOR - But combinations of them can (non-linearities ?
Neural Nets) - Can also include pre-computed non-linear terms
(e.g. quadratic).
witteneibe
33Algorithms
- Rule Induction
- Linear Models (Discriminants)
- Instance-based (Nearest-neighbour)
34Instance-based Representation
- Simplest form of learning rote learning
- Training instances are searched for instance that
most closely resembles new instance - The instances themselves represent the knowledge
- Also called instance-based learning
- Similarity/distance function defines whats
learned - Instance-based learning is lazy learning
- Methods
- nearest-neighbour
- k-nearest-neighbour
witteneibe
35Distance Function
- Key to success (or failure) defines whats
learned - Several numeric attributes normally, Euclidean
distance is used and attributes are normalized - Nominal attributes distance is set to 1 if
values are different, 0 if they are equal - Ordinal attributes distance depends on order of
values - Are all attributes equally important?
- Weighting the attributes might be necessary
- Scale so that each attribute contributes
(approximately) the same to the distance metric
witteneibe
36Instance-based Learning
- Most instance-based schemes use Euclidean
distance - a(1) and a(2) two instances with k attributes
- Taking the square root is not required when
comparing distances - Other popular metric city-block (Manhattan)
metric - Adds absolute differences without squaring them
- Why the name? (Think of a city-grid in two
dimensions).
witteneibe
37Normalization and Other Issues
- Different attributes are measured on different
scales ? need to be normalized - vi the actual value of attribute i
- Nominal attributes distance either 0 or 1
- Common policy for missing values assumed to be
maximally distant (given normalized attributes).
Completely ad hoc!
or
witteneibe
38Discussion of 1-NN
- Often very accurate
- but slow in classification
- simple version scans entire training data to
derive a prediction. Tree data structures provide
speed improvements - Assumes all attributes are equally important
- Remedy attribute selection or weights
- Possible remedies against noisy instances
- Take a majority vote over the k nearest
neighbours - Removing noisy instances from dataset
(difficult!) - Statisticians have used k-NN since early 1950s
- If n ? ? and k/n ? 0, error approaches minimum
possible (Bayes error)
witteneibe
39Overview
- Compare rule induction and decision tree learning
algorithms - Understand some classifiers that dont represent
data with rules - Compare a range of benchmark learning algorithms