Classification Algorithms Continued - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Classification Algorithms Continued

Description:

Astigmatism = no. 1/12. Spectacle prescription = Hypermetrope. 3/12. Spectacle ... If age = young and astigmatism = yes. and tear production rate = normal ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 38
Provided by: gregoryp8
Category:

less

Transcript and Presenter's Notes

Title: Classification Algorithms Continued


1
Classification Algorithms Continued
2
Overview
  • Compare rule induction and decision tree learning
    algorithms
  • Understand some classifiers that dont represent
    data with rules
  • Compare a range of benchmark learning algorithms

3
Algorithms
  • Rule Induction
  • Linear Models (Discriminants)
  • Instance-based (Nearest-neighbour)

4
Generating Rules
  • Decision tree can be converted into a rule set
  • Straightforward conversion
  • each path to the leaf becomes a rule but makes
    an overly complex rule set
  • More effective conversions are not trivial
  • (e.g. C4.8 tests each node in root-leaf path to
    see if it can be eliminated without loss of
    accuracy)
  • Instead, generate rules directly from data rule
    induction

5
Covering Algorithms
  • Strategy for generating a rule set directly for
    each class in turn find rule set that covers all
    instances in it (excluding instances not in the
    class)
  • This approach is called a covering approach
    because at each stage a rule is identified that
    covers some of the instances

6
Example Generating a Rule
7
Example Generating a Rule II
8
Example Generating a Rule III
9
Example Generating a Rule IV
  • Possible rule set for class b

10
Rules vs. Trees
  • Corresponding decision tree(produces exactly
    the samepredictions)
  • But rule sets may be easier to understand
    decision trees suffer from replicated subtrees
  • Also in multi-class situations, covering
    algorithm concentrates on one class at a time
    whereas decision tree learner takes all classes
    into account. Covering algorithm clearer.

11
A Simple Covering Algorithm
  • Generates a rule by adding tests that maximize
    rules accuracy
  • Similar to decision trees problem of selecting
    an attribute to split on
  • But decision tree inducer maximizes overall
    purity and considers all branches
  • Each new test reducesrules coverage

witteneibe
12
Selecting a Test
  • Goal maximize accuracy
  • t total number of instances covered by rule
  • p positive examples of the class covered by rule
  • t p number of errors made by rule
  • Select test that maximizes the ratio p/t
  • We are finished when p/t 1 or the set of
    instances cant be split any further
  • Also want t be large some algorithms have
    heuristics to take this into account.

witteneibe
13
Example Contact Lens Data
  • Rule we seek
  • Possible tests

witteneibe
14
Modified Rule and Resulting Data
  • Rule with best test added
  • Instances covered by modified rule

witteneibe
15
Further Refinement
  • Current state
  • Possible tests

witteneibe
16
Modified Rule and Resulting Data
  • Rule with best test added
  • Instances covered by modified rule
  • Now you test spectacle_prescription

witteneibe
17
Further refinement
  • Current state
  • Possible tests
  • Tie between the first and the fourth test
  • We choose the one with greater coverage

witteneibe
18
Resulting Rule
  • Final rule
  • Second rule for recommending hard
    lenses(built from instances not covered by
    first rule)
  • These two rules cover all hard lenses
  • Process is repeated with other two classes

witteneibe
19
Pseudo-code for PRISM
witteneibe
20
Rules vs. Decision Lists
  • PRISM with outer loop removed generates a
    decision list for one class
  • Subsequent rules are designed for rules that are
    not covered by previous rules (i.e. rule order
    matters)
  • Order doesnt matter for testing because all
    rules predict the same class but should affect
    pruning.
  • Outer loop considers all classes separately
  • No order dependence between classes/rules implied
  • Problems overlapping rules uncovered examples
    (default rule required)

21
Separate and Conquer
  • Methods like PRISM (for dealing with one class)
    are separate-and-conquer algorithms
  • First, a rule is identified
  • Then, all instances covered by the rule are
    separated out
  • Finally, the remaining instances are conquered
  • Difference from divide-and-conquer methods
  • Subset covered by rule doesnt need to be
    explored any further

witteneibe
22
Rule Induction Algorithms
  • Common procedure separate-and-conquer
  • Differences
  • Search method (e.g. greedy, beam search, ...)
  • Test selection criteria (e.g. accuracy, ...)
  • Pruning method (e.g. MDL, hold-out set, ...)
  • Stopping criterion (e.g. minimum accuracy)
  • Post-processing step
  • Also Decision list over all classes vs. one
    rule set for each class

witten eibe
23
Algorithms
  • Rule Induction
  • Linear Models (Discriminants)
  • Instance-based (Nearest-neighbour)

24
Linear Models
  • Work most naturally with numeric attributes
  • Standard technique for numeric prediction linear
    regression
  • Outcome is linear combination of attributes
  • Weights are calculated from the training data
  • Predicted value for first training instance a(1)

1 (called the bias)
witteneibe
25
Minimizing the Squared Error
  • Choose k 1 coefficients to minimize the squared
    error on the training data
  • Squared error
  • Compute coefficients using standard matrix
    operations (pseudo-inverse) a fast process
  • Can be done if there are more instances than
    attributes
  • Minimizing the absolute error is more difficult

sum over inputs
data
model output
sum over patterns
witteneibe
26
Regression for Classification
  • What is regression?
  • Any regression technique can be used for
    classification
  • Training perform a regression for each class,
    setting the output to 1 for training instances
    that belong to class, and 0 for those that dont.
    Called the 1-of-c coding
  • Prediction predict class corresponding to model
    with largest output value (membership value)
  • For linear regression this is known as
    multi-response linear regression

witteneibe
27
Theoretical justification
Observed target value (either 0 or 1)
Model
Instance
The scheme minimizes this
True class probability
We want to minimize this
Constant
witteneibe
28
Pairwise Regression
  • Another way of using regression for
    classification
  • A regression function for every pair of classes,
    using only instances from those two classes
  • Assign output of 1 to one member of the pair, 1
    to the other (or 1 and 0)
  • Prediction is done by voting
  • Class that receives most votes is predicted
  • Alternative dont know if there is no
    agreement
  • More likely to be accurate but more expensive
  • Basic idea of building a classifier for pairs of
    classes can be used for any model

witteneibe
29
Logistic Regression
  • Problem some assumptions violated when linear
    regression is applied to classification problems
    (assumes Gaussian conditional noise). Really want
    outputs that estimate class probabilities
  • Logistic regression alternative to linear
    regression
  • Designed for classification problems
  • Estimates class probabilities directly using the
    maximum likelihood method
  • Uses this generalised linear model

P Class probability
witteneibe
30
Logistic Regression II
p 1/(1exp(-y)), where y is the linear model
output.
  • Still has linear decision boundaries, but
    probabilistic outputs fit into a more principled
    framework
  • Opens up the path for application of Bayesian
    techniques complexity control, selection of
    inputs, missing data,

31
Discussion of Linear Models
  • Not appropriate if data exhibits non-linear
    dependencies
  • But can serve as building blocks for more
    complex schemes (i.e. model trees)
  • Example multi-response linear discriminants
    defines a hyperplane for any two given
    classes
  • Logistic regression and linear discriminants both
    give linear decision boundaries excellent
    benchmarks

witteneibe
32
Comments on Basic Methods
  • Minsky and Papert (1969) showed that linear
    classifiers have limitations, e.g. cant learn
    XOR
  • But combinations of them can (non-linearities ?
    Neural Nets)
  • Can also include pre-computed non-linear terms
    (e.g. quadratic).

witteneibe
33
Algorithms
  • Rule Induction
  • Linear Models (Discriminants)
  • Instance-based (Nearest-neighbour)

34
Instance-based Representation
  • Simplest form of learning rote learning
  • Training instances are searched for instance that
    most closely resembles new instance
  • The instances themselves represent the knowledge
  • Also called instance-based learning
  • Similarity/distance function defines whats
    learned
  • Instance-based learning is lazy learning
  • Methods
  • nearest-neighbour
  • k-nearest-neighbour

witteneibe
35
Distance Function
  • Key to success (or failure) defines whats
    learned
  • Several numeric attributes normally, Euclidean
    distance is used and attributes are normalized
  • Nominal attributes distance is set to 1 if
    values are different, 0 if they are equal
  • Ordinal attributes distance depends on order of
    values
  • Are all attributes equally important?
  • Weighting the attributes might be necessary
  • Scale so that each attribute contributes
    (approximately) the same to the distance metric

witteneibe
36
Instance-based Learning
  • Most instance-based schemes use Euclidean
    distance
  • a(1) and a(2) two instances with k attributes
  • Taking the square root is not required when
    comparing distances
  • Other popular metric city-block (Manhattan)
    metric
  • Adds absolute differences without squaring them
  • Why the name? (Think of a city-grid in two
    dimensions).

witteneibe
37
Normalization and Other Issues
  • Different attributes are measured on different
    scales ? need to be normalized
  • vi the actual value of attribute i
  • Nominal attributes distance either 0 or 1
  • Common policy for missing values assumed to be
    maximally distant (given normalized attributes).
    Completely ad hoc!

or
witteneibe
38
Discussion of 1-NN
  • Often very accurate
  • but slow in classification
  • simple version scans entire training data to
    derive a prediction. Tree data structures provide
    speed improvements
  • Assumes all attributes are equally important
  • Remedy attribute selection or weights
  • Possible remedies against noisy instances
  • Take a majority vote over the k nearest
    neighbours
  • Removing noisy instances from dataset
    (difficult!)
  • Statisticians have used k-NN since early 1950s
  • If n ? ? and k/n ? 0, error approaches minimum
    possible (Bayes error)

witteneibe
39
Overview
  • Compare rule induction and decision tree learning
    algorithms
  • Understand some classifiers that dont represent
    data with rules
  • Compare a range of benchmark learning algorithms
Write a Comment
User Comments (0)
About PowerShow.com