LEARNING FROM NOISY DATA - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

LEARNING FROM NOISY DATA

Description:

'Bottom-up restriction': T can only be pruned if it does not contain a sub tree ... Cost of v (T pruned at v) = R(v) When costs of T and v are equal: ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 66
Provided by: Opte
Category:
Tags: data | from | learning | noisy | pruned

less

Transcript and Presenter's Notes

Title: LEARNING FROM NOISY DATA


1
LEARNING FROM NOISY DATA
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
  • Ivan Bratko
  • University of Ljubljana
  • Slovenia

Acknowledgement Thanks to Blaz Zupan for his
contribution to these slides
2
Overview
  • Learning from noisy data
  • Idea of tree pruning
  • How to prune optimally
  • Methods for tree pruning
  • Estimating probabilities

3
Learning from Noisy Data
  • Sources of noise
  • Errors in measurements, errors in data encoding,
    errors in examples, missing values
  • Problems
  • Complex hypothesis
  • Poor comprehensibility
  • Overfitting hypothesis overfits the data
  • Low classification accuracy on new data

4
Fitting data
y
x
What is the relation between x and y, y
y(x)? How can we predict y from x?
5
Overfitting data
y
Makes no error in the training data! But how
about predicting new cases?
x
What is the relation between x and y, y
y(x)? How can we predict y from x?
6
Overfitting in Extreme
  • Let default accuracy be the probability of
    majority class
  • Overfitting may result in accuracy lower then
    default
  • Example
  • Attributes have no correlation with class (i.e.,
    100 noise)
  • Two classes c1, c2
  • Class probabilities p(c1) 0.7, p(c2) 0.3
  • Default accuracy 0.7

7
Overfitting in Extreme
Decision tree with one example per leaf
c1
c2
Acc. 0.7
Acc. 0.3
Expected accuracy 0.7 x 0.7 0.3 x 0.3 0.58
0.58 lt 0.7
8
Pruning of Decision Trees
  • Means of handling noise in tree learning
  • After pruning the accuracy on previously unseen
    examples may increase

?
9
Typical Example from PracticeLocating Primary
Tumor
  • Data set
  • 20 classes
  • Default classifier 24.7

10
Effects of Pruning
credit
11
Effects of Pruning
glass
accuracy on training set
accuracy on test set
bigger trees
smaller trees
12
How to Prune Optimally?
  • Main questions
  • How much pruning?
  • Where to prune?
  • Large number of candidate pruned trees!
  • Typical relation btw tree size and accuracy on
    the new data
  • Main difficulty in pruning this curve is not
    known!

Accuracy
Tree Size
13
Two Kinds of Pruning
Pre pruning (forward pruning)
?
Post pruning
14
Forward Pruning
  • Stop expanding trees if benefits of potential
    sub-trees seem dubious
  • Information gain low
  • Number of examples very small
  • Example set statistically insignificant
  • Etc.

15
Forward Pruning Inferior
  • Myopic
  • Depends on parameters which are hard
    (impossible?) to guess
  • Example

x2
b
x1
a
16
Pre and Post Pruning
  • Forward pruning considered inferior and myopic
  • Post pruning makes use of sub-trees and in this
    way reduces the complexity

17
Post pruning
  • Main idea prune unreliable parts of tree
  • Outline of pruning procedure
  • start at bottom of tree, proceed upward
  • that is prune unreliable subtrees
  • Main question
  • How to know whether a subtree is unreliable?
  • Will accuracy improve after pruning?

18
Estimating accuracy of subtree
  • One idea Use special test data set (pruning
    set)
  • This is OK if sufficient amount of learning data
    available
  • In case of shortage of data Try estimate
    accuracy directly from learning data

19
Partitioning data in tree learning
  • All available data
  • Training set Test
    set
  • Growing set Pruning set
  • Typical proportions
  • training set 70, test set 30
  • growing set 70, pruning set 30

20
Estimating accuracy with pruning set
  • Accuracy of hypothesis on new data
  • probability of correct
    classification of a new example
  • Accuracy of hypothesis on new data ?
  • proportion of correctly classified
    examples in pruning set
  • Error of a hypothesis
  • probability of misclassification
    of a new example
  • Drawback of using a pruning set less data for
    growing set

21
Reduced error pruning, Quinlan 87
  • Use pruning set to estimate accuracy of sub trees
    and accuracy at individual nodes
  • Let T be a sub tree rooted at node v
  • v
  • T
  • Define Gain from pruning at v
  • misclassifications in T -
  • misclassifications at v

22
Reduced error pruning
  • Repeat
  • prune at node with largest
    gain
  • until only negative gain nodes
    remain
  • Bottom-up restriction T can only be pruned if
    it does not contain a sub tree with lower error
    than T

23
Reduced error pruning
  • Theorem (Esposito, Malerba, Semeraro 1997)
  • REP with bottom-up restriction finds the
    smallest most accurate sub tree w.r.t. pruning
    set.

24
Minimal Error Pruning (MEP)Niblett and Bratko
86 Cestnik and Bratko 91
  • Does not require a pruning set for estimating
    error
  • Estimates error on new data directly from
  • growing set, using the Bayesian method for
  • probability estimation
  • (e.g. Laplace estimate or m-estimate)
  • Main principle
  • Prune so that estimated
    classification error is
  • minimal

25
Minimal Error Pruning
  • Deciding about pruning at node v
  • a tree T
  • v
  • p1 p2 ...
  • T1 T2
  • E(T) error of optimally pruned tree T

26
Static and backed-up errors
  • Define
  • static error at v
  • e(v) p( class ? C v)
  • where C is the most likely class at v
  • If T pruned at v then its error is e(v).
  • If T not pruned at v then its (backed-up) error
    is
  • p1 E(T1) p2 E(T2) ...

27
Minimal error pruning
  • Decision whether to prune or not
  • Prune if static error ? backed-up error
  • E(T) min( e(v), ?i pi E(Ti))

28
Minimal error pruning
  • Main question
  • How to estimate static errors e(v)?
  • Use Laplace or m-estimate of probability
  • At a node v
  • N examples
  • nC majority class examples

29
Laplace probability estimate
  • where k is the number of classes.
  • Problems with Laplace
  • Assumes all classes a priori equally likely
  • Degree of pruning depends on number of classes

30
m-estimate of probability
  • pC ( nC pCa m ) / ( N m)
  • where
  • pCa a priori probability of class C
  • m is a non-negative parameter tuned
  • by expert

31
m-estimate
  • Important points
  • Takes into account prior probabilities
  • Pruning not sensitive to number of classes
  • Varying m series of differently pruned trees
  • Choice of m depends on confidence in data

32
m-estimate in pruning
  • Choice of m
  • Low noise ? low m ? little pruning
  • High noise ? high m ? much pruning
  • Note Using m-estimate is as if examples at
  • node were a random sample, which they are
  • not. Suitably adjusting m compensates for this.

33
Some other pruning methods
  • Error-complexity pruning, Breiman et al. 84
    (CART)
  • Pessimistic error pruning, Quinlan 87
  • Error-based pruning, Quinlan 93 (C4.5)

34
Error-complexity pruningBreiman et al. 1884,
Program CART
  • Considers
  • Error rate on "growing" set
  • Size of tree
  • Error rate on "pruning set"
  • Minimise error and complexity i.e. find a
    compromise between error and size

35
  • A sub tree T with root v
  • v
  • T
  • R(v) errors on "growing" set at node v
  • R(T) errors on "growing" set of tree T
  • NT leaves in T
  • Total cost Error cost Complexity cost
  • Total cost R ? N

36
Error complexity cost
  • Total cost Error cost Complexity cost
  • Total cost R ? N
  • ? complexity cost per leaf

37
Pruning at v
  • Cost of T (T unpruned) R(T) ? NT
  • Cost of v (T pruned at v) R(v) ?
  • When costs of T and v are equal
  • ? reduction of error per leaf

38
Pruning algorithm
  • Compute ? for each node in unpruned tree
  • Repeat
  • prune sub tree with smallest ?
  • until root only is left
  • This gives a series of increasingly pruned trees
    estimate their accuracy

39
Selecting best pruned tree
  • Finally select the "best" tree from this series
  • Select the smallest tree within 1 standard error
    of minimum error (1-SE rule)
  • Standard error sqrt( Rmin (1-Rmin) / exs)

40
Comments
  • Note Cost complexity pruning limits selection to
    a subset of all possible pruned trees.
  • Consequence Best pruned tree may be missed
  • Two ways of estimating error on new data
  • (a) using pruning set
  • (b) using cross-validation in a rather
  • complicated way

41
Comments
  • 1-SE rule tends to overprune
  • Simply choosing min. error tree ("0-SE rule")
    performs better in experiments
  • Error estimate with cross validation is
    complicated and based on a debatable assumption

42
Selecting best tree
  • Using pruning set
  • Measure error of candidate pruned trees on
    pruning set
  • Select the smallest tree within 1 standard error
    of minimum error.

43
Comparison of pruning methods (Esposito,
Malerba, Semeraro 96, IEEE Trans.)
  • Experiments with 14 data sets from UCI repository
  • Results Does pruning improve accuracy?
  • Generally yes
  • But the effects of pruning also depend on domain
  • In most domains pruning improves accuracy, in
    some it does not, in very few it worsens

44
Pruning in rule learning
  • Ideas from pruning decision trees can be adapted
    to the learning of if-then rules
  • Pre-pruning and post-pruning can be combined and
    reduced error pruning idea applies
  • Furnkranz (1997) reviews several approaches and
    evaluates them experimentally

45
Estimating Probabilities
  • Setup
  • n experiments (n r s)
  • r successes
  • s failures
  • How likely it is that next experiment will be a
    success?
  • Estimate with relative frequency

46
Relative Frequency
  • Works when we have many experiments,
  • but not with small samples
  • Consider
  • flipping a coin
  • we flip a coin twice, both times comes a head
  • what is probability of head in the next flip?
  • Probability of 1.0 (1.02/2) seems unreasonable

47
Coins and mushrooms
  • Probability of head ?
  • Probability of mushroom edible ?
  • Make one, two ... experiments
  • Interpret results in terms of probability
  • Relative frequency does not work well

48
Coins and mushrooms
  • We need to consider prior expectations
  • Prior prob. 1/2, in both cases not
    unreasonable
  • But, is this enough?
  • Intuition says our probability estimates for
    coins and mushrooms still different
  • Difference lies in prior probability distribution
  • What are sensible prior distributions for coins
    and for mushrooms?

49
Bayesian Procedure for Estimating Probabilities
  • Assume initial probability distribution (prior
    distribution)
  • Based on some evidence E, update this
    distribution to obtain posterior distribution
  • Compute the expected value over posterior
    distribution. Variance of posterior distribution
    is related to certainty of this estimate

50
Bayes Formula
  • Bayesian process takes prior probability and
    combines it with new evidence to obtain updated
    (posterior) probability

51
Bayes in estimating probabilities
  • Form of hypothesis H is
  • P(event) x
  • So
  • P( H E) P( P(event)x E)
  • That is probability that probability is x
  • May appear confusing!

52
Bayes update of probability
  • P( P(event)x E)
  • P( P(event)x) P( E P(event)x) / P(
    E)

Prior prob. density
Posterior prob. density
53
Expected probability
  • Expected value of prob. of event
  • P( event E )
  • P( event E)
  • Integral over 0,1 of x weighted by
  • prob. density

54
Bayes update of probability
  • Prior prob. distribution
  • Bayes update with evidence E
  • Posterior prob. distribution
  • Expected value, variance

55
Update of density Example
  • Uniform prior
  • 0 1
  • Posterior
  • 0 1

56
Choice of Prior DistributionBeta Distribution
?(a,b)
57
Bayesian Update ofBeta Distributions
  • Let prior distribution be ?(a,b)
  • Assume experimental results
  • s successful outcomes
  • f failures
  • Updated Beta distribution is ?(as, bf)
  • Beta probability distributions have a nice
    mathematical property

Bayes Update
Beta distribution
Beta distribution
58
m-estimate of probability
  • Cestnik, 1991
  • Replace parameters a and b with m and pa
  • pa is prior probability
  • Assume N experiments, n positive outcome

59
m-estimate of probability
relative frequency
prior probability
60
Choosing priorprobability distribution
  • If we know prior probability and variance, this
    determines a, b and m, pa
  • A domain expert to choose prior distribution,
    defined by either a, b or m, pa
  • m, pa may be more practical then a, b
  • Expert hopefully has some idea about pa and m
  • low variance, more confidence in pa ? large m
  • high variance, less confidence in pa ? small m

61
Laplace Probability Estimate
  • For a problem with two outcomes
  • Assumes prior probability distribution of ?(1,1)
  • Also equals m-estimate with pa 1/k and m k ,
    where k 2

62
Using domain knowledge to improve accuracy
  • If domain-specific knowledge is available prior
    to learning it may provide useful additional
    constraints
  • Additional constraints may alleviate problems
    with noise
  • One approach is Q2 learning that uses qualitative
    constraints in numerical learning

63
Q2 LearningVladuic, uc and Bratko 2004
  • Q2 learning, Qualitatively faithful Quantitative
    learning
  • Learning from numerical data is guided by
    qualitative constraints
  • Resulting numerical model fits learning data
    numerically and respects given qualitative model
  • Qualitative model can be provided by domain
    expert, or induced from data

64
Qualitative difficulties ofnumerical learning
h
h
outflow
t
  • Learn time behavior of water level
  • h f( t, initial_outflow)

65
Predicting water level with M5
Initial_ouflow12.5
11.25
10.0
8.75
7.5
6.25
66
Predicting water level with Q2
Q2 predictions
True values
Write a Comment
User Comments (0)
About PowerShow.com