LEARNING FROM NOISY DATA - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

LEARNING FROM NOISY DATA

Description:

'Bottom-up restriction': T can only be pruned if it does not contain a sub tree ... Cost of v (T pruned at v) = R(v) When costs of T and v are equal: ... – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 66

Provided by: Opte

Category:

more less

Transcript and Presenter's Notes

Title: LEARNING FROM NOISY DATA

1
LEARNING FROM NOISY DATA
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY

Ivan Bratko
University of Ljubljana
Slovenia

Acknowledgement Thanks to Blaz Zupan for his
contribution to these slides
2
Overview

Learning from noisy data
Idea of tree pruning
How to prune optimally
Methods for tree pruning
Estimating probabilities

3
Learning from Noisy Data

Sources of noise
Errors in measurements, errors in data encoding,
errors in examples, missing values
Problems
Complex hypothesis
Poor comprehensibility
Overfitting hypothesis overfits the data
Low classification accuracy on new data

4
Fitting data
y
x
What is the relation between x and y, y
y(x)? How can we predict y from x?
5
Overfitting data
y
Makes no error in the training data! But how
about predicting new cases?
x
What is the relation between x and y, y
y(x)? How can we predict y from x?
6
Overfitting in Extreme

Let default accuracy be the probability of
majority class
Overfitting may result in accuracy lower then
default
Example
Attributes have no correlation with class (i.e.,
100 noise)
Two classes c1, c2
Class probabilities p(c1) 0.7, p(c2) 0.3
Default accuracy 0.7

7
Overfitting in Extreme
Decision tree with one example per leaf
c1
c2
Acc. 0.7
Acc. 0.3
Expected accuracy 0.7 x 0.7 0.3 x 0.3 0.58
0.58 lt 0.7
8
Pruning of Decision Trees

Means of handling noise in tree learning
After pruning the accuracy on previously unseen
examples may increase

?
9
Typical Example from PracticeLocating Primary
Tumor

Data set
20 classes
Default classifier 24.7

10
Effects of Pruning
credit
11
Effects of Pruning
glass
accuracy on training set
accuracy on test set
bigger trees
smaller trees
12
How to Prune Optimally?

Main questions
How much pruning?
Where to prune?
Large number of candidate pruned trees!
Typical relation btw tree size and accuracy on
the new data
Main difficulty in pruning this curve is not
known!

Accuracy
Tree Size
13
Two Kinds of Pruning
Pre pruning (forward pruning)
?
Post pruning
14
Forward Pruning

Stop expanding trees if benefits of potential
sub-trees seem dubious
Information gain low
Number of examples very small
Example set statistically insignificant
Etc.

15
Forward Pruning Inferior

Myopic
Depends on parameters which are hard
(impossible?) to guess
Example

x2
b
x1
a
16
Pre and Post Pruning

Forward pruning considered inferior and myopic
Post pruning makes use of sub-trees and in this
way reduces the complexity

17
Post pruning

Main idea prune unreliable parts of tree
Outline of pruning procedure
start at bottom of tree, proceed upward
that is prune unreliable subtrees
Main question
How to know whether a subtree is unreliable?
Will accuracy improve after pruning?

18
Estimating accuracy of subtree

One idea Use special test data set (pruning
set)
This is OK if sufficient amount of learning data
available
In case of shortage of data Try estimate
accuracy directly from learning data

19
Partitioning data in tree learning

All available data
Training set Test
set
Growing set Pruning set
Typical proportions
training set 70, test set 30
growing set 70, pruning set 30

20
Estimating accuracy with pruning set

Accuracy of hypothesis on new data
probability of correct
classification of a new example
Accuracy of hypothesis on new data ?
proportion of correctly classified
examples in pruning set
Error of a hypothesis
probability of misclassification
of a new example
Drawback of using a pruning set less data for
growing set

21
Reduced error pruning, Quinlan 87

Use pruning set to estimate accuracy of sub trees
and accuracy at individual nodes
Let T be a sub tree rooted at node v
v
T
Define Gain from pruning at v
misclassifications in T -
misclassifications at v

22
Reduced error pruning

Repeat
prune at node with largest
gain
until only negative gain nodes
remain
Bottom-up restriction T can only be pruned if
it does not contain a sub tree with lower error
than T

23
Reduced error pruning

Theorem (Esposito, Malerba, Semeraro 1997)
REP with bottom-up restriction finds the
smallest most accurate sub tree w.r.t. pruning
set.

24
Minimal Error Pruning (MEP)Niblett and Bratko
86 Cestnik and Bratko 91

Does not require a pruning set for estimating
error
Estimates error on new data directly from
growing set, using the Bayesian method for
probability estimation
(e.g. Laplace estimate or m-estimate)
Main principle
Prune so that estimated
classification error is
minimal

25
Minimal Error Pruning

Deciding about pruning at node v
a tree T
v
p1 p2 ...
T1 T2
E(T) error of optimally pruned tree T

26
Static and backed-up errors

Define
static error at v
e(v) p( class ? C v)
where C is the most likely class at v
If T pruned at v then its error is e(v).
If T not pruned at v then its (backed-up) error
is
p1 E(T1) p2 E(T2) ...

27
Minimal error pruning

Decision whether to prune or not
Prune if static error ? backed-up error
E(T) min( e(v), ?i pi E(Ti))

28
Minimal error pruning

Main question
How to estimate static errors e(v)?
Use Laplace or m-estimate of probability
At a node v
N examples
nC majority class examples

29
Laplace probability estimate

where k is the number of classes.
Problems with Laplace
Assumes all classes a priori equally likely
Degree of pruning depends on number of classes

30
m-estimate of probability

pC ( nC pCa m ) / ( N m)
where
pCa a priori probability of class C
m is a non-negative parameter tuned
by expert

31
m-estimate

Important points
Takes into account prior probabilities
Pruning not sensitive to number of classes
Varying m series of differently pruned trees
Choice of m depends on confidence in data

32
m-estimate in pruning

Choice of m
Low noise ? low m ? little pruning
High noise ? high m ? much pruning
Note Using m-estimate is as if examples at
node were a random sample, which they are
not. Suitably adjusting m compensates for this.

33
Some other pruning methods

Error-complexity pruning, Breiman et al. 84
(CART)
Pessimistic error pruning, Quinlan 87
Error-based pruning, Quinlan 93 (C4.5)

34
Error-complexity pruningBreiman et al. 1884,
Program CART

Considers
Error rate on "growing" set
Size of tree
Error rate on "pruning set"
Minimise error and complexity i.e. find a
compromise between error and size

A sub tree T with root v
v
T
R(v) errors on "growing" set at node v
R(T) errors on "growing" set of tree T
NT leaves in T
Total cost Error cost Complexity cost
Total cost R ? N

36
Error complexity cost

Total cost Error cost Complexity cost
Total cost R ? N
? complexity cost per leaf

37
Pruning at v

Cost of T (T unpruned) R(T) ? NT
Cost of v (T pruned at v) R(v) ?
When costs of T and v are equal
? reduction of error per leaf

38
Pruning algorithm

Compute ? for each node in unpruned tree
Repeat
prune sub tree with smallest ?
until root only is left
This gives a series of increasingly pruned trees
estimate their accuracy

39
Selecting best pruned tree

Finally select the "best" tree from this series
Select the smallest tree within 1 standard error
of minimum error (1-SE rule)
Standard error sqrt( Rmin (1-Rmin) / exs)

40
Comments

Note Cost complexity pruning limits selection to
a subset of all possible pruned trees.
Consequence Best pruned tree may be missed
Two ways of estimating error on new data
(a) using pruning set
(b) using cross-validation in a rather
complicated way

41
Comments

1-SE rule tends to overprune
Simply choosing min. error tree ("0-SE rule")
performs better in experiments
Error estimate with cross validation is
complicated and based on a debatable assumption

42
Selecting best tree

Using pruning set
Measure error of candidate pruned trees on
pruning set
Select the smallest tree within 1 standard error
of minimum error.

43
Comparison of pruning methods (Esposito,
Malerba, Semeraro 96, IEEE Trans.)

Experiments with 14 data sets from UCI repository
Results Does pruning improve accuracy?
Generally yes
But the effects of pruning also depend on domain
In most domains pruning improves accuracy, in
some it does not, in very few it worsens

44
Pruning in rule learning

Ideas from pruning decision trees can be adapted
to the learning of if-then rules
Pre-pruning and post-pruning can be combined and
reduced error pruning idea applies
Furnkranz (1997) reviews several approaches and
evaluates them experimentally

45
Estimating Probabilities

Setup
n experiments (n r s)
r successes
s failures
How likely it is that next experiment will be a
success?
Estimate with relative frequency

46
Relative Frequency

Works when we have many experiments,
but not with small samples
Consider
flipping a coin
we flip a coin twice, both times comes a head
what is probability of head in the next flip?
Probability of 1.0 (1.02/2) seems unreasonable

47
Coins and mushrooms

Probability of head ?
Probability of mushroom edible ?
Make one, two ... experiments
Interpret results in terms of probability
Relative frequency does not work well

48
Coins and mushrooms

We need to consider prior expectations
Prior prob. 1/2, in both cases not
unreasonable
But, is this enough?
Intuition says our probability estimates for
coins and mushrooms still different
Difference lies in prior probability distribution
What are sensible prior distributions for coins
and for mushrooms?

49
Bayesian Procedure for Estimating Probabilities

Assume initial probability distribution (prior
distribution)
Based on some evidence E, update this
distribution to obtain posterior distribution
Compute the expected value over posterior
distribution. Variance of posterior distribution
is related to certainty of this estimate

50
Bayes Formula

Bayesian process takes prior probability and
combines it with new evidence to obtain updated
(posterior) probability

51
Bayes in estimating probabilities

Form of hypothesis H is
P(event) x
So
P( H E) P( P(event)x E)
That is probability that probability is x
May appear confusing!

52
Bayes update of probability

P( P(event)x E)
P( P(event)x) P( E P(event)x) / P(
E)

Prior prob. density
Posterior prob. density
53
Expected probability

Expected value of prob. of event
P( event E )
P( event E)
Integral over 0,1 of x weighted by
prob. density

54
Bayes update of probability

Prior prob. distribution
Bayes update with evidence E
Posterior prob. distribution
Expected value, variance

55
Update of density Example

Uniform prior
0 1
Posterior
0 1

56
Choice of Prior DistributionBeta Distribution
?(a,b)
57
Bayesian Update ofBeta Distributions

Let prior distribution be ?(a,b)
Assume experimental results
s successful outcomes
f failures
Updated Beta distribution is ?(as, bf)

Beta probability distributions have a nice
mathematical property

Bayes Update
Beta distribution
Beta distribution
58
m-estimate of probability

Cestnik, 1991
Replace parameters a and b with m and pa
pa is prior probability
Assume N experiments, n positive outcome

59
m-estimate of probability
relative frequency
prior probability
60
Choosing priorprobability distribution

If we know prior probability and variance, this
determines a, b and m, pa
A domain expert to choose prior distribution,
defined by either a, b or m, pa
m, pa may be more practical then a, b
Expert hopefully has some idea about pa and m
low variance, more confidence in pa ? large m
high variance, less confidence in pa ? small m

61
Laplace Probability Estimate

For a problem with two outcomes
Assumes prior probability distribution of ?(1,1)
Also equals m-estimate with pa 1/k and m k ,
where k 2

62
Using domain knowledge to improve accuracy

If domain-specific knowledge is available prior
to learning it may provide useful additional
constraints
Additional constraints may alleviate problems
with noise
One approach is Q2 learning that uses qualitative
constraints in numerical learning

63
Q2 LearningVladuic, uc and Bratko 2004

Q2 learning, Qualitatively faithful Quantitative
learning
Learning from numerical data is guided by
qualitative constraints
Resulting numerical model fits learning data
numerically and respects given qualitative model
Qualitative model can be provided by domain
expert, or induced from data

64
Qualitative difficulties ofnumerical learning
h
h
outflow
t