Title: LEARNING FROM NOISY DATA
1LEARNING FROM NOISY DATA
ACAI 05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY
- Ivan Bratko
- University of Ljubljana
- Slovenia
Acknowledgement Thanks to Blaz Zupan for his
contribution to these slides
2Overview
- Learning from noisy data
- Idea of tree pruning
- How to prune optimally
- Methods for tree pruning
- Estimating probabilities
3Learning from Noisy Data
- Sources of noise
- Errors in measurements, errors in data encoding,
errors in examples, missing values - Problems
- Complex hypothesis
- Poor comprehensibility
- Overfitting hypothesis overfits the data
- Low classification accuracy on new data
4Fitting data
y
x
What is the relation between x and y, y
y(x)? How can we predict y from x?
5Overfitting data
y
Makes no error in the training data! But how
about predicting new cases?
x
What is the relation between x and y, y
y(x)? How can we predict y from x?
6Overfitting in Extreme
- Let default accuracy be the probability of
majority class - Overfitting may result in accuracy lower then
default - Example
- Attributes have no correlation with class (i.e.,
100 noise) - Two classes c1, c2
- Class probabilities p(c1) 0.7, p(c2) 0.3
- Default accuracy 0.7
7Overfitting in Extreme
Decision tree with one example per leaf
c1
c2
Acc. 0.7
Acc. 0.3
Expected accuracy 0.7 x 0.7 0.3 x 0.3 0.58
0.58 lt 0.7
8Pruning of Decision Trees
- Means of handling noise in tree learning
- After pruning the accuracy on previously unseen
examples may increase
?
9Typical Example from PracticeLocating Primary
Tumor
- Data set
- 20 classes
- Default classifier 24.7
10Effects of Pruning
credit
11Effects of Pruning
glass
accuracy on training set
accuracy on test set
bigger trees
smaller trees
12How to Prune Optimally?
- Main questions
- How much pruning?
- Where to prune?
- Large number of candidate pruned trees!
- Typical relation btw tree size and accuracy on
the new data - Main difficulty in pruning this curve is not
known!
Accuracy
Tree Size
13Two Kinds of Pruning
Pre pruning (forward pruning)
?
Post pruning
14Forward Pruning
- Stop expanding trees if benefits of potential
sub-trees seem dubious - Information gain low
- Number of examples very small
- Example set statistically insignificant
- Etc.
15Forward Pruning Inferior
- Myopic
- Depends on parameters which are hard
(impossible?) to guess - Example
x2
b
x1
a
16Pre and Post Pruning
- Forward pruning considered inferior and myopic
- Post pruning makes use of sub-trees and in this
way reduces the complexity
17Post pruning
- Main idea prune unreliable parts of tree
- Outline of pruning procedure
- start at bottom of tree, proceed upward
- that is prune unreliable subtrees
- Main question
- How to know whether a subtree is unreliable?
- Will accuracy improve after pruning?
18Estimating accuracy of subtree
- One idea Use special test data set (pruning
set) - This is OK if sufficient amount of learning data
available - In case of shortage of data Try estimate
accuracy directly from learning data
19Partitioning data in tree learning
- All available data
- Training set Test
set -
- Growing set Pruning set
- Typical proportions
- training set 70, test set 30
- growing set 70, pruning set 30
20Estimating accuracy with pruning set
- Accuracy of hypothesis on new data
- probability of correct
classification of a new example - Accuracy of hypothesis on new data ?
- proportion of correctly classified
examples in pruning set - Error of a hypothesis
- probability of misclassification
of a new example - Drawback of using a pruning set less data for
growing set
21Reduced error pruning, Quinlan 87
- Use pruning set to estimate accuracy of sub trees
and accuracy at individual nodes - Let T be a sub tree rooted at node v
-
- v
-
- T
- Define Gain from pruning at v
- misclassifications in T -
- misclassifications at v
22Reduced error pruning
- Repeat
- prune at node with largest
gain - until only negative gain nodes
remain - Bottom-up restriction T can only be pruned if
it does not contain a sub tree with lower error
than T
23Reduced error pruning
- Theorem (Esposito, Malerba, Semeraro 1997)
- REP with bottom-up restriction finds the
smallest most accurate sub tree w.r.t. pruning
set.
24Minimal Error Pruning (MEP)Niblett and Bratko
86 Cestnik and Bratko 91
- Does not require a pruning set for estimating
error - Estimates error on new data directly from
- growing set, using the Bayesian method for
- probability estimation
- (e.g. Laplace estimate or m-estimate)
- Main principle
- Prune so that estimated
classification error is - minimal
25Minimal Error Pruning
- Deciding about pruning at node v
- a tree T
- v
- p1 p2 ...
- T1 T2
- E(T) error of optimally pruned tree T
26Static and backed-up errors
- Define
- static error at v
- e(v) p( class ? C v)
- where C is the most likely class at v
- If T pruned at v then its error is e(v).
- If T not pruned at v then its (backed-up) error
is - p1 E(T1) p2 E(T2) ...
27Minimal error pruning
- Decision whether to prune or not
- Prune if static error ? backed-up error
- E(T) min( e(v), ?i pi E(Ti))
28Minimal error pruning
- Main question
- How to estimate static errors e(v)?
- Use Laplace or m-estimate of probability
- At a node v
- N examples
- nC majority class examples
29Laplace probability estimate
- where k is the number of classes.
- Problems with Laplace
- Assumes all classes a priori equally likely
- Degree of pruning depends on number of classes
30m-estimate of probability
- pC ( nC pCa m ) / ( N m)
- where
- pCa a priori probability of class C
- m is a non-negative parameter tuned
- by expert
31m-estimate
- Important points
- Takes into account prior probabilities
- Pruning not sensitive to number of classes
- Varying m series of differently pruned trees
- Choice of m depends on confidence in data
32m-estimate in pruning
- Choice of m
- Low noise ? low m ? little pruning
- High noise ? high m ? much pruning
- Note Using m-estimate is as if examples at
- node were a random sample, which they are
- not. Suitably adjusting m compensates for this.
33Some other pruning methods
- Error-complexity pruning, Breiman et al. 84
(CART) - Pessimistic error pruning, Quinlan 87
- Error-based pruning, Quinlan 93 (C4.5)
34Error-complexity pruningBreiman et al. 1884,
Program CART
- Considers
- Error rate on "growing" set
- Size of tree
- Error rate on "pruning set"
- Minimise error and complexity i.e. find a
compromise between error and size
35- A sub tree T with root v
- v
- T
-
- R(v) errors on "growing" set at node v
- R(T) errors on "growing" set of tree T
- NT leaves in T
- Total cost Error cost Complexity cost
- Total cost R ? N
36Error complexity cost
- Total cost Error cost Complexity cost
- Total cost R ? N
- ? complexity cost per leaf
37Pruning at v
- Cost of T (T unpruned) R(T) ? NT
- Cost of v (T pruned at v) R(v) ?
- When costs of T and v are equal
- ? reduction of error per leaf
38Pruning algorithm
- Compute ? for each node in unpruned tree
- Repeat
- prune sub tree with smallest ?
- until root only is left
- This gives a series of increasingly pruned trees
estimate their accuracy
39Selecting best pruned tree
- Finally select the "best" tree from this series
- Select the smallest tree within 1 standard error
of minimum error (1-SE rule) - Standard error sqrt( Rmin (1-Rmin) / exs)
40Comments
- Note Cost complexity pruning limits selection to
a subset of all possible pruned trees. - Consequence Best pruned tree may be missed
- Two ways of estimating error on new data
- (a) using pruning set
- (b) using cross-validation in a rather
- complicated way
41Comments
- 1-SE rule tends to overprune
- Simply choosing min. error tree ("0-SE rule")
performs better in experiments - Error estimate with cross validation is
complicated and based on a debatable assumption
42Selecting best tree
- Using pruning set
- Measure error of candidate pruned trees on
pruning set - Select the smallest tree within 1 standard error
of minimum error.
43Comparison of pruning methods (Esposito,
Malerba, Semeraro 96, IEEE Trans.)
- Experiments with 14 data sets from UCI repository
- Results Does pruning improve accuracy?
- Generally yes
- But the effects of pruning also depend on domain
- In most domains pruning improves accuracy, in
some it does not, in very few it worsens
44Pruning in rule learning
- Ideas from pruning decision trees can be adapted
to the learning of if-then rules - Pre-pruning and post-pruning can be combined and
reduced error pruning idea applies - Furnkranz (1997) reviews several approaches and
evaluates them experimentally
45Estimating Probabilities
- Setup
- n experiments (n r s)
- r successes
- s failures
- How likely it is that next experiment will be a
success? - Estimate with relative frequency
46Relative Frequency
- Works when we have many experiments,
- but not with small samples
- Consider
- flipping a coin
- we flip a coin twice, both times comes a head
- what is probability of head in the next flip?
- Probability of 1.0 (1.02/2) seems unreasonable
47Coins and mushrooms
- Probability of head ?
- Probability of mushroom edible ?
- Make one, two ... experiments
- Interpret results in terms of probability
- Relative frequency does not work well
48Coins and mushrooms
- We need to consider prior expectations
- Prior prob. 1/2, in both cases not
unreasonable - But, is this enough?
- Intuition says our probability estimates for
coins and mushrooms still different - Difference lies in prior probability distribution
- What are sensible prior distributions for coins
and for mushrooms?
49Bayesian Procedure for Estimating Probabilities
- Assume initial probability distribution (prior
distribution) - Based on some evidence E, update this
distribution to obtain posterior distribution - Compute the expected value over posterior
distribution. Variance of posterior distribution
is related to certainty of this estimate
50Bayes Formula
- Bayesian process takes prior probability and
combines it with new evidence to obtain updated
(posterior) probability
51Bayes in estimating probabilities
- Form of hypothesis H is
- P(event) x
- So
- P( H E) P( P(event)x E)
- That is probability that probability is x
- May appear confusing!
52Bayes update of probability
- P( P(event)x E)
- P( P(event)x) P( E P(event)x) / P(
E)
Prior prob. density
Posterior prob. density
53Expected probability
- Expected value of prob. of event
- P( event E )
- P( event E)
- Integral over 0,1 of x weighted by
- prob. density
54Bayes update of probability
- Prior prob. distribution
- Bayes update with evidence E
- Posterior prob. distribution
- Expected value, variance
55Update of density Example
- Uniform prior
- 0 1
- Posterior
- 0 1
56Choice of Prior DistributionBeta Distribution
?(a,b)
57Bayesian Update ofBeta Distributions
- Let prior distribution be ?(a,b)
- Assume experimental results
- s successful outcomes
- f failures
- Updated Beta distribution is ?(as, bf)
- Beta probability distributions have a nice
mathematical property
Bayes Update
Beta distribution
Beta distribution
58m-estimate of probability
- Cestnik, 1991
- Replace parameters a and b with m and pa
- pa is prior probability
- Assume N experiments, n positive outcome
59m-estimate of probability
relative frequency
prior probability
60Choosing priorprobability distribution
- If we know prior probability and variance, this
determines a, b and m, pa - A domain expert to choose prior distribution,
defined by either a, b or m, pa - m, pa may be more practical then a, b
- Expert hopefully has some idea about pa and m
- low variance, more confidence in pa ? large m
- high variance, less confidence in pa ? small m
61Laplace Probability Estimate
- For a problem with two outcomes
- Assumes prior probability distribution of ?(1,1)
- Also equals m-estimate with pa 1/k and m k ,
where k 2
62Using domain knowledge to improve accuracy
- If domain-specific knowledge is available prior
to learning it may provide useful additional
constraints - Additional constraints may alleviate problems
with noise - One approach is Q2 learning that uses qualitative
constraints in numerical learning
63Q2 LearningVladuic, uc and Bratko 2004
- Q2 learning, Qualitatively faithful Quantitative
learning - Learning from numerical data is guided by
qualitative constraints - Resulting numerical model fits learning data
numerically and respects given qualitative model - Qualitative model can be provided by domain
expert, or induced from data
64Qualitative difficulties ofnumerical learning
h
h
outflow
t
- Learn time behavior of water level
- h f( t, initial_outflow)
65Predicting water level with M5
Initial_ouflow12.5
11.25
10.0
8.75
7.5
6.25
66Predicting water level with Q2
Q2 predictions
True values