Engineering the input and output

About This Presentation

Title:

Engineering the input and output

Description:

but redundant (dependent) attributes cause trouble. Attribute selection... same effect by transformation. A with k values k binary attributes A1,...,Ak ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 55

Provided by: timokn

Category:

more less

Transcript and Presenter's Notes

Title: Engineering the input and output

1
Engineering the input and output

Successful DM
more than just selecting the algorithm
most methods have many parameters
appropriate choice depends on the data
How to choose?
straight brute-force approach?
not usually the best (only training data
available for comparison)
separate test data cross-validation
This chapter other important processes that can
improve the success of DM

2
Data engineering

Bag of tricks
not sure if they work or not
yet better to understand what they are like
Input
make it more suitable for a ML scheme
attribute selection discretization
data cleansing
not addressed invention of synthetic attributes
Output
make the result more effective
combine different models

3
7.1 Attribute selection

Many ML methods try to find important
attributes on their own
decision trees splitting
Effect of irrelevant attributes
DT adding a random binary attribute 5-10 worse
chance increases at lower levels of the tree
similarly with separate conquer rule learning
instance-based methods suffer a lot (distance)
naive Bayes is quite robust
independence assumption is valid for random data
but redundant (dependent) attributes cause trouble

4
Attribute selection...

Irrelevant attributes clearly cause harm
Also relevant ones may!
2-classes, attribute class 65 of time
classification accuracy 1-5 worse
reason when the new attribute is chosen for
splitting, later splits have to rely on sparser
data
Selection is important
manual (understanding of the problem) should be
best
improves performance (accuracy)
improves readability

5
Scheme-independent selection

Two different selection approaches
Filter methods
general, independent of learning method
based on general characteristics of the data
no universal measures for relevance exist
Wrapper methods
test different subsets with a known (wrapped) ML
method

6
A simple filter method

Use just as many attributes that instances are
still different
easy yet computationally expensive to find
statistically unwarranted (relies on training
data)
prone to noise (? overfit)

7
Using ML algorithms...

Decision trees
build DT on training data
select attributes that are actually used in the
tree
use these attributes with any other ML scheme
1-Rule algorithm
user decides how many attributes are used
1R finds out the best ones (one by one)
note 1R is error-based, not the optimal choice

8
...Using ML algorithms

Instance-based methods
sample training instances
examine closest neighbours
same class near hit
differs in some attribute value ? that attribute
appears to be irrelevant ? less weight
different class near miss
differs ? relevant attribute ? more weight
selection select only attributes with positive
weights
will not detect dependent attributes (both are
either selected or rejected)

9
Searching the attribute space

Subset lattice
structure created by removing/adding attributes
search either one- or two-directional
forward selection start from the empty set
tentatively add one attribute evaluate (e.g. by
cross-validation)
select best continue or stop if no improvement
backward elimination start from the full set
easy to add bias for small attribute sets
threshold value for performance gain
best-first, beam search, GA, ...

10
Scheme-specific selection

Performance measure
use given ML scheme on chosen attributes
exhaustive search 2k choices
Experiments tell that
backward larger sets, more accurate
forward smaller sets, more readable
reason we usually stop too early (optimistic
error estimates)
sophisticated search methods
are not generally better (no uniform performance
gain)
hard to predict when worthwhile to use

11
Decision tables

Classifier type for which scheme-specific
selection is essential
entire learning problem which attributes to
select
usually done by cross-validating different
subsets
validation is computationally cheap
table structure stays same all the time
only class counters change

12
Success story

Selective Naive Bayes
Naive Bayes forward selection
forward selection detects redundant attributes
better than backward elimination
naive evaluation metric performance on training
data
Experiments
improves performance on many standard test cases
no negative effects

13
7.2 Discretizing numeric attributes

Why?
some ML methods work only on nominal data
some deal with them but not satisfactorily
assumption on normal distribution
DT (repetitive) sorting required
1R discretization
sort, assign values to change points in class
value (require some min. of points)
global method (applied to all data)
DT discretization
local decision on best (2-way) split-point

14
Local or global?

Local
tailored to the actual context
different discretizations in different places
less reliable with small datasets
Global (prior learning)
has to make one general decision
numeric data is ordered ? is nominal too?

15
Ordering information

Order potentially valuable knowledge
how to express it to a ML scheme that does not
understand ordering?
Transformation
replace attribute with k values with k-1 binary
attributes
orig. value i
set first i-1 new attributes to 0, rest to 1
if DT splits on ith attribute, it actually uses
the ordering information
note independent of the original discretization
method

16
Unsupervised discretization

Unsupervised/supervised
discretization made without/with knowledge of the
class value
Obvious way
divide range into a fixed number of equal
intervals
may lose information (or create noise)
too coarse intervals
unfortunate choices of boundary values

17
Unsupervised discretization...

Equal-interval binning
the previous way
uneven distribution of examples
Equal-frequency binning
histogram should be flat
may still separate class members with bad
boundaries

18
Entropy-based discretization

Recursively split intervals
apply DT method to find the initial split
repeat this process in both parts
Fact
cut point minimizing info never appears between
two consequtive examples of the same class
? we can reduce the of candidate points

19
When to stop recursion?

Use MDL principle
no split encode example classes
split
theory splitting point takes log(N-1) bits to
encode (N of instances)
encode classes in both partitions
optimal situation
all lt are yes all gt are no
each instance costs 1 bit without splitting and
almost 0 bits with it
formula for MDL-based gain threshold
note temp example, no splitting at all ? 1 value
no discretization ? quite irrelevant attribute!

20
Other methods

Entropy MDL one of the best general methods
for supervised discretization
Bottom-up discretization
consider merges of adjacent intervals
find best pair, merge if good enough
Error-based
assign majority class to each value
compute prediction error
bad best each example has own value ? fix some
k

21
Error-based methods...

Brute-force method exp. in k
Dynamic programming
applicable to any impurity function
finds a partition of N instances into k subsets
minimizing the impurity in time O(kN2)
e.g. impurity entropy
O(kN) for error-based impurity function

22
Error-based vs entropy-based

Error-based
finds optimal discretization very quickly
but can not produce adjacent intervals with same
class value
Do we need such intervals (Fig 7.4)?
discretisize a1 a2
best for a1 0..0.3, 0.3..0.7, 0.7..1.0
majority classes for a1 dot, ?, triangle
? must be either of them ? merging
majority does not change at 0.3 but distribution
does
entropy-based methods are sensitive to these
changes

23
From discrete to numeric?

Some methods work only on numeric data
nearest neighbour, regression
Distance 0/1 equal / different
same effect by transformation
A with k values ? k binary attributes A1,...,Ak
Ai 1 iff value A i
equal weight ? no changes to distance function
weighting can be used to express shades of
difference
Ordered values

24
7.3 Automatic data cleansing

RL data is bound to contain errors
both attribute class values
manual checking is impossible (size)
DM techniques may help
Improving decision trees
build DT, discard misclassified data relearn
repeat until no misclassified examples
surprisingly often
simpler DT
no significant change in accuracy (/-)

25
Improving decision trees...

Why does the previous method work?
pruning is subtree justified by the data
decision to ignore data misclassified by new tree
local to the pruned node
removing misclassified data
propagate ignorance decisions to the whole tree
if pruning strategy is good, should not harm
improvements possible by better attribute
selection with cleaned data
also present misclassified examples to human
expert, remove or correct

26
Improving decision trees...

Assumption misclassifications are not systematic
e.g. exchanged class values
DT would probably learn the systematic error
Experiment
add noise to attribute values in test data
better results when similar noise is added also
to training data
idea no use learning with clean data if
performance is measured with dirty test set
DT learns which attributes are unreliable and
how to combine them

27
Robust regression

Outliers in statistics
detection (e.g. visually) manual removal
is an outlier an error or not?
large effect on MSE (squared error)
Robust methods
outlier-tolerant statistical methods
other error functions than MSE (MAE)
automatic detection removal
create regression model, remove 10 (farthest
points)
minimize median instead of mean error

28
Robust regression...

Example case phone call data
1964...1969 minutes of calls
others numbers of calls
large fraction of outliers in y-axis
MSE gives quite bad result
median SE works remarkably well
finds narrowest strip covering half of the data
median model center of this strip
bad thing computationally (too) expensive

29
Detecting anomalies...

Is the error in the model or in the data?
are we justified to remove seemingly erroneous
examples
visualization may help with regression models
but not with all models
how to visualize a rule set, for example?
misclassified data
can usually be removed from DT training set
but we never know if its the case with our data

30
...Detecting anomalies

One solution attempt
try several different ML schemes
use their combined results to filter data
conservative all misclassify
voting (danger of outvoting the right scheme)
training with filtered data may yield even better
results
Danger with filtering approaches
some classes may get sacrificed in order to get
better results for other classes
Human expert is still the winner
filtering suspects reduces the manual work

31
7.4 Combining multiple models

Aim make decisions more reliable
consult several experts on the area
General combination models
bagging (bootstrap aggregating)
boosting
stacking
k-class (k gt 2) classification problems
error-correcting codes

32
Combining results

In general
how to convert several predictions into (a
hopefully better) one
Approaches
(weighted) vote/average
bagging each model has equal weight
boosting successful experts get more weight

33
Bagging

Introductory example
t random training samples
build DT for each sample
trees are usually not identical
attribute selection is sensitive to data
instances for which some trees are correct and
some not
voting usually gives a better result than any of
the DTs alone

34
Bias-variance decomposition...

Theoretical basis to analyze the effect of
combining models
build infinite of classifiers from infinite
of independent training sets
process test instances with each vote
Expected error bias
tells how well the chosen model fits the data
average error of the combined classifier
persistent error of the learning algorithm
which can not be eliminated by sampling more data

35
...Bias-variance decomposition

Error due to the chosen sample variance
samples are finite ? do not fully represent the
population
average value over all training sets of the given
size all test sets
Total expected error bias variance
combining reduces variance
In practice only one training set ... what to do?

36
Back to bagging

Simulate infinite of training sets
by resampling the same training data
delete replicate by sampling with replacement
(like in bootstrap method)
apply learning scheme to each vote
Approximation of the idealized procedure
training sets are not independent
but still works remarkably well
often siginificant improvements
never substantially worse

37
Bagging numeric prediction

Just average the results of different predictions
Bias-variance decomposition?
error expected value of MSE
bias average MSE over models built from all
possible datasets of the same size
variance expected error of a single model
fact bagging always reduces expected total error
(not true for classification problems)

38
Boosting

Bagging
works due to the inherent instability of learning
models
does not work for stable models (insensitive to
small changes in data)
e.g. linear regression
Boosting
explicitely searches for complementing models

39
Boosting vs. bagging

Similarities
uses voting/averaging
combines several models of same type
Differences
boosting is iterative later models depend on
earlier ones
new models should become experts in areas where
earlier models fail
boosting weights models based on their performance

40
AdaBoost.M1

One of the many boosting variants
designed for classification tasks
works for any ML method, but we assume first that
instances can be weighted (e.g. C4.5 alg)
Weighted examples
error sum (weights of misclassified examples) /
sum(weights of all examples)
weighting forces ML method to concentrate on
certain examples (greater need to classify them
correctly)

41
Adjusting weights

Re-weighting
decrease weights of correctly classified
instances normalize weights
next iteration hard instances get more focus
weight tells how often an example has been
misclassified by earlier models
How much?
depends on the overall error of the classifier
w w e(1-e) (small error ? small adjustment)
note if e gt 0.5 we stop the algorithm

42
Classification, non-weighted case

Weighting classifiers
classifiers with small error should get more
votes
w - log(e/1-e) ( 0..infinity)
ML algorithms with no weighted instances
replicate examples according to their weight
weighted resampling
small weight ? not present in training data
error gt 0.5 ? restart from a new fresh sample
more boosting iterations than with the original
method

43
Properties of boosting

Studied in computational learning theory
guaranteed performance improvement bounds
fact error ? 0 for training data (and fast)
test data boosting fails if
component models are too complex or
e gt 0.5 too quickly
balance between complexity fit

44
Properties of boosting...

Continuing iterations
after error of the combined classifier is 0
may still improve performance on test data
Does this contradict Occams razor?
not necessarily
we are improving our confidence on the model
margin Prestimated c Pnext likely c
boosting may increase this margin long after
overall training error is 0

45
Properties of boosting...

Weak learning ? strong learning
if we have many simple classifiers with e lt 0.5
we can combine them to a very accurate classifier
(with good probability)
easy to find weak models for 2-class problems
decision stump one-level DT
other boosting models for multiclass situations
Boosting may sometimes fail (due to overfit)
combined model is less accurate than a single
model
bagging does not fail

46
Stacking

Stacked generalization
difficult to analyze theoretically
no generally accepted best way of doing it
not normally used to combine models of the same
type
Combining different models
voting probable that correct one gets outvoted
add a meta learner atop the components
learns how to best combine the outputs (which
components are reliable and when)

47
Meta model a.k.a level-1 model

Input predictions of level-0 models
Training?
how to transform level-0 data to level-1 data?
obvious way feed data in models, collect
outputs, combine with actual class
leads to believe A, ignore B C
may be appropriate only for the training data
in general we learn to prefer overfitting models

48
Better estimates

We already have them (chapter 5)
separate hold-out set for validation
level-1 data formed from the validation set
cross-validation
leave-one-out ? level-1 example from each level-0
example
slow but gives full use of training data
using probabilities
replace nominal classifications with predicted
class probabilities (k numbers for k classes)
level-1 model knows the confidences of level-0
models

49
Level-1 learner

What models are most suitable?
any method in principle
most of the work should be already done at level
0 ? simple methods should do at level 1
Wolpert relatively global, smooth
linear models work well in practice

50
Error-correcting output codes

Aim
improve performance of classification methods
in multiclass problems
some methods work only with 2-class tasks
use iteratively (A, all rest), (B, all rest), ...
combine
error-correcting codes can be used to do most of
this transformation
are useful even when method works with multiple
classes

51
k-class task ? 2-class tasks

Create k (copied) datasets
new binary class attribute for each set i 1..k
yes class i, no class ltgt i
learn classifier for each
classification
all models output their confidence on yes
select the one with highest confidence
sensitive to accuracy (over-confidence)

52
Example case

4 classes a,b,c,d ? yes/no (0,1)
direct transformation
4 4-bit code words 1000, 0100, 0010, 0001
classifiers predict bits independently
errors occur when wrong bit gets highest
confidence
alternate coding
7-bit code words (7 classifiers)
output (error in 2nd bit) 1011111 ?
a is closest wrt Hamming distance
same correction not possible in 4-bit coding

53
What makes a code error-correcting?

Row separation
Hamming distance of codewords
d(c1,c2) 2d 1 ? can correct all errors of d
bits and less
Column separation
columns their complements should be different
otherwise classifiers will make same errors ?
more errors simultaneously ? harder to correct
Note at least 4 classes required to build code

54
Properties of e-c -codes