Title: Engineering the input and output
1Engineering the input and output
- Successful DM
- more than just selecting the algorithm
- most methods have many parameters
- appropriate choice depends on the data
- How to choose?
- straight brute-force approach?
- not usually the best (only training data
available for comparison) - separate test data cross-validation
- This chapter other important processes that can
improve the success of DM
2Data engineering
- Bag of tricks
- not sure if they work or not
- yet better to understand what they are like
- Input
- make it more suitable for a ML scheme
- attribute selection discretization
- data cleansing
- not addressed invention of synthetic attributes
- Output
- make the result more effective
- combine different models
37.1 Attribute selection
- Many ML methods try to find important
attributes on their own - decision trees splitting
- Effect of irrelevant attributes
- DT adding a random binary attribute 5-10 worse
- chance increases at lower levels of the tree
- similarly with separate conquer rule learning
- instance-based methods suffer a lot (distance)
- naive Bayes is quite robust
- independence assumption is valid for random data
- but redundant (dependent) attributes cause trouble
4Attribute selection...
- Irrelevant attributes clearly cause harm
- Also relevant ones may!
- 2-classes, attribute class 65 of time
- classification accuracy 1-5 worse
- reason when the new attribute is chosen for
splitting, later splits have to rely on sparser
data - Selection is important
- manual (understanding of the problem) should be
best - improves performance (accuracy)
- improves readability
5Scheme-independent selection
- Two different selection approaches
- Filter methods
- general, independent of learning method
- based on general characteristics of the data
- no universal measures for relevance exist
- Wrapper methods
- test different subsets with a known (wrapped) ML
method
6A simple filter method
- Use just as many attributes that instances are
still different - easy yet computationally expensive to find
- statistically unwarranted (relies on training
data) - prone to noise (? overfit)
7Using ML algorithms...
- Decision trees
- build DT on training data
- select attributes that are actually used in the
tree - use these attributes with any other ML scheme
- 1-Rule algorithm
- user decides how many attributes are used
- 1R finds out the best ones (one by one)
- note 1R is error-based, not the optimal choice
8...Using ML algorithms
- Instance-based methods
- sample training instances
- examine closest neighbours
- same class near hit
- differs in some attribute value ? that attribute
appears to be irrelevant ? less weight - different class near miss
- differs ? relevant attribute ? more weight
- selection select only attributes with positive
weights - will not detect dependent attributes (both are
either selected or rejected)
9Searching the attribute space
- Subset lattice
- structure created by removing/adding attributes
- search either one- or two-directional
- forward selection start from the empty set
- tentatively add one attribute evaluate (e.g. by
cross-validation) - select best continue or stop if no improvement
- backward elimination start from the full set
- easy to add bias for small attribute sets
- threshold value for performance gain
- best-first, beam search, GA, ...
10Scheme-specific selection
- Performance measure
- use given ML scheme on chosen attributes
- exhaustive search 2k choices
- Experiments tell that
- backward larger sets, more accurate
- forward smaller sets, more readable
- reason we usually stop too early (optimistic
error estimates) - sophisticated search methods
- are not generally better (no uniform performance
gain) - hard to predict when worthwhile to use
11Decision tables
- Classifier type for which scheme-specific
selection is essential - entire learning problem which attributes to
select - usually done by cross-validating different
subsets - validation is computationally cheap
- table structure stays same all the time
- only class counters change
12Success story
- Selective Naive Bayes
- Naive Bayes forward selection
- forward selection detects redundant attributes
better than backward elimination - naive evaluation metric performance on training
data - Experiments
- improves performance on many standard test cases
- no negative effects
137.2 Discretizing numeric attributes
- Why?
- some ML methods work only on nominal data
- some deal with them but not satisfactorily
- assumption on normal distribution
- DT (repetitive) sorting required
- 1R discretization
- sort, assign values to change points in class
value (require some min. of points) - global method (applied to all data)
- DT discretization
- local decision on best (2-way) split-point
14Local or global?
- Local
- tailored to the actual context
- different discretizations in different places
- less reliable with small datasets
- Global (prior learning)
- has to make one general decision
- numeric data is ordered ? is nominal too?
15Ordering information
- Order potentially valuable knowledge
- how to express it to a ML scheme that does not
understand ordering? - Transformation
- replace attribute with k values with k-1 binary
attributes - orig. value i
- set first i-1 new attributes to 0, rest to 1
- if DT splits on ith attribute, it actually uses
the ordering information - note independent of the original discretization
method
16Unsupervised discretization
- Unsupervised/supervised
- discretization made without/with knowledge of the
class value - Obvious way
- divide range into a fixed number of equal
intervals - may lose information (or create noise)
- too coarse intervals
- unfortunate choices of boundary values
17Unsupervised discretization...
- Equal-interval binning
- the previous way
- uneven distribution of examples
- Equal-frequency binning
- histogram should be flat
- may still separate class members with bad
boundaries
18Entropy-based discretization
- Recursively split intervals
- apply DT method to find the initial split
- repeat this process in both parts
- Fact
- cut point minimizing info never appears between
two consequtive examples of the same class - ? we can reduce the of candidate points
19When to stop recursion?
- Use MDL principle
- no split encode example classes
- split
- theory splitting point takes log(N-1) bits to
encode (N of instances) - encode classes in both partitions
- optimal situation
- all lt are yes all gt are no
- each instance costs 1 bit without splitting and
almost 0 bits with it - formula for MDL-based gain threshold
- note temp example, no splitting at all ? 1 value
- no discretization ? quite irrelevant attribute!
20Other methods
- Entropy MDL one of the best general methods
for supervised discretization - Bottom-up discretization
- consider merges of adjacent intervals
- find best pair, merge if good enough
- Error-based
- assign majority class to each value
- compute prediction error
- bad best each example has own value ? fix some
k
21Error-based methods...
- Brute-force method exp. in k
- Dynamic programming
- applicable to any impurity function
- finds a partition of N instances into k subsets
minimizing the impurity in time O(kN2) - e.g. impurity entropy
- O(kN) for error-based impurity function
22Error-based vs entropy-based
- Error-based
- finds optimal discretization very quickly
- but can not produce adjacent intervals with same
class value - Do we need such intervals (Fig 7.4)?
- discretisize a1 a2
- best for a1 0..0.3, 0.3..0.7, 0.7..1.0
- majority classes for a1 dot, ?, triangle
- ? must be either of them ? merging
- majority does not change at 0.3 but distribution
does - entropy-based methods are sensitive to these
changes
23From discrete to numeric?
- Some methods work only on numeric data
- nearest neighbour, regression
- Distance 0/1 equal / different
- same effect by transformation
- A with k values ? k binary attributes A1,...,Ak
- Ai 1 iff value A i
- equal weight ? no changes to distance function
- weighting can be used to express shades of
difference - Ordered values
247.3 Automatic data cleansing
- RL data is bound to contain errors
- both attribute class values
- manual checking is impossible (size)
- DM techniques may help
- Improving decision trees
- build DT, discard misclassified data relearn
- repeat until no misclassified examples
- surprisingly often
- simpler DT
- no significant change in accuracy (/-)
25Improving decision trees...
- Why does the previous method work?
- pruning is subtree justified by the data
- decision to ignore data misclassified by new tree
- local to the pruned node
- removing misclassified data
- propagate ignorance decisions to the whole tree
- if pruning strategy is good, should not harm
- improvements possible by better attribute
selection with cleaned data - also present misclassified examples to human
expert, remove or correct
26Improving decision trees...
- Assumption misclassifications are not systematic
- e.g. exchanged class values
- DT would probably learn the systematic error
- Experiment
- add noise to attribute values in test data
- better results when similar noise is added also
to training data - idea no use learning with clean data if
performance is measured with dirty test set - DT learns which attributes are unreliable and
how to combine them
27Robust regression
- Outliers in statistics
- detection (e.g. visually) manual removal
- is an outlier an error or not?
- large effect on MSE (squared error)
- Robust methods
- outlier-tolerant statistical methods
- other error functions than MSE (MAE)
- automatic detection removal
- create regression model, remove 10 (farthest
points) - minimize median instead of mean error
28Robust regression...
- Example case phone call data
- 1964...1969 minutes of calls
- others numbers of calls
- large fraction of outliers in y-axis
- MSE gives quite bad result
- median SE works remarkably well
- finds narrowest strip covering half of the data
- median model center of this strip
- bad thing computationally (too) expensive
29Detecting anomalies...
- Is the error in the model or in the data?
- are we justified to remove seemingly erroneous
examples - visualization may help with regression models
- but not with all models
- how to visualize a rule set, for example?
- misclassified data
- can usually be removed from DT training set
- but we never know if its the case with our data
30...Detecting anomalies
- One solution attempt
- try several different ML schemes
- use their combined results to filter data
- conservative all misclassify
- voting (danger of outvoting the right scheme)
- training with filtered data may yield even better
results - Danger with filtering approaches
- some classes may get sacrificed in order to get
better results for other classes - Human expert is still the winner
- filtering suspects reduces the manual work
317.4 Combining multiple models
- Aim make decisions more reliable
- consult several experts on the area
- General combination models
- bagging (bootstrap aggregating)
- boosting
- stacking
- k-class (k gt 2) classification problems
- error-correcting codes
32Combining results
- In general
- how to convert several predictions into (a
hopefully better) one - Approaches
- (weighted) vote/average
- bagging each model has equal weight
- boosting successful experts get more weight
33Bagging
- Introductory example
- t random training samples
- build DT for each sample
- trees are usually not identical
- attribute selection is sensitive to data
- instances for which some trees are correct and
some not - voting usually gives a better result than any of
the DTs alone
34Bias-variance decomposition...
- Theoretical basis to analyze the effect of
combining models - build infinite of classifiers from infinite
of independent training sets - process test instances with each vote
- Expected error bias
- tells how well the chosen model fits the data
- average error of the combined classifier
- persistent error of the learning algorithm
which can not be eliminated by sampling more data
35...Bias-variance decomposition
- Error due to the chosen sample variance
- samples are finite ? do not fully represent the
population - average value over all training sets of the given
size all test sets - Total expected error bias variance
- combining reduces variance
- In practice only one training set ... what to do?
36Back to bagging
- Simulate infinite of training sets
- by resampling the same training data
- delete replicate by sampling with replacement
(like in bootstrap method) - apply learning scheme to each vote
- Approximation of the idealized procedure
- training sets are not independent
- but still works remarkably well
- often siginificant improvements
- never substantially worse
37Bagging numeric prediction
- Just average the results of different predictions
- Bias-variance decomposition?
- error expected value of MSE
- bias average MSE over models built from all
possible datasets of the same size - variance expected error of a single model
- fact bagging always reduces expected total error
(not true for classification problems)
38Boosting
- Bagging
- works due to the inherent instability of learning
models - does not work for stable models (insensitive to
small changes in data) - e.g. linear regression
- Boosting
- explicitely searches for complementing models
39Boosting vs. bagging
- Similarities
- uses voting/averaging
- combines several models of same type
- Differences
- boosting is iterative later models depend on
earlier ones - new models should become experts in areas where
earlier models fail - boosting weights models based on their performance
40AdaBoost.M1
- One of the many boosting variants
- designed for classification tasks
- works for any ML method, but we assume first that
instances can be weighted (e.g. C4.5 alg) - Weighted examples
- error sum (weights of misclassified examples) /
sum(weights of all examples) - weighting forces ML method to concentrate on
certain examples (greater need to classify them
correctly)
41Adjusting weights
- Re-weighting
- decrease weights of correctly classified
instances normalize weights - next iteration hard instances get more focus
- weight tells how often an example has been
misclassified by earlier models - How much?
- depends on the overall error of the classifier
- w w e(1-e) (small error ? small adjustment)
- note if e gt 0.5 we stop the algorithm
42Classification, non-weighted case
- Weighting classifiers
- classifiers with small error should get more
votes - w - log(e/1-e) ( 0..infinity)
- ML algorithms with no weighted instances
- replicate examples according to their weight
- weighted resampling
- small weight ? not present in training data
- error gt 0.5 ? restart from a new fresh sample
- more boosting iterations than with the original
method
43Properties of boosting
- Studied in computational learning theory
- guaranteed performance improvement bounds
- fact error ? 0 for training data (and fast)
- test data boosting fails if
- component models are too complex or
- e gt 0.5 too quickly
- balance between complexity fit
44Properties of boosting...
- Continuing iterations
- after error of the combined classifier is 0
- may still improve performance on test data
- Does this contradict Occams razor?
- not necessarily
- we are improving our confidence on the model
- margin Prestimated c Pnext likely c
- boosting may increase this margin long after
overall training error is 0
45Properties of boosting...
- Weak learning ? strong learning
- if we have many simple classifiers with e lt 0.5
- we can combine them to a very accurate classifier
(with good probability) - easy to find weak models for 2-class problems
- decision stump one-level DT
- other boosting models for multiclass situations
- Boosting may sometimes fail (due to overfit)
- combined model is less accurate than a single
model - bagging does not fail
46Stacking
- Stacked generalization
- difficult to analyze theoretically
- no generally accepted best way of doing it
- not normally used to combine models of the same
type - Combining different models
- voting probable that correct one gets outvoted
- add a meta learner atop the components
- learns how to best combine the outputs (which
components are reliable and when)
47Meta model a.k.a level-1 model
- Input predictions of level-0 models
- Training?
- how to transform level-0 data to level-1 data?
- obvious way feed data in models, collect
outputs, combine with actual class - leads to believe A, ignore B C
- may be appropriate only for the training data
- in general we learn to prefer overfitting models
48Better estimates
- We already have them (chapter 5)
- separate hold-out set for validation
- level-1 data formed from the validation set
- cross-validation
- leave-one-out ? level-1 example from each level-0
example - slow but gives full use of training data
- using probabilities
- replace nominal classifications with predicted
class probabilities (k numbers for k classes) - level-1 model knows the confidences of level-0
models
49Level-1 learner
- What models are most suitable?
- any method in principle
- most of the work should be already done at level
0 ? simple methods should do at level 1 - Wolpert relatively global, smooth
- linear models work well in practice
50Error-correcting output codes
- Aim
- improve performance of classification methods
- in multiclass problems
- some methods work only with 2-class tasks
- use iteratively (A, all rest), (B, all rest), ...
combine - error-correcting codes can be used to do most of
this transformation - are useful even when method works with multiple
classes
51k-class task ? 2-class tasks
- Create k (copied) datasets
- new binary class attribute for each set i 1..k
- yes class i, no class ltgt i
- learn classifier for each
- classification
- all models output their confidence on yes
- select the one with highest confidence
- sensitive to accuracy (over-confidence)
52Example case
- 4 classes a,b,c,d ? yes/no (0,1)
- direct transformation
- 4 4-bit code words 1000, 0100, 0010, 0001
- classifiers predict bits independently
- errors occur when wrong bit gets highest
confidence - alternate coding
- 7-bit code words (7 classifiers)
- output (error in 2nd bit) 1011111 ?
- a is closest wrt Hamming distance
- same correction not possible in 4-bit coding
53What makes a code error-correcting?
- Row separation
- Hamming distance of codewords
- d(c1,c2) 2d 1 ? can correct all errors of d
bits and less - Column separation
- columns their complements should be different
- otherwise classifiers will make same errors ?
more errors simultaneously ? harder to correct - Note at least 4 classes required to build code
54Properties of e-c -codes
- Exhaustive code for k classes
- every possible k-bit string
- excluding complements 1k 0k
- each codeword takes 2(k-1) 1 bits
- number of columns increases exponentially
- Instance-based learning?
- prediction based on nearby instances ? all output
bits from same instances - circumvention different attribute sets for each
output bit