Engineering the input and output - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Engineering the input and output

Description:

but redundant (dependent) attributes cause trouble. Attribute selection... same effect by transformation. A with k values k binary attributes A1,...,Ak ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 55
Provided by: timokn
Category:

less

Transcript and Presenter's Notes

Title: Engineering the input and output


1
Engineering the input and output
  • Successful DM
  • more than just selecting the algorithm
  • most methods have many parameters
  • appropriate choice depends on the data
  • How to choose?
  • straight brute-force approach?
  • not usually the best (only training data
    available for comparison)
  • separate test data cross-validation
  • This chapter other important processes that can
    improve the success of DM

2
Data engineering
  • Bag of tricks
  • not sure if they work or not
  • yet better to understand what they are like
  • Input
  • make it more suitable for a ML scheme
  • attribute selection discretization
  • data cleansing
  • not addressed invention of synthetic attributes
  • Output
  • make the result more effective
  • combine different models

3
7.1 Attribute selection
  • Many ML methods try to find important
    attributes on their own
  • decision trees splitting
  • Effect of irrelevant attributes
  • DT adding a random binary attribute 5-10 worse
  • chance increases at lower levels of the tree
  • similarly with separate conquer rule learning
  • instance-based methods suffer a lot (distance)
  • naive Bayes is quite robust
  • independence assumption is valid for random data
  • but redundant (dependent) attributes cause trouble

4
Attribute selection...
  • Irrelevant attributes clearly cause harm
  • Also relevant ones may!
  • 2-classes, attribute class 65 of time
  • classification accuracy 1-5 worse
  • reason when the new attribute is chosen for
    splitting, later splits have to rely on sparser
    data
  • Selection is important
  • manual (understanding of the problem) should be
    best
  • improves performance (accuracy)
  • improves readability

5
Scheme-independent selection
  • Two different selection approaches
  • Filter methods
  • general, independent of learning method
  • based on general characteristics of the data
  • no universal measures for relevance exist
  • Wrapper methods
  • test different subsets with a known (wrapped) ML
    method

6
A simple filter method
  • Use just as many attributes that instances are
    still different
  • easy yet computationally expensive to find
  • statistically unwarranted (relies on training
    data)
  • prone to noise (? overfit)

7
Using ML algorithms...
  • Decision trees
  • build DT on training data
  • select attributes that are actually used in the
    tree
  • use these attributes with any other ML scheme
  • 1-Rule algorithm
  • user decides how many attributes are used
  • 1R finds out the best ones (one by one)
  • note 1R is error-based, not the optimal choice

8
...Using ML algorithms
  • Instance-based methods
  • sample training instances
  • examine closest neighbours
  • same class near hit
  • differs in some attribute value ? that attribute
    appears to be irrelevant ? less weight
  • different class near miss
  • differs ? relevant attribute ? more weight
  • selection select only attributes with positive
    weights
  • will not detect dependent attributes (both are
    either selected or rejected)

9
Searching the attribute space
  • Subset lattice
  • structure created by removing/adding attributes
  • search either one- or two-directional
  • forward selection start from the empty set
  • tentatively add one attribute evaluate (e.g. by
    cross-validation)
  • select best continue or stop if no improvement
  • backward elimination start from the full set
  • easy to add bias for small attribute sets
  • threshold value for performance gain
  • best-first, beam search, GA, ...

10
Scheme-specific selection
  • Performance measure
  • use given ML scheme on chosen attributes
  • exhaustive search 2k choices
  • Experiments tell that
  • backward larger sets, more accurate
  • forward smaller sets, more readable
  • reason we usually stop too early (optimistic
    error estimates)
  • sophisticated search methods
  • are not generally better (no uniform performance
    gain)
  • hard to predict when worthwhile to use

11
Decision tables
  • Classifier type for which scheme-specific
    selection is essential
  • entire learning problem which attributes to
    select
  • usually done by cross-validating different
    subsets
  • validation is computationally cheap
  • table structure stays same all the time
  • only class counters change

12
Success story
  • Selective Naive Bayes
  • Naive Bayes forward selection
  • forward selection detects redundant attributes
    better than backward elimination
  • naive evaluation metric performance on training
    data
  • Experiments
  • improves performance on many standard test cases
  • no negative effects

13
7.2 Discretizing numeric attributes
  • Why?
  • some ML methods work only on nominal data
  • some deal with them but not satisfactorily
  • assumption on normal distribution
  • DT (repetitive) sorting required
  • 1R discretization
  • sort, assign values to change points in class
    value (require some min. of points)
  • global method (applied to all data)
  • DT discretization
  • local decision on best (2-way) split-point

14
Local or global?
  • Local
  • tailored to the actual context
  • different discretizations in different places
  • less reliable with small datasets
  • Global (prior learning)
  • has to make one general decision
  • numeric data is ordered ? is nominal too?

15
Ordering information
  • Order potentially valuable knowledge
  • how to express it to a ML scheme that does not
    understand ordering?
  • Transformation
  • replace attribute with k values with k-1 binary
    attributes
  • orig. value i
  • set first i-1 new attributes to 0, rest to 1
  • if DT splits on ith attribute, it actually uses
    the ordering information
  • note independent of the original discretization
    method

16
Unsupervised discretization
  • Unsupervised/supervised
  • discretization made without/with knowledge of the
    class value
  • Obvious way
  • divide range into a fixed number of equal
    intervals
  • may lose information (or create noise)
  • too coarse intervals
  • unfortunate choices of boundary values

17
Unsupervised discretization...
  • Equal-interval binning
  • the previous way
  • uneven distribution of examples
  • Equal-frequency binning
  • histogram should be flat
  • may still separate class members with bad
    boundaries

18
Entropy-based discretization
  • Recursively split intervals
  • apply DT method to find the initial split
  • repeat this process in both parts
  • Fact
  • cut point minimizing info never appears between
    two consequtive examples of the same class
  • ? we can reduce the of candidate points

19
When to stop recursion?
  • Use MDL principle
  • no split encode example classes
  • split
  • theory splitting point takes log(N-1) bits to
    encode (N of instances)
  • encode classes in both partitions
  • optimal situation
  • all lt are yes all gt are no
  • each instance costs 1 bit without splitting and
    almost 0 bits with it
  • formula for MDL-based gain threshold
  • note temp example, no splitting at all ? 1 value
  • no discretization ? quite irrelevant attribute!

20
Other methods
  • Entropy MDL one of the best general methods
    for supervised discretization
  • Bottom-up discretization
  • consider merges of adjacent intervals
  • find best pair, merge if good enough
  • Error-based
  • assign majority class to each value
  • compute prediction error
  • bad best each example has own value ? fix some
    k

21
Error-based methods...
  • Brute-force method exp. in k
  • Dynamic programming
  • applicable to any impurity function
  • finds a partition of N instances into k subsets
    minimizing the impurity in time O(kN2)
  • e.g. impurity entropy
  • O(kN) for error-based impurity function

22
Error-based vs entropy-based
  • Error-based
  • finds optimal discretization very quickly
  • but can not produce adjacent intervals with same
    class value
  • Do we need such intervals (Fig 7.4)?
  • discretisize a1 a2
  • best for a1 0..0.3, 0.3..0.7, 0.7..1.0
  • majority classes for a1 dot, ?, triangle
  • ? must be either of them ? merging
  • majority does not change at 0.3 but distribution
    does
  • entropy-based methods are sensitive to these
    changes

23
From discrete to numeric?
  • Some methods work only on numeric data
  • nearest neighbour, regression
  • Distance 0/1 equal / different
  • same effect by transformation
  • A with k values ? k binary attributes A1,...,Ak
  • Ai 1 iff value A i
  • equal weight ? no changes to distance function
  • weighting can be used to express shades of
    difference
  • Ordered values

24
7.3 Automatic data cleansing
  • RL data is bound to contain errors
  • both attribute class values
  • manual checking is impossible (size)
  • DM techniques may help
  • Improving decision trees
  • build DT, discard misclassified data relearn
  • repeat until no misclassified examples
  • surprisingly often
  • simpler DT
  • no significant change in accuracy (/-)

25
Improving decision trees...
  • Why does the previous method work?
  • pruning is subtree justified by the data
  • decision to ignore data misclassified by new tree
  • local to the pruned node
  • removing misclassified data
  • propagate ignorance decisions to the whole tree
  • if pruning strategy is good, should not harm
  • improvements possible by better attribute
    selection with cleaned data
  • also present misclassified examples to human
    expert, remove or correct

26
Improving decision trees...
  • Assumption misclassifications are not systematic
  • e.g. exchanged class values
  • DT would probably learn the systematic error
  • Experiment
  • add noise to attribute values in test data
  • better results when similar noise is added also
    to training data
  • idea no use learning with clean data if
    performance is measured with dirty test set
  • DT learns which attributes are unreliable and
    how to combine them

27
Robust regression
  • Outliers in statistics
  • detection (e.g. visually) manual removal
  • is an outlier an error or not?
  • large effect on MSE (squared error)
  • Robust methods
  • outlier-tolerant statistical methods
  • other error functions than MSE (MAE)
  • automatic detection removal
  • create regression model, remove 10 (farthest
    points)
  • minimize median instead of mean error

28
Robust regression...
  • Example case phone call data
  • 1964...1969 minutes of calls
  • others numbers of calls
  • large fraction of outliers in y-axis
  • MSE gives quite bad result
  • median SE works remarkably well
  • finds narrowest strip covering half of the data
  • median model center of this strip
  • bad thing computationally (too) expensive

29
Detecting anomalies...
  • Is the error in the model or in the data?
  • are we justified to remove seemingly erroneous
    examples
  • visualization may help with regression models
  • but not with all models
  • how to visualize a rule set, for example?
  • misclassified data
  • can usually be removed from DT training set
  • but we never know if its the case with our data

30
...Detecting anomalies
  • One solution attempt
  • try several different ML schemes
  • use their combined results to filter data
  • conservative all misclassify
  • voting (danger of outvoting the right scheme)
  • training with filtered data may yield even better
    results
  • Danger with filtering approaches
  • some classes may get sacrificed in order to get
    better results for other classes
  • Human expert is still the winner
  • filtering suspects reduces the manual work

31
7.4 Combining multiple models
  • Aim make decisions more reliable
  • consult several experts on the area
  • General combination models
  • bagging (bootstrap aggregating)
  • boosting
  • stacking
  • k-class (k gt 2) classification problems
  • error-correcting codes

32
Combining results
  • In general
  • how to convert several predictions into (a
    hopefully better) one
  • Approaches
  • (weighted) vote/average
  • bagging each model has equal weight
  • boosting successful experts get more weight

33
Bagging
  • Introductory example
  • t random training samples
  • build DT for each sample
  • trees are usually not identical
  • attribute selection is sensitive to data
  • instances for which some trees are correct and
    some not
  • voting usually gives a better result than any of
    the DTs alone

34
Bias-variance decomposition...
  • Theoretical basis to analyze the effect of
    combining models
  • build infinite of classifiers from infinite
    of independent training sets
  • process test instances with each vote
  • Expected error bias
  • tells how well the chosen model fits the data
  • average error of the combined classifier
  • persistent error of the learning algorithm
    which can not be eliminated by sampling more data

35
...Bias-variance decomposition
  • Error due to the chosen sample variance
  • samples are finite ? do not fully represent the
    population
  • average value over all training sets of the given
    size all test sets
  • Total expected error bias variance
  • combining reduces variance
  • In practice only one training set ... what to do?

36
Back to bagging
  • Simulate infinite of training sets
  • by resampling the same training data
  • delete replicate by sampling with replacement
    (like in bootstrap method)
  • apply learning scheme to each vote
  • Approximation of the idealized procedure
  • training sets are not independent
  • but still works remarkably well
  • often siginificant improvements
  • never substantially worse

37
Bagging numeric prediction
  • Just average the results of different predictions
  • Bias-variance decomposition?
  • error expected value of MSE
  • bias average MSE over models built from all
    possible datasets of the same size
  • variance expected error of a single model
  • fact bagging always reduces expected total error
    (not true for classification problems)

38
Boosting
  • Bagging
  • works due to the inherent instability of learning
    models
  • does not work for stable models (insensitive to
    small changes in data)
  • e.g. linear regression
  • Boosting
  • explicitely searches for complementing models

39
Boosting vs. bagging
  • Similarities
  • uses voting/averaging
  • combines several models of same type
  • Differences
  • boosting is iterative later models depend on
    earlier ones
  • new models should become experts in areas where
    earlier models fail
  • boosting weights models based on their performance

40
AdaBoost.M1
  • One of the many boosting variants
  • designed for classification tasks
  • works for any ML method, but we assume first that
    instances can be weighted (e.g. C4.5 alg)
  • Weighted examples
  • error sum (weights of misclassified examples) /
    sum(weights of all examples)
  • weighting forces ML method to concentrate on
    certain examples (greater need to classify them
    correctly)

41
Adjusting weights
  • Re-weighting
  • decrease weights of correctly classified
    instances normalize weights
  • next iteration hard instances get more focus
  • weight tells how often an example has been
    misclassified by earlier models
  • How much?
  • depends on the overall error of the classifier
  • w w e(1-e) (small error ? small adjustment)
  • note if e gt 0.5 we stop the algorithm

42
Classification, non-weighted case
  • Weighting classifiers
  • classifiers with small error should get more
    votes
  • w - log(e/1-e) ( 0..infinity)
  • ML algorithms with no weighted instances
  • replicate examples according to their weight
  • weighted resampling
  • small weight ? not present in training data
  • error gt 0.5 ? restart from a new fresh sample
  • more boosting iterations than with the original
    method

43
Properties of boosting
  • Studied in computational learning theory
  • guaranteed performance improvement bounds
  • fact error ? 0 for training data (and fast)
  • test data boosting fails if
  • component models are too complex or
  • e gt 0.5 too quickly
  • balance between complexity fit

44
Properties of boosting...
  • Continuing iterations
  • after error of the combined classifier is 0
  • may still improve performance on test data
  • Does this contradict Occams razor?
  • not necessarily
  • we are improving our confidence on the model
  • margin Prestimated c Pnext likely c
  • boosting may increase this margin long after
    overall training error is 0

45
Properties of boosting...
  • Weak learning ? strong learning
  • if we have many simple classifiers with e lt 0.5
  • we can combine them to a very accurate classifier
    (with good probability)
  • easy to find weak models for 2-class problems
  • decision stump one-level DT
  • other boosting models for multiclass situations
  • Boosting may sometimes fail (due to overfit)
  • combined model is less accurate than a single
    model
  • bagging does not fail

46
Stacking
  • Stacked generalization
  • difficult to analyze theoretically
  • no generally accepted best way of doing it
  • not normally used to combine models of the same
    type
  • Combining different models
  • voting probable that correct one gets outvoted
  • add a meta learner atop the components
  • learns how to best combine the outputs (which
    components are reliable and when)

47
Meta model a.k.a level-1 model
  • Input predictions of level-0 models
  • Training?
  • how to transform level-0 data to level-1 data?
  • obvious way feed data in models, collect
    outputs, combine with actual class
  • leads to believe A, ignore B C
  • may be appropriate only for the training data
  • in general we learn to prefer overfitting models

48
Better estimates
  • We already have them (chapter 5)
  • separate hold-out set for validation
  • level-1 data formed from the validation set
  • cross-validation
  • leave-one-out ? level-1 example from each level-0
    example
  • slow but gives full use of training data
  • using probabilities
  • replace nominal classifications with predicted
    class probabilities (k numbers for k classes)
  • level-1 model knows the confidences of level-0
    models

49
Level-1 learner
  • What models are most suitable?
  • any method in principle
  • most of the work should be already done at level
    0 ? simple methods should do at level 1
  • Wolpert relatively global, smooth
  • linear models work well in practice

50
Error-correcting output codes
  • Aim
  • improve performance of classification methods
  • in multiclass problems
  • some methods work only with 2-class tasks
  • use iteratively (A, all rest), (B, all rest), ...
    combine
  • error-correcting codes can be used to do most of
    this transformation
  • are useful even when method works with multiple
    classes

51
k-class task ? 2-class tasks
  • Create k (copied) datasets
  • new binary class attribute for each set i 1..k
  • yes class i, no class ltgt i
  • learn classifier for each
  • classification
  • all models output their confidence on yes
  • select the one with highest confidence
  • sensitive to accuracy (over-confidence)

52
Example case
  • 4 classes a,b,c,d ? yes/no (0,1)
  • direct transformation
  • 4 4-bit code words 1000, 0100, 0010, 0001
  • classifiers predict bits independently
  • errors occur when wrong bit gets highest
    confidence
  • alternate coding
  • 7-bit code words (7 classifiers)
  • output (error in 2nd bit) 1011111 ?
  • a is closest wrt Hamming distance
  • same correction not possible in 4-bit coding

53
What makes a code error-correcting?
  • Row separation
  • Hamming distance of codewords
  • d(c1,c2) 2d 1 ? can correct all errors of d
    bits and less
  • Column separation
  • columns their complements should be different
  • otherwise classifiers will make same errors ?
    more errors simultaneously ? harder to correct
  • Note at least 4 classes required to build code

54
Properties of e-c -codes
  • Exhaustive code for k classes
  • every possible k-bit string
  • excluding complements 1k 0k
  • each codeword takes 2(k-1) 1 bits
  • number of columns increases exponentially
  • Instance-based learning?
  • prediction based on nearby instances ? all output
    bits from same instances
  • circumvention different attribute sets for each
    output bit
Write a Comment
User Comments (0)
About PowerShow.com