Model Compression - PowerPoint PPT Presentation

About This Presentation
Title:

Model Compression

Description:

Title: Multitask Learning Subject: MTL Faculty Talk Author: Rich Caruana Last modified by: Rich Caruana Created Date: 4/28/1998 12:10:22 PM Document presentation format – PowerPoint PPT presentation

Number of Views:363
Avg rating:3.0/5.0
Slides: 109
Provided by: RichC163
Category:

less

Transcript and Presenter's Notes

Title: Model Compression


1
Model Compression
  • Rich Caruana
  • Computer Science
  • Cornell University
  • joint work with Cristian Bucila Alex
    Niculescu-Mizil

2
Outline
  • Motivation
  • Ensemble learning usually most accurate
  • Ensemble models can be large and slow
  • Model compression
  • Where does data come from?
  • Experimental results
  • Related work
  • Future work
  • Summary

3
Supervised Learning
  • Major Goals
  • Accurate Models
  • Easy to train
  • Fast to train
  • Can deal with many data types
  • Can deal with many performance criteria
  • Does not require too much human expertise
  • Compact, easy to use models
  • Intelligible models
  • Fast predictions
  • Confidences for predictions
  • Explanations for predictions

4
Normalized Scores for ES
Threshold Metrics Threshold Metrics Threshold Metrics Rank/Ordering Metrics Rank/Ordering Metrics Rank/Ordering Metrics Probability Metrics Probability Metrics Probability Metrics
Model Accuracy F-Score Lift ROC Area Average Precision Break Even Point Squared Error Cross-Entropy Calibration Mean
ES 0.9560 0.9442 0.9916 0.9965 0.9846 0.9786 0.9795 0.9808 0.9877 0.9850
BAYESAVG 0.9258 0.8906 0.9785 0.9851 0.9773 0.9557 0.9504 0.9585 0.9871 0.9566
BEST 0.9283 0.9188 0.9754 0.9876 0.9588 0.9581 0.9194 0.9443 0.9891 0.9533
AVG_ALL 0.8363 0.8007 0.9815 0.9878 0.9721 0.9606 0.8271 0.8086 0.9856 0.9067
STACK_LR 0.2753 0.7772 0.8352 0.7992 0.7860 0.8469 0.3317 -0.9897 0.8221 0.4982

BST-DT 0.860 0.854 0.956 0.977 0.958 0.952 0.929 0.932 0.808 0.914
RND-FOR 0.866 0.871 0.958 0.977 0.957 0.948 0.892 0.898 0.702 0.897
ANN 0.817 0.875 0.947 0.963 0.926 0.929 0.872 0.878 0.826 0.892
SVM 0.823 0.851 0.928 0.961 0.931 0.929 0.882 0.880 0.769 0.884
BAG-DT 0.836 0.849 0.953 0.972 0.950 0.928 0.875 0.901 0.637 0.878
KNN 0.759 0.820 0.914 0.937 0.893 0.898 0.786 0.805 0.706 0.835
BST-STMP 0.698 0.760 0.898 0.926 0.871 0.854 0.740 0.783 0.678 0.801
5
Ensemble Selection Works,But Is It Worth It?
  • Best of best of best yields 20 reduction in
    loss compared to boosted trees
  • Accuracy or AUC increase from 88 to 90
  • RMS decrease from 0.25 to 0.20
  • Typically 10 reduction in loss compared to best
    model above
  • Accuracy or AUC increase from 90 to 91
  • RMS decrease from 0.20 to 0.18
  • Overall reduction in loss can be 30, which is
    significant

6
Computational Cost
  • Have to train multiple models anyway
  • models can be trained in parallel
  • different packages, different machines, at
    different times, by different people
  • just generate and collect (no optimization
    necessary, no test sets)
  • saves human effort -- no need to examine/optimize
    models
  • 48 hours on 10 workstations to train 2000
    models with 5k train sets
  • model library can be built before optimization
    metric is known
  • anytime selection -- no need to wait for all
    models
  • Ensemble Selection is cheap
  • each iteration, consider adding 2000 models to
    ensemble
  • adding model is simple unweighted averaging of
    predictions
  • caching makes this very efficient
  • compute performance metric when each model is
    added
  • for 250 iterations, evaluate 2502000 500,000
    ensembles
  • 1 minute on workstation if metric is not
    expensive

7
Ensemble Selection
  • Good news
  • A carefully selected ensemble that combines many
    models outperforms boosting, bagging, random
    forests, SVMs, and neural nets, (because it
    builds on top of them)
  • Bad news
  • The ensembles are too big, too slow, too
    cumbersome to use for most applications

8
Best Ensembles are Big Ugly!
  • Best ensemble for one problem/metric has 422
    models
  • 72 boosted trees (28,642 individual decision
    trees!)
  • 1 random forest (1024 decision trees)
  • 5 bagged trees (100 decision trees in each model)
  • 44 neural nets (2,200 hidden units,total,
    gt100,000 weights)
  • 115 knn models (both large and expensive!)
  • 38 SVMs (100s of support vectors in each model)
  • 26 boosted stump models (36,184 stumps total --
    could compress)
  • 122 individual decision trees

9
Best Ensembles are Big Slow!
  • Size
  • Best single models 1.41 Mb
  • Ensemble selection 550.29 Mb
  • Speed (to classify 10,000 examples)
  • Best single model 93.37 secs / 10k
  • Ensemble selection 5396.27 secs / 10k

10
  • Cant we make the ensembles smaller, faster, and
    easier to use by eliminating some base-level
    models?

11
What Models are Used in Ensembles?
ADULT Acc Fsc Lft Roc Apr Bep Rms Mxe Sar Avg
ANN .071 .132 .101 .365 .430 .243 .167 .094 .573 .272
KNN .020 .015 .586 .037 .029 .049 .000 .000 .049 .098
SVM .001 .000 .110 .284 .274 .092 .057 .000 .022 .105
DT .020 .035 .007 .088 .049 .019 .746 .867 .234 .258
BAG_DT .002 .001 .000 .004 .005 .002 .006 .010 .014 .005
BST_DT .110 .152 .025 .057 .032 .047 .024 .028 .075 .069
BST_STMP .776 .666 .171 .166 .181 .548 .000 .000 .032 .317
COV-TYPE Acc Fsc Lft Roc Apr Bep Rms Mxe Sar Avg
ANN .011 .010 .001 .038 .023 .009 .087 .097 .052 .041
KNN .179 .166 .576 .252 .295 .202 .436 .427 .364 .362
SVM .021 .016 .087 .104 .106 .051 .010 .013 .038 .056
DT .061 .054 .012 .238 .242 .029 .408 .368 .200 .202
BAG_DT .005 .006 .002 .010 .015 .006 .016 .022 .044 .016
BST_DT .553 .613 .130 .278 .240 .644 .042 .073 .292 .358
BST_STMP .170 .134 .194 .080 .080 .059 .000 .000 .009 .091
12
What Models are Used in Ensembles?
ADULT Acc Fsc Lft Roc Apr Bep Rms Mxe Sar Avg
ANN .071 .132 .101 .365 .430 .243 .167 .094 .573 .272
KNN .020 .015 .586 .037 .029 .049 .000 .000 .049 .098
SVM .001 .000 .110 .284 .274 .092 .057 .000 .022 .105
DT .020 .035 .007 .088 .049 .019 .746 .867 .234 .258
BAG_DT .002 .001 .000 .004 .005 .002 .006 .010 .014 .005
BST_DT .110 .152 .025 .057 .032 .047 .024 .028 .075 .069
BST_STMP .776 .666 .171 .166 .181 .548 .000 .000 .032 .317
COV-TYPE Acc Fsc Lft Roc Apr Bep Rms Mxe Sar Avg
ANN .011 .010 .001 .038 .023 .009 .087 .097 .052 .041
KNN .179 .166 .576 .252 .295 .202 .436 .427 .364 .362
SVM .021 .016 .087 .104 .106 .051 .010 .013 .038 .056
DT .061 .054 .012 .238 .242 .029 .408 .368 .200 .202
BAG_DT .005 .006 .002 .010 .015 .006 .016 .022 .044 .016
BST_DT .553 .613 .130 .278 .240 .644 .042 .073 .292 .358
BST_STMP .170 .134 .194 .080 .080 .059 .000 .000 .009 .091
13
Summary of Models Used by ES
  • Most ensembles use 10-100 of the 2000 models
  • Different models are selected for different
    problems
  • Different models are selected for different
    metrics
  • Most ensembles use a diversity of model types
  • Most ensembles use different parameter settings
  • Selected Models often make sense
  • Neural nets for RMS, Cross-Entropy
  • Max-margin methods for Accuracy
  • Large k in knn for AUC

14
Motivation Model Compression
  • Unfortunately, not suitable for many
    applications
  • PDAs (storage space is important)
  • Cell phones (storage space)
  • Hearing aids (storage space speed is important
    because of power restrictions)
  • Search engines like Google (speed)
  • Image recognition applications (speed)
  • Our solution Model Compression
  • Models perform as well as the best ensembles, but
    small and fast enough to be used

15
Solution Model Compression
  • Train simple model to mimic the complex model
  • Pass large amounts of unlabeled data (synthetic
    data points or real unlabeled data) through
    ensemble and collect predictions
  • 100,000 to 10,000,000 synthetic training points
  • Extensional representation of the ensemble model
  • Train copycat model on this large synthetic train
    set to mimic the high-performance ensemble
  • Train neural net to mimic ensemble
  • Potential to not only perform as well as target
    ensemble, but possibly outperform it

16
Why Mimic with Neural Nets?
  • Decision trees do not work well
  • synthetic data must be very large because of
    recursive partitioning
  • mimic decision trees are enormous (depth gt 1000
    and gt 106 nodes) making them expensive to store
    and compute
  • single tree does not seem to model ensemble
    accurately enough
  • SVMs
  • number of support vectors increases quickly with
    complexity
  • Artificial Neural nets
  • can model complex functions with modest of
    hidden units
  • can compress millions of training cases into
    thousands of weights
  • expensive to train, but execution cost low (just
    matrix multiplies)
  • models with few thousand weights have small
    footprint

17
Unlabeled Data?
  • Assume original labeled training set is small
  • But we need a large train set to train the mimic
    ANN
  • Should come from same distribution as train data
  • Learned model must focus on most important
    regions in space
  • For some domains unlabeled data is available
  • Text, web, images,
  • If not available, we need to generate synthetic
    data
  • Random
  • Nbe
  • Munge

18
Synthetic Data True Distribution
19
Synthetic Data Small Sample
20
Synthetic Data Random
  • Values for attributes are generated randomly from
    their univariate distribution

21
Synthetic Data Random
  • Values for attributes are generated randomly from
    their univariate distribution

22
Synthetic Data Random
  • Values for attributes are generated randomly from
    their univariate distribution
  • The conditional structure of the data is lost
  • Many generated examples cover uninteresting
    regions of the space

23
Synthetic Data NBE
  • Estimate the joint distribution from the train set

24
Synthetic Data NBE
  • Estimate the joint distribution from the train
    set
  • NBE (Naïve Bayes Estimation) algorithm
  • Lowd and Domingos, 2005
  • Code for learning and sampling available

25
  • These dont work well enough.
  • Had to develop a new, better method.

26
  • These dont work well enough.
  • Had to develop a new, better method.
  • Munging
  • 1. To imperfectly transform information. 2. To
    modify data in a way that cannot be described
    succinctly.

27
Munging
28
Munging
29
Munging
x y

1 -1.09 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.06 -0.19

61 -0.53 0.77

89 -0.95 0.17


30
Munging
x y

1 -1.09 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.06 -0.19

61 -0.53 0.77

89 -0.95 0.17


31
Munging
x y

1 -1.09 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.06 -0.19

61 -0.53 0.77

89 -0.95 0.17


32
Munging
x y

1 -1.09 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.06 -0.19

61 -0.53 0.77

89 -0.95 0.17


33
Munging
x y d

1 -1.09 -0.09 -
2 -0.94 0.24 0.35
3 -0.56 0.76 0.99
4 0.82 0.56 2.02
5 -0.23 -0.90 1.18

19 -1.06 -0.19 0.11

61 -0.53 0.77 1.02

89 -0.95 0.17 0.28


34
Munging
x y d

1 -1.09 -0.09 -
2 -0.94 0.24 0.35
3 -0.56 0.76 0.99
4 0.82 0.56 2.02
5 -0.23 -0.90 1.18

19 -1.06 -0.19 0.11

61 -0.53 0.77 1.02

89 -0.95 0.17 0.28


35
Munging
x y

1 -1.09 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.06 -0.19

61 -0.53 0.77

89 -0.95 0.17

yes no
36
Munging
x y x' y'

1 -1.09 -0.09 -1.06 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.06 -0.19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.17

yes no
37
Munging
x y

1 -1.06 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.17


38
Munging
x y

1 -1.06 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.17


39
Munging
x y

1 -1.06 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.17


40
Munging
x y d

1 -1.06 -0.09 0.35
2 -0.94 0.24 -
3 -0.56 0.76 0.65
4 0.82 0.56 1.79
5 -0.23 -0.90 1.34

19 -1.11 -0.19 0.46

61 -0.53 0.77 0.68

89 -0.95 0.17 0.07


41
Munging
x y d

1 -1.06 -0.09 0.35
2 -0.94 0.24 -
3 -0.56 0.76 0.65
4 0.82 0.56 1.79
5 -0.23 -0.90 1.34

19 -1.11 -0.19 0.46

61 -0.53 0.77 0.68

89 -0.95 0.17 0.07


42
Munging
x y

1 -1.06 -0.09
2 -0.94 0.24
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.17

no yes
43
Munging
x y x' y'

1 -1.06 -0.09
2 -0.94 0.24 -0.94 0.17
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.17 -0.95 0.23

no yes
44
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.23


45
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.23


46
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.23


47
Munging
x y d

1 -1.06 -0.09 0.99
2 -0.94 0.17 0.72
3 -0.56 0.76 -
4 0.82 0.56 1.39
5 -0.23 -0.90 1.69

19 -1.11 -0.19 1.10

61 -0.53 0.77 0.03

89 -0.95 0.23 0.67


48
Munging
x y d

1 -1.06 -0.09 0.99
2 -0.94 0.17 0.72
3 -0.56 0.76 -
4 0.82 0.56 1.39
5 -0.23 -0.90 1.69

19 -1.11 -0.19 1.10

61 -0.53 0.77 0.03

89 -0.95 0.23 0.67


49
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.56 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77

89 -0.95 0.23

yes no
50
Munging
x y x' y'

1 -1.06 -0.09
2 -0.94 0.17
3 -0.56 0.76 -0.56 -0.54
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.53 0.77 -0.53 -0.58

89 -0.95 0.23

yes no
51
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.54 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.58 0.77

89 -0.95 0.23


52
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.54 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.58 0.77

89 -0.95 0.23


53
Munging
x y

1 -1.06 -0.09
2 -0.94 0.17
3 -0.54 0.76
4 0.82 0.56
5 -0.23 -0.90

19 -1.11 -0.19

61 -0.58 0.77

89 -0.95 0.23


54
Munging
x y

1 -0.93 -0.09
2 -0.94 0.16
3 -0.55 0.76
4 0.82 0.55
5 -0.19 -1.01

19 -0.85 -0.19

61 -0.57 0.82

89 -0.95 0.23


55
Synthetic Data Munge
56
Synthetic Data Munge
57
Synthetic Data Munge
58
Synthetic Data Munge
59
Synthetic Data
60
Now That We Have a Method to Generate
Data,Lets Do Some Compression
61
Experimental Setup Datasets
PROBLEM ATTR POS TRAIN SIZE TEST SIZE
ADULT 14/104 25 4000 35222
COVTYPE 54 36 4000 25000
HS 200 24 4000 4366
LETTER.P1 16 3 4000 14000
LETTER.P2 16 53 4000 14000
MEDIS 63 11 4000 8199
MG 124 17 4000 12807
SLAC 59 50 4000 25000
62
Experimental Setup
  • Target model Ensemble Selection
  • Mimic model neural net
  • Up to 256 hidden units
  • Synthetic data
  • Up to 400,000 examples
  • Methods
  • Random
  • NBE
  • Munge
  • Unlabeled vs. Synthetic

63
Average Results by Size
64
Average Results by Size
65
Average Results by Size
66
Average Results by Size
67
Average Results by Size
68
Average Results by Size
69
Letter.P1 Results
70
Hs Results
71
Average Results by HU
72
Letter.P1 Results
73
Letter.P2 Results
74
Letter Results
  • Letter.p1 Distinguish letter O from the rest
  • Letter.p2 Distinguish letters A-M from N-Z

75
It Doesnt Always WorkAs Well As Wed Like,Yet!
76
Covtype Results
77
Covtype Results
78
Covtype Results
79
Covtype Results
  • More hidden units necessary to get a better
    mimic model
  • More Munge data also needed
  • Performance on TRUE DIST data is very good, so
    may get better performance if better
    synthetic data can be generated

80
Adult Results
81
Adult Results
82
Adult Results
  • More Munge data or more hidden units doesnt
    seem to help much
  • Adult has a few high arity nominal attributes
    that when binarized increase the number of
    attributes from 14 to 104 sparse binary
    attributes
  • Neural nets may not be well suited for this
    problem?
  • Munge may not be effective in generating good
    pseudo data for adult?

83
RMSE Results 400K, 256 HU
MUNGE ENSEMBLE ANN RATIO
ADULT 0.325 0.317 0.328 0.29
COVTYPE 0.340 0.334 0.378 0.84
HS 0.204 0.213 0.231 1.47
LETTER.P1 0.075 0.075 0.092 1.01
LETTER.P2 0.179 0.178 0.228 0.98
MEDIS 0.277 0.278 0.279 2.29
MG 0.288 0.287 0.295 0.88
SLAC 0.422 0.424 0.428 1.69
AVERAGE 0.264 0.263 0.282 0.97
RATIO (MUNGE ANN) / (ENSEMBLE ANN)
84
Were Retaining 97 of Accuracy of Target
Model,but How Are We Doing on Compression?
85
Size of Models (MB)
MUNGE ENSEMBLE ANN RATIO
ADULT 0.45 1234.72 0.22 2744
COVTYPE 0.23 1108.16 0.03 4818
HS 0.79 74.37 0.12 94
LETTER.P1 0.08 1.23 0.01 15
LETTER.P2 0.08 325.80 0.04 4073
MEDIS 0.27 5.24 0.14 19
MG 0.50 25.75 0.03 52
SLAC 0.25 1627.08 0.13 6508
AVERAGE 0.33 550.29 0.09 2290
RATIO ENSEMBLE / MUNGE
86
Execution Time of Models
Time in seconds to classify 10,000 examples
MUNGE ENSEMBLE ANN RATIO
ADULT 7.88 8560.61 3.94 1086
COVTYPE 4.46 3440.90 1.05 772
HS 12.09 1817.17 3.85 150
LETTER.P1 2.59 1630.21 0.25 629
LETTER.P2 2.59 2651.95 0.74 1024
MEDIS 4.78 190.18 2.85 40
MG 6.98 1220.04 1.80 175
SLAC 3.60 23659.03 2.85 6572
AVERAGE 5.62 5396.27 2.17 1306
RATIO ENSEMBLE / MUNGE
87
Summary of Compression Results
MUNGE ENSEMBLE ANN RATIO
RMSE 0.264 0.263 0.282 0.97
Size (Mb) 0.33 550.29 0.09 2290
Time (s) 5.62 5396.27 2.17 1306
  • Neural nets trained to mimic high performing
    ensemble selection models
  • on average, captures more than 97 performance of
    target model
  • perform much better than any ANN we could train
    on original data
  • More than 2000 times smaller than target ensemble
  • More than 1000 times faster than target ensemble

88
Related Work
  • Neural Nets Approximator Zeng and Martinez,
    2000
  • Used same general approach
  • Only pseudo data used to train the neural net
  • Trained a neural net to model ensemble of neural
    nets
  • Target model not nearly as complex as ES

89
Related Work
  • CMM (Combine Multiple Models) Domingos, 1997
  • Goal improve accuracy and stability of base
    classifier (C4.5 rules) without losing
    comprehensibility
  • Create ensemble of base classifiers
  • Train a base classifier on original data extra
    data
  • Generate extra data to be labeled by ensemble
  • Method for generated extra data specific for C4.5
    rules

90
Related Work
  • TREPAN Craven and Shavlik, 1996
  • Extract tree-structured representations of
    trained neural nets
  • Used the original train set the nets were trained
    on
  • Generated synthetic data at every node in the
    tree
  • Learning rules from neural nets
  • Towell and Shavlik 1992, Craven and Shavlik,
    1993,1994

91
Related Work
  • Pruning adaptive boosting Margineantu and
    Dietterich 2000
  • To compress the ensemble, retain only some of the
    models it contains
  • DECORATE Melville and Mooney, 2003
  • Use extra data to increase the diversity of base
    classifiers in order to build a better ensemble
  • Data generated randomly from each attributes
    marginal distribution (similarly to our Random
    algorithm)

92
What Still Needs to Be Done?
93
Future Work Other Mimic Models
  • Neural nets are not only possible mimic models
  • Other learning methods may provide insight into
    effectiveness of model compression
  • Things to do
  • Use Decision Trees, SVMs, k-nearest neighbor
    models to mimic Ensemble Selection
  • Expect to see
  • Decision trees grow too large, need too much data
  • Knn too slow
  • SVMs need too many support vectors

94
Future Work Other Target Metrics
  • Key feature of Ensemble Selection can be
    optimized for different metrics (RMSE, ROC, ACC,
    Precision, )
  • Important that compressed models good on target
    metric
  • If the squared error between target model and
    mimic neural net is small enough, performance on
    target metric should be similar
  • Things to do
  • Use neural nets to mimic ES optimized for
    accuracy, area under ROC curve
  • May need to adapt the model compression approach
    for metrics other than RMSE
  • Expect to see
  • good performance for other metrics as well

95
Future Work Model Complexity
  • Complexity of model varies from problem to
    problem
  • To accurately approximate a model, the mimic
    model needs to have similar complexity
  • For neural nets, number of hidden units is a
    measure of complexity
  • Things to do
  • For some problems, experiments with more hidden
    units
  • Experiments with more than one hidden layer
    (ADULT)
  • Expect to see
  • For some problems, more hidden units will help
  • For ADULT ???

96
Future Work Munge
  • Two free parameters that must be set
  • We might not have picked optimal values
  • Different problems may have different optimal
    values
  • Compression experiments are very expensive
  • Things to do
  • Experiment with different parameter values
  • Try to find distance metric between datasets that
    expresses quality of data generated
  • Expect to see
  • Better synthetic data yields better compression
    with less data

97
Future Work Active Learning
  • Too many examples ? labeling is expensive
  • Too many examples ? training is expensive
  • Things to do
  • Choosing the most important synthetic examples
  • Retain only non redundant examples generated by
    Munge
  • Modify Munge so that it generates less redundant
    examples
  • Expect to see
  • Active learning reduces amount of train data
    needed

98
Summary
  • Ensemble learning yields most accurate models
  • Ensemble selection is best ensemble method
  • Ensembles sometimes are too big and too slow
  • Compress complex ensemble into simpler ANN
  • 97 of accuracy retained
  • 2000 times smaller
  • 1000 times faster
  • Potentially useful measure of model complexity?
  • Compression separates how function is learned
    from data and the model used at runtime to make
    predictions

99
Thank You.Questions?
100
(No Transcript)
101
Hs Results
102
Letter.P2 Results
103
Medis Results
104
Medis Results
105
Mg Results
106
Mg Results
107
Slac Results
108
Slac Results
Write a Comment
User Comments (0)
About PowerShow.com