Title: Model Compression
1Model Compression
- Rich Caruana
- Computer Science
- Cornell University
- joint work with Cristian Bucila Alex
Niculescu-Mizil
2Outline
- Motivation
- Ensemble learning usually most accurate
- Ensemble models can be large and slow
- Model compression
- Where does data come from?
- Experimental results
- Related work
- Future work
- Summary
3Supervised Learning
- Major Goals
- Accurate Models
- Easy to train
- Fast to train
- Can deal with many data types
- Can deal with many performance criteria
- Does not require too much human expertise
- Compact, easy to use models
- Intelligible models
- Fast predictions
- Confidences for predictions
- Explanations for predictions
4Normalized Scores for ES
5Ensemble Selection Works,But Is It Worth It?
- Best of best of best yields 20 reduction in
loss compared to boosted trees - Accuracy or AUC increase from 88 to 90
- RMS decrease from 0.25 to 0.20
- Typically 10 reduction in loss compared to best
model above - Accuracy or AUC increase from 90 to 91
- RMS decrease from 0.20 to 0.18
- Overall reduction in loss can be 30, which is
significant
6Computational Cost
- Have to train multiple models anyway
- models can be trained in parallel
- different packages, different machines, at
different times, by different people - just generate and collect (no optimization
necessary, no test sets) - saves human effort -- no need to examine/optimize
models - 48 hours on 10 workstations to train 2000
models with 5k train sets - model library can be built before optimization
metric is known - anytime selection -- no need to wait for all
models - Ensemble Selection is cheap
- each iteration, consider adding 2000 models to
ensemble - adding model is simple unweighted averaging of
predictions - caching makes this very efficient
- compute performance metric when each model is
added - for 250 iterations, evaluate 2502000 500,000
ensembles - 1 minute on workstation if metric is not
expensive
7Ensemble Selection
- Good news
- A carefully selected ensemble that combines many
models outperforms boosting, bagging, random
forests, SVMs, and neural nets, (because it
builds on top of them) - Bad news
- The ensembles are too big, too slow, too
cumbersome to use for most applications
8Best Ensembles are Big Ugly!
- Best ensemble for one problem/metric has 422
models - 72 boosted trees (28,642 individual decision
trees!) - 1 random forest (1024 decision trees)
- 5 bagged trees (100 decision trees in each model)
- 44 neural nets (2,200 hidden units,total,
gt100,000 weights) - 115 knn models (both large and expensive!)
- 38 SVMs (100s of support vectors in each model)
- 26 boosted stump models (36,184 stumps total --
could compress) - 122 individual decision trees
9Best Ensembles are Big Slow!
- Size
- Best single models 1.41 Mb
- Ensemble selection 550.29 Mb
- Speed (to classify 10,000 examples)
- Best single model 93.37 secs / 10k
- Ensemble selection 5396.27 secs / 10k
10- Cant we make the ensembles smaller, faster, and
easier to use by eliminating some base-level
models?
11What Models are Used in Ensembles?
12What Models are Used in Ensembles?
13Summary of Models Used by ES
- Most ensembles use 10-100 of the 2000 models
- Different models are selected for different
problems - Different models are selected for different
metrics - Most ensembles use a diversity of model types
- Most ensembles use different parameter settings
- Selected Models often make sense
- Neural nets for RMS, Cross-Entropy
- Max-margin methods for Accuracy
- Large k in knn for AUC
14Motivation Model Compression
- Unfortunately, not suitable for many
applications - PDAs (storage space is important)
- Cell phones (storage space)
- Hearing aids (storage space speed is important
because of power restrictions) - Search engines like Google (speed)
- Image recognition applications (speed)
- Our solution Model Compression
- Models perform as well as the best ensembles, but
small and fast enough to be used
15Solution Model Compression
- Train simple model to mimic the complex model
- Pass large amounts of unlabeled data (synthetic
data points or real unlabeled data) through
ensemble and collect predictions - 100,000 to 10,000,000 synthetic training points
- Extensional representation of the ensemble model
- Train copycat model on this large synthetic train
set to mimic the high-performance ensemble - Train neural net to mimic ensemble
- Potential to not only perform as well as target
ensemble, but possibly outperform it
16Why Mimic with Neural Nets?
- Decision trees do not work well
- synthetic data must be very large because of
recursive partitioning - mimic decision trees are enormous (depth gt 1000
and gt 106 nodes) making them expensive to store
and compute - single tree does not seem to model ensemble
accurately enough - SVMs
- number of support vectors increases quickly with
complexity - Artificial Neural nets
- can model complex functions with modest of
hidden units - can compress millions of training cases into
thousands of weights - expensive to train, but execution cost low (just
matrix multiplies) - models with few thousand weights have small
footprint
17Unlabeled Data?
- Assume original labeled training set is small
- But we need a large train set to train the mimic
ANN - Should come from same distribution as train data
- Learned model must focus on most important
regions in space - For some domains unlabeled data is available
- Text, web, images,
- If not available, we need to generate synthetic
data - Random
- Nbe
- Munge
18Synthetic Data True Distribution
19Synthetic Data Small Sample
20Synthetic Data Random
- Values for attributes are generated randomly from
their univariate distribution
21Synthetic Data Random
- Values for attributes are generated randomly from
their univariate distribution
22Synthetic Data Random
- Values for attributes are generated randomly from
their univariate distribution - The conditional structure of the data is lost
- Many generated examples cover uninteresting
regions of the space
23Synthetic Data NBE
- Estimate the joint distribution from the train set
24Synthetic Data NBE
- Estimate the joint distribution from the train
set - NBE (Naïve Bayes Estimation) algorithm
- Lowd and Domingos, 2005
- Code for learning and sampling available
25- These dont work well enough.
- Had to develop a new, better method.
26- These dont work well enough.
- Had to develop a new, better method.
- Munging
- 1. To imperfectly transform information. 2. To
modify data in a way that cannot be described
succinctly.
27Munging
28Munging
29Munging
30Munging
31Munging
32Munging
33Munging
34Munging
35Munging
36Munging
37Munging
38Munging
39Munging
40Munging
41Munging
42Munging
43Munging
44Munging
45Munging
46Munging
47Munging
48Munging
49Munging
50Munging
51Munging
52Munging
53Munging
54Munging
55Synthetic Data Munge
56Synthetic Data Munge
57Synthetic Data Munge
58Synthetic Data Munge
59Synthetic Data
60Now That We Have a Method to Generate
Data,Lets Do Some Compression
61Experimental Setup Datasets
62Experimental Setup
- Target model Ensemble Selection
- Mimic model neural net
- Up to 256 hidden units
- Synthetic data
- Up to 400,000 examples
- Methods
- Random
- NBE
- Munge
- Unlabeled vs. Synthetic
63Average Results by Size
64Average Results by Size
65Average Results by Size
66Average Results by Size
67Average Results by Size
68Average Results by Size
69Letter.P1 Results
70Hs Results
71Average Results by HU
72Letter.P1 Results
73Letter.P2 Results
74Letter Results
- Letter.p1 Distinguish letter O from the rest
- Letter.p2 Distinguish letters A-M from N-Z
75It Doesnt Always WorkAs Well As Wed Like,Yet!
76Covtype Results
77Covtype Results
78Covtype Results
79Covtype Results
- More hidden units necessary to get a better
mimic model - More Munge data also needed
- Performance on TRUE DIST data is very good, so
may get better performance if better
synthetic data can be generated
80Adult Results
81Adult Results
82Adult Results
- More Munge data or more hidden units doesnt
seem to help much - Adult has a few high arity nominal attributes
that when binarized increase the number of
attributes from 14 to 104 sparse binary
attributes - Neural nets may not be well suited for this
problem? - Munge may not be effective in generating good
pseudo data for adult?
83RMSE Results 400K, 256 HU
RATIO (MUNGE ANN) / (ENSEMBLE ANN)
84Were Retaining 97 of Accuracy of Target
Model,but How Are We Doing on Compression?
85Size of Models (MB)
RATIO ENSEMBLE / MUNGE
86Execution Time of Models
Time in seconds to classify 10,000 examples
RATIO ENSEMBLE / MUNGE
87Summary of Compression Results
- Neural nets trained to mimic high performing
ensemble selection models - on average, captures more than 97 performance of
target model - perform much better than any ANN we could train
on original data - More than 2000 times smaller than target ensemble
- More than 1000 times faster than target ensemble
88Related Work
- Neural Nets Approximator Zeng and Martinez,
2000 - Used same general approach
- Only pseudo data used to train the neural net
- Trained a neural net to model ensemble of neural
nets - Target model not nearly as complex as ES
89Related Work
- CMM (Combine Multiple Models) Domingos, 1997
- Goal improve accuracy and stability of base
classifier (C4.5 rules) without losing
comprehensibility - Create ensemble of base classifiers
- Train a base classifier on original data extra
data - Generate extra data to be labeled by ensemble
- Method for generated extra data specific for C4.5
rules
90Related Work
- TREPAN Craven and Shavlik, 1996
- Extract tree-structured representations of
trained neural nets - Used the original train set the nets were trained
on - Generated synthetic data at every node in the
tree - Learning rules from neural nets
- Towell and Shavlik 1992, Craven and Shavlik,
1993,1994
91Related Work
- Pruning adaptive boosting Margineantu and
Dietterich 2000 - To compress the ensemble, retain only some of the
models it contains - DECORATE Melville and Mooney, 2003
- Use extra data to increase the diversity of base
classifiers in order to build a better ensemble - Data generated randomly from each attributes
marginal distribution (similarly to our Random
algorithm)
92What Still Needs to Be Done?
93Future Work Other Mimic Models
- Neural nets are not only possible mimic models
- Other learning methods may provide insight into
effectiveness of model compression - Things to do
- Use Decision Trees, SVMs, k-nearest neighbor
models to mimic Ensemble Selection - Expect to see
- Decision trees grow too large, need too much data
- Knn too slow
- SVMs need too many support vectors
94Future Work Other Target Metrics
- Key feature of Ensemble Selection can be
optimized for different metrics (RMSE, ROC, ACC,
Precision, ) - Important that compressed models good on target
metric - If the squared error between target model and
mimic neural net is small enough, performance on
target metric should be similar - Things to do
- Use neural nets to mimic ES optimized for
accuracy, area under ROC curve - May need to adapt the model compression approach
for metrics other than RMSE - Expect to see
- good performance for other metrics as well
95Future Work Model Complexity
- Complexity of model varies from problem to
problem - To accurately approximate a model, the mimic
model needs to have similar complexity - For neural nets, number of hidden units is a
measure of complexity - Things to do
- For some problems, experiments with more hidden
units - Experiments with more than one hidden layer
(ADULT) - Expect to see
- For some problems, more hidden units will help
- For ADULT ???
96Future Work Munge
- Two free parameters that must be set
- We might not have picked optimal values
- Different problems may have different optimal
values - Compression experiments are very expensive
- Things to do
- Experiment with different parameter values
- Try to find distance metric between datasets that
expresses quality of data generated - Expect to see
- Better synthetic data yields better compression
with less data
97Future Work Active Learning
- Too many examples ? labeling is expensive
- Too many examples ? training is expensive
- Things to do
- Choosing the most important synthetic examples
- Retain only non redundant examples generated by
Munge - Modify Munge so that it generates less redundant
examples - Expect to see
- Active learning reduces amount of train data
needed
98Summary
- Ensemble learning yields most accurate models
- Ensemble selection is best ensemble method
- Ensembles sometimes are too big and too slow
- Compress complex ensemble into simpler ANN
- 97 of accuracy retained
- 2000 times smaller
- 1000 times faster
- Potentially useful measure of model complexity?
- Compression separates how function is learned
from data and the model used at runtime to make
predictions
99Thank You.Questions?
100(No Transcript)
101Hs Results
102Letter.P2 Results
103Medis Results
104Medis Results
105Mg Results
106Mg Results
107Slac Results
108Slac Results