Model Compression presentation

About This Presentation

Transcript and Presenter's Notes

Title: Model Compression

1
Model Compression

Rich Caruana
Computer Science
Cornell University
joint work with Cristian Bucila Alex
Niculescu-Mizil

2
Outline

Motivation
Ensemble learning usually most accurate
Ensemble models can be large and slow
Model compression
Where does data come from?
Experimental results
Related work
Future work
Summary

3
Supervised Learning

Major Goals
Accurate Models
Easy to train
Fast to train
Can deal with many data types
Can deal with many performance criteria
Does not require too much human expertise
Compact, easy to use models
Intelligible models
Fast predictions
Confidences for predictions
Explanations for predictions

4
Normalized Scores for ES
5
Ensemble Selection Works,But Is It Worth It?

Best of best of best yields 20 reduction in
loss compared to boosted trees
Accuracy or AUC increase from 88 to 90
RMS decrease from 0.25 to 0.20
Typically 10 reduction in loss compared to best
model above
Accuracy or AUC increase from 90 to 91
RMS decrease from 0.20 to 0.18
Overall reduction in loss can be 30, which is
significant

6
Computational Cost

Have to train multiple models anyway
models can be trained in parallel
different packages, different machines, at
different times, by different people
just generate and collect (no optimization
necessary, no test sets)
saves human effort -- no need to examine/optimize
models
48 hours on 10 workstations to train 2000
models with 5k train sets
model library can be built before optimization
metric is known
anytime selection -- no need to wait for all
models
Ensemble Selection is cheap
each iteration, consider adding 2000 models to
ensemble
adding model is simple unweighted averaging of
predictions
caching makes this very efficient
compute performance metric when each model is
added
for 250 iterations, evaluate 2502000 500,000
ensembles
1 minute on workstation if metric is not
expensive

7
Ensemble Selection

Good news
A carefully selected ensemble that combines many
models outperforms boosting, bagging, random
forests, SVMs, and neural nets, (because it
builds on top of them)
Bad news
The ensembles are too big, too slow, too
cumbersome to use for most applications

8
Best Ensembles are Big Ugly!

Best ensemble for one problem/metric has 422
models
72 boosted trees (28,642 individual decision
trees!)
1 random forest (1024 decision trees)
5 bagged trees (100 decision trees in each model)
44 neural nets (2,200 hidden units,total,
gt100,000 weights)
115 knn models (both large and expensive!)
38 SVMs (100s of support vectors in each model)
26 boosted stump models (36,184 stumps total --
could compress)
122 individual decision trees

9
Best Ensembles are Big Slow!

Size
Best single models 1.41 Mb
Ensemble selection 550.29 Mb
Speed (to classify 10,000 examples)
Best single model 93.37 secs / 10k
Ensemble selection 5396.27 secs / 10k

Cant we make the ensembles smaller, faster, and
easier to use by eliminating some base-level
models?

11
What Models are Used in Ensembles?
12
What Models are Used in Ensembles?
13
Summary of Models Used by ES

Most ensembles use 10-100 of the 2000 models
Different models are selected for different
problems
Different models are selected for different
metrics
Most ensembles use a diversity of model types
Most ensembles use different parameter settings
Selected Models often make sense
Neural nets for RMS, Cross-Entropy
Max-margin methods for Accuracy
Large k in knn for AUC

14
Motivation Model Compression

Unfortunately, not suitable for many
applications
PDAs (storage space is important)
Cell phones (storage space)
Hearing aids (storage space speed is important
because of power restrictions)
Search engines like Google (speed)
Image recognition applications (speed)
Our solution Model Compression
Models perform as well as the best ensembles, but
small and fast enough to be used

15
Solution Model Compression

Train simple model to mimic the complex model
Pass large amounts of unlabeled data (synthetic
data points or real unlabeled data) through
ensemble and collect predictions
100,000 to 10,000,000 synthetic training points
Extensional representation of the ensemble model
Train copycat model on this large synthetic train
set to mimic the high-performance ensemble
Train neural net to mimic ensemble
Potential to not only perform as well as target
ensemble, but possibly outperform it

16
Why Mimic with Neural Nets?

Decision trees do not work well
synthetic data must be very large because of
recursive partitioning
mimic decision trees are enormous (depth gt 1000
and gt 106 nodes) making them expensive to store
and compute
single tree does not seem to model ensemble
accurately enough
SVMs
number of support vectors increases quickly with
complexity
Artificial Neural nets
can model complex functions with modest of
hidden units
can compress millions of training cases into
thousands of weights
expensive to train, but execution cost low (just
matrix multiplies)
models with few thousand weights have small
footprint

17
Unlabeled Data?

Assume original labeled training set is small
But we need a large train set to train the mimic
ANN
Should come from same distribution as train data
Learned model must focus on most important
regions in space
For some domains unlabeled data is available
Text, web, images,
If not available, we need to generate synthetic
data
Random
Nbe
Munge

18
Synthetic Data True Distribution
19
Synthetic Data Small Sample
20
Synthetic Data Random

Values for attributes are generated randomly from
their univariate distribution

21
Synthetic Data Random

Values for attributes are generated randomly from
their univariate distribution

22
Synthetic Data Random

Values for attributes are generated randomly from
their univariate distribution
The conditional structure of the data is lost
Many generated examples cover uninteresting
regions of the space

23
Synthetic Data NBE

Estimate the joint distribution from the train set

24
Synthetic Data NBE

Estimate the joint distribution from the train
set
NBE (Naïve Bayes Estimation) algorithm
Lowd and Domingos, 2005
Code for learning and sampling available

These dont work well enough.
Had to develop a new, better method.

These dont work well enough.
Had to develop a new, better method.
Munging
1. To imperfectly transform information. 2. To
modify data in a way that cannot be described
succinctly.

27
Munging
28
Munging
29
Munging
30
Munging
31
Munging
32
Munging
33
Munging
34
Munging
35
Munging
36
Munging
37
Munging
38
Munging
39
Munging
40
Munging
41
Munging
42
Munging
43
Munging
44
Munging
45
Munging
46
Munging
47
Munging
48
Munging
49
Munging
50
Munging
51
Munging
52
Munging
53
Munging
54
Munging
55
Synthetic Data Munge
56
Synthetic Data Munge
57
Synthetic Data Munge
58
Synthetic Data Munge
59
Synthetic Data
60
Now That We Have a Method to Generate
Data,Lets Do Some Compression
61
Experimental Setup Datasets
62
Experimental Setup

Target model Ensemble Selection
Mimic model neural net
Up to 256 hidden units
Synthetic data
Up to 400,000 examples
Methods
Random
NBE
Munge
Unlabeled vs. Synthetic

63
Average Results by Size
64
Average Results by Size
65
Average Results by Size
66
Average Results by Size
67
Average Results by Size
68
Average Results by Size
69
Letter.P1 Results
70
Hs Results
71
Average Results by HU
72
Letter.P1 Results
73
Letter.P2 Results
74
Letter Results

Letter.p1 Distinguish letter O from the rest
Letter.p2 Distinguish letters A-M from N-Z

75
It Doesnt Always WorkAs Well As Wed Like,Yet!
76
Covtype Results
77
Covtype Results
78
Covtype Results
79
Covtype Results

More hidden units necessary to get a better
mimic model
More Munge data also needed
Performance on TRUE DIST data is very good, so
may get better performance if better
synthetic data can be generated

80
Adult Results
81
Adult Results
82
Adult Results

More Munge data or more hidden units doesnt
seem to help much
Adult has a few high arity nominal attributes
that when binarized increase the number of
attributes from 14 to 104 sparse binary
attributes
Neural nets may not be well suited for this
problem?
Munge may not be effective in generating good
pseudo data for adult?

83
RMSE Results 400K, 256 HU
RATIO (MUNGE ANN) / (ENSEMBLE ANN)
84
Were Retaining 97 of Accuracy of Target
Model,but How Are We Doing on Compression?
85
Size of Models (MB)
RATIO ENSEMBLE / MUNGE
86
Execution Time of Models
Time in seconds to classify 10,000 examples
RATIO ENSEMBLE / MUNGE
87
Summary of Compression Results

Neural nets trained to mimic high performing
ensemble selection models
on average, captures more than 97 performance of
target model
perform much better than any ANN we could train
on original data
More than 2000 times smaller than target ensemble
More than 1000 times faster than target ensemble

88
Related Work

Neural Nets Approximator Zeng and Martinez,
2000
Used same general approach
Only pseudo data used to train the neural net
Trained a neural net to model ensemble of neural
nets
Target model not nearly as complex as ES

89
Related Work

CMM (Combine Multiple Models) Domingos, 1997
Goal improve accuracy and stability of base
classifier (C4.5 rules) without losing
comprehensibility
Create ensemble of base classifiers
Train a base classifier on original data extra
data
Generate extra data to be labeled by ensemble
Method for generated extra data specific for C4.5
rules

90
Related Work

TREPAN Craven and Shavlik, 1996
Extract tree-structured representations of
trained neural nets
Used the original train set the nets were trained
on
Generated synthetic data at every node in the
tree
Learning rules from neural nets
Towell and Shavlik 1992, Craven and Shavlik,
1993,1994

91
Related Work

Pruning adaptive boosting Margineantu and
Dietterich 2000
To compress the ensemble, retain only some of the
models it contains
DECORATE Melville and Mooney, 2003
Use extra data to increase the diversity of base
classifiers in order to build a better ensemble
Data generated randomly from each attributes
marginal distribution (similarly to our Random
algorithm)

92
What Still Needs to Be Done?
93
Future Work Other Mimic Models

Neural nets are not only possible mimic models
Other learning methods may provide insight into
effectiveness of model compression
Things to do
Use Decision Trees, SVMs, k-nearest neighbor
models to mimic Ensemble Selection
Expect to see
Decision trees grow too large, need too much data
Knn too slow
SVMs need too many support vectors

94
Future Work Other Target Metrics

Key feature of Ensemble Selection can be
optimized for different metrics (RMSE, ROC, ACC,
Precision, )
Important that compressed models good on target
metric
If the squared error between target model and
mimic neural net is small enough, performance on
target metric should be similar
Things to do
Use neural nets to mimic ES optimized for
accuracy, area under ROC curve
May need to adapt the model compression approach
for metrics other than RMSE
Expect to see
good performance for other metrics as well

95
Future Work Model Complexity

Complexity of model varies from problem to
problem
To accurately approximate a model, the mimic
model needs to have similar complexity
For neural nets, number of hidden units is a
measure of complexity
Things to do
For some problems, experiments with more hidden
units
Experiments with more than one hidden layer
(ADULT)
Expect to see
For some problems, more hidden units will help
For ADULT ???

96
Future Work Munge

Two free parameters that must be set
We might not have picked optimal values
Different problems may have different optimal
values
Compression experiments are very expensive
Things to do
Experiment with different parameter values
Try to find distance metric between datasets that
expresses quality of data generated
Expect to see
Better synthetic data yields better compression
with less data

97
Future Work Active Learning

Too many examples ? labeling is expensive
Too many examples ? training is expensive
Things to do
Choosing the most important synthetic examples
Retain only non redundant examples generated by
Munge
Modify Munge so that it generates less redundant
examples
Expect to see
Active learning reduces amount of train data
needed

98
Summary

Ensemble learning yields most accurate models
Ensemble selection is best ensemble method
Ensembles sometimes are too big and too slow
Compress complex ensemble into simpler ANN
97 of accuracy retained
2000 times smaller
1000 times faster
Potentially useful measure of model complexity?
Compression separates how function is learned
from data and the model used at runtime to make
predictions

99
Thank You.Questions?
100
(No Transcript)
101
Hs Results
102
Letter.P2 Results
103
Medis Results
104
Medis Results
105
Mg Results
106
Mg Results
107
Slac Results
108
Slac Results

Write a Comment

User Comments (0)

About PowerShow.com

Model Compression PowerPoint PPT Presentation