Title: Multitask Learning
 1Spooky Stuff in Metric Space 
 2Spooky StuffData Mining in Metric Space
- Rich Caruana 
- Alex Niculescu 
- Cornell University
3Motivation 1 
 4Motivation 1 Pneumonia Risk Prediction 
 5Motivation 1 Many Learning Algorithms
- Neural nets 
- Logistic regression 
- Linear perceptron 
- K-nearest neighbor 
- Decision trees 
- ILP (Inductive Logic Programming) 
- SVMs (Support Vector Machines) 
- Bagging X 
- Boosting X 
- Rule learners (C2, ) 
- Ripper 
- Random Forests (forests of decision trees) 
- Gaussian Processes 
- Bayes Nets 
-  
- No one/few learning methods dominates the others
6Motivation 2 
 7Motivation 2 SLAC B/Bbar
- Particle accelerator generates B/Bbar particles 
- Use machine learning to classify tracks as B or 
 Bbar
- Domain specific performance measure SLQ-Score 
- 5 increase in SLQ can save 1M in accelerator 
 time
- SLAC researchers tried various DM/ML methods 
- Good, but not great, SLQ performance 
- We tried standard methods, got similar results 
- We studied SLQ metric 
- similar to probability calibration 
- tried bagged probabilistic decision trees (good 
 on C-Section)
8Motivation 2 Bagged Probabilistic Trees
- Draw N bootstrap samples of data 
- Train tree on each sample gt N trees 
- Final prediction  average prediction of N trees
Average prediction (0.23  0.19  0.34  0.22  
0.26    0.31) /  Trees  0.24 
 9Motivation 2 Improves Calibration Order of 
Magnitude
single tree
Poor Calibration
100 bagged trees
Excellent Calibration 
 10Motivation 2 Significantly Improves SLQ
100 bagged trees
single tree 
 11Motivation 2
- Can we automate this analysis of performance 
 metrics so that its easier to recognize which
 metrics are similar to each other?
12Motivation 3 
 13Motivation 3 
 14Scary Stuff
- In ideal world 
- Learn model that predicts correct conditional 
 probabilities (Bayes optimal)
- Yield optimal performance on any reasonable 
 metric
- In real world 
- Finite data 
- 0/1 targets instead of conditional probabilities 
- Hard to learn this ideal model 
- Dont have good metrics for recognizing ideal 
 model
- Ideal model isnt always needed 
- In practice 
- Do learning using many different metrics ACC, 
 AUC, CXE, RMS,
- Each metric represents different tradeoffs 
- Because of this, usually important to optimize to 
 appropriate metric
15Scary Stuff 
 16Scary Stuff 
 17In this work we compare nine commonly used 
performance metrics by applying data mining to 
the results of a massive empirical study
- Goals 
- Discover relationships between performance 
 metrics
- Are the metrics really that different? 
- If you optimize to metric X, also get good perf 
 on metric Y?
- Need to optimize to metric Y, which metric X 
 should you optimize to?
- Which metrics are more/less robust? 
- Design new, better metrics?
1810 Binary Classification Performance Metrics
- Threshold Metrics 
- Accuracy 
- F-Score 
- Lift 
- Ordering/Ranking Metrics 
- ROC Area 
- Average Precision 
- Precision/Recall Break-Even Point 
- Probability Metrics 
- Root-Mean-Squared-Error 
- Cross-Entropy 
- Probability Calibration 
- SAR  ((1 - Squared Error)  Accuracy  ROC Area) 
 / 3
19Accuracy
Predicted 1 Predicted 0
correct
a
b
True 0 True 1
c
d
incorrect
threshold
accuracy  (ad) / (abcd) 
 20Lift
- not interested in accuracy on entire dataset 
- want accurate predictions for 5, 10, or 20 of 
 dataset
- dont care about remaining 95, 90, 80, resp. 
- typical application marketing 
- how much better than random prediction on the 
 fraction of the dataset predicted true (f(x) gt
 threshold)
21Lift
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold 
 22lift  3.5 if mailings sent to 20 of the 
customers 
 23Precision/Recall, F, Break-Even Pt
harmonic average of precision and recall 
 24better performance
worse performance 
 25Predicted 1 Predicted 0
Predicted 1 Predicted 0
true positive
false negative
FN
TP
True 0 True 1
True 0 True 1
false positive
true negative
TN
FP
Predicted 1 Predicted 0
Predicted 1 Predicted 0
misses
P(pr0tr1)
hits
P(pr1tr1)
True 0 True 1
True 0 True 1
false alarms
correct rejections
P(pr0tr0)
P(pr1tr0) 
 26ROC Plot and ROC Area
- Receiver Operator Characteristic 
- Developed in WWII to statistically model false 
 positive and false negative detections of radar
 operators
- Better statistical foundations than most other 
 measures
- Standard measure in medicine and biology 
- Becoming more popular in ML 
- Sweep threshold and plot 
- TPR vs. FPR 
- Sensitivity vs. 1-Specificity 
- P(truetrue) vs. P(truefalse) 
- Sensitivity  a/(ab)  Recall  LIFT numerator 
- 1 - Specificity  1 - d/(cd) 
27diagonal line is random prediction 
 28Calibration
- Good calibration 
- If 1000 xs have pred(x)  0.2, 200 should be 
 positive
29Calibration
- Model can be accurate but poorly calibrated 
- good threshold with uncalibrated probabilities 
- Model can have good ROC but be poorly calibrated 
- ROC insensitive to scaling/stretching 
- only ordering has to be correct, not 
 probabilities themselves
- Model can have very high variance, but be well 
 calibrated
- Model can be stupid, but be well calibrated 
- Calibration is a real oddball
30Measuring Calibration
- Bucket method 
- In each bucket 
- measure observed c-sec rate 
- predicted c-sec rate (average of probabilities) 
- if observed csec rate similar to predicted csec 
 rate gt good calibration in that bucket
0.05 0.15 0.25 0.35 
0.45 0.55 0.65 0.75 
0.85 0.95
0.0 0.1 0.2 0.3 
 0.4 0.5 0.6 0.7 
 0.8 0.9 1.0  
 31Calibration Plot 
 32Experiments 
 33Base-Level Learning Methods
- Decision trees 
- K-nearest neighbor 
- Neural nets 
- SVMs 
- Bagged Decision Trees 
- Boosted Decision Trees 
- Boosted Stumps 
- Each optimizes different things 
- Each best in different regimes 
- Each algorithm has many variations and free 
 parameters
- Generate about 2000 models on each test problem
34Data Sets
- 7 binary classification data sets 
- Adult 
- Cover Type 
- Letter.p1 (balanced) 
- Letter.p2 (unbalanced) 
- Pneumonia (University of Pittsburgh) 
- Hyper Spectral (NASA Goddard Space Center) 
- Particle Physics (Stanford Linear Accelerator) 
- 4 k train sets 
- Large final test sets (usually 20k) 
35Massive Empirical Comparison
- 7 base-level learning methods 
- X 
- 100s of parameter settings per method 
-  
-  2000 models per problem 
- X 
- 7 test problems 
-  
- 14,000 models 
- X 
- 10 performance metrics 
-  
- 140,000 model performance evaluations
36COVTYPE Calibration vs. Accuracy 
 37Multi Dimensional Scaling 
 38Scaling, Ranking, and Normalizing
- Problem 
- some metrics, 1.00 is best (e.g. ACC) 
- some metrics, 0.00 is best (e.g. RMS) 
- some metrics, baseline is 0.50 (e.g. AUC) 
- some problems/metrics, 0.60 is excellent 
 performance
- some problems/metrics, 0.99 is poor performance 
- Solution 1 Normalized Scores 
- baseline performance gt 0.00 
- best observed performance gt 1.00 (proxy for 
 Bayes optimal)
- puts all metrics on equal footing 
- Solution 2 Scale by Standard Deviation 
- Solution 3 Rank Correlation
39Multi Dimensional Scaling
- Find low-dimension embedding of 10x14,000 data 
- The 10 metrics span a 2-5 dimension subspace
40Multi Dimensional Scaling
- Look at 2-D MDS plots 
- Scaled by standard deviation 
- Normalized scores 
- MDS of rank correlations 
- MDS on each problem individually 
- MDS averaged across all problems
412-D Multi-Dimensional Scaling 
 422-D Multi-Dimensional Scaling
Normalized Scores Scaling
Rank-Correlation Distance 
 43 Adult Covertype 
 Hyper-Spectral
 Letter Medis 
 SLAC 
 44Correlation Analysis
- 2000 performances for each metric on each problem 
- Correlation between all pairs of metrics 
- 10 metrics 
- 45 pairwise correlations 
- Average of correlations over 7 test problems 
- Standard correlation 
- Rank correlation 
- Present rank correlation here
45Rank Correlations
- Correlation analysis consistent with MDS analysis 
- Ordering metrics have high correlations to each 
 other
- ACC, AUC, RMS have best correlations of metrics 
 in each metric class
- RMS has good correlation to other metrics 
- SAR has best correlation to other metrics
46Summary
- 10 metrics span 2-5 Dim subspace 
- Consistent results across problems and scalings 
- Ordering Metrics Cluster AUC  APR  BEP 
- CAL far from Ordering Metrics 
- CAL nearest to RMS/MXE 
- RMS  MXE, but RMS much more centrally located 
- Threshold Metrics ACC and FSC do not cluster as 
 tightly as ordering metrics and RMS/MXE
- Lift behaves more like Ordering than Threshold 
 metrics
- Old friends ACC, AUC, and RMS most representative 
- New SAR metric is good, but not much better than 
 RMS
47New Resources
- Want to borrow 14,000 models? 
- margin analysis 
- comparison to new algorithm X 
-  
- PERF code software that calculates 2 dozen 
 performance metrics
- Accuracy (at different thresholds) 
- ROC Area and ROC plots 
- Precision and Recall plots 
- Break-even-point, F-score, Average Precision 
- Squared Error 
- Cross-Entropy 
- Lift 
-  
- Currently, most metrics are for boolean 
 classification problems
- We are willing to add new metrics and new 
 capabilities
- Available at http//www.cs.cornell.edu/caruan
 a
48Future Work 
 49Future/Related Work
- Ensemble method optimizes any metric (ICML04) 
- Get good probs from Boosted Trees (AISTATS05) 
- Comparison of learning algs on metrics (ICML06) 
- First step in analyzing different performance 
 metrics
- Develop new metrics with better properties 
- SAR is a good general purpose metric 
- Does optimizing to SAR yield better models? 
- but RMS nearly as good 
- attempts to make SAR better did not help much 
- Extend to multi-class or hierarchical problems 
 where evaluating performance is more difficult
50Thank You. 
 51Spooky Stuff in Metric Space 
 52Which learning methods perform best on each 
metric? 
 53Normalized Scores Best Single Models
- SVM predictions transformed to posterior 
 probabilities via Platt Scaling
- SVM and ANN tied for first place Bagged Trees 
 nearly as good
- Boosted Trees win 5 of 6 Threshold  Rank 
 metrics, but yield lousy probs!
- Boosting weaker stumps does not compare to 
 boosting full trees
- KNN and Plain Decision Trees usually not 
 competitive (with 4k train sets)
- Other interesting things. See papers.
54Platt Scaling
- SVM predictions -inf, inf 
- Probability metrics require 0,1 
- Platt scaling transforms SVM preds by fitting a 
 sigmoid
- This gives SVM good probability performance
55Outline
- Motivation The One True Model 
- Ten Performance Metrics 
- Experiments 
- Multidimensional Scaling (MDS) Analysis 
- Correlation Analysis 
- Learning Algorithm vs. Metric 
- Summary 
56Base-Level Learners
- Each optimizes different things 
- ANN minimize squared error or cross-entropy 
 (good for probs)
- SVM, Boosting optimize margin (good for 
 accuracy, poor for probs)
- DT optimize info gain 
- KNN ? 
- Each best in different regimes 
- SVM high dimensional data 
- DT, KNN large data sets 
- ANN non-linear prediction from many correlated 
 features
- Each algorithm has many variations and free 
 parameters
- SVM margin parameter, kernel, kernel parameters 
 (gamma, )
- ANN  hidden units,  hidden layers, learning 
 rate, early stopping point
- DT splitting criterion, pruning options, 
 smoothing options,
- KNN K, distance metric, distance weighted 
 averaging,
- Generate about 2000 models on each test problem
57Motivation
- Holy Grail of Supervised Learning 
- One True Model (a.k.a. Bayes Optimal Model) 
- Predicts correct conditional probability for each 
 case
- Yields optimal performance on all reasonable 
 metrics
- Hard to learn given finite data 
- train sets rarely have conditional probs, usually 
 just 0/1 targets
- Isnt always necessary 
- Many Different Performance Metrics 
- ACC, AUC, CXE, RMS, PRE/REC  
- Each represents different tradeoffs 
- Usually important to optimize to appropriate 
 metric
- Not all metric created equal
58Motivation
- In an ideal world 
- Learn model that predicts correct conditional 
 probabilities
- Yield optimal performance on any reasonable 
 metric
- In real world 
- Finite data 
- 0/1 targets instead of conditional probabilities 
- Hard to learn this ideal model 
- Dont have good metrics for recognizing ideal 
 model
- Ideal model isnt always necessary 
- In practice 
- Do learning using many different metrics ACC, 
 AUC, CXE, RMS,
- Each metric represents different tradeoffs 
- Because of this, usually important to optimize to 
 appropriate metric
59Accuracy
- Target 0/1, -1/1, True/False,  
- Prediction  f(inputs)  f(x) 0/1 or Real 
- Threshold f(x) gt thresh gt 1, else gt 0 
- threshold(f(x)) 0/1 
- right / total 
- p(correct) p(threshold(f(x))  target)
60Precision and Recall
- Typically used in document retrieval 
- Precision 
- how many of the returned documents are correct 
- precision(threshold) 
- Recall 
- how many of the positives does the model return 
- recall(threshold) 
- Precision/Recall Curve sweep thresholds
61Precision/Recall
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold 
 62(No Transcript)