Title: A BlackBox approach to machine learning
1A Black-Box approach to machine learning
2Why do we need learning?
- Computers need functions that map highly variable
data - Speech recognition Audio signal -gt words
- Image analysis Video signal -gt objects
- Bio-Informatics Micro-array Images -gt gene
function - Data Mining Transaction logs -gt customer
classification - For accuracy, functions must be tuned to fit the
data source. - For real-time processing, function computation
has to be very fast.
3The complexity/accuracy tradeoff
Error
Complexity
4The speed/flexibility tradeoff
Matlab Code
Java Code
Flexibility
Machine code
Digital Hardware
Analog Hardware
Speed
5Theory Vs. Practice
- Theoretician I want a polynomial-time algorithm
which is guaranteed to perform arbitrarily well
in all situations. - - I prove theorems.
- Practitioner I want a real-time algorithm that
performs well on my problem. - - I experiment.
- My approach I want combining algorithms whose
performance and speed is guaranteed relative to
the performance and speed of their components. - - I do both.
6Plan of talk
- The black-box approach
- Boosting
- Alternating decision trees
- A commercial application
- Boosting the margin
- Confidence rated predictions
- Online learning
7The black-box approach
- Statistical models are not generators, they are
predictors. - A predictor is a function from observation X to
action Z. - After action is taken, outcome Y is observed
which implies loss L (a real valued number). - Goal find a predictor with small loss(in
expectation, with high probability, cumulative)
8Main software components
We assume the predictor will be applied to
examples similar to those on which it was trained
9Learning in a system
Learning System
Sensor Data
Action
10Special case Classification
Observation X - arbitrary (measurable) space
Usually K2 (binary classification)
11batch learning for binary classification
12Boosting
13A weighted training set
14A weak learner
Weighted training set
Weak Leaner
h
15The boosting process
16Adaboost
17Main property of Adaboost
- If advantages of weak rules over random guessing
are g1,g2,..,gT then training error of final
rule is at most
18Boosting block diagram
19What is a good weak learner?
- The set of weak rules (features) should be
- flexible enough to be (weakly) correlated with
most conceivable relations between feature vector
and label. - Simple enough to allow efficient search for a
rule with non-trivial weighted training error. - Small enough to avoid over-fitting.
- Calculation of prediction from observations
should be very fast.
20Alternating decision trees
Freund, Mason 1997
21Decision Trees
1
-1
Xgt3
Ygt5
-1
-1
1
-1
22A decision tree as a sum of weak rules.
-0.2
0.1
-0.1
0.2
-0.2
0.1
-0.1
-0.3
0.2
-0.3
23An alternating decision tree
-0.2
0.7
24Example Medical Diagnostics
- Cleve dataset from UC Irvine database.
- Heart disease diagnostics (1healthy,-1sick)
- 13 features from tests (real valued and
discrete). - 303 instances.
25AD-tree for heart-disease diagnostics
gt0 Healthy lt0 Sick
26Commercial Deployment.
27ATT buisosity problem
Freund, Mason, Rogers, Pregibon, Cortes 2000
- Distinguish business/residence customers from
call detail information. (time of day, length of
call ) - 230M telephone numbers, label unknown for 30
- 260M calls / day
- Required computer resources
- Huge counting log entries to produce statistics
-- use specialized I/O efficient sorting
algorithms (Hancock). - Significant Calculating the classification for
70M customers. - Negligible Learning (2 Hours on 10K training
examples on an off-line computer).
28AD-tree for buisosity
29AD-tree (Detail)
30Quantifiable results
- For accuracy 94 increased coverage from 44 to
56. - Saved ATT 15M in the year 2000 in operations
costs and missed opportunities.
31Adaboosts resistance to over fitting
- Why statisticians find Adaboost interesting.
32A very curious phenomenon
Boosting decision trees
Using lt10,000 training examples we fit gt2,000,000
parameters
33Large margins
Thesis large margins gt reliable predictions
Very similar to SVM.
34Experimental Evidence
35Theorem
Schapire, Freund, Bartlett Lee / Annals of
statistics 1998
H set of binary functions with VC-dimension d
No dependence on no. of combined functions!!!
36Idea of Proof
37Confidence rated predictions
- Agreement gives confidence
38A motivating example
?
?
?
39The algorithm
Freund, Mansour, Schapire 2001
40Suggested tuning
Suppose H is a finite set.
Yields
41Confidence Rating block diagram
42Face Detection
Viola Jones 1999
- Paul Viola and Mike Jones developed a face
detector that can work in real time (15 frames
per second).
43Using confidence to save time
The detector combines 6000 simple features using
Adaboost.
In most boxes, only 8-9 features are calculated.
Feature 1
Feature 2
44Using confidence to train car detectors
45Original Image Vs. difference image
46Co-training
Blum and Mitchell 98
Partially trained B/W based Classifier
Raw B/W
Hwy Images
Diff Image
Partially trained Diff based Classifier
47Co-Training Results
Levin, Freund, Viola 2002
48Selective sampling
Unlabeled data
Partially trained classifier
Query-by-committee, Seung, Opper
Sompolinsky Freund, Seung, Shamir Tishby
49Online learning
50Online learning
So far, the only statistical assumption was that
data is generated IID.
Can we get rid of that assumption?
Yes, if we consider prediction as a repeating game
Suppose we have a set of experts, we believe one
is good, but we dont know which one.
51Online prediction game
52A very simple example
- Binary classification
- N experts
- one expert is known to be perfect
- Algorithm predict like the majority of experts
that have made no mistake so far. - Bound
53History of online learning
- Littlestone Warmuth
- Vovk
- Vovk and Shafers recent bookProbability and
Finance, its only a game! - Innumerable contributions from many fields
Hannan, Blackwell, Davison, Gallager, Cover,
Barron, Foster Vohra, Fudenberg Levin, Feder
Merhav, Starkov, Rissannen, Cesa-Bianchi,
Lugosi, Blum, Freund, Schapire, Valiant, Auer
54Lossless compression
X - arbitrary input space.
Y - 0,1
Entropy, Lossless compression, MDL.
- Statistical likelihood, standard probability
theory.
55Bayesian averaging
Folk theorem in Information Theory
56Game theoretical loss
X - arbitrary space
57Learning in games
Freund and Schapire 94
An algorithm which knows T in advance guarantees
58Multi-arm bandits
Auer, Cesa-Bianchi, Freund, Schapire 95
We describe an algorithm that guarantees
59Why isnt online learning practical?
- Prescriptions too similar to Bayesian approach.
- Implementing low-level learning requires a large
number of experts. - Computation increases linearly with the number of
experts. - Potentially very powerful for combining a few
high-level experts.
60Online learning for detector deployment
Detector can be adaptive!!
OL
61Summary
- By Combining predictors we can
- Improve accuracy.
- Estimate prediction confidence.
- Adapt on-line.
- To make machine learning practical
- Speed-up the predictors.
- Concentrate human feedback on hard cases.
- Fuse data from several sources.
- Share predictor libraries.