Inductive Transfer Retrospective - PowerPoint PPT Presentation

About This Presentation

Title:

Inductive Transfer Retrospective

Description:

stock market. economic forecasting. weather prediction. spatial series. many more ... Pratt, Mostow [1991-94]: serial transfer in bp nets ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 76

Provided by: richca

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Inductive Transfer Retrospective

1
Inductive Transfer Retrospective Review

Rich Caruana
Computer Science Department
Cornell University

2
Inductive Transfer a.k.a.

Bias Learning
Multitask learning
Learning (Internal) Representations
Learning-to-learn
Lifelong learning
Continual learning
Speedup learning
Hints
Hierarchical Bayes

Rich Sutton 1994 Constructive Induction
Workshop
Everyone knows that good representations are key
to 99 of good learning performance. Why then has
constructive induction, the science of finding
good representations, been able to make only
incremental improvements in performance?
People can learn amazingly fast because they
bring good representations to the problem,
representations they learned on previous
problems. For people, then, constructive
induction does make a large difference in
performance.
The standard machine learning methodology is to
consider a single concept to be learned. That
itself is the crux of the problem
This is not the way to study constructive
induction! The standard one-concept learning
task will never do this for us and must be
abandoned. Instead we should look to natural
learning systems, such as people, to get a better
sense of the real task facing them. When we do
this, I think we find the key difference that,
for all practical purposes, people face not one
task, but a series of tasks. The different tasks
have different solutions, but they often share
the same useful representations.
If you can come to the nth task with an
excellent representation learned from the
preceding n-1 tasks, then you can learn
dramatically faster than a system that does not
use constructive induction. A system without
constructive induction will learn no faster on
the nth task than on the 1st.

4
Transfer through the Ages

1986 Sejnowski Rosenberg NETtalk
1990 Dietterich, Hild, Bakiri ID3 vs. NETtalk
1990 Suddarth, Kergiosen, Holden rule
injection (ANNs)
1990 Abu-Mostafa hints (ANNs)
1991 Dean Pomerleau ALVINN output
representation (ANNs)
1991 Lorien Pratt speedup learning (ANNs)
1992 Sharkey Sharkey speedup learning (ANNs)
1992 Mark Ring continual learning
1993 Rich Caruana MTL (ANNs, KNN, DT)
1993 Thrun Mitchell EBNN
1994 Virginia de Sa minimizing disagreement
1994 Jonathan Baxter representation learning
(and theory)
1994 Thrun Mitchell learning one more thing
1994 J. Schmidhuber learning how to learn
learning strategies

1994 Dietterich Bakiri ECOC outputs
1995 Breiman Friedman Curds Whey
1995 Sebastian Thrun LLL (learning-to-learn,
lifelong-learning)
1996 Danny Silver parallel transfer (ANNs)
1996 OSullivan Thrun task clustering (KNN)
1996 Caruana de Sa inputs better as outputs
(ANNs)
1997 Munro Parmanto committee machines
(ANNs)
1998 Blum Mitchell co-training
2002 Ben-David, Gehrke, Schuller theoretical
framework
2003 Bakker Heskes Bayesian MTL (and task
clustering)
2004 Tony Jebara MTL in SVMs (feature and
kernel selection)
2004 Pontil Micchelli Kernels for MTL
2004 Lawrence Platt MTL in GP (info vector
machine)
2005 Yu, Tresp, Schwaighofer MTL in GP
2005 Lia Carin MTL for RBF Networks

6
A Quick Romp Through Some Stuff
7
1 Task vs. 2 Tasks vs. 4 Tasks
8
STL vs. MTL Learning Curves
courtesy Joseph OSullivan
9
STL vs. MTL Learning Curves
10
A Different Kind of Learning Curve
11
MTL for Bayes Net Structure Learning
Yeast 1
Yeast 2
Yeast 3

Bayes Nets for these three species overlap
significantly
Learn structures from data for each species
separately? No.
Learn one structure for all three species? No.
Bias learning to favor shared structure while
allowing some differences? Yes -- makes most of
limited data.

12
When to Use Inductive Transfer?

multiple tasks occur naturally
using future to predict present
time series
decomposable tasks
multiple error metrics
focus of attention
different data distributions for same/similar
problems
hierarchical tasks
some input features work better as outputs

13
Multiple Tasks Occur Naturally

Mitchells Calendar Apprentice (CAP)
time-of-day (900am, 930am, ...)
day-of-week (M, T, W, ...)
duration (30min, 60min, ...)
location (Toms office, Deans office, 5409, ...)

14
Using Future to Predict Present

medical domains
autonomous vehicles and robots
time series
stock market
economic forecasting
weather prediction
spatial series
many more

15
Decomposable Tasks

DireOutcome ICU v Complication v
Death

INPUTS
16
Focus of Attention
Single-Task ALVINN
Multi-Task ALVINN
17
Different Data Distributions

Hospital 1 50 cases, rural (Ithaca)
Hospital 2 500 cases, mature urban (Des Moines)
Hospital 3 1000 cases, elderly suburbs (Florida)
Hospital 4 5000 cases, young urban (LA,SF)

18
Some Inputs are Better as Outputs
19
And many more uses of Xfer
20
A Few Issues That Arise With Xfer
21
Issue 1 Interference
22
Issue 1 Interference
23
Issue 2 Task Selection/Weighting

Analogous to feature selection
Correlation between tasks
heuristic works well in practice
very suboptimal
Wrapper-based methods
expensive
benefit from single tasks can be too small to
detect reliably
does not examine tasks in sets
Task weighting MTL ? one model for all tasks
main task vs. all tasks
even harder than task selection
but yields best results

24
Issue 3 Parallel vs. Serial Transfer

Where possible, use parallel transfer
All info about a task is in the training set, not
necessarily a model trained on that train set
Information useful to other tasks can be lost
training one task at a time
Tasks often benefit each other mutually
When serial is necessary, implement via parallel
task rehearsal
Storing all experience not always feasible

25
Issue 4 Psychological Plausibility
?
26
Issue 5 Xfer vs. Hierarchical Bayes

Is Xfer just regularization/smoothing?
Yes and No
Yes
Similar models for different problem
instancese.g. similar stocks, data
distributions,
No
Focus of attention
Task selection/clustering/rehearsal

27
Issue 6 What does Related Mean?

related ? helps learning (e.g., copy task)

28
Issue 6 What does Related Mean?

related ? helps learning (e.g., copy task)
helps learning ? related (e.g., noise task)

29
Issue 6 What does Related Mean?

related ? helps learning (e.g., copy task)
helps learning ? related (e.g., noise task)
related ? correlated (e.g., AB, A-B)

30
Why Doesnt Xfer Rule the Earth?

Tabula rasa learning surprisingly effective
the UCI problem

31
Use Some Features as Outputs
32
Why Doesnt Xfer Rule the Earth?

Xfer opportunities abound in real problems
Somewhat easier with ANNs (and Bayes nets)
Death is in the details
Xfer often hurts more than it helps if not
careful
Some important tricks counterintuitive
dont share too much
give tasks breathing room
focus on one task at a time

Tabula rasa learning surprisingly effective
the UCI problem

33
What Needs to be Done?

Have algs for ANN, KNN, DT, SVM, GP, BN,
Better prescription of where to use Xfer
Public data sets
Comparison of Methods
Inductive Transfer Competition?
Task selection, task weighting, task clustering
Explicit (TC) vs. Implicit (backprop) Xfer
Theory/definition of task relatedness

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Kinds of Transfer

Human Expertise
Constraints
Hints (monotonicity, smoothness, )
Parallel
Multitask Learning
Serial
Learning-To-Learn
Serial via parallel (rehearsal)

39
Motivating Example

4 tasks defined on eight bits B1-B8
all tasks ignore input bits B7-B8

40
Goals of MTL

improve predictive accuracy
not intelligibility
not learning speed
exploit background knowledge
applicable to many learning methods
exploit strength of current learning methods
surprisingly good tabula rasa performance

41
Problem 2 1D-Doors

color camera on Xavier robot
main tasks doorknob location and door type
8 extra tasks (training signals collected by
mouse)
doorway width
location of doorway center
location of left jamb, right jamb
location of left and right edges of door

42
Predicting Pneumonia Risk
43
Predicting Pneumonia Risk
In-Hospital Attributes
RBC Count
Blood pO2
Albumin
White Count
Pneumonia Risk
44
Pneumonia 1 Medis
45
Pneumonia 1 Results
-10.8 -11.8 -6.2 -6.9 -5.7
46
Use imputed values for missing lab tests as
extra inputs?
47
Pneumonia 1 Feature Nets
48
Pneumonia 2 Results
MTL reduces error gt10
49
Related?