Inductive Transfer Retrospective - PowerPoint PPT Presentation

About This Presentation
Title:

Inductive Transfer Retrospective

Description:

stock market. economic forecasting. weather prediction. spatial series. many more ... Pratt, Mostow [1991-94]: serial transfer in bp nets ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 76
Provided by: richca
Category:

less

Transcript and Presenter's Notes

Title: Inductive Transfer Retrospective


1
Inductive Transfer Retrospective Review
  • Rich Caruana
  • Computer Science Department
  • Cornell University

2
Inductive Transfer a.k.a.
  • Bias Learning
  • Multitask learning
  • Learning (Internal) Representations
  • Learning-to-learn
  • Lifelong learning
  • Continual learning
  • Speedup learning
  • Hints
  • Hierarchical Bayes

3
  • Rich Sutton 1994 Constructive Induction
    Workshop
  • Everyone knows that good representations are key
    to 99 of good learning performance. Why then has
    constructive induction, the science of finding
    good representations, been able to make only
    incremental improvements in performance?
  • People can learn amazingly fast because they
    bring good representations to the problem,
    representations they learned on previous
    problems. For people, then, constructive
    induction does make a large difference in
    performance.
  • The standard machine learning methodology is to
    consider a single concept to be learned. That
    itself is the crux of the problem
  • This is not the way to study constructive
    induction! The standard one-concept learning
    task will never do this for us and must be
    abandoned. Instead we should look to natural
    learning systems, such as people, to get a better
    sense of the real task facing them. When we do
    this, I think we find the key difference that,
    for all practical purposes, people face not one
    task, but a series of tasks. The different tasks
    have different solutions, but they often share
    the same useful representations.
  • If you can come to the nth task with an
    excellent representation learned from the
    preceding n-1 tasks, then you can learn
    dramatically faster than a system that does not
    use constructive induction. A system without
    constructive induction will learn no faster on
    the nth task than on the 1st.

4
Transfer through the Ages
  • 1986 Sejnowski Rosenberg NETtalk
  • 1990 Dietterich, Hild, Bakiri ID3 vs. NETtalk
  • 1990 Suddarth, Kergiosen, Holden rule
    injection (ANNs)
  • 1990 Abu-Mostafa hints (ANNs)
  • 1991 Dean Pomerleau ALVINN output
    representation (ANNs)
  • 1991 Lorien Pratt speedup learning (ANNs)
  • 1992 Sharkey Sharkey speedup learning (ANNs)
  • 1992 Mark Ring continual learning
  • 1993 Rich Caruana MTL (ANNs, KNN, DT)
  • 1993 Thrun Mitchell EBNN
  • 1994 Virginia de Sa minimizing disagreement
  • 1994 Jonathan Baxter representation learning
    (and theory)
  • 1994 Thrun Mitchell learning one more thing
  • 1994 J. Schmidhuber learning how to learn
    learning strategies

5
  • 1994 Dietterich Bakiri ECOC outputs
  • 1995 Breiman Friedman Curds Whey
  • 1995 Sebastian Thrun LLL (learning-to-learn,
    lifelong-learning)
  • 1996 Danny Silver parallel transfer (ANNs)
  • 1996 OSullivan Thrun task clustering (KNN)
  • 1996 Caruana de Sa inputs better as outputs
    (ANNs)
  • 1997 Munro Parmanto committee machines
    (ANNs)
  • 1998 Blum Mitchell co-training
  • 2002 Ben-David, Gehrke, Schuller theoretical
    framework
  • 2003 Bakker Heskes Bayesian MTL (and task
    clustering)
  • 2004 Tony Jebara MTL in SVMs (feature and
    kernel selection)
  • 2004 Pontil Micchelli Kernels for MTL
  • 2004 Lawrence Platt MTL in GP (info vector
    machine)
  • 2005 Yu, Tresp, Schwaighofer MTL in GP
  • 2005 Lia Carin MTL for RBF Networks

6
A Quick Romp Through Some Stuff
7
1 Task vs. 2 Tasks vs. 4 Tasks
8
STL vs. MTL Learning Curves
courtesy Joseph OSullivan
9
STL vs. MTL Learning Curves
10
A Different Kind of Learning Curve
11
MTL for Bayes Net Structure Learning
Yeast 1
Yeast 2
Yeast 3
  • Bayes Nets for these three species overlap
    significantly
  • Learn structures from data for each species
    separately? No.
  • Learn one structure for all three species? No.
  • Bias learning to favor shared structure while
    allowing some differences? Yes -- makes most of
    limited data.

12
When to Use Inductive Transfer?
  • multiple tasks occur naturally
  • using future to predict present
  • time series
  • decomposable tasks
  • multiple error metrics
  • focus of attention
  • different data distributions for same/similar
    problems
  • hierarchical tasks
  • some input features work better as outputs

13
Multiple Tasks Occur Naturally
  • Mitchells Calendar Apprentice (CAP)
  • time-of-day (900am, 930am, ...)
  • day-of-week (M, T, W, ...)
  • duration (30min, 60min, ...)
  • location (Toms office, Deans office, 5409, ...)

14
Using Future to Predict Present
  • medical domains
  • autonomous vehicles and robots
  • time series
  • stock market
  • economic forecasting
  • weather prediction
  • spatial series
  • many more

15
Decomposable Tasks
  • DireOutcome ICU v Complication v
    Death

INPUTS
16
Focus of Attention
Single-Task ALVINN
Multi-Task ALVINN
17
Different Data Distributions
  • Hospital 1 50 cases, rural (Ithaca)
  • Hospital 2 500 cases, mature urban (Des Moines)
  • Hospital 3 1000 cases, elderly suburbs (Florida)
  • Hospital 4 5000 cases, young urban (LA,SF)

18
Some Inputs are Better as Outputs
19
And many more uses of Xfer
20
A Few Issues That Arise With Xfer
21
Issue 1 Interference
22
Issue 1 Interference
23
Issue 2 Task Selection/Weighting
  • Analogous to feature selection
  • Correlation between tasks
  • heuristic works well in practice
  • very suboptimal
  • Wrapper-based methods
  • expensive
  • benefit from single tasks can be too small to
    detect reliably
  • does not examine tasks in sets
  • Task weighting MTL ? one model for all tasks
  • main task vs. all tasks
  • even harder than task selection
  • but yields best results

24
Issue 3 Parallel vs. Serial Transfer
  • Where possible, use parallel transfer
  • All info about a task is in the training set, not
    necessarily a model trained on that train set
  • Information useful to other tasks can be lost
    training one task at a time
  • Tasks often benefit each other mutually
  • When serial is necessary, implement via parallel
    task rehearsal
  • Storing all experience not always feasible

25
Issue 4 Psychological Plausibility
?
26
Issue 5 Xfer vs. Hierarchical Bayes
  • Is Xfer just regularization/smoothing?
  • Yes and No
  • Yes
  • Similar models for different problem
    instancese.g. similar stocks, data
    distributions,
  • No
  • Focus of attention
  • Task selection/clustering/rehearsal

27
Issue 6 What does Related Mean?
  • related ? helps learning (e.g., copy task)

28
Issue 6 What does Related Mean?
  • related ? helps learning (e.g., copy task)
  • helps learning ? related (e.g., noise task)

29
Issue 6 What does Related Mean?
  • related ? helps learning (e.g., copy task)
  • helps learning ? related (e.g., noise task)
  • related ? correlated (e.g., AB, A-B)

30
Why Doesnt Xfer Rule the Earth?
  • Tabula rasa learning surprisingly effective
  • the UCI problem

31
Use Some Features as Outputs
32
Why Doesnt Xfer Rule the Earth?
  • Xfer opportunities abound in real problems
  • Somewhat easier with ANNs (and Bayes nets)
  • Death is in the details
  • Xfer often hurts more than it helps if not
    careful
  • Some important tricks counterintuitive
  • dont share too much
  • give tasks breathing room
  • focus on one task at a time
  • Tabula rasa learning surprisingly effective
  • the UCI problem

33
What Needs to be Done?
  • Have algs for ANN, KNN, DT, SVM, GP, BN,
  • Better prescription of where to use Xfer
  • Public data sets
  • Comparison of Methods
  • Inductive Transfer Competition?
  • Task selection, task weighting, task clustering
  • Explicit (TC) vs. Implicit (backprop) Xfer
  • Theory/definition of task relatedness

34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Kinds of Transfer
  • Human Expertise
  • Constraints
  • Hints (monotonicity, smoothness, )
  • Parallel
  • Multitask Learning
  • Serial
  • Learning-To-Learn
  • Serial via parallel (rehearsal)

39
Motivating Example
  • 4 tasks defined on eight bits B1-B8
  • all tasks ignore input bits B7-B8

40
Goals of MTL
  • improve predictive accuracy
  • not intelligibility
  • not learning speed
  • exploit background knowledge
  • applicable to many learning methods
  • exploit strength of current learning methods
  • surprisingly good tabula rasa performance

41
Problem 2 1D-Doors
  • color camera on Xavier robot
  • main tasks doorknob location and door type
  • 8 extra tasks (training signals collected by
    mouse)
  • doorway width
  • location of doorway center
  • location of left jamb, right jamb
  • location of left and right edges of door

42
Predicting Pneumonia Risk
43
Predicting Pneumonia Risk
In-Hospital Attributes
RBC Count
Blood pO2
Albumin
White Count
Pneumonia Risk
44
Pneumonia 1 Medis
45
Pneumonia 1 Results
-10.8 -11.8 -6.2 -6.9 -5.7
46
Use imputed values for missing lab tests as
extra inputs?
47
Pneumonia 1 Feature Nets
48
Pneumonia 2 Results
MTL reduces error gt10
49
Related?
  • Ideal
  • Func (MainTask, ExtraTask, Alg) 1
  • iff
  • Alg (MainTask ExtraTask) gt Alg (MainTask)
  • unrealistic
  • try all extra tasks (or all combinations)?
  • need heuristics to help us find potentially
    useful extra tasks to use for MTL
  • Related Tasks

50
Related?
  • related ? helps learning (e.g., copy tasks)

51
Related?
  • related ? helps learning (e.g., copy task)
  • helps learning ? related (e.g., noise task)

52
Related?
  • related ? helps learning (e.g., copy task)
  • helps learning ? related (e.g., noise task)
  • related ? correlated (e.g., AB, A-B)

53
120 Synthetic Tasks
  • backprop net not told how tasks are related, but
    ...
  • 120 Peaks Functions A,B,C,D,E,F ?? (0.0,1.0)
  • P 001 If (A gt 0.5) Then B, Else C
  • P 002 If (A gt 0.5) Then B, Else D
  • P 014 If (A gt 0.5) Then E, Else C
  • P 024 If (B gt 0.5) Then A, Else F
  • P 120 If (F gt 0.5) Then E, Else D

54
  • MTL nets cluster tasks
  • by function

55
Peaks Functions Clustering
56
Focus of Attention
  • 1D-ALVINN
  • centerline
  • left and right edges of road
  • removing centerlines from 1D-ALVINN images hurts
    MTL accuracy more than STL accuracy

57
Some Inputs are Better as Outputs
  • MainTask Sigmoid(A)Sigmoid(B)
  • A, B ??????????????
  • Inputs A and B coded via 10-bit binary code

58
Inputs Better as Outputs Results
59
MTL in K-Nearest Neighbor
  • Most learning methods can MTL
  • shared representation
  • combine performance of extra tasks
  • control the effect of extra tasks
  • MTL in K-Nearest Neighbor
  • shared representation distance metric
  • MTLPerf (1-?)?MainPerf ???(??ExtraPerf)

60
Summary
  • inductive transfer improves learning
  • gt15 problem types where MTL is applicable
  • using the future to predict the present
  • multiple metrics
  • focus of attention
  • different data populations
  • using inputs as extra tasks
  • . . . (at least 10 more)
  • most real-world problems fit one of these

61
Summary/Contributions
  • applied MTL to a dozen problems, some not created
    for MTL
  • MTL helps most of the time
  • benefits range from 5-40
  • ways to improve MTL/Backprop
  • learning rate optimization
  • private hidden layers
  • MTL Feature Nets
  • MTL nets do unsupervised learning/clustering
  • algorithms for MTL ANN, KNN, SVMs, DTs

62
Open Problems
  • output selection
  • scale to 1000s of extra tasks
  • compare to Bayes Nets
  • theory of MTL
  • task weighting
  • features as both inputs and extra outputs

63
Features as Both Inputs Outputs
  • some features help when used as inputs
  • some of those also help when used as outputs
  • get both benefits in one net?

64
Summary/Contributions
  • focus on main task improves performance
  • gt15 problem types where MTL is applicable
  • using the future to predict the present
  • multiple metrics
  • focus of attention
  • different data populations
  • using inputs as extra tasks
  • . . . (at least 10 more)
  • most real-world problems fit one of these

65
Summary/Contributions
  • applied MTL to a dozen problems, some not created
    for MTL
  • MTL helps most of the time
  • benefits range from 5-40
  • ways to improve MTL/Backprop
  • learning rate optimization
  • private hidden layers
  • MTL Feature Nets
  • MTL nets do unsupervised clustering
  • algs for MTL kNN and MTL Decision Trees

66
Future MTL Work
  • output selection
  • scale to 1000s of extra tasks
  • theory of MTL
  • compare to Bayes Nets
  • task weighting
  • features as both inputs and extra outputs

67
Inputs as Outputs DNA Domain
  • given sequence of 60 DNA nucleotides, predict if
    sequence is I?E, E?I, neither
  • ... ACAGTACGTTGCATTACCCTCGTT...
  • ?
  • I?E, E?I, neither
  • nucleotides A,C,G,T coded with 3 bits
  • 3 60 180 inputs 3 binary outputs

68
Making MTL/Backprop Better
  • Better training algorithm
  • learning rate optimization
  • Better architectures
  • private hidden layers (overfitting in hidden unit
    space)
  • using features as both inputs and outputs
  • combining MTL with Feature Nets

69
Private Hidden Layers
  • many tasks need many hidden units
  • many hidden units hidden unit selection
    problem
  • allow sharing, but without too many hidden units?

70
Related Work
  • Sejnowski, Rosenberg 1986 NETtalk
  • Pratt, Mostow 1991-94 serial transfer in bp
    nets
  • Suddarth, Kergiosen 1990 1st MTL in bp nets
  • Abu-Mostafa 1990-95 catalytic hints
  • Abu-Mostafa, Baxter 92,95 transfer PAC models
  • Dietterich, Hild, Bakiri 90,95 bp vs. ID3
  • Pomerleau, Baluja other uses of hidden layers
  • Munro 1996 extra tasks to decorrelate experts
  • Breiman 1995 Curds Whey
  • de Sa 1995 minimizing disagreement
  • Thrun, Mitchell 1994,96 EBNN
  • OSullivan, Mitchell now EBNNMTLRobot

71
MTL vs. EBNN on Robot Problem
courtesy Joseph OSullivan
72
Theoretical Models of Parallel Xfer
  • PAC models based on VC-dim or MDL
  • unreasonable assumptions
  • fixed size hidden layers
  • all tasks generated by one hidden layer
  • backprop is ideal search procedure
  • predictions do not fit observations
  • have to add hidden units
  • main problems
  • can't take behavior of backprop into account
  • not enough is known about capacity of backprop
    nets

73
Learning Rate Optimization
  • optimize learning rates of extra tasks
  • goal is maximize generalization of main task
  • ignore performance of extra tasks
  • expensive!
  • performance on extra tasks improves 9!

74
MTL Feature Nets
75
Acknowledgements
  • advisors Mitchell Simon
  • committee Pomerleau Dietterich
  • CEHC Cooper, Fine, Buchanan, et al.
  • co-authors Baluja, de Sa, Freitag
  • robot Xavier OSullivan, Simmons
  • discussion Fahlman, Moore, Touretzky
  • funding NSF, ARPA, DEC, CEHC, JPRC
  • SCS/CMU a great place to do research
  • spouse Diane
Write a Comment
User Comments (0)
About PowerShow.com