Title: Convex Point Estimation using Undirected Bayesian Transfer Hierarchies
1Convex Point Estimation using Undirected
Bayesian Transfer Hierarchies
- Gal Elidan Ben Packer
- Geremy Heitz Daphne Koller
- Stanford AI Lab
2Motivation
Problem With few instances, learned models
arent robust
Task Shape modeling
Principal Components
std 1
MEAN
MEAN
std -1
std 1
std -1
3Transfer Learning
Shape is stabilized, but doesnt look like an
elephant
Can we use rhinos to help elephants?
Principal Components
std 1
MEAN
MEAN
std -1
std 1
std -1
4Hierarchical Bayes
P(qroot)
qroot
P(qElephantqroot)
P(qRhinoqroot)
qRhino
qElephant
P(DataRhinoqRhino)
P(DataElephantqElephant)
5Goals
- Transfer between related classes
- Range of settings, tasks
- Probabilistic motivation
- Multilevel, complex hierarchies
- Simple, efficient computation
- Automatically learn what to transfer
6Hierarchical Bayes
- Compute full posterior P(D)
- P(croot) must be conjugate to P(Dc)
Problem Often cant perform full
Bayesian computations
7Approx. Point estimation
Best parameters are good enough dont need full
distribution
- Empirical Bayes
- Point estimation
- Other approximations
- Posterior as normal, sampling, etc.
8More Issues Multiple Levels
- Conjugate priors usually cant be extended to
- multiple levels (e.g., Dirichlet,
inverse-Wishart) - Exception Thibeaux and Jordan (05)
9More Issues Restrictive Priors
- Example inverse-Wishart
- Pseudocount restriction
- n gt d
- If d is large, N is small, signal from prior
overwhelms data - We show experiments with N3, d20
Pseudocounts
N samples, d dimension
10Alternative Shrinkage
- McCallum et al. (98)
- 1. Compute maximum likelihood at each node
- 2. Shrink each node toward its parent
- Linear combination of q and qparent
- Uses cross-validation of a
- Pros
- Simple to compute
- Handles multiple levels
- Cons
- Naive heuristic for transfer
- Averaging not always appropriate
qparent
q aq (1-a) qparent
q
11Undirected HB Reformulation
Probabilistic Abstraction Hierarchies (Segal et
al. 01)
Defines an undirected Markov random field model
over Q, D
12Undirected Probabilistic Model
qroot
b high
b low
Divergence
Divergence
qRhino
qElephant
Fdata
Fdata
Divergence Encourage parameters to be similar
to parents
Fdata Encourage parameters to explain data
13Purpose of Reformulation
- Easy to specify
- Fdata can be likelihood, classification, or other
objective - Divergence can be L1-distance, L2-distance,
- e-insensitive loss, KL divergence, etc.
- No conjugacy or proper prior restrictions
- Easy to optimize
- Convex over Q if Fdata is convex and Divergence
is concave
14Application Text categorization
- Task Categorize Documents
- Bag-of-words model
- Fdata Multinomial log likelihood (regularized)
- µi represents frequency of word i
- Divergence L2 norm
Newsgroup20 Dataset
15Baselines
- 1. Maximum likelihood at each node (no hierarchy)
- 2. Cross-validate regularization (no hierarchy)
- 3. Shrinkage (McCallum et al. 98, with hierarchy)
qparent
qB
qA
q aq (1-a) qparent
q
16Can It Handle Multiple Levels?
Newsgroup Topic Classification
0.7
0.65
0.6
0.55
Classification Rate
0.5
0.45
Max Likelihood (No regularization)
Shrinkage
Regularized Max Likelihood
0.4
Undirected HB
0.35
75
150
225
300
375
Total Number of Training Instances
17Application Shape Modeling
Mammals Dataset (Fink, 05)
- Task Learn shape
- (Density estimation test likelihood)
- Instances represented by 60 x-y
- coordinates of landmarks on outline
- Divergence
- L2 norm over mean and variance
Mean landmark location
Covariance over landmarks
Regularization
MEAN
Principal Components
18Does Hierarchy Help?
Mammal Pairs
50
Regularized Max Likelihood
0
Bison-Rhino
-50
-100
Delta log-loss / instance
-150
Bison-Rhino
Elephant-Bison
-200
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
-250
Giraffe-Rhino
Llama-Bison
Llama-Elephant
-300
Llama-Giraffe
Llama-Rhino
-350
6
10
20
30
Total Number of Training Instances
Unregularized max likelihood, shrinkage Much
worse, not shown
19Transfer
Not all parameters deserve equal sharing
20Degrees of Transfer
How do we estimate all these parameters?
Split q into subcomponents µi with weights l
Allows for different strengths for different
subcomponents, child-parent pairs
21Learning Degrees of Transfer
- Bootstrap approach
- If and have a consistent
relationship, want to encourage them to be
similar - Define
- Want to estimate variance of d across all
possible datasets - Select random subsets of data
- Let be empirical variance of d
- If and have a consistent
relationship (low variance), strongly
encourages similarity
- Bootstrap approach
- Hyper-prior approach
-
22Degrees of Transfer
- Bootstrap approach
- If we use an L2 norm for our Div terms
- Resembles a product of Gaussian priors
- If we were to fix the value of
- empirical Bayes
- Undirected empirical Bayes estimation
L2 norm
Gaussian Prior
Degree of Transfer
23Learning Degrees of Transfer
- Bootstrap approach
- If and have a consistent
relationship, want to encourage them to be
similar - Hyper-prior approach
- Bayesian idea
- Put prior on
- Add as parameter to optimization along with
- Concretely inverse-Gamma prior (forced to be
positive)
If likelihood is concave, entire objective is
convex!
Prior on Degree of Transfer
24Do Degrees of Transfer Help?
Mammal Pairs
15
Hyperprior
10
Bison-Rhino
5
Delta log-loss / instance
0
Regularized Max Likelihood
Bison-Rhino
-5
Elephant-Bison
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
Giraffe-Rhino
-10
Llama-Bison
Llama-Elephant
Llama-Giraffe
Llama-Rhino
-15
6
10
20
30
Total Number of Training Instances
25Do Degrees of Transfer Help?
Bison-Rhino
- Const
- No Degrees of
- Transfer
- Hyperprior, Bootstrap
- Use Degrees of
- Transfer
10
Hyperprior
Bootstrap
5
0
RegML
Delta log-loss / instance
-5
Const
qroot
-10
-15
6
10
20
30
Total Number of Training Instances
26Do Degrees of Transfer Help?
Llama-Rhino
Const gt100 bits worse
8
7
Hyperprior
6
5
Bootstrap
4
Delta log-loss / instance
3
2
qroot
1
RegML
0
6
10
20
30
Total Number of Training Instances
27Degrees of Transfer
Distribution of DOT coefficients using Hyperprior
20
18
qroot
16
14
12
10
8
6
4
2
0
0
5
10
15
20
25
30
35
40
45
50
1/l
Stronger transfer
Weaker transfer
28Summary
- Transfer between related classes
- Range of settings, tasks
- Probabilistic motivation
- Multilevel, complex hierarchies
- Simple, efficient computation
- Refined transfer of components
29Future Work
- Non-tree hierarchies
- (multiple inheritance)
- Block degrees of transfer
- Structure learning
Gene Ontology (GO) network
WordNet Hierarchy
General undirected model doesnt require tree
structure
Part discovery