Title: Convex Point Estimation using Undirected Bayesian Transfer Hierarchies
1Convex Point Estimation using Undirected Bayesian
Transfer Hierarchies
- Gal Elidan, Ben Packer, Geremy Heitz, Daphne
Koller - Computer Science Dept.
- Stanford University
- UAI 2008
Presented by Haojun Chen August 1st, 2008
2Outline
- Background and motivation
- Undirected transfer hierarchies
- Experiments
- Degree of transfer coefficients
- Experiments
- Summary
3Background (1/2)
- Transfer learning
- Data from similar tasks/distributions are
used to compensate for the sparsity of training
data in primary class or task
Example Use rhinos to help learn elephants
shape
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
4Background (2/2)
- Hierarchical Bayes (HB) framework
- Principled approach for transfer learning
Joint distribution over the observed data and all
class parameters as follows
where
Example of a hierarchical Bayes parameterization
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
5Motivation
- In practice, point estimation of the MAP is
desirable, for full Bayesian computations can be
difficult and computationally demanding - Efficient point estimation may not be achieved in
many standard hierarchical Bayes models, because
many common conjugate priors such as the
Dirichlet or normal-inverse-Wishart are not
convex with respect to the parameters - In this paper, an undirected hierarchical
Bayes(HB) reformulation is proposed to allow
efficient point estimation
6Undirected HB Reformulation
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
7Purpose of Reformulation
- Easy to specify
- Fdata can be likelihood, classification, or other
objective - Divergence can be L1-norm, L2-norm, e-insensitive
loss, KL divergence, etc. - No conjugacy or proper prior restrictions
- Easy to optimize
- Convex over Q if Fdata is concave and Divergence
is convex
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
8Experiment Text categorization
Newsgroup20 Dataset
- Bag-of-words model
- Fdata Multinomial log likelihood (regularized)
- frequency of word i
- Divergence L2 norm
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
9Text categorization Result
Baseline Maximum likelihood at each node (no
hierarchy) Cross-validate
regularization (no hierarchy)
Shrinkage (McCallum et al. 98, with hierarchy)
Newsgroup Topic Classification
0.7
0.65
0.6
0.55
Classification Rate
0.5
0.45
Max Likelihood (No regularization)
Shrinkage
Regularized Max Likelihood
0.4
Undirected HB
0.35
75
150
225
300
375
Total Number of Training Instances
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
10Experiment Shape Modeling
Mammals Dataset (Fink, 05)
- (Density estimation test likelihood)
- Instances represented by 60 x-y
- coordinates of landmarks on outline
- Divergence
- L2 norm over mean and variance
Mean landmark location
Covariance over landmarks
Regularization
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
11Undirect HB Shape Modeling Result
Mammal Pairs
50
Regularized Max Likelihood
0
-50
Elephant-Rhino
-100
Delta log-loss / instance
-150
Bison-Rhino
Elephant-Bison
-200
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
-250
Giraffe-Rhino
Llama-Bison
Llama-Elephant
-300
Llama-Giraffe
Llama-Rhino
-350
6
10
20
30
Total Number of Training Instances
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
12Problem in Transfer
Not all parameters deserve equal sharing
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
13Degrees of Transfer (DOT)
is split into subcomponents with weights
, and hence different strengths are allowed
for different subcomponents, child-parent pairs
? 0 forces parameters to
agree ?8 allows parameters to
be flexible
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
14Estimation of DOT Parameters
- Hyper-prior approach
- Bayesian idea Put prior on and add as
parameter to optimization along with - Concretely inverse-Gamma prior (forced to be
positive)
Prior on Degree of Transfer
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
15DOT Shape Modeling Result
Mammal Pairs
15
Hyperprior
10
Elephant-Rhino
5
Delta log-loss / instance
0
Regularized Max Likelihood
Bison-Rhino
-5
Elephant-Bison
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
Giraffe-Rhino
-10
Llama-Bison
Llama-Elephant
Llama-Giraffe
Llama-Rhino
-15
6
10
20
30
Total Number of Training Instances
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
16Distribution of DOT coefficients
Distribution of DOT coefficients using Hyperprior
approach
20
18
qroot
16
14
12
10
8
6
4
2
0
0
5
10
15
20
25
30
35
40
45
50
1/l
Stronger transfer
Weaker transfer
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
17Summary
- Undirected reformulation of the hierarchical
Bayes framework is proposed for efficient convex
point estimation - Different degrees of transfer for different
- parameters are introduced so that some parts
of the distribution can be transferred to a
greater extent than others