Convex Point Estimation using Undirected Bayesian Transfer Hierarchies - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies

Description:

Title: PowerPoint Presentation Last modified by: CHEN Created Date: 1/1/1601 12:00:00 AM Document presentation format: Other titles – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 18
Provided by: dukeEdu7
Category:

less

Transcript and Presenter's Notes

Title: Convex Point Estimation using Undirected Bayesian Transfer Hierarchies


1
Convex Point Estimation using Undirected Bayesian
Transfer Hierarchies
  • Gal Elidan, Ben Packer, Geremy Heitz, Daphne
    Koller
  • Computer Science Dept.
  • Stanford University
  • UAI 2008

Presented by Haojun Chen August 1st, 2008
2
Outline
  • Background and motivation
  • Undirected transfer hierarchies
  • Experiments
  • Degree of transfer coefficients
  • Experiments
  • Summary

3
Background (1/2)
  • Transfer learning
  • Data from similar tasks/distributions are
    used to compensate for the sparsity of training
    data in primary class or task

Example Use rhinos to help learn elephants
shape
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
4
Background (2/2)
  • Hierarchical Bayes (HB) framework
  • Principled approach for transfer learning

Joint distribution over the observed data and all
class parameters as follows
where
Example of a hierarchical Bayes parameterization
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
5
Motivation
  • In practice, point estimation of the MAP is
    desirable, for full Bayesian computations can be
    difficult and computationally demanding
  • Efficient point estimation may not be achieved in
    many standard hierarchical Bayes models, because
    many common conjugate priors such as the
    Dirichlet or normal-inverse-Wishart are not
    convex with respect to the parameters
  • In this paper, an undirected hierarchical
    Bayes(HB) reformulation is proposed to allow
    efficient point estimation

6
Undirected HB Reformulation
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
7
Purpose of Reformulation
  • Easy to specify
  • Fdata can be likelihood, classification, or other
    objective
  • Divergence can be L1-norm, L2-norm, e-insensitive
    loss, KL divergence, etc.
  • No conjugacy or proper prior restrictions
  • Easy to optimize
  • Convex over Q if Fdata is concave and Divergence
    is convex

Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
8
Experiment Text categorization
Newsgroup20 Dataset
  • Bag-of-words model
  • Fdata Multinomial log likelihood (regularized)
  • frequency of word i
  • Divergence L2 norm

Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
9
Text categorization Result
Baseline Maximum likelihood at each node (no
hierarchy) Cross-validate
regularization (no hierarchy)
Shrinkage (McCallum et al. 98, with hierarchy)
Newsgroup Topic Classification
0.7

0.65
0.6
0.55
Classification Rate
0.5
0.45
Max Likelihood (No regularization)
Shrinkage
Regularized Max Likelihood
0.4
Undirected HB
0.35
75
150
225
300
375

Total Number of Training Instances
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
10
Experiment Shape Modeling
Mammals Dataset (Fink, 05)
  • (Density estimation test likelihood)
  • Instances represented by 60 x-y
  • coordinates of landmarks on outline
  • Divergence
  • L2 norm over mean and variance

Mean landmark location
Covariance over landmarks
Regularization
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
11
Undirect HB Shape Modeling Result
Mammal Pairs
50

Regularized Max Likelihood
0
-50
Elephant-Rhino
-100
Delta log-loss / instance
-150
Bison-Rhino
Elephant-Bison
-200
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
-250
Giraffe-Rhino
Llama-Bison
Llama-Elephant
-300
Llama-Giraffe
Llama-Rhino
-350
6
10
20
30

Total Number of Training Instances
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
12
Problem in Transfer
Not all parameters deserve equal sharing
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
13
Degrees of Transfer (DOT)
is split into subcomponents with weights
, and hence different strengths are allowed
for different subcomponents, child-parent pairs
? 0 forces parameters to
agree ?8 allows parameters to
be flexible
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
14
Estimation of DOT Parameters
  • Hyper-prior approach
  • Bayesian idea Put prior on and add as
    parameter to optimization along with
  • Concretely inverse-Gamma prior (forced to be
    positive)

Prior on Degree of Transfer
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
15
DOT Shape Modeling Result
Mammal Pairs
15

Hyperprior
10
Elephant-Rhino
5
Delta log-loss / instance
0
Regularized Max Likelihood
Bison-Rhino
-5
Elephant-Bison
Elephant-Rhino
Giraffe-Bison
Giraffe-Elephant
Giraffe-Rhino
-10
Llama-Bison
Llama-Elephant
Llama-Giraffe
Llama-Rhino
-15
6
10
20
30

Total Number of Training Instances
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
16
Distribution of DOT coefficients
Distribution of DOT coefficients using Hyperprior
approach
20
18
qroot
16
14
12
10
8
6
4
2
0
0
5
10
15
20
25
30
35
40
45
50
1/l
Stronger transfer
Weaker transfer
Resources http//velblod.videolectures.net/2008/p
ascal2/uai08_helsinki/packer_cpe/uai08_packer_cpe_
01.ppt
17
Summary
  • Undirected reformulation of the hierarchical
    Bayes framework is proposed for efficient convex
    point estimation
  • Different degrees of transfer for different
  • parameters are introduced so that some parts
    of the distribution can be transferred to a
    greater extent than others
Write a Comment
User Comments (0)
About PowerShow.com