Regularized Adaptation: Theory, Algorithms and Applications - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Regularized Adaptation: Theory, Algorithms and Applications

Description:

while applying certain regularization strategy to achieve good generalization performance ... 'Accuracy-regularization' We want to minimize the empirical error ... – PowerPoint PPT presentation

Number of Views:204

Avg rating:3.0/5.0

Slides: 41

Provided by: Emi251

Category:

more less

Transcript and Presenter's Notes

Title: Regularized Adaptation: Theory, Algorithms and Applications

1
Regularized Adaptation Theory, Algorithms and
Applications

Xiao Li
Electrical Engineering Department
University of Washington

2
Roadmap

Introduction
Theoretical results
A Bayesian fidelity prior for adaptation
Generalization error bounds
Regularized adaptation algorithms
SVM and MLP adaptation
Experiments on vowel and object classification
The application to the Vocal Joystick
Conclusions and future work

3
Inductive Learning

Given
a set of m samples (xi, yi) p(x, y)
a decision function space F X ? 1
Goal
learn a decision function that minimizes
the expected error
In practice
minimize the empirical error
while applying certain regularization strategy to
achieve good generalization performance

4
Why Is Regularization Helpful?

Learning theory says
Frequentist Vapniks VC bound expresses F as a
function of the VC dimension of F
Bayesian the Occams Razor bound expresses F as
a function of the prior probability of f
Accuracy-regularization
We want to minimize the empirical error as well
as the capacity
Frequentist support vector machines
Bayesian Bayesian model selection

5
Adaptive Learning

Two related yet different distributions
Training
target (test-time)
Given
An unadapted model
Adaptation data (labeled)
Goal
Learn an adapted model that is as close as
possible to our desired model
Notes
Assume sufficient training data but limited
adaptation data
Training data is not preserved

6
Scenarios

Customization
Speech recognition speaker adaptation
Handwriting recognition writer adaptation
Language processing domain adaptation
Evolutionary environments
Spam filtering
Incremental/sequential learning
Start from a simple or rough model and refine
incrementally

7
Practical Work on Adaptation

Gaussian mixture models (GMMs)
MAP (Gauvain 94) MLLR (Leggetter 95)
Support vector machines (SVMs)
Boosting-like approach (Matic 93)
Weighted combination of old support vectors and
adaptation data (Wu 04)
Multi-layer perceptrons (MLPs)
Shared internal representation (Baxter 95,
Caruana 97, Stadermann 05)
Linear input network (Neto 95)
Conditional maximum entropy models
Gaussian prior (Chelba 04)

8
This Work Seeks Answers to

A unified and principled approach to adaptation
applicable to a variety of classifiers
amenable to variations in the amount of
adaptation data
Quantitative relationships between
the generalization error bound (or sample
complexity bound) and
the divergence between training and target
distributions

9
Roadmap

Introduction
Theoretical results
A Bayesian fidelity prior for adaptation
Generalization error bounds
Regularized adaptation algorithms
SVM and MLP adaptation
Experiments on vowel and object classification
The application to the Vocal Joystick
Conclusions and future work

10
Bayesian Fidelity Prior

Adaptation objective
Remp( f ) empirical error on the adaptation
data
Pfid( f ) Bayesian fidelity prior
Fidelity prior
How likely a classifier is given a training
distribution (rather than a training set key
difference from hierarchical Bayes approaches,
e.g. Baxter 97)
Applicable to different classifiers
Relates to the KL-divergence

11
Generative Models

Generative models p( x, y f )
Classification
Posterior
Assume f tr and f ad are the true models
generating the training and target distributions
respectively, i.e.
Note that this assumption is justifiable if the
function space contains the true models and if we
use the log likelihood loss

standard prior, chosen before training
12
Fidelity Prior for Generative Models

Key result
where ß gt 0
Implication
Fidelity prior at the desired model
We are more likely to learn our desired model
using the fidelity prior than using the standard
prior

13
Instantiations

To compute the fidelity prior
assuming a uniform standard prior, this prior
is determined by the KL-divergence
In cases the KL-divergence does not have a close
form, we use an upper bound instead (hence a
lower bound on the prior)
Gaussian models
The fidelity prior is a normal-Wishart
distribution
Mixture models
An upper bound on the KL-divergence (using log
sum inequality)
Hidden Markov models
An upper bound on the KL-divergence (Silva 06)

14
Discriminative Models

A unified view of SVMs, MLPs, CRFs and etc.
Affine classifiers in a transformed space f (
w, b )
Classification
Conditional likelihood (for binary case)

15
Discriminative Models (cont.)

Conditional models p( y x, f )
Classification
Posterior
Assume f tr and f ad are the true models
generating the training and target conditional
distributions respectively, i.e.

16
Fidelity Prior for Conditional Models

Again a divergence
where ß gt 0
What if we do not know ptr(x, y)
We seek an upper bound on the KL-divergence and
hence a lower bound on the prior
Key result
where

17
Roadmap

Introduction
Theoretical results
A Bayesian fidelity prior for adaptation
Generalization error bounds
Regularized adaptation algorithms
SVM and MLP adaptation
Experiments on vowel and object classification
The application to the Vocal Joystick
Conclusions and future work

18
Occams Razor Bound for Adaptation

For a countable function space

19
Bound using standard prior
Bounds using divergence prior
m
20
PAC-Bayesian Bounds for Adaptation

For both countable and uncountable function
spaces
Choice of prior p( f ) and posterior q( f )
D( q( f )p( f ) ) and stochastic error are
easily computable
Use pfid( f ) or its related forms as prior
Choose q( f ) to have the same parametric form
Examples
Gaussian models
Linear classifier

21
Roadmap

Introduction
Theoretical results
A Bayesian fidelity prior for adaptation
Generalization error bounds
Regularized adaptation algorithms
SVM and MLP adaptation
Experiments on vowel and object classification
The application to the Vocal Joystick
Conclusions and future work

22
Algorithms Derived from the Fidelity Prior

Generative Models
Relation to MAP adaptation
Conditional Models
Log linear models
We focus on SVMs and MLPs

23
Regularized SVM Adaptation

Optimization objective
Globally optimal solution
Regularized fixing old support vectors and
their coefficients
Extended regularized update coefficients of old
support vectors as well

24
Algorithms in Comparison

Unadapted
Retrained
Use adaptation data only
Boosted (Matic 93)
Select adaptation data misclassified by the
unadapted model
Combine with old support vectors
Bootstrapped (proposed in thesis)
Train a seed classifier using adaptation data
only
Select old support vectors correctly classified
by the seed classifier combine with adaptation
data
Combine with adaptation data
Regularized and Extended regularized

25
Regularized MLP Adaptation

Optimization objective for a multi-class,
two-layer MLP
Wh2o and Wh2o the hidden-to-output and
input-to-hidden layer weight matrix respectively
W is the L2 norm
Remp( f ) cross-entropy, corresponding to log
loss
Locally optimal solution found using
back-propogation

26
Algorithms in Comparison

Unadapted
Retrained
Start from randomly initialized weight and train
with weight decay
Linear input network (Neto 95)
Add a linear transformation in the input space
Retrained speaker-independent (Neto 95)
Start from the unadapted train both layers
Retrained last layer (Baxter 95, Caruana 97,
Stadermann 05)
Start from the unadapted only train the last
layer
Retrained first layer (proposed in thesis)
Start from the unadapted only train the first
layer
Regularized
Note that all above (except retrained) can be
considered as special cases of regularized

27
Roadmap

Introduction
Theoretical results
A Bayesian fidelity prior for adaptation
Generalization error bounds
Regularized adaptation algorithms
SVM and MLP adaptation
Experiments on vowel and object classification
The application to the Vocal Joystick
Conclusions and future work

28
Experimental Paradigm

Goal
To compare adaptation algorithms for a given
classifier
Not to compare adaptation algorithms across
classifiers
Procedure
Train an unadapted model on training set
Adapt (with supervision) and evaluate via n-fold
CV on test set
Select regularization coefficients on the dev set
Corpora
VJ vowel dataset (Kilanski 06)
NORB image dataset (LeCun 04)

29
VJ Vowel Dataset

Task
8 Vowel classes
Frame-level classification error rate
Speaker adaptation
Data allocation
Training set 21 speakers, 420K samples
For SVM, we random selected 80K samples for
training
Test set 10 speakers, 200 samples
Dev set 4 speakers, 80 samples
Features
182 dimensions 7 frames of MFCCdelta features

30
SVM Adaptation

RBF kernel (std10) optimized for training and
fixed for adaptation
Mean and std. dev over 10 speakers red are
significant at plt0.001 level

31
MLP Adaptation (I)

50 hidden nodes
Mean and std. dev over 10 speakers

32
MLP Adaptation (II)

Varying number of vowel classes available in
adaptation data

33
NORB Image Dataset

Task
5 object classes
Lighting condition adaptation
Data allocation
Training set 2700 samples
Test set 2700 samples
Features
32x32 raw images

34
SVM Adaptation

RBF kernel (std500) optimized for training and
fixed for adaptation
Mean and std. dev over 6 lighting conditions

35
MLP Adaptation

30 hidden node
Mean and std. dev over 6 lighting conditions

36
Roadmap

Introduction
Theoretical results
A Bayesian fidelity prior for adaptation
Generalization error bounds
Regularized adaptation algorithms
SVM and MLP adaptation
Experiments on vowel and object classification
The application to the Vocal Joystick
Conclusions and future work

37
Why the Vocal Joystick

Computer interfaces for individuals with
motor-impairments
Head tracking
Eye-gaze tracking
Brain-computer interfaces
Expensive and error prone
Speech is a natural solution, but
Most suitable for discrete commands
Or, requires a more complex syntax

38
What Is the Vocal Joystick

A voice-based interface
produce real-time, continuous signals to control
standard computing devices and robotic arms
Acoustic Parameters
Vowel quality
Loudness
Pitch
Discrete sound identity
VJ mouse