Functional%20Analytic%20Approach%20to%20Model%20Selection presentation

About This Presentation

Transcript and Presenter's Notes

Title: Functional%20Analytic%20Approach%20to%20Model%20Selection

1
Functional Analytic Approach to Model Selection

Department of Computer Science,
Tokyo Institute of Technology, Tokyo, Japan
Masashi Sugiyama
(sugi_at_cs.titech.ac.jp)

2
Regression Problem
Learning target function
Learned function
Training examples
(noise)
From , obtain such
that it is as close to as possible.
3
Typical Method of Learning

Linear regression model
Ridge regression

Parameters to be learned
Basis functions
Ridge parameter
4
Model Selection
Target function Learned function
Too simple
Appropriate
Too complex
Choice of the model affects heavily on the
learned function
(Model refers to, e.g., the ridge parameter )
5
Ideal Model Selection
Determine the model such that a certain
generalization error is minimized.
Badness of
6
Practical Model Selection
However, the generalization error can not
be directly calculated since it includes unknown
learning target function
(not true for Bayesian model selection using
evidence)
Determine the model such that an estimator
of the generalization error is minimized.
We want to have an accurate estimator
7
Two Approaches toEstimating Generalization Error
(1)
Try to obtain unbiased estimators

CP (Mallows, 1973)
Cross-Validation
Akaike Information Criterion
(Akaike, 1974) etc.

Interested in typical-case performance
8
Two Approaches toEstimating Generalization Error
(2)
Try to obtain probabilistic upper bounds

VC-bound (Vapnik Chervonenkis, 1974)
Span bound (Chapelle Vapnik, 2000)
Concentration bound
(Bousquet Elisseeff 2001) etc.

with probability
Interested in worst-case performance
9
Popular Choices ofGeneralization Measure

Risk
e.g.,
Kullback-Leibler divergence

Target density
Learned density
10
Concerns in Existing Methods

The used approximation often requires a large
(infinite) number of training examples for
justification (asymptotic approximation)
They do not work with small samples
Generalization measure should be integrated over
, from which training examples
are drawn
They can not be used for transduction
(estimating error at a point of interest)

11
Our Interests

We are interested in
Estimating the generalization error with accuracy
guaranteed for small (finite) samples
Estimating the transduction error
(the error at a point of interest)
Investigating the role of unlabeled samples
(samples without output sample
values )

12
Our Generalization Measure

Functional Hilbert space
We assume

Norm in the function space
13
Generalization Measurein Functional Hilbert Space

A functional Hilbert space is specified by
Set of functions which span the space,
Inner product (and norm).
Given a set of functions, we can design the inner
product (and therefore the generalization
measure) as desired.

14
Examples of the Norm

Weighted distance in input domain
Weighted distance in Fourier domain
Sobolev norm

Weight function
Weight function
-th derivative of
15
Interesting Features
Weight function

When and
, we can use unlabelled samples
for estimating
For transductive inference (given
),
For interpolation, extrapolation Desired

16
Goal of My Talk

I suppose that you like the generalization
measure defined in the functional Hilbert space.
The goal of my talk is to give a method for
estimating the generalization error.

Norm in the function space
17
Function Spaces for Learning

For further discussion, we have to specify the
class of function spaces.
We want the class to be less restrictive.
A general function space such as is not
suitable for learning problems because a value of
a function at a point is not specified in .

and have different values at
But they are treated as the same function in
is spanned by
18
Reproducing Kernel Hilbert Space

A function space that is rather general and a
value of a function at a point is specified is
the reproducing kernel Hilbert space (RKHS).
RKHS has the reproducing kernel
For any fixed ,
is a function of in
For any function in and any ,

Inner product in the function space
19
Formulation of Learning Problem

Specified RKHS
Fixed
, Mean , Variance
Linear estimation
e.g., ridge regression for linear model

Linear operator
Basis functions in
20
Sampling Operator

For any RKHS , there exists a linear operator
from to such that
Indeed,

Neumann-Schatten product
For vectors,
-th standard basis in
21
Formulation
RKHS
Sample value space
Sampling operator (Always linear)
Learning target function
noise
Gen. error
Learning operator (Assume linear)
Learned function
22
Expected Generalization Error

We are interested in typical performance so we
estimate the expected generalization error over
the noise
We do not take expectation over input points
Data-dependent !
We do not assume
Advantageous in active learning !

Expectation over noise
23
Bias / Variance Decomposition
Variance
Bias
RKHS
Expectation over noise
Bias
Noise variance
Adjoint of
Variance
We want to estimate the bias !
24
Tricks for Estimating Bias
Sugiyama Ogawa (Neural Comp., 2001)

Suppose we have a linear operator
that gives an unbiased estimate of

Expectation over noise
We use for estimating the bias of
RKHS
25
Unbiased Estimator of Bias
RKHS
Bias
Rough estimate
26
Subspace Information Criterion(SIC)
Estimate of Bias
Variance
SIC is an unbiased estimator of the
generalization error with finite samples
Expectation over noise
27
Obtaining Unbiased Estimate
We need that gives an unbiased estimate of
learning target .
exists if and only if
span the entire space . When this is
satisfied, is given by .
Generalized inverse
We can enjoy all the features ! (Unlabeled
samples, transductive inference etc.)
28
Example of Using SICStandard Linear Regression

Learning target function
where are unknown
Regression model
where are estimated linearly
(e.g., ridge regression)

29
Example (cont.)

Generalization measure
If the design matrix has
rank , then the best linear unbiased
estimator (BLUE) always exists
In this case, SIC provides an unbiased estimate
of the above generalization error

Weight function
Number of basis functions
30
Applicability of SIC

However, the design matrix
has rank only if
Therefore, the target function should be
included in a rather small model

Number of basis functions
Number of training examples
Range of application of SIC is rather limited
31
When Unbiased EstimateDoes Not Exist
Sugiyama Müller (JMLR, 2002)

exists if and only if
span the whole space .
When this condition is not fulfilled, let us
restrict ourselves to finding a learning result
function from a subspace , not from the
entire RKHS

RKHS
32
Essential Generalization Error
Irrelevant (constant)
Essential
is just replaced by
Essentially, we are estimating projection
RKHS
33
Unbiased Estimate of Projection

If a linear operator that gives an unbiased
estimate of the projection of the learning target
is available, then SIC is an unbiased
estimator of the essential generalization error.

Such exists if and only if the subspace
is included in the span of
.
e.g., kernel regression model
34
Restriction

However, another restriction arises
If the generalization measure is designed as
desired, we have to use the kernel function
induced by the generalization measure

35
Restriction (cont.)

On the other hand,
If a desired kernel function is used, then we
have to use the generalization measure induced by
the kernel
e.g., generalization measure in Gaussian
RKHS heavily penalizes high frequency
components

36
Summary of Usage of SIC

SIC essentially has two modes.
For rather restricted linear regression, SIC has
several interesting properties.
Unlabeled samples can be utilized for estimating
prediction error (expected test error).
Any weighted error measures can be used,
e.g., inter-(extra-)polation, transductive
inference.
For kernel regression, SIC can always be applied.
However, kernel induced generalization measure
should be employed.

37
Simulation (1) Setting

Trigonometric polynomial RKHS
Span
Gen. measure
Learning target function
sinc-like function in

38
Simulation (1) Setting (cont.)

Training examples
Ridge regression is used for learning
Number of training examples
Noise variance
Number of trials

39
Simulation (1-a)Using Unlabeled Samples

We estimate the prediction error using 1000
unlabeled samples

40
Results Unlabeled Samples
Values can be negative since some constants are
ignored
Ridge parameter
41
Results Unlabeled Samples
Ridge parameter
42
Simulation (1-b)Transduction

We estimate the test error
at a single test point

43
Results Transduction
Ridge parameter
44
Results Transduction
Ridge parameter
45
Simulation (2) Infinite Dimensional RKHS

Gaussian RKHS
Learning target function sinc function
Training examples
We estimate

46
Results Gaussian RKHS
Ridge parameter
47
Results Gaussian RKHS
Ridge parameter
48
Simulation (3) DELVE Data Sets

Gaussian RKHS
We choose the ridge parameter by
SIC
Leave-one-out cross-validation
An empirical Bayesian method
(Marginal likelihood maximization)
Performance is compared by test error

(Akaike, 1980)
49
Normalized Test Errors
Red Best or comparable (95t-test)
50
Image Restoration
Sugiyama et al. (IEICE Trans., 2001) Sugiyama
Ogawa (Signal Processing, 2002)
Restoration Filter
Parameter
e.g., Gaussian filter, regularization filter
Degraded Image
large
small
appropriate
We would like to determine parameter values
appropriately.
51
Formulation
Hilbert space
Hilbert space
Degradation
Original image
Noise
Filter
Restored image
Observed image
52
Results with Regularization Filter
Original images
Degraded images
Restored images using SIC
53
Precipitation Estimation
Moro Sugiyama (IEICE General Conf., 2001)

Estimating future precipitation from past
precipitation and whether radar data.
Our method with SIC won the 1st prize in
estimation accuracy in IEICE Precipitation
Estimation Contest 2001

1st TokyoTech MSE0.71
2nd KyuTech MSE0.75
3rd Chiba Univ MSE0.93
4th MSE1.18
Precipitation and weather radar and data
from IEICE Precipitation Estimation Contest 2001
54
References(Fundamentals of SIC)

Proposing the concept of SIC
Sugiyama, M. Ogawa, H. Subspace information
criterion for model selection. Neural
Computation, vol.13, no.8, pp.1863-1889, 2001
Performance evaluation of SIC
Sugiyama, M. Ogawa, H. Theoretical and
experimental evaluation of the subspace
information criterion. Machine Learning, vol.48,
no.1/2/3, pp.25-50, 2002.

55
References (SIC forParticular Learning Methods)

SIC for regularization learning
Sugiyama, M. Ogawa, H. Optimal design of
regularization term and regularization parameter
by subspace information criterion. Neural
Networks, vol.15, no.3, pp.349-361, 2002.
SIC for sparse regressors
Tsuda, K., Sugiyama, M., Müller, K.-R.
Subspace information criterion for non-quadratic
regularizers --- Model selection for sparse
regressors. IEEE Transactions on Neural
Networks, vol.13, no.1, pp.70-80, 2002.

56
References (Applications of SIC)

Applying SIC to image restoration
Sugiyama, M., Imaizumi, D., Ogawa, H.
Subspace information criterion for image
restoration --- Optimizing parameters in linear
filters. IEICE Transactions on Information and
Systems, vol.E84-D, no.9, pp.1249-1256, Sep.
2001.
Sugiyama, M. Ogawa, H. A unified method for
optimizing linear image restoration filters.
Signal Processing, vol.82, no.11, pp.1773-1787,
2002.
Applying SIC to precipitation estimation
Moro, S. Sugiyama, M. Estimation of
precipitation from meteorological radar data. In
Proceedings of the 2001 IEICE General Conference
SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29,
2001.

57
References (Extensions of SIC)

Extending range of application of SIC
Sugiyama, M. Müller, K.-R. The subspace
information criterion for infinite dimensional
hypothesis spaces. Journal of Machine Learning
Research, vol.3 (Nov), pp.323-359, 2002.
Further Improving SIC
Sugiyama, M.Improving precision of the subspace
information criterion. IEICE Transactions on
Fundamentals (to appear).
Sugiyama, M., Kawanabe, M. Müller, K.-R.
Trading variance reduction with unbiasedness ---
The regularized subspace information criterion
for robust model selection (submitted).

58
Conclusions

We formulated the regression problem from a
functional analytic point of view.
Within this framework, we gave a generalization
error estimator called the subspace information
criterion (SIC).
Unbiasedness of SIC guaranteed even with finite
samples.
We did not take expectation over training sample
points so SIC may be more data-dependent.

59
Conclusions (cont.)

SIC essentially has two modes
For rather restrictive linear regression,
SIC has several interesting properties.
Unlabeled samples can be utilized for estimating
prediction error.
Any weighted error measures can be used, e.g.,
interpolation, extrapolation, transductive
inference.
For kernel regression, SIC can always be applied.
However, kernel induced generalization measure
should be employed.

Write a Comment

User Comments (0)

About PowerShow.com

Functional%20Analytic%20Approach%20to%20Model%20Selection PowerPoint PPT Presentation