Functional%20Analytic%20Approach%20to%20Model%20Selection - PowerPoint PPT Presentation

About This Presentation
Title:

Functional%20Analytic%20Approach%20to%20Model%20Selection

Description:

... space We assume Generalization Measure in Functional Hilbert Space A functional Hilbert space is specified by Set of functions ... Trigonometric polynomial RKHS ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 60
Provided by: sug54
Category:

less

Transcript and Presenter's Notes

Title: Functional%20Analytic%20Approach%20to%20Model%20Selection


1
Functional Analytic Approach to Model Selection
  • Department of Computer Science,
  • Tokyo Institute of Technology, Tokyo, Japan
  • Masashi Sugiyama
  • (sugi_at_cs.titech.ac.jp)

2
Regression Problem
Learning target function
Learned function
Training examples
(noise)
From , obtain such
that it is as close to as possible.
3
Typical Method of Learning
  • Linear regression model
  • Ridge regression

Parameters to be learned
Basis functions
Ridge parameter
4
Model Selection
Target function Learned function
Too simple
Appropriate
Too complex
Choice of the model affects heavily on the
learned function
(Model refers to, e.g., the ridge parameter )
5
Ideal Model Selection
Determine the model such that a certain
generalization error is minimized.
Badness of
6
Practical Model Selection
However, the generalization error can not
be directly calculated since it includes unknown
learning target function
(not true for Bayesian model selection using
evidence)
Determine the model such that an estimator
of the generalization error is minimized.
We want to have an accurate estimator
7
Two Approaches toEstimating Generalization Error
(1)
Try to obtain unbiased estimators
  • CP (Mallows, 1973)
  • Cross-Validation
  • Akaike Information Criterion
  • (Akaike, 1974) etc.

Interested in typical-case performance
8
Two Approaches toEstimating Generalization Error
(2)
Try to obtain probabilistic upper bounds
  • VC-bound (Vapnik Chervonenkis, 1974)
  • Span bound (Chapelle Vapnik, 2000)
  • Concentration bound
  • (Bousquet Elisseeff 2001) etc.

with probability
Interested in worst-case performance
9
Popular Choices ofGeneralization Measure
  • Risk
  • e.g.,
  • Kullback-Leibler divergence

Target density
Learned density
10
Concerns in Existing Methods
  • The used approximation often requires a large
    (infinite) number of training examples for
    justification (asymptotic approximation)
  • They do not work with small samples
  • Generalization measure should be integrated over
    , from which training examples
    are drawn
  • They can not be used for transduction
  • (estimating error at a point of interest)

11
Our Interests
  • We are interested in
  • Estimating the generalization error with accuracy
    guaranteed for small (finite) samples
  • Estimating the transduction error
  • (the error at a point of interest)
  • Investigating the role of unlabeled samples
    (samples without output sample
    values )

12
Our Generalization Measure
  • Functional Hilbert space
  • We assume

Norm in the function space
13
Generalization Measurein Functional Hilbert Space
  • A functional Hilbert space is specified by
  • Set of functions which span the space,
  • Inner product (and norm).
  • Given a set of functions, we can design the inner
    product (and therefore the generalization
    measure) as desired.

14
Examples of the Norm
  • Weighted distance in input domain
  • Weighted distance in Fourier domain
  • Sobolev norm

Weight function
Weight function
-th derivative of
15
Interesting Features
Weight function
  • When and
    , we can use unlabelled samples
    for estimating
  • For transductive inference (given
    ),
  • For interpolation, extrapolation Desired

16
Goal of My Talk
  • I suppose that you like the generalization
    measure defined in the functional Hilbert space.
  • The goal of my talk is to give a method for
    estimating the generalization error.

Norm in the function space
17
Function Spaces for Learning
  • For further discussion, we have to specify the
    class of function spaces.
  • We want the class to be less restrictive.
  • A general function space such as is not
    suitable for learning problems because a value of
    a function at a point is not specified in .

and have different values at
But they are treated as the same function in
is spanned by
18
Reproducing Kernel Hilbert Space
  • A function space that is rather general and a
    value of a function at a point is specified is
    the reproducing kernel Hilbert space (RKHS).
  • RKHS has the reproducing kernel
  • For any fixed ,
  • is a function of in
  • For any function in and any ,

Inner product in the function space
19
Formulation of Learning Problem
  • Specified RKHS
  • Fixed
  • , Mean , Variance
  • Linear estimation
  • e.g., ridge regression for linear model

Linear operator
Basis functions in
20
Sampling Operator
  • For any RKHS , there exists a linear operator
    from to such that
  • Indeed,

Neumann-Schatten product
For vectors,
-th standard basis in
21
Formulation
RKHS
Sample value space
Sampling operator (Always linear)
Learning target function
noise
Gen. error
Learning operator (Assume linear)
Learned function
22
Expected Generalization Error
  • We are interested in typical performance so we
    estimate the expected generalization error over
    the noise
  • We do not take expectation over input points
  • Data-dependent !
  • We do not assume
  • Advantageous in active learning !

Expectation over noise
23
Bias / Variance Decomposition
Variance
Bias
RKHS
Expectation over noise
Bias
Noise variance
Adjoint of
Variance
We want to estimate the bias !
24
Tricks for Estimating Bias
Sugiyama Ogawa (Neural Comp., 2001)
  • Suppose we have a linear operator
  • that gives an unbiased estimate of

Expectation over noise
We use for estimating the bias of
RKHS
25
Unbiased Estimator of Bias
RKHS
Bias
Rough estimate
26
Subspace Information Criterion(SIC)
Estimate of Bias
Variance
SIC is an unbiased estimator of the
generalization error with finite samples
Expectation over noise
27
Obtaining Unbiased Estimate
We need that gives an unbiased estimate of
learning target .
exists if and only if
span the entire space . When this is
satisfied, is given by .
Generalized inverse
We can enjoy all the features ! (Unlabeled
samples, transductive inference etc.)
28
Example of Using SICStandard Linear Regression
  • Learning target function
  • where are unknown
  • Regression model
  • where are estimated linearly
  • (e.g., ridge regression)

29
Example (cont.)
  • Generalization measure
  • If the design matrix has
    rank , then the best linear unbiased
    estimator (BLUE) always exists
  • In this case, SIC provides an unbiased estimate
    of the above generalization error

Weight function
Number of basis functions
30
Applicability of SIC
  • However, the design matrix
    has rank only if
  • Therefore, the target function should be
    included in a rather small model

Number of basis functions
Number of training examples
Range of application of SIC is rather limited
31
When Unbiased EstimateDoes Not Exist
Sugiyama Müller (JMLR, 2002)
  • exists if and only if
    span the whole space .
  • When this condition is not fulfilled, let us
    restrict ourselves to finding a learning result
    function from a subspace , not from the
    entire RKHS

RKHS
32
Essential Generalization Error
Irrelevant (constant)
Essential
is just replaced by
Essentially, we are estimating projection
RKHS
33
Unbiased Estimate of Projection
  • If a linear operator that gives an unbiased
    estimate of the projection of the learning target
    is available, then SIC is an unbiased
    estimator of the essential generalization error.

Such exists if and only if the subspace
is included in the span of
.
e.g., kernel regression model
34
Restriction
  • However, another restriction arises
  • If the generalization measure is designed as
    desired, we have to use the kernel function
    induced by the generalization measure

35
Restriction (cont.)
  • On the other hand,
  • If a desired kernel function is used, then we
    have to use the generalization measure induced by
    the kernel
  • e.g., generalization measure in Gaussian
    RKHS heavily penalizes high frequency
    components

36
Summary of Usage of SIC
  • SIC essentially has two modes.
  • For rather restricted linear regression, SIC has
    several interesting properties.
  • Unlabeled samples can be utilized for estimating
    prediction error (expected test error).
  • Any weighted error measures can be used,
  • e.g., inter-(extra-)polation, transductive
    inference.
  • For kernel regression, SIC can always be applied.
    However, kernel induced generalization measure
    should be employed.

37
Simulation (1) Setting
  • Trigonometric polynomial RKHS
  • Span
  • Gen. measure
  • Learning target function
  • sinc-like function in

38
Simulation (1) Setting (cont.)
  • Training examples
  • Ridge regression is used for learning
  • Number of training examples
  • Noise variance
  • Number of trials

39
Simulation (1-a)Using Unlabeled Samples
  • We estimate the prediction error using 1000
    unlabeled samples

40
Results Unlabeled Samples
Values can be negative since some constants are
ignored
Ridge parameter
41
Results Unlabeled Samples
Ridge parameter
42
Simulation (1-b)Transduction
  • We estimate the test error
  • at a single test point

43
Results Transduction
Ridge parameter
44
Results Transduction
Ridge parameter
45
Simulation (2) Infinite Dimensional RKHS
  • Gaussian RKHS
  • Learning target function sinc function
  • Training examples
  • We estimate

46
Results Gaussian RKHS
Ridge parameter
47
Results Gaussian RKHS
Ridge parameter
48
Simulation (3) DELVE Data Sets
  • Gaussian RKHS
  • We choose the ridge parameter by
  • SIC
  • Leave-one-out cross-validation
  • An empirical Bayesian method
  • (Marginal likelihood maximization)
  • Performance is compared by test error

(Akaike, 1980)
49
Normalized Test Errors
Red Best or comparable (95t-test)
50
Image Restoration
Sugiyama et al. (IEICE Trans., 2001) Sugiyama
Ogawa (Signal Processing, 2002)
Restoration Filter
Parameter
e.g., Gaussian filter, regularization filter
Degraded Image
large
small
appropriate
We would like to determine parameter values
appropriately.
51
Formulation
Hilbert space
Hilbert space
Degradation
Original image
Noise
Filter
Restored image
Observed image
52
Results with Regularization Filter
Original images
Degraded images
Restored images using SIC
53
Precipitation Estimation
Moro Sugiyama (IEICE General Conf., 2001)
  • Estimating future precipitation from past
    precipitation and whether radar data.
  • Our method with SIC won the 1st prize in
    estimation accuracy in IEICE Precipitation
    Estimation Contest 2001

1st TokyoTech MSE0.71
2nd KyuTech MSE0.75
3rd Chiba Univ MSE0.93
4th MSE1.18
Precipitation and weather radar and data
from IEICE Precipitation Estimation Contest 2001
54
References(Fundamentals of SIC)
  • Proposing the concept of SIC
  • Sugiyama, M. Ogawa, H. Subspace information
    criterion for model selection. Neural
    Computation, vol.13, no.8, pp.1863-1889, 2001
  • Performance evaluation of SIC
  • Sugiyama, M. Ogawa, H. Theoretical and
    experimental evaluation of the subspace
    information criterion. Machine Learning, vol.48,
    no.1/2/3, pp.25-50, 2002.

55
References (SIC forParticular Learning Methods)
  • SIC for regularization learning
  • Sugiyama, M. Ogawa, H. Optimal design of
    regularization term and regularization parameter
    by subspace information criterion. Neural
    Networks, vol.15, no.3, pp.349-361, 2002.
  • SIC for sparse regressors
  • Tsuda, K., Sugiyama, M., Müller, K.-R.
    Subspace information criterion for non-quadratic
    regularizers --- Model selection for sparse
    regressors. IEEE Transactions on Neural
    Networks, vol.13, no.1, pp.70-80, 2002.

56
References (Applications of SIC)
  • Applying SIC to image restoration
  • Sugiyama, M., Imaizumi, D., Ogawa, H.
    Subspace information criterion for image
    restoration --- Optimizing parameters in linear
    filters. IEICE Transactions on Information and
    Systems, vol.E84-D, no.9, pp.1249-1256, Sep.
    2001.
  • Sugiyama, M. Ogawa, H. A unified method for
    optimizing linear image restoration filters.
    Signal Processing, vol.82, no.11, pp.1773-1787,
    2002.
  • Applying SIC to precipitation estimation
  • Moro, S. Sugiyama, M. Estimation of
    precipitation from meteorological radar data. In
    Proceedings of the 2001 IEICE General Conference
    SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29,
    2001.

57
References (Extensions of SIC)
  • Extending range of application of SIC
  • Sugiyama, M. Müller, K.-R. The subspace
    information criterion for infinite dimensional
    hypothesis spaces. Journal of Machine Learning
    Research, vol.3 (Nov), pp.323-359, 2002.
  • Further Improving SIC
  • Sugiyama, M.Improving precision of the subspace
    information criterion. IEICE Transactions on
    Fundamentals (to appear).
  • Sugiyama, M., Kawanabe, M. Müller, K.-R.
    Trading variance reduction with unbiasedness ---
    The regularized subspace information criterion
    for robust model selection (submitted).

58
Conclusions
  • We formulated the regression problem from a
    functional analytic point of view.
  • Within this framework, we gave a generalization
    error estimator called the subspace information
    criterion (SIC).
  • Unbiasedness of SIC guaranteed even with finite
    samples.
  • We did not take expectation over training sample
    points so SIC may be more data-dependent.

59
Conclusions (cont.)
  • SIC essentially has two modes
  • For rather restrictive linear regression,
  • SIC has several interesting properties.
  • Unlabeled samples can be utilized for estimating
    prediction error.
  • Any weighted error measures can be used, e.g.,
    interpolation, extrapolation, transductive
    inference.
  • For kernel regression, SIC can always be applied.
    However, kernel induced generalization measure
    should be employed.
Write a Comment
User Comments (0)
About PowerShow.com