Dimensionality reduction - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Dimensionality reduction

Description:

Best performing methods are GPFS and ARD which usually find the optimal subset. However the GPFS method is on average more than three times slower than ARD. ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 21
Provided by: ale94
Category:

less

Transcript and Presenter's Notes

Title: Dimensionality reduction


1
Dimensionality reduction
  • Alexis Boukouvalas
  • Work in collaboration with D. M. Maniyar and D.
    Cornford

2
Goals
  • Develop methods for dimensionality reduction of
    either the input and/or output space of models.
  • To gain an understanding initially use a toy
    dataset to compare existing methods.
  • Later on utilize methods on real world models.
  • Goal is to extend methods to work with high
    number of variables - 105

3
Methods
  • Feature Selection
  • Also known as Screening in statistical literature
  • Select p most relevant of the original k
    variables.
  • Meaning of variables is preserved gt method
    results are interpretable
  • Projective methods
  • Variables are transformed XF(X)
  • Transformations can be linear or non-linear
  • Interpretation is non-trivial especially for
    non-linear mappings.

4
Toy data set (1)
  • Generate N base vectors x of dimensionality d
    from sampling a Latin hypercube. Normalize the
    data.
  • Evaluate the generative model g(.)
  • Corrupt the model output with independent
    identically distributed Gaussian noise. Initially
    we set noise variance is 0.1signal variance.
  • Screening Augment with extra noise dimensions
  • e Bx input noise
  • Noise is always N(0,I). B matrix is described on
    the next slide.
  • Projection Project to a higher dimensional
    space using x WF(x)

5
Toy data set (2)
  • Screening B matrix determines correlation
    between noise and model variables
  • B0 constructs noise variables that are
    uncorrelated to the model variables.
  • k randomly selected rows have a single non zero
    entry corresponding to the noise variable being
    linearly correlated to a single model variable.
    Currently k0.5noise variables and coefficient
    is set to 0.5
  • Same as previous but two elements of k rows are
    non-zero, k0.8 and coefficients are randomly
    taken from the set -0.2,-0.5,0.5,0.7

6
Toy data set (3)
  • Projection Project into higher dimensional
    space q
  • x WF(x)
  • W is a qd weight matrix and F() are basis
    functions which are responsible for the
    projection mapping. A typical choice of such
    projection mapping is to use Radial Basis
    Functions (RBF).

7
Toy data set - extensions
  • Different noise models
  • Correlated
  • Multiplicative
  • Non-linear interactions of noise variables with
    model variables
  • Mix screening and projection

8
Feature Selection
  • Variable selection methods have been broadly
    categorised in three categories
  • Variable Ranking. Input variables are ranked
    according to the prediction accuracy of each
    input calculated against the model output.
  • Wrapper methods. The emulator is used to assess
    the predictive power of subsets of variables
  • Embedded methods. For both variable ranking and
    wrapper methods, the emulator is considered a
    perfect black box. In embedded methods, the
    variable selection is done as part of the
    training of the emulator.

9
Wrapper Methods
  • Forward selection where variables are
    progressively incorporated in larger and larger
    subsets
  • Backward elimination proceeds in the opposite
    direction.
  • Efroymsons algorithm aka stepwise selection.
    Proceed as forward selection but after each
    variable is added, check if any of the selected
    variables can be deleted without significantly
    affecting RSS.
  • Exhaustive search where all possible subsets are
    considered.
  • Branch and Bound. Eliminate subset choices as
    early as possible. E.g. is variables A-Z, RSS of
    A,B subset 100, then C-Z subset branch need not
    be followed if RSS of all C-Z variables gt 100.

10
Embedded methods
  • An embedded method commonly employed in the
    context of Gaussian Processes is Automatic
    Relevance Determination (ARD) where the
    characteristic length scales l determine the
    input relevance

11
Preliminary experiments
  • The following algorithms were used in the
    experiments
  • BaseRelevant Baseline run using the relevant
    dimensions only. The RMSE was obtained by
    training a GP on the relevant dimensions. This
    value can be interpreted as the optimal RMSE
    value.
  • BaseAll Baseline run using all the dimensions,
    i.e. relevant extra. Again the RMSE was
    obtained by training a GP on this set. The
    difference BaseAll-BaseRelevant is a measure of
    the effect of the extra variables on the
    predictive accuracy of the GP.
  • CorrCoef Pearson Correlation Coefficient. A
    variable ranking is performed using the formulae
    10 and the top 3 variables are selected and used
    to train a GP.
  • LinFS Employ a forward selection subset
    selection strategy using a multivariate linear
    regression model. The RMSE is obtained from
    evaluating the selected subset on a multiple
    linear regression model.
  • GPFS Again employ forward selection to
    generate subsets but use a GP rather than a
    linear model.
  • ARD Employ the ARD method to rank the input
    variables and select the top 3 to train a GP
    model.

12
Experiment 1No correlation
  • 200 observations,3 model dimensions, 6 total

Algorithm Variables Selected RMSE Elapsed time
BaseRelevant 1,2,3 0.9128 1.44142
BaseAll 1,2,3,4,5,6 1.0473 1.60529
CorrCoef 1,4,2(,3,5,6) 2.1642 1.50487
LinFS 1,4,2 2.7803 0.134283
GPFS 1,2,3 0.9092 18.2017
ARD 1,2,3 0.9134 5.56684
5.56684
5.56684
13
Experiment2Two var correlation
  • 200 observations,3 model dimensions, 6 total

Algorithm Variables Selected RMSE Elapsed time
BaseRelevant 1,2,3 0.9111 1.42363
BaseAll 1,2,3,4,5,6 1.0633 1.66093
CorrCoef 1,4,5(,2,6,3) 2.6794 1.31676
LinFS 1,4,6 2.8083 0.143308
GPFS 1,2,3 0.9274 19.0051
ARD 1,2,3 1.0076 5.0611
5.56684
5.56684
14
Experiment 3 ARD
  • Initial results for high-D input, two-correlated,
    model inputs 100, noise dimensions 500, number of
    observations 500.

Length - Input Number 31.8373 361 18.7081
501 14.2097 296 12.7581 51 12.3160
456 11.8689 496 11.3176 166 10.2424
310 10.2220 420 9.6192 325 9.0732
363
Length - Input Number 8.6898 53
8.5453 347 7.9338 419 7.8201 294
7.8017 188 7.4327 103 7.3760 13
7.1526 572 7.0997 478 6.9481 393
6.6417 187
15
Summary of Experiments
  • Best performing methods are GPFS and ARD which
    usually find the optimal subset. However the GPFS
    method is on average more than three times slower
    than ARD.
  • The CorrCoef and LinFS methods are
    computationally inexpensive but provide
    unsatisfactory results.
  • Even for simple mapping functions (sinx) on
    underdetermined systems where number of
    observations lt dimensions, ARD breaks down.

16
Research Directions
  • Batch hierarchical screening
  • Explore the potential of partitioning the input
    space into groups of inputs, applying screening
    methods on the groups and combining the important
    inputs
  • Some work already done for linear models (Gabriel
    and Pan 1979)
  • Grouping of variables such that if two variable
    Xi Xj are in different groups, then their
    regression sum of squares (RSS) are additive,
    i.e. if Si is the reduction in RSS from including
    Xi and Sj for Xj, then when including both Xi Xj
    Si.jSiSj

17
Research directions (2)
  • Coupled Emulation
  • separate emulators for different outputs, linked
    with some model for the covariance
  • Connections to sequential methods to handle large
    datasets. Linked to Sequential Sparse GPs?
  • Projective methods in conjunction with feature
    selection.

18
Projective methods
From Van der Maaten et al 2007
19
But
  • But Van der Maaten et al 2007 compared the
    non-linear to linear methods and found them no
    better. Reasons they propose relate to curse of
    dimensionality, overfitting of local models and
    others.

20
References
  • Dimensionality Reduction A Comparative Review,
    L.J.P. van der Maaten E.O. Postma H.J. van den
    Herik 2007
  • Andr Elisseeff Isabelle Guyon. An Introduction to
    Variable and Feature Selection. Journal of
    Maching Learning Research, 311571182, 2003.
Write a Comment
User Comments (0)
About PowerShow.com