Dimensionality reduction - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Dimensionality reduction

Description:

Best performing methods are GPFS and ARD which usually find the optimal subset. However the GPFS method is on average more than three times slower than ARD. ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 21

Provided by: ale94

Category:

more less

Transcript and Presenter's Notes

Title: Dimensionality reduction

1
Dimensionality reduction

Alexis Boukouvalas
Work in collaboration with D. M. Maniyar and D.
Cornford

2
Goals

Develop methods for dimensionality reduction of
either the input and/or output space of models.
To gain an understanding initially use a toy
dataset to compare existing methods.
Later on utilize methods on real world models.
Goal is to extend methods to work with high
number of variables - 105

3
Methods

Feature Selection
Also known as Screening in statistical literature
Select p most relevant of the original k
variables.
Meaning of variables is preserved gt method
results are interpretable
Projective methods
Variables are transformed XF(X)
Transformations can be linear or non-linear
Interpretation is non-trivial especially for
non-linear mappings.

4
Toy data set (1)

Generate N base vectors x of dimensionality d
from sampling a Latin hypercube. Normalize the
data.
Evaluate the generative model g(.)
Corrupt the model output with independent
identically distributed Gaussian noise. Initially
we set noise variance is 0.1signal variance.
Screening Augment with extra noise dimensions
e Bx input noise
Noise is always N(0,I). B matrix is described on
the next slide.
Projection Project to a higher dimensional
space using x WF(x)

5
Toy data set (2)

Screening B matrix determines correlation
between noise and model variables
B0 constructs noise variables that are
uncorrelated to the model variables.
k randomly selected rows have a single non zero
entry corresponding to the noise variable being
linearly correlated to a single model variable.
Currently k0.5noise variables and coefficient
is set to 0.5
Same as previous but two elements of k rows are
non-zero, k0.8 and coefficients are randomly
taken from the set -0.2,-0.5,0.5,0.7

6
Toy data set (3)

Projection Project into higher dimensional
space q
x WF(x)
W is a qd weight matrix and F() are basis
functions which are responsible for the
projection mapping. A typical choice of such
projection mapping is to use Radial Basis
Functions (RBF).

7
Toy data set - extensions

Different noise models
Correlated
Multiplicative
Non-linear interactions of noise variables with
model variables
Mix screening and projection

8
Feature Selection

Variable selection methods have been broadly
categorised in three categories
Variable Ranking. Input variables are ranked
according to the prediction accuracy of each
input calculated against the model output.
Wrapper methods. The emulator is used to assess
the predictive power of subsets of variables
Embedded methods. For both variable ranking and
wrapper methods, the emulator is considered a
perfect black box. In embedded methods, the
variable selection is done as part of the
training of the emulator.

9
Wrapper Methods

Forward selection where variables are
progressively incorporated in larger and larger
subsets
Backward elimination proceeds in the opposite
direction.
Efroymsons algorithm aka stepwise selection.
Proceed as forward selection but after each
variable is added, check if any of the selected
variables can be deleted without significantly
affecting RSS.
Exhaustive search where all possible subsets are
considered.
Branch and Bound. Eliminate subset choices as
early as possible. E.g. is variables A-Z, RSS of
A,B subset 100, then C-Z subset branch need not
be followed if RSS of all C-Z variables gt 100.

10
Embedded methods

An embedded method commonly employed in the
context of Gaussian Processes is Automatic
Relevance Determination (ARD) where the
characteristic length scales l determine the
input relevance

11
Preliminary experiments

The following algorithms were used in the
experiments
BaseRelevant Baseline run using the relevant
dimensions only. The RMSE was obtained by
training a GP on the relevant dimensions. This
value can be interpreted as the optimal RMSE
value.
BaseAll Baseline run using all the dimensions,
i.e. relevant extra. Again the RMSE was
obtained by training a GP on this set. The
difference BaseAll-BaseRelevant is a measure of
the effect of the extra variables on the
predictive accuracy of the GP.
CorrCoef Pearson Correlation Coefficient. A
variable ranking is performed using the formulae
10 and the top 3 variables are selected and used
to train a GP.
LinFS Employ a forward selection subset
selection strategy using a multivariate linear
regression model. The RMSE is obtained from
evaluating the selected subset on a multiple
linear regression model.
GPFS Again employ forward selection to
generate subsets but use a GP rather than a
linear model.
ARD Employ the ARD method to rank the input
variables and select the top 3 to train a GP
model.

12
Experiment 1No correlation

200 observations,3 model dimensions, 6 total

Algorithm Variables Selected RMSE Elapsed time
BaseRelevant 1,2,3 0.9128 1.44142
BaseAll 1,2,3,4,5,6 1.0473 1.60529
CorrCoef 1,4,2(,3,5,6) 2.1642 1.50487
LinFS 1,4,2 2.7803 0.134283
GPFS 1,2,3 0.9092 18.2017
ARD 1,2,3 0.9134 5.56684
5.56684
5.56684
13
Experiment2Two var correlation

200 observations,3 model dimensions, 6 total

Algorithm Variables Selected RMSE Elapsed time
BaseRelevant 1,2,3 0.9111 1.42363
BaseAll 1,2,3,4,5,6 1.0633 1.66093
CorrCoef 1,4,5(,2,6,3) 2.6794 1.31676
LinFS 1,4,6 2.8083 0.143308
GPFS 1,2,3 0.9274 19.0051
ARD 1,2,3 1.0076 5.0611
5.56684
5.56684
14
Experiment 3 ARD

Initial results for high-D input, two-correlated,
model inputs 100, noise dimensions 500, number of
observations 500.

Length - Input Number 31.8373 361 18.7081
501 14.2097 296 12.7581 51 12.3160
456 11.8689 496 11.3176 166 10.2424
310 10.2220 420 9.6192 325 9.0732
363
Length - Input Number 8.6898 53
8.5453 347 7.9338 419 7.8201 294
7.8017 188 7.4327 103 7.3760 13
7.1526 572 7.0997 478 6.9481 393
6.6417 187
15
Summary of Experiments

Best performing methods are GPFS and ARD which
usually find the optimal subset. However the GPFS
method is on average more than three times slower
than ARD.
The CorrCoef and LinFS methods are
computationally inexpensive but provide
unsatisfactory results.
Even for simple mapping functions (sinx) on
underdetermined systems where number of
observations lt dimensions, ARD breaks down.

16
Research Directions

Batch hierarchical screening
Explore the potential of partitioning the input
space into groups of inputs, applying screening
methods on the groups and combining the important
inputs
Some work already done for linear models (Gabriel
and Pan 1979)
Grouping of variables such that if two variable
Xi Xj are in different groups, then their
regression sum of squares (RSS) are additive,
i.e. if Si is the reduction in RSS from including
Xi and Sj for Xj, then when including both Xi Xj
Si.jSiSj

17
Research directions (2)

Coupled Emulation
separate emulators for different outputs, linked
with some model for the covariance
Connections to sequential methods to handle large
datasets. Linked to Sequential Sparse GPs?
Projective methods in conjunction with feature
selection.

18
Projective methods
From Van der Maaten et al 2007
19
But

But Van der Maaten et al 2007 compared the
non-linear to linear methods and found them no
better. Reasons they propose relate to curse of
dimensionality, overfitting of local models and
others.

20
References

Dimensionality Reduction A Comparative Review,
L.J.P. van der Maaten E.O. Postma H.J. van den
Herik 2007
Andr Elisseeff Isabelle Guyon. An Introduction to
Variable and Feature Selection. Journal of
Maching Learning Research, 311571182, 2003.

Write a Comment

User Comments (0)