Title: Dimensionality reduction
1Dimensionality reduction
- Alexis Boukouvalas
- Work in collaboration with D. M. Maniyar and D.
Cornford
2Goals
- Develop methods for dimensionality reduction of
either the input and/or output space of models. - To gain an understanding initially use a toy
dataset to compare existing methods. - Later on utilize methods on real world models.
- Goal is to extend methods to work with high
number of variables - 105
3Methods
- Feature Selection
- Also known as Screening in statistical literature
- Select p most relevant of the original k
variables. - Meaning of variables is preserved gt method
results are interpretable - Projective methods
- Variables are transformed XF(X)
- Transformations can be linear or non-linear
- Interpretation is non-trivial especially for
non-linear mappings.
4Toy data set (1)
- Generate N base vectors x of dimensionality d
from sampling a Latin hypercube. Normalize the
data. - Evaluate the generative model g(.)
- Corrupt the model output with independent
identically distributed Gaussian noise. Initially
we set noise variance is 0.1signal variance. - Screening Augment with extra noise dimensions
- e Bx input noise
- Noise is always N(0,I). B matrix is described on
the next slide. - Projection Project to a higher dimensional
space using x WF(x)
5Toy data set (2)
- Screening B matrix determines correlation
between noise and model variables - B0 constructs noise variables that are
uncorrelated to the model variables. - k randomly selected rows have a single non zero
entry corresponding to the noise variable being
linearly correlated to a single model variable.
Currently k0.5noise variables and coefficient
is set to 0.5 - Same as previous but two elements of k rows are
non-zero, k0.8 and coefficients are randomly
taken from the set -0.2,-0.5,0.5,0.7
6Toy data set (3)
- Projection Project into higher dimensional
space q - x WF(x)
- W is a qd weight matrix and F() are basis
functions which are responsible for the
projection mapping. A typical choice of such
projection mapping is to use Radial Basis
Functions (RBF).
7Toy data set - extensions
- Different noise models
- Correlated
- Multiplicative
- Non-linear interactions of noise variables with
model variables - Mix screening and projection
8Feature Selection
- Variable selection methods have been broadly
categorised in three categories - Variable Ranking. Input variables are ranked
according to the prediction accuracy of each
input calculated against the model output. - Wrapper methods. The emulator is used to assess
the predictive power of subsets of variables - Embedded methods. For both variable ranking and
wrapper methods, the emulator is considered a
perfect black box. In embedded methods, the
variable selection is done as part of the
training of the emulator.
9Wrapper Methods
- Forward selection where variables are
progressively incorporated in larger and larger
subsets - Backward elimination proceeds in the opposite
direction. - Efroymsons algorithm aka stepwise selection.
Proceed as forward selection but after each
variable is added, check if any of the selected
variables can be deleted without significantly
affecting RSS. - Exhaustive search where all possible subsets are
considered. - Branch and Bound. Eliminate subset choices as
early as possible. E.g. is variables A-Z, RSS of
A,B subset 100, then C-Z subset branch need not
be followed if RSS of all C-Z variables gt 100.
10Embedded methods
- An embedded method commonly employed in the
context of Gaussian Processes is Automatic
Relevance Determination (ARD) where the
characteristic length scales l determine the
input relevance
11Preliminary experiments
- The following algorithms were used in the
experiments - BaseRelevant Baseline run using the relevant
dimensions only. The RMSE was obtained by
training a GP on the relevant dimensions. This
value can be interpreted as the optimal RMSE
value. - BaseAll Baseline run using all the dimensions,
i.e. relevant extra. Again the RMSE was
obtained by training a GP on this set. The
difference BaseAll-BaseRelevant is a measure of
the effect of the extra variables on the
predictive accuracy of the GP. - CorrCoef Pearson Correlation Coefficient. A
variable ranking is performed using the formulae
10 and the top 3 variables are selected and used
to train a GP. - LinFS Employ a forward selection subset
selection strategy using a multivariate linear
regression model. The RMSE is obtained from
evaluating the selected subset on a multiple
linear regression model. - GPFS Again employ forward selection to
generate subsets but use a GP rather than a
linear model. - ARD Employ the ARD method to rank the input
variables and select the top 3 to train a GP
model.
12Experiment 1No correlation
- 200 observations,3 model dimensions, 6 total
Algorithm Variables Selected RMSE Elapsed time
BaseRelevant 1,2,3 0.9128 1.44142
BaseAll 1,2,3,4,5,6 1.0473 1.60529
CorrCoef 1,4,2(,3,5,6) 2.1642 1.50487
LinFS 1,4,2 2.7803 0.134283
GPFS 1,2,3 0.9092 18.2017
ARD 1,2,3 0.9134 5.56684
5.56684
5.56684
13Experiment2Two var correlation
- 200 observations,3 model dimensions, 6 total
Algorithm Variables Selected RMSE Elapsed time
BaseRelevant 1,2,3 0.9111 1.42363
BaseAll 1,2,3,4,5,6 1.0633 1.66093
CorrCoef 1,4,5(,2,6,3) 2.6794 1.31676
LinFS 1,4,6 2.8083 0.143308
GPFS 1,2,3 0.9274 19.0051
ARD 1,2,3 1.0076 5.0611
5.56684
5.56684
14Experiment 3 ARD
- Initial results for high-D input, two-correlated,
model inputs 100, noise dimensions 500, number of
observations 500.
Length - Input Number 31.8373 361 18.7081
501 14.2097 296 12.7581 51 12.3160
456 11.8689 496 11.3176 166 10.2424
310 10.2220 420 9.6192 325 9.0732
363
Length - Input Number 8.6898 53
8.5453 347 7.9338 419 7.8201 294
7.8017 188 7.4327 103 7.3760 13
7.1526 572 7.0997 478 6.9481 393
6.6417 187
15Summary of Experiments
- Best performing methods are GPFS and ARD which
usually find the optimal subset. However the GPFS
method is on average more than three times slower
than ARD. - The CorrCoef and LinFS methods are
computationally inexpensive but provide
unsatisfactory results. - Even for simple mapping functions (sinx) on
underdetermined systems where number of
observations lt dimensions, ARD breaks down.
16Research Directions
- Batch hierarchical screening
- Explore the potential of partitioning the input
space into groups of inputs, applying screening
methods on the groups and combining the important
inputs - Some work already done for linear models (Gabriel
and Pan 1979) - Grouping of variables such that if two variable
Xi Xj are in different groups, then their
regression sum of squares (RSS) are additive,
i.e. if Si is the reduction in RSS from including
Xi and Sj for Xj, then when including both Xi Xj
Si.jSiSj
17Research directions (2)
- Coupled Emulation
- separate emulators for different outputs, linked
with some model for the covariance - Connections to sequential methods to handle large
datasets. Linked to Sequential Sparse GPs? - Projective methods in conjunction with feature
selection.
18Projective methods
From Van der Maaten et al 2007
19But
- But Van der Maaten et al 2007 compared the
non-linear to linear methods and found them no
better. Reasons they propose relate to curse of
dimensionality, overfitting of local models and
others.
20References
- Dimensionality Reduction A Comparative Review,
L.J.P. van der Maaten E.O. Postma H.J. van den
Herik 2007 - Andr Elisseeff Isabelle Guyon. An Introduction to
Variable and Feature Selection. Journal of
Maching Learning Research, 311571182, 2003.