Title: Seeking Interpretable Models for High Dimensional Data
1Seeking Interpretable Models forHigh Dimensional
Data
Bin Yu Statistics Department, EECS
Department University of California,
Berkeley http//www.stat.berkeley.edu/binyu
2Characteristics of Modern Data Problems
- Goal efficient use of data for
- Prediction
- Interpretation (proxy sparsity)
- Larger number of variables
- Number of variables (p) in data sets is large
- Sample sizes (n) have not increased at same pace
- Scientific opportunities
- New findings in different scientific fields
3Todays Talk
- Understanding early visual cortex area V1 through
fMRI - Occams Razor
- Lasso linear regression and Gaussian graphical
models - Learning about V1 through sparse models
- (pre-selection via corr, linear, nonlinear,
graphical) - Future work
4Understanding visual pathway
Gallant Lab at UCB is a leading vision
lab. Da Vinci (1452-1519)
Mapping of different visual cortex areas
PPoPolyak (1957)
Small left middle grey area V1
5Understanding visual pathway through fMRI
One goal at Gallant Lab understand how
natural images relate to fMRI signals
6Gallant Lab in Nature News
- This article is part of Nature's premium content.
- Published online 5 March 2008 Nature
doi10.1038/news.2008.650 - Mind-reading with a brain scan
- Brain activity can be decoded using magnetic
resonance imaging. - Kerri Smith
- Scientists have developed a way of decoding
someones brain activity to determine what they
are looking at. - The problem is analogous to the classic pick
a card, any card magic trick, says Jack
Gallant, a neuroscientist at the University of
California in Berkeley, who led the study.
7Stimuli
8Stimulus to fMRI response
- Natural image stimuli drawn randomly from a
database of 11,499 images - Experiment designed so that response from
different presentations are nearly independent - fMRI response is pre-processed and roughly
Gaussian
9Gabor Wavelet Pyramid
10Features
11Neural (fMRI) encoding for visual cortex V1
- Predictor p10,921
features of an image - Response (preprocessed)
fMRI signal at a voxel -
n1750 samples - Goal understanding human visual
system - interpretable (sparse)
model desired - good prediction is
necessary - Minimization of an empirical loss (e.g. L2)
leads to -
- ill-posed computational problem, and
- bad prediction
-
12Linear Encoding Model by Gallant Lab
- Data
- X p10921 dimensions (features)
- Y fMRI signal
- n 1750 training samples
- Separate linear model for each voxel via
e-L2boosting (or Lasso) - Fitted model tested on 120 validation samples
- Performance measured by correlation
13Modeling history at Gallant Lab
- Prediction on validation set is the benchmark
- Methods tried neural nets, SVMs, e-L2boosting
(Lasso) - Among models with similar predictions, simpler
(sparser) - models by e-L2boosting are preferred for
interpretation - This practice reflects a general trend in
statistical machine - learning -- moving from prediction to
simpler/sparser - models for interpretation,
faster computation - or data transmission.
14Occams Razor
14th-century English logician and Franciscan
friar, William of Ockham
Principle of Parsimony Entities must not be
multiplied beyond necessity.
Wikipedia
15Occams Razor via Model Selection in Linear
Regression
- Maximum likelihood (ML) is LS when Gaussian
assumption - There are submodels
- ML goes for the largest submodel with all
predictors - Largest model often gives bad prediction for p
large
16Model Selection Criteria
Akaike (73,74) and Mallows Cp used estimated
prediction error to choose a model Schwartz
(1980) Both are penalized LS by
. Rissanens Minimum Description Length (MDL)
principle gives rise to many different different
criteria. The two-part code leads to BIC.
17Model Selection for image-fMRI problem
-
- For the linear encoding model, the number of
submodels -
- Combinatorial search too expensive and
- often not
necessary - A recent alternative
-
- continuous embedding into a convex
optimization problem through L1 penalized LS
(Lasso) -- a third generation computational
method in statistics or machine learning.
18Lasso L1-norm as a penalty
- The L1 penalty is defined for coefficients ??
- Used initially with L2 loss
- Signal processing Basis Pursuit (Chen
Donoho,1994) - Statistics Non-Negative Garrote (Breiman, 1995)
- Statistics LASSO (Tibshirani, 1996)
- Properties of Lasso
- Sparsity (variable selection) and regularization
- Convexity (convex relaxation of L0-penalty)
19Lasso computation and evaluation
- The right tuning parameter unknown so
path is needed - (discretized or continuous)
- Initially quadratic program for each a
grid on ?. QP is called - for each ?.
- Later path following algorithms such
as - homotopy by Osborne et al
(2000) - LARS by Efron et al (2004)
-
- Theoretical studies much work recently on
Lasso in terms of - L2 prediction error
- L2 error of parameter
- model selection consistency
-
20Model Selection Consistency of Lasso
- Set-up
- Linear regression model
-
- n observations and p predictors
-
-
- Assume
- (A)
-
- Knight and Fu (2000) showed L2 estimation
consistency under (A). -
21Model Selection Consistency of Lasso
- p small, n large (Zhao and Y, 2006), assume (A)
and -
- Then roughly Irrepresentable condition (1
by (p-s) matrix) - Some ambiguity when equality holds.
- Related work Tropp(06), Meinshausen and
Buhlmann (06), Zou (06), - Wainwright (06)
- Population version
-
model selection consistency
22Irrepresentable condition (s2, p3) geomery
23Consistency of Lasso for Model Selection
- Interpretation of Condition Regressing the
irrelevant predictors on the relevant predictors.
If the L1 norm of regression coefficients - ()
- Larger than 1, Lasso can not distinguish the
irrelevant predictor from the relevant predictors
for some parameter values. - Smaller than 1, Lasso can distinguish the
irrelevant predictor from the relevant
predictors. - Sufficient Conditions (Verifiable)
- Constant correlation
- Power decay correlation
- Bounded correlation
-
-
24Model Selection Consistency of Lasso (pgtgtn)
- Consistency holds also for s and p growing with
n, assuming - bounds on max and min eigenvalues
of design matrix - smallest nonzero coefficient
bounded away from zero - irrepresentable condition
-
- Gaussian noise (Wainwright, 06)
-
- Finite 2k-th moment noise (ZhaoY,06)
-
-
25Gaussian Graphical Model
- Gaussian inverse covariance characterizes
conditional independencies (and hence graphical
model)
Graph
Inverse Covariance
26L1 penalized log Gaussian Likelihood
Given n iid observations of X with Yuan
and Lin (06) Banerjee et al (08), Friedman
et al (07) fast algorithms
based on block descent. Ravikumar,
Wainwright, Raskutti, Yu (2008)
sufficient conditions for model selection
consistency (pgtgtn) for
sub-Gaussian and also heavy tail distributions
27Success probs dependence on n and p (Gaussian)
- Edge covariances as
Each point is an average over 100 trials. - Curves stack up in second plot, so that (n/log
p) controls model selection.
28Success probs dependence on model complexity K
and n
Chain graph with p 120 nodes.
- Curves from left to right have increasing values
of K. - Models with larger K thus require more samples n
for same probability of success.
29Back to image-fMRI problem Linear sparse
encoding model on complex cells
- Gallant Labs approach
- Separate linear model for each voxel
- Y Xb e
- Model fitting via e-L2boosting and stopping by CV
- X p10921 dimensions (features or complex
cells) - n 1750 training samples
- Fitted model tested on 120 validation samples
(not used in fitting) - Performance measured by correlation (cc)
30Our story on image-fMRI problem
- Episode 1 linear
- pre-selection by correlation
- localization
- Fitting details
- Regularization parameter selected by 5-fold cross
validation - L2 boosting applied to all 10,000 features
- -- L2 boosting is the method of choice
in Gallant Lab - Other methods applied to 500 features
pre-selected by correlation
31Other methods
- k 1 semi OLS (theoretically better than OLS)
- k 0 ridge
- k -1 semi OLS (inverse)
- k 1,-1 semi-supervised because of the use of
- population cov. , which is estimated from the
whole image data base.
32Validation correlation
33Comparison of the feature locations
Semi methods
L2boost
34Our Story (cont)
- Episode 2 nonlinear via iV-SpAM (machine
learning stage) - Additive Models (Hastie and Tibshirani, 1990)
- Sparse
- High dimensional p gtgtgt n
- SpAM (Sparse Additve Models)
- By Ravikumar, Lafferty, Liu, Wasserman (2007)
- Related work COSSO, Lin and Zhang (2006)
for most j
page 34
2009?11?16????
35SpAM V1 encoding model (1331 voxels from
V1)(Ravikumar, Vu, Gallant Lab, BY, NIPS08)
- For each voxel,
- Start with 10921 complex cells (features)
- Pre-selection of 100 complex cells via
correlation - Fitting of SpAM to 100 complex cells with AIC
stopping - Pooling of SpAM fitted complex cells according
to location - and frequency to form pooled complex cells
- Fitting SpAM to 100 complex cells and pooled
complex cells with AIC stopping
page 35
2009?11?16????
36Prediction performance (R2)
inset region
median improvement 12.
page 36
2009?11?16????
37Nonlinearities
Compressive effect (finite energy supply) Common
nonlinearity for each voxel?
page 37
2009?11?16????
38Identical Nonlinearity
- Identical V-SpAM model (iV-SpAM) (Ravikumar et
al, 08) - where
is small or sparse.
page 38
2009?11?16????
39Identical-nonlinearity vs linearityR2
prediction
Median improvement 16
page 39
2009?11?16????
40Episode 3 nonlinearity via power transformations
(classical stage)
Power transformation of features
Plot a predictor vs. the response The feature
is skewed The response is Gaussian. After
transformation (4th root) Get better
correlation!
Response
Feature
Response
Feature
page 40
16 November 2009
41Make it Gaussian
- Histograms of Gabor wavelets, by spatial
frequency.
Transformations
None
Log
4th root
Square root
Frequency
1
2
4
8
16
32
page 41
2009?11?16????
42Gaussianization of features improves prediction
- Gaussianized features used to train (using L2
epsilon-boosting) - FT F1/2 if F in levels 2,3 (4 x 4 and 8 x
8) - FT F1/4 if F in levels 4,5
- (16 x 16 and 32 x 32)
page 42
2009?11?16????
43Episode 4 localized predictors
Correlation of predictors with voxels Multi
resolution
Voxel 1
1
Voxel 2
- Max correlation value at each location (of 8
orientations) - Purple strong (gt 0.4), red weak.
- Speci?c strong location across resolutions, for
each voxel.
page 43
16 November 2009
44Localized prediction
-
- Each voxel is related to an area on the image
by - Locatization Algorithm
- For each level 2 to 5
- For each location
- Find the orientation giving max correlation
- Generates 4 maps 4 4 to 32 32
- For each map
- Smooth using a Gaussian kernel, and find peak
correlation - Levels 2,3 use only features from peak of
correlation (8 features each) - Levels 4,5 include 4,12 neighboring locations
(40,104 features) - On these features, train a linear prediction
functionusing L2 epsilon-boosting. - 10 Times faster than global precision.
- Similar prediction on almost all voxels.
page 44
2009?11?16????
45After localization view responses of a single
voxel
- Easy to distinguish stimuli that induce strong
vs. weak - responses in the voxel when we look at local area.
Location in full image
Strong
Weak
page 45
2009?11?16????
46Localization does not work for all voxels
Generally prediction does not deteriorate. However
, there are a few voxels for which local
prediction is not as good.
Red Global-local gt 0.05 Blue Local-global gt
0.05
page 46
November 16, 2009
47Who is responsible? (on-going)
- Large regions of features correlated with outcome
- The image features have correlations
- Between close locations
- Between levels of pyramid
- Between neighboring gradients
- Which of the features are responsible, and
which are - tagging along?
- Can we distinguish between
- Correlation between features, and
- Causation of voxel response?
page 47
2009?11?16????
48Some unlocalized voxels
- Two neighboring voxels
- Highly correlated
- Each is influenced by both areas.
- One influences the other
Area A
voxel1
voxel2
page 48
16 November 2009
49Episode 5 sparse graphical models (on-going)
- Estimated apart, each voxel depends on both areas
- Together, each depends on only one area
- Estimated inverse covariance entries shown
below
Voxel 1
Voxel 2
Area B
Area A
Area B
Area A
Estimated separately
Estimated together
page 49
2009?11?16????
50Episode 6 building neighbor model (on-going)
- Voxels with poor localization are probably
influenced by neighbor voxels - For each voxel 0, take 6 neighbor voxels
(v1,,6) - L(v) are local (tranformed) predictors for voxel
v - Build a sparse linear model for voxel v based on
L(v) - Add the 6 predicted values from the neighbor
models to - the local (transformed) predictors for voxel
0. - We call this model neighbor model.
page 50
2009?11?16????
51Prediction (Neighbor compared to Global)
inset region
We dont lose prediction power
page 51
November 16, 2009
52Summary
- L1 penalized minimization model selection
consistency - Irrepresentable condition came from KKT and
L1 penalty. - Effective sample size n/logp? Not always.
- Depending on the tail of relevant
distribution. -
- Model complexity parameters matter.
page 52
November 16, 2009
53Summary (cont)
- Understanding fMRI signals in V1 via sparse
modeling - Discovered nonlinear compressive effect of
V1 fMRI response - Sparse Gaussian models to relate selected
features and voxels - (on-going)
- General lessons
- Algorithmic EDA -- off-the-shelf sparse
modeling techniques - to reduce problem size and point to
nonlinearity - Subject-matter based models -- simpler (power
tranform) - and more meaningful models (local and
neighbor sparse graphical - models)
page 53
2009?11?16????
54Future work
- V1 voxels graphical model gives us a way to
borrow strength across voxels - improved predction based on graphical
models? -
- improved decoding? (to decode
- images seen by the
subject using fMRI signals) -
- Higher vision area voxels no good models yet
from the lab - hope improved encoding models (hence
making decoding possible) - Next V4 challenge, that is, how to build
models for V4? - Better image representation
page 54
November 16, 2009
55Acknowledgements
Co-authors P. Zhao and N. Meinshausen (Lasso) P.
Ravikumar, M. Wainwright, G. Raskutti (Graphical
model) P. Ravikumar, V. Vu, Y. Benjamini (fMRI
problem) Gallant Lab K. Kay, T. Naleris, J.
Gallant (fMRI problem) Funding NSF-IGMS Grant
(06-07 in Gallant Lab) Guggenheim Fellowship (06
in Gallant Lab) NSF DMS Grant NSF VIGRE Grant ARO
Grant
page 55
2009?11?16????
56Shared-nonlinearity vs. nonlinearity sparsity
Recall 100 original features pre-selected for
V-SpAM or Shared V-SpAM. On
average Linear model 100 predictors selected
from 10,000 V-SpAM 70 predictors (original and
pooled) better prediction than
linear Shared V-SpAM sparser (13 predictors)
and better prediction
than V-SpAM
page 56
2009?11?16????
57Using neighboring voxels for prediction
- Voxels with poor localization are probably
influenced by neighbor voxels - Response of voxel v
-
- L(v) are features indexes in a localized area of
the image - Xi are the (transformed) feature values
- N(v) are the neighboring voxels (N(v) ? 6)
- We will use
page 57
2009?11?16????
58Using neighboring voxels for prediction
- Algorithm
- Phase 1
- For every voxel
- Find best location
- Train a linear prediction function on localized
features(using L2 epsilon-boosting) - Generate predictions
- (Until now Similar to Localized algorithm)
- Phase 2
- For every voxel
- Find neighboring voxels in the brain
- Retrain a linear prediction function on
localized-features and predictions for
neighboring voxels - Note
- The models are nested Local ? Neighbor ? Global
page 58
2009?11?16????
59Sparse cov. estimation via L1 penalty
Banerjee, El Ghaoui, dAspremont (08)
60Model selection consistency (Ravikumar et al, 08)
Assume the irrepresentable condition below
holds 1. X sub-Gaussian with parameter
and effective sample size Or
2. X has 4m-th moment, Then with high
probability as n tends to infinity, the correct
model is chosen.
61Model selection consistency
Ravikumar, Wainwright, Raskutti, Yu (08) gives
sufficient conditions for model selection
consistency. Hessian Define model
complexity
62Gallant Lab in Nature News
- This article is part of Nature's premium content.
- Published online 5 March 2008 Nature
doi10.1038/news.2008.650 - Mind-reading with a brain scan
- Brain activity can be decoded using magnetic
resonance imaging. - Kerri Smith
- Scientists have developed a way of decoding
someones brain activity to determine what they
are looking at. - The problem is analogous to the classic pick
a card, any card magic trick, says Jack
Gallant, a neuroscientist at the University of
California in Berkeley, who led the study.
page 62
November 16, 2009
63Proof Techniques
- Primal-dual witness
- Useful to prove equalities (difficult) by only
showing inequalities (easier) - Fixed point analysis
- Useful to bound a zero of a complicated
functional (e.g. gradient)
page 63
November 16, 2009
64Primal-dual witness sketch
- construct candidate primal-dual pair
- proof technique not a practical algorithm
page 64
November 16, 2009
65Sparsity of the Inverse Correlation (S)
- In Multivariate Gaussian distribution
- S signifies conditional independence
- If Sij 0, then (i,j) are conditionally
independent (conditioned on all other variables)
Example 44 Grid Each pixel independently
(positively) correlated with 4 neighbors. Then
all pixels are correlated. But the inverse
correlation is sparse.
Correlation
Inverse Correlation
Redlt0 Purplegt0
0ltRedltGreenltPurple
page 65
2009?11?16????
66Sparsity of the Inverse Correlation (cont.)
- We want to keep only true independent effect
- Sparsity of structure in images is assumed.
- Example (cont.)
- Estimate Cor. and Inverse Cor. from 1000 samples
Inverse Sample Correlation
Sparse Inverse Correlation (?0.1)
Redlt0 Orange0 Purplegt0
RedltYellowlt0ltGreenltPurple
page 66
2009?11?16????
67Sparsity of the Inverse Correlation (S)
- In multivariate gaussian distribution
- S signifies conditional independence
- If Sij 0, then (i,j) are conditionally
independent (conditioned on the rest of the
graph)
Example 44 Grid Each pixel independently
(positively) correlated with 4 neighbors. Then
all pixels are correlated. But the inverse
correlation is sparse.
Correlation
Inverse Correlation
Redlt0 Purplegt0
0ltRedltGreenltPurple
page 67
16 November 2009
68Sparsity of the Inverse Correlation (cont.)
- We want to keep only true independent effect
- Sparsity of structure in images is assumed.
- Example (cont.)
- Estimate Corr and inverse Corr from 1000 samples
Inverse Sample Correlation
Sparse Inverse Correlation (?0.1)
Redlt0 Orange0 Purplegt0
RedltYellowlt0ltGreenltPurple
page 68
16 November 2009
69Acknowledgements
Co-authors P. Ravikumar, M. Wainwright, G.
Raskutti (Graphical model) P. Ravikumar, V. Vu,
Y. Benjamini, K. Kay, T. Naleris, J.
Gallant Funding NSF-IGMS Grant (06-07 in
Gallant Lab) Guggenheim Fellowship (06) NSF DMS
Grant ARO Grant
page 69
November 16, 2009
70Spatial RFs and Tuning Curves
page 70
November 16, 2009