Prediction using Side Information - PowerPoint PPT Presentation

About This Presentation
Title:

Prediction using Side Information

Description:

Bin Yu Department of Statistics, UC Berkeley Joint work with Peng Zhao, Guilherme Rocha, and Vince Vu – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 57
Provided by: Gang156
Category:

less

Transcript and Presenter's Notes

Title: Prediction using Side Information


1
Prediction using Side Information
  • Bin Yu
  • Department of Statistics, UC Berkeley
  • Joint work with
  • Peng Zhao, Guilherme Rocha, and
  • Vince Vu

2
Outline
  • Motivation
  • Background
  • Penalization methods (building in side
    information through penalty)
  • L1-penalty (sparsity as the side information)
  • Group and hierarchy as side information
  • Composite Absolute Penalty (CAP)
  • Building Blocks L?-norm Regularization
  • Definition
  • Interpretation
  • Algorithms
  • Examples and Results
  • Unlabeled data as side information
    semi-supervised learning
  • Motivating example image-fMRI problem
    in neuroscience
  • Penalty based on population covariance
    matrix
  • Theoretical result to compare with OLS
  • Experimental results on image-fMRI data

3
Characteristics of Modern Data Set Problems
  • Goal efficient use of data for
  • Prediction
  • Interpretation
  • Larger number of variables
  • Number of variables (p) in data sets is large
  • Sample sizes (n) have not increased at same pace
  • Scientific opportunities
  • New findings in different scientific fields

4
Regression and classification
  • Data
  • Example image-fMRI problem
  • Predictor 11,000
    features of an image
  • Response (preprocessed)
    fMRI signal at a voxe
  • n1750 samplesl
  • Minimization of an empirical loss (e.g. L2)
    leads to
  • ill-posed computational problem, and
  • bad prediction

5
Regularization improves prediction
  • Penalization -- linked to computation
  • L2 (numerical stability ridge SVM)
  • Model selection (sparisty, combinatorial
    search)
  • L1 (sparsity, convex optimization)
  • Early stopping tuning parameter is computational
  • Neural nets
  • Boosting
  • Hierarchical modeling (computational
    considerations)

6
Lasso L1-norm as a penalty
  • The L1 penalty is defined for coefficients ??
  • Used initially with L2 loss
  • Signal processing Basis Pursuit (Chen
    Donoho,1994)
  • Statistics Non-Negative Garrote (Breiman, 1995)
  • Statistics LASSO (Tibshirani, 1996)
  • Properties of Lasso
  • Sparsity (variable selection)
  • Convexity (convex relaxation of L0-penalty)

7
Lasso L1-norm as a penalty
  • Computation
  • the right tuning parameter unknown so
    path is needed
  • (discretized or continuous)
  • Initially quadratic program for each a grid on
    ?. QP is called
  • for each ?.
  • Later path following algorithms
  • homotopy by Osborne et al (2000)
  • LARS by Efron et al (2004)
  • Theoretical studies much work recently on
    Lasso

8
General Penalization Methods
  • Given data
  • Xi a p-dimensional predictor
  • Yi response variable
  • The parameters ? are defined by the penalized
    problem
  • where
  • is the empirical loss function
  • is a penalty function
  • ? is a tuning parameter

9
Beyond Sparsity of Individual PredictorsNatural
Structures among predictors
  • Rationale side information might be
    available and/or
  • additional regularization is needed beyond
    Lasso for pgtgtn
  • Groups
  • Genes belonging to the same pathway
  • Categorical variables represented by dummies
  • Polynomial terms from the same variable
  • Noisy measurements of the same variable.
  • Hierarchy
  • Multi-resolution/wavelet models
  • Interactions terms in factorial analysis (ANOVA)
  • Order selection in Markov Chain models

10
Composite Absolute Penalties (CAP)Overview
  • The CAP family of penalties
  • Highly customizable
  • ability to perform grouped selection
  • ability to perform hierarchical selection
  • Computational considerations
  • Feasibility Convexity
  • Efficiency Piecewise linearity in some cases
  • Define groups according to structure
  • Combine properties of L?-norm penalties
  • Encompass and go beyond existing works
  • Elastic Net (Zou Hastie, 2005)
  • GLASSO (Yuan Lin, 2006)
  • Blockwise Sparse Regression (Kim, Kim Kim,
    2006)

11
Composite Absolute PenaltiesReview of L?
Regularization
  • Given data and loss
    function
  • L? Regularization
  • Penalty
  • Estimate
  • where ?gt0 is a tuning parameter
  • For the squared error loss function
  • Hoerl Kennard (1970) Ridge (?2)
  • Frank Friedman (1993) Bridge (general ?)
  • LASSO (1996) (?1)
  • SCAD (Fan and Li, 1999) (?lt1)

12
Composite Absolute PenaltiesDefinition
  • The CAP parameter estimate is given by
  • Gk's, k1,,K - indices of k-th pre-defined group
  • ?Gk corresponding vector of coefficients.
  • . ?k group L?k norm Nk ??k?k
  • . ?0 overall norm T(?) N?0
  • groups may overlap (hierarchical selection)

13
Composite Absolute PenaltiesA Bayesian
interpretation
  • For non-overlapping groups
  • Prior on group norms
  • Prior on individual coefficients

14
Composite Absolute PenaltiesGroup selection
  • Tailoring T(?) for group selection
  • Define non-overlapping groups
  • Set ?kgt1, for all k ? 0
  • Group norm ?k tunes similarity within its group
  • ?kgt1 causes all variables in group i to be
    included/excluded together
  • Set ?01
  • This yields grouped sparsity
  • ?k2 has been studied by Yuan and Lin(Grouped
    Lasso, 2005).

15
Composite Absolute PenaltiesHierarchical
Structures
  • Tailoring T(?) for Hierarchical Structure
  • Set ?01
  • Set ?igt1, ?i
  • Groups overlap
  • If??2 appears in all groups where ?1 is included
  • Then X2 enters the model after X1
  • As an example

16
Composite Absolute PenaltiesHierarchical
Structures
  • Represent Hierarchy by a directed graph
  • Then construct penalty by
  • For graph above, ?01, ?r?

17
Composite Absolute PenaltiesComputation
  • CAP with general L? norms
  • Approximate algorithms available for tracing
    regularization path
  • Two examples
  • Rosset (2004)
  • Boosted Lasso (Zhao and Yu, 2004) BLASSO
  • CAP with L1L? norms
  • Exact algorithms for tracing regularization path
  • Some applications
  • Grouped Selection iCAP
  • Hierarchical Selection hiCAP for ANOVA and
    wavelets

18
iCAPDegrees of Freedom (DFs) for tuning par.
selection
  • Two ways for selecting the tuning parameter in
    iCAP
  • 1. Cross-validation
  • 2. Model selection criterion AIC_c
  • where DF used is a generalization of Zou et
    al (2004)s df
  • for Lasso to iCAP.

19
Simulation Studies (pgtn) (partially adaptive
grouping)Summary of Results
  • Good prediction accuracy
  • Extra structure results in non-trivial reduction
    of model error
  • Sparsity/Parsimony
  • Less sparse models in l0 sense
  • Sparser in terms of degrees of freedom
  • Estimated degrees of freedom (Group, iCAP only)
  • Good choices for regularization parameter
  • AICc model errors close to CV

20
ANOVA Hierarchical SelectionSimulation Setup
  • 55 variables (10 main effects, 45 interactions)
  • 121 observations
  • 200 replications in results that follow

21
ANOVA Hierarchical SelectionModel Error
22
Summary on CAP Group and Hierarchical Sparsity
  • CAP penalties
  • Are built from L? blocks
  • Allow incorporation of different structures to
    fitted model
  • Group of variables
  • Hierarchy among predictors
  • Algorithms
  • Approximation using BLASSO for general CAP
    penalties
  • Exact and efficient for particular cases (L2
    loss, L1 and L? norms)
  • Choice of regularization parameter ?
  • Cross-validation
  • AICc for particular cases (L2 loss, L1 and L?
    norms)

23
Regularization using unlabeled data
semisupervised learning
Motivating example image-fMRI problem in
neuroscience (Gallant Lab at UCB)
Goal to understand how natural images
relate to fMRI signals
24
Stimuli
  • Natural image stimuli

25
Stimulus to response
  • Natural image stimuli drawn randomly from a
    database of 11,499 images
  • Experiment designed so that response from
    different presentations are nearly independent
  • Response is pre-processed and roughly Gaussian

26
Gabor wavelet pyramid
27
Features
28
Linear model
  • Separate linear model for each voxel
  • Y Xb e
  • Model fitting
  • X p10921 dimensions (features)
  • n 1750 training samples
  • Fitted model tested on 120 validation samples
  • Performance measured by correlation

29
Ordinary Least Squares (OLS)
  • Minimize empirical squared error risk
  • Notice that OLS estimate is a function of
    estimates of covariance of X (Sxx) and covariance
    X with Y (Sxy)

30
OLS
  • Sample covariance matrix of X is often nearly
    singular and so inversion is ill-posed.
  • Some existing solutions
  • Ridge regression
  • Pseudo-inverse (or truncated SVD)
  • Lasso (closely related to L2boosting -- current
    method at Gallant Lab)

31
Semi-supervised
  • Abundant unlabeled data available
  • samples from the marginal distribution of X
  • Book on semisupervised learning (2006) (eds.
    Chapelle, Scholkopf, and Zien)
  • Stat. science article (2007) (Liang, Mukherjee
    and Westl)
  • Image-fMRI images in the database are unlabeled
    data
  • Semi-supervised linear regression
  • Use
  • labeled (Xi,Yi) i1,, n, and
  • unlabeled data Xi in1,,nm to fit

32
Semi-supervised
  • Does marginal distribution of X play a role?
  • For fixed design X, marginal dist of X plays no
    role.
  • (Brown 1990) shows that OLS estimate of the
    intercept is inadmissible if X assumed random.

33
Refining OLS
  • The unknown parameter satisfies
  • So OLS can be seen as a plug-in estimate for this
    equation
  • Can plug-in an improved estimate of Sxx ?

34
A first approach
  • Suppose population covariance of X is known
  • (infinite amount of unlabeled data)
  • Use a linear combination of the sample and
    population covariances.
  • (Ledoit and Wolf 2004) considered convex
    combinations of sample covariance and another
    matrix from a parametric model

35
Semi-supervised OLS
Plug in the improved estimate of Sxx, we get
semi-OLS
36
Semi-supervised OLS
  • Equivalent to penalized least squares
  • Equivalent to ridge regression in pre-whitened
    covariates

37
Spectrally semi-supervised OLS
  • Ridge regression in (W,Y) is just a
    transformation of ?, where W has spectral
    decomposition
  • More generally, can consider arbitrary
    transformations of the spectrum of W
  • Resulting estimator

38
Spectrally semi-supervised OLS
  • Examples
  • OLS
  • h(s) 1/s
  • Semi-OLS Ridge on pre-whitened predictors
  • h(s) 1/(sa)
  • Truncated SVD on pre-whitened predictors (PCA
    reg)
  • h(s) 1/s if sgtc, otherwise 0

39
Large n,p asymptotic MSPE
  • Assumptions
  • S non-degenerate
  • Z X S-1/2 is n-by-p with IID entries
    satisfying
  • mean 0, variance 1
  • finite 4th moment
  • h is a bounded function
  • ßT Sxx ß / s2 has finite limit SNR2 as p,n tend
    to 8
  • p/n has finite, strictly positive limit r

40
Large n,p MSPE
  • Theorem
  • The Mean Squared Prediction Error satisfies
  • where Fr is the Marchenko-Pastur law with index r
    and

41
Consequences
  • Asymptotically optimal h
  • Asymptotically better than OLS and truncated SVD
  • Reminiscent of shrinkage factor in James-Stein
    estimate
  • SNR might be easily estimated

42
Back to image-fMRi problem
  • Fitting details
  • Regularization parameter selected by 5-fold cross
    validation
  • L2 boosting applied to all 10,000 features
  • -- L2 boosting is the method of choice
    in Gallant Lab
  • Other methods applied to 500 features
    pre-selected by correlation

43
Other methods
  • k 1 semi OLS (theoretically better than OLS)
  • k 0 ridge
  • k -1 semi OLS (inverse)

44
Validation correlation
voxel OLS L2 boost Semi OLS Ridge Semi OLS inverse (k-1)
1 0.434 0.738 0.572 0.775 0.773
2 0.228 0.731 0.518 0.725 0.731
3 0.320 0.754 0.607 0.741 0.754
4 0.303 0.764 0.549 0.742 0.737
5 0.357 0.742 0.544 0.763 0.763
6 0.325 0.696 0.520 0.726 0.734
7 0.427 0.739 0.579 0.728 0.736
8 0.512 0.741 0.608 0.747 0.743
9 0.247 0.746 0.571 0.754 0.753
10 0.458 0.751 0.606 0.761 0.758
11 0.288 0.778 0.610 0.801 0.801
45
Features used by L2boost
Features used by L2boosting
46
500 features used by semi-methods
47
Comparison of the feature locations
Semi methods
L2boost
48
Further work
  • Image-fMRI problem based on a linear model
  • Compare methods for other voxels
  • Use fewer features for
    semi-methods?
  • (average features for
    L2boosting 120
  • features for
    semi-methods 500, by design)
  • Interpretation of the results of
    different methods
  • Theoretical results for ridge and semi inverse
    OLS?
  • Image-fMRI problem non-linear modeling
  • understanding the image space
    (clusters? Manifolds?)
  • different linear models on
    different clusters (manifolds)?
  • non-linear models on different
    clusters (manifolds)?

49
  • CAP Codes www.stat.berkeley.edu/yugroup
  • Paper www.stat.berkeley.edu/binyu
  • to appear in Annals of
    Statistics
  • Thanks Gallant Lab at UC Berkeley

50
Proof Ingredients
  • Can show that MSPE decomposes as
  • Results in random matrix theory can be applied
  • BIAS term is a quadratic form in sample
    covariance matrix
  • VARIANCE term is an integral wrt empirical
    spectral distribution of sample covariance matrix

51
Proof Ingredients
  • For bounded g and unit p-vector v
  • 1.
  • 2.
  • (1) shown in (Silverstein 1989) and strengthened
    in (Bai, Miao, and Pan 2007)
  • (2) is Marchenko-Pastur result, strongest version
    in (Bai and Silverstein 1995)

52
Grouping examplesCase 1 Settings
  • Goals
  • Comparison of different group norms
  • Comparison of CV against AICC
  • Y X? ??
  • Settings

? Coefficient Profile
53
Grouping exampleCase 1 LASSO vs. iCAP sample
paths
LASSO path
Number of steps
Normalized coefficients
iCAP path
Number of steps
54
Grouping exampleCase 1 Comparison of norms and
clusterings
10 fold CV Model error
55
ANOVA Hierarchical SelectionNumber of Selected
Terms
56
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com