Title: Regression analysis
1Regression analysis
Relating two data matrices/tables to each other
Purpose prediction and interpretation
Y-data
X-data
2Typical examples
- Spectroscopy Predict chemistry from spectral
measurements - Product development Relating sensory to
chemistry data - Marketing Relating sensory data to consumer
preferences
3Topics covered
- Simple linear regression
- The selectivity problem a reason why
multivariate methods are needed - The collinearity problem a reason why data
compression is needed - The outlier problem why and how to detect
4Simple linear regression
- One y and one x. Use x to predict y.
- Use a linear model/equation and fit it by least
squares
5Data structure
X-variable
Y-variable
2 4 1 . . .
7 6 8 . . .
Objects, same number in x and y-column
6Least squares (LS) used for estimation of
regression coefficients
y
yb0b1xe
b1
b0
x
Simple linear regression
7Regression analysis
Interpretation
Outliers?
Pre-processing
8The selectivity problem A reason why multivariate
methods are needed
9Can be used for several Ys also
10Multiple linear regression
- Provides
- predicted values
- regression coefficients
- diagnostics
- If there are many highly collinear variables
- unstable regression equations
- difficult to interpret coefficients many and
unstable
11Â
Collinearity, the problem of correlated X-variable
yb0b1x1b2x2e
Regression in this case is fitting a plane to
the data (open circles)
The two xs have high correlation Leads to
unstable equation/plane (in the direction with
little variability)
12Possible solutions
- Select the most important wavelengths/variables
(stepwise methods) - Compress the variables to the most dominating
dimensions (PCR, PLS) - We will concentrate on the latter (can be
combined)
13Data compression
- We will first discuss the situation with one
y-variable - Focus on ideas and principles
- Provides regression equation (as above) and plots
for interpretation
14Model for data compression methods
XTPTE
Centred X and y
yTqf
T-scores, carrier of information from X to y P,q
loadings E,f residuals (noise)
15Regression by data compression
PC1
Regression on scores
16x1
x2
MLR
y
x3
x4
x1
t1
x2
PCR
y
t2
x3
x4
x1
t1
x2
y
PLS
x3
t2
x4
17PCR and PLS
- For each factor/component
- PCR
- Maximize variance of linear combinations of X
- PLS
- Maximize covariance between linear combinations
of X and y - Each factor is subtracted before the next is
computed
18Principal component regression (PCR)
- Uses principal components
- Solves the collinearity problem, stable solutions
- Provides plots for interpretation (scores and
loadings) - Well understood
- Outlier diagnostics
- Easy to modify
- But uses only X to determine components
19(No Transcript)
20PLS-regression
- Easy to compute
- Stable solutions
- Provides scores and loadings
- Often less number of components than PCR
- Sometimes better predictions
21PCR and PLS for several Y-variables
- PCR is computed for each Y. Each Y is regressed
onto the principal components - PLS The algorithm is easily modified. Maximises
linear combinations of X and Y. - For both methods Regression equations and plots
22Validation is important
- Measure quality of the predictor
- Determine A number of components
- Compare methods
23Prediction testing
Calibration Estimate coefficients
Testing/validation Predict y, use the coefficients
24Cross-validation
25Validation
- Compute
- Plot RMSEP versus component
- Choose the number of components with best RMSEP
properties - Compare for different methods
26RMSEP
MLR
NIR calibration of protein in wheat. 6 NIR
wavelengths 12 calibration samples, 26 test
samples
27Estimation error
Model error
Conceptual illustration of important phenomena
28Prediction vs. cross-validation
- Prediction testing Prediction ability of the
predictor at hand. Requires much data. - Cross-validation Property of the method. Better
for smaller data set.
29Validation
- One should also plot measured versus predicted
y-value - Correlation can be computed, but can sometimes be
misleading
30Example, plot of y versus predicted y
Plot of measured and predicted protein NIR
calibration
31Outlier detection
- Instrument error or noise
- Drift of signal (over time)
- Misprints
- Samples outside normal range (different
population)
32Outlier detection
- Outliers can be detected because
- Model for spectral data (XTPTE)
- Model for relationship between X and y (yTqf)
33Outlier detection tools
- Residuals
- X and y-residuals
- X-residuals as before, y-residual is difference
between measured and predicted y - Leverage
- hi