Title: Chapter 4 : Linear Methods for Classification
1Chapter 4 Linear Methods for Classification
- Linear regression of an indicator matrix
- Linear discriminant analysis
- Logistic regression
- Separating hyperplanes
In this chapter, decision boudaries are linear.
24.2. Linear regression of an indicator matrix
indicator
indicator response matrix
example
2 groups (K2) and 5 observations (N5)
observations 1 and 5 in group 1 observations 2, 3
and 4 in group 2
3Linear regression model to each colums of Y
(see chapter 3 for linear regression)
model matrix
4Classification of a new observation
- compute the fitted output
- identify the largest component and classify
accordingly
justification
regression estimate of conditional expectation
x in group k if max for G k
5Linear regression to estimate conditional
expectation?
problem
can be negative or greater than 1
(if prediction outside the hull of the training
data)
probability?
BUT good results
solution linear regression onto a basis
expansion h(X) of the inputs (see chapter 5)
6More simplistic view point construct targets
kth element
linear model by least square
classification
7Problem with K ? 3 classes can be masked by
others
solution quadratic rather than linear fit
84.3. Linear discriminant analysis
density of X in class Gk
prior probability
fk(x) Gaussian and the class have a common
covariance matrix
log-ratio
is linear in x
decision boundaries are linear
discriminant function
classification
9Remarks
- with 2 classes, linear discriminant analysis
classification with linear least square - with more than 2 classes avoid masking problems
- if not common covariance matrix, quadratic
discriminant analysis
10Regularized discriminant analysis (RDA)
compromise between linear discriminant analysis
(LDA) and quadratic discriminant analysis (QDA)
regularized covariance matrix
covariance matrix used in LDA
determined by cross-validation
11Computations
- Algorithm
- Sphere the data X (using eigen-decomposition of
the covariance matrix) common
covarianceidentity - classify in the transformed space
simplified by diagonalisation of covariance
matrices
(eigen-decomposition)
12Reduced-rank linear discriminant analysis
Fisher  Find the linear combination ZaTX such
that the between-class variance is maximized
relative to the within-class variance.Â
maximizing the Rayleigh quotient
where B between-class covariance W
within-class covariance
134.4. Logistic regression
model specified by K-1 log-odds or logit
transformations
14Fitting logistic regression model
usually, by maximum likelihood (Newton-Raphson
algorithm to solve the score equations)
example K 2 (2 groups)
write
encode
log-likelihood
15Example South african heart disease
correlation between the set of predictors
surprising results some variables not included
in the logistic model
16Quadratic approximations and inference
- quadratic approximation of deviance Pearson
chi-square statistic - if the model is correct, then consistent
(convergence to the true ) - normal distribution of
- model building Rao score test, Wald test.
connection with least square parameters
estimates of logistic regression
coefficients of a weigthed least square fit
weigths
17Differences between LDA and logistic regression
same form BUT differences in the way the
coefficients are estimated
logistic regression more general, less
assumptions (arbitrary density function for X),
more robust BUT very similar results in practice
184.5. Separating hyperplanes
perceptron classifiers such as
- vector normal to the surface L
- for any point x0 in L,
- the signed distance of any point x to L is given
by
hyperplane or affine set L defined by the
equation
( a line in )
properties
19Rosenblatts perceptron learning algorithm
try to separate hyperplanes by minimizing the
distance of missclassified points to the
decisison boundary
minimize
M is the index set of missclassified points.
The algorithm uses stochastic gradient descent to
minimize this piecewise linear criterion.
20Optimal separating hyperplanes
find hyperplane that minimizes some measure of
overlap in the training data.
- unique solution
- better classification performance on test data
advantages over Rosenblatts algorithm
least square
2 solutions by perceptron algorithm with
different random starts