Maximum Likelihood Modeling with Gaussian Distributions for Classification - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Maximum Likelihood Modeling with Gaussian Distributions for Classification

Description:

E. Alpaydin, 'Introduction to Machine Learning,' The MIT Press, 2004. ... nonsingular d x d matrix. new mean. new covariance. Linear Transformation (cont. ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 37

Provided by: Ryan85

Category:

more less

Transcript and Presenter's Notes

Title: Maximum Likelihood Modeling with Gaussian Distributions for Classification

1
Maximum Likelihood Modeling with Gaussian
Distributions for Classification
R. A. Gopinath IBM T. J. Watson Research
Center ICASSP 1998
Presentator Winston Lee
2
References

S. R. Searle, Matrix Algebra Useful for
Statistics, John Wiley and Sons, 1982.
N. Kumar and A. G. Andreou, Heteroscedastic
Discriminant Analysis and Reduced Rank HMMs for
Improved Speech Recognition, Speech
Communication, 1998.
E. Alpaydin, Introduction to Machine Learning,
The MIT Press, 2004.
R. A. Gopinath, Constrained Maximum Likelihood
Modeling with Gaussian Distributions.
M. J. F. Gales, Maximum Likelihood Linear
Transformations for HMM-based Speech
Recognition, Cambridge, U.K. Cambridge Univ.,
Tech. Rep. CUED/F-INFENG/TR291, 1997
M. J. F. Gales, Semi-Tied Covariance Matrices
for Hidden Markov Models, IEEE Transactions.
Speech and Audio Processing, 7272281, 1999
G. Saon, et al., Maximum Likelihood.
Discriminant Feature Spaces, in Proc. ICASSP
2000.

3
Outline

Abstract
Introduction
ML Modeling
Linear Transformation
Single Class
Multiclass

4
Abstract

Maximum Likelihood (ML) modeling of multiclass
data for classification often suffers from the
following problems a) data insufficiency
implying over-trained or unreliable models, b)
large storage requirement, c) large computational
requirement, and/or d) ML is not discriminating
between classes. Sharing parameters across
classes (or constraining the parameters) clearly
tends to alleviate the first three problems.
In this paper we show that in some cases it can
also lead to better discrimination (as evidenced
by reduced misclassification error). The
parameters considered are the means and variances
of the Gaussians and linear transformations of
the feature space (or equivalently the Gaussian
means).
Some constraints on the parameters are shown to
lead to Linear Discrimination Analysis (a
well-known result) while others are shown to lead
to optimal feature spaces (a relatively new
result). Applications of some of these ideas to
the speech recognition problem are also given.

5
Introduction

Why Gaussians when modeling data?
Any distribution can be approximated by Gaussian
Mixtures.
A rich set of mathematical results is available
for Gaussians.
How to model the labeled training data well for
classification?
Assumption the training and test data have the
same distribution.
So, Just model the training data as well as
possible.
Criterion Maximum Likelihood (ML) Principle.
The main idea
In constrained ML modeling (e.g., diagonal
covariance), there are optimal feature spaces.
Kumars HLDA.
Gales efficient algorithm.

To search for the parameters that maximizes the
likelihood of the sample.
6
ML Modeling

The likelihood of the training data
is given by
The idea is to choose the parameters and
so as to maximize
.

label
data numbers
class
dimension
7
ML Modeling (cont.)

can be expressed as
follows

sample mean of class j
sample covariance of class j
8
Linear Transformation

Now consider linearly transforming the samples
from each class
can also be modeled with
Gaussians. Why?
Proof The moment generating function for normal
distribution is The
moment generating function of is

nonsingular d x d matrix
new covariance
new mean
9
Linear Transformation (cont.)

The problem of scaling
The choice of can make the likelihood of a
test data sample difficult to be compared.
The likelihood of data from class j may be
arbitrarily large.
Two approaches to compare likelihoods
First, to ensure that for every
class, and the likelihood of the data
corresponding to each class is the same in the
original and transformed spaces. (implying
.)
Second, to only consider the likelihood in the
original space even though the data is modeled in
the transformed space.(Kumars start point in
his paper.)

10
Linear Transformation (cont.)

For the first approach

11
Linear Transformation (cont.)

For the second approach

12
Linear Transformation (cont.)

Why rather than when modeling?
If the data is modeled using full-covariance
Gaussians, it makes no difference.
How about diagonal or block-diagonal Gaussians?
If we directly remove the off-diagonal entries,
the correlations in the data cannot be account
for explicitly (Ljolje, 1994).
The transformations can be used to find the basis
in which this structural constraint on the
variances is more valid as evidenced from the
data.

13
Single Class

Ignore the class labels and modeling the entire
data with one Gaussian
By ML estimation, , and the
ML value of the training data is
On average each sample contribute to the
ML value.

The ML estimator
14
Single Class ML Estimates

ML estimation

15
Single Class Linear Transformation

Consider a global non-singular linear
transformation of the data
By ML estimation, and
the ML value of the training data is
If then

We can name such kind of matrix as unimodular or
volume-preserving linear transformation.
16
Single Class Linear Transformation (MLE)

ML estimation

17
Single Class Diagonal Covariance

If we are constrained to use a diagonal
covariance model, The ML estimators will be
and .
The ML value
Because of the diagonal constraint,

Another proof of Hadamards inequality.
18
Single Class Diagonal Covariance (MLE)

ML estimation

19
Single Class Diagonal Covariance (LT)

If one linearly transforms the data and models
using a diagonal Gaussian, the ML value is
One can maximize the ML value over A to obtain
the best feature space in which to model with the
diagonal covariance constraint.
Note that A is still supposed to be unimodular.
one optimal choice of A

No loss in likelihood!
20
Multiclass

In this case the training data is modeled with a
Gaussian for each class . one can
split the data into J classes and model each one
separately.
The ML estimators
The ML value
Note that there is no interaction between the
classes and therefore unconstrained ML modeling
is not discriminating.
For the same sake, the unimodular transformations
cannot help in better classification.

21
Multiclass Diagonal Covariance

The ML estimators
The ML value
If one linearly transforms the data and models
using a diagonal Gaussian, the ML value is

The main task is to choose A.
22
Multiclass Some Issues

If the sample size for each class is not large
enough then the ML parameter estimates may have
large variance and hence be unreliable.
The storage requirement for the model
The computational requirement
The parameters for each class are obtained
independently ML principle dose not allow for
discrimination between classes.

MLE
population covariance
23
Multiclass Some Issues (cont.)

Parameters sharing across classes
reduces the number of parameters
reduces storage requirements
reduces computational requirements
is more discriminating leading to better
classifiers.
But, we can appeal to Fishers criterion of LDA
and a result of Campbell to argue that sometimes
constrained ML modeling is discriminating .

but comes with a loss in likelihood
hard to justify
24
Multiclass Some Issues (cont.)

Solution
We can globally transform the data with a
unimodular matrix A and model the transformed
data with diagonal Gaussians.(There is a loss in
likelihood too.)
Among all possible transformation A, we can
choose the one that takes the least loss in
likelihood.(In essence we will find a linearly
transformed (shared) feature space in which the
diagonal Gaussian assumption is most valid.)

25
Multiclass Equal Covariance

Here all the covariances are assumed to be
equal.
Notice that the ML estimate of covariance for
each class is not supposed to be equal a priori.
The ML estimators

equal covariance of each class
26
Multiclass Equal Covariance (cont.)

ML estimation

ML estimator of mean
within-class scatter as the ML estimate of
covariance
27
Multiclass Equal Covariance (cont.)

Each sample on average contributes to the
likelihood and the ML value
From , we can prove
The sample covariance of the entire data
From , we can derive

The geometric is smaller than or equal to the
arithmetic mean.
within-class scatter
between-class scatter
28
Multiclass Equal Covariance Clusters

Classes are organized into clusters and each
cluster modeled with a single mean or collection
of means and a single covariance.
The former case the data can be relabeled using
cluster labels and ML estimates and
ML values can be obtained as before for the
full-covariance multiclass case.
The latter case the data can be split into K
groups in which case this essentially becomes
the equal-covariance problem for each group.

29
Multiclass Diagonal Covariance Clusters

Again classes are grouped into clusters. Each
cluster is modeled with a diagonal Gaussian in a
transformed feature space.
The ML estimators in the original feature
space
The ML value
One can choose the best feature space for each
class cluster by maximizing over the .
Notice that the for each class cluster is
obtained independently.

30
Multiclass One Cluster

When the number of clusters is one, there is
single global transformation and the classes are
modeled as diagonal Gaussians in this feature
space.
Note that we have three parameters
needed to be optimized.
The ML estimators
The ML value
The optimal A can be obtained by optimization as
follows

objective function
31
Multiclass One Cluster (cont.)

Optimization the numerical approach
The objective function F
Differentiating F with respect to A, and we will
get the derivative G

32
Multiclass One Cluster (cont.)

Directly optimizing the objective function is
nontrivial and requires numerical optimization
techniques and full matrix, , to be stored
at each class.
A more efficient algorithm from (Gales, TR291,
1997) can be used.

33
Multiclass Gales Approach

Algorithm
Estimate the mean (sample mean), which is
independent of the other model parameters.
Use the current estimate of the transform A, and
estimate the set of class-specific diagonal
variances.
Estimate the transform A using the current set of
diagonal covariances.
Go to 2) until convergence, or appropriate
criterion satisfied.

sample covariance of class j
The i-th entry of diagonal variance of class j
the i-th row vector of A
The i-th row vector of the cofactors
34
Multiclass Gales Approach (Appendix)

Lets go back to the log-likelihood of the
training data

35
Multiclass Gales Approach (Appendix)

Differentiating with respect to and
equating to zero

scalar
36
MLLT vs. HDA

The maximum likelihood linear transformation
(MLLT) aims at minimizing the loss in likelihood
between full and diagonal covariance Gaussian
models.
(Gopinath, 1998)
(Gales, 1999)
The objective is to find a transformation that
maximizes the log likelihood difference of the
data

Write a Comment

User Comments (0)