Title: Maximum Likelihood Modeling with Gaussian Distributions for Classification
1Maximum Likelihood Modeling with Gaussian
Distributions for Classification
R. A. Gopinath IBM T. J. Watson Research
Center ICASSP 1998
Presentator Winston Lee
2References
- S. R. Searle, Matrix Algebra Useful for
Statistics, John Wiley and Sons, 1982. - N. Kumar and A. G. Andreou, Heteroscedastic
Discriminant Analysis and Reduced Rank HMMs for
Improved Speech Recognition, Speech
Communication, 1998. - E. Alpaydin, Introduction to Machine Learning,
The MIT Press, 2004. - R. A. Gopinath, Constrained Maximum Likelihood
Modeling with Gaussian Distributions. - M. J. F. Gales, Maximum Likelihood Linear
Transformations for HMM-based Speech
Recognition, Cambridge, U.K. Cambridge Univ.,
Tech. Rep. CUED/F-INFENG/TR291, 1997 - M. J. F. Gales, Semi-Tied Covariance Matrices
for Hidden Markov Models, IEEE Transactions.
Speech and Audio Processing, 7272281, 1999 - G. Saon, et al., Maximum Likelihood.
Discriminant Feature Spaces, in Proc. ICASSP
2000.
3Outline
- Abstract
- Introduction
- ML Modeling
- Linear Transformation
- Single Class
- Multiclass
4Abstract
- Maximum Likelihood (ML) modeling of multiclass
data for classification often suffers from the
following problems a) data insufficiency
implying over-trained or unreliable models, b)
large storage requirement, c) large computational
requirement, and/or d) ML is not discriminating
between classes. Sharing parameters across
classes (or constraining the parameters) clearly
tends to alleviate the first three problems. - In this paper we show that in some cases it can
also lead to better discrimination (as evidenced
by reduced misclassification error). The
parameters considered are the means and variances
of the Gaussians and linear transformations of
the feature space (or equivalently the Gaussian
means). - Some constraints on the parameters are shown to
lead to Linear Discrimination Analysis (a
well-known result) while others are shown to lead
to optimal feature spaces (a relatively new
result). Applications of some of these ideas to
the speech recognition problem are also given.
5Introduction
- Why Gaussians when modeling data?
- Any distribution can be approximated by Gaussian
Mixtures. - A rich set of mathematical results is available
for Gaussians. - How to model the labeled training data well for
classification? - Assumption the training and test data have the
same distribution. - So, Just model the training data as well as
possible. - Criterion Maximum Likelihood (ML) Principle.
- The main idea
- In constrained ML modeling (e.g., diagonal
covariance), there are optimal feature spaces. - Kumars HLDA.
- Gales efficient algorithm.
To search for the parameters that maximizes the
likelihood of the sample.
6ML Modeling
- The likelihood of the training data
is given by - The idea is to choose the parameters and
so as to maximize
.
label
data numbers
class
dimension
7ML Modeling (cont.)
- can be expressed as
follows
sample mean of class j
sample covariance of class j
8Linear Transformation
- Now consider linearly transforming the samples
from each class - can also be modeled with
Gaussians. Why? - Proof The moment generating function for normal
distribution is The
moment generating function of is
nonsingular d x d matrix
new covariance
new mean
9Linear Transformation (cont.)
- The problem of scaling
- The choice of can make the likelihood of a
test data sample difficult to be compared. - The likelihood of data from class j may be
arbitrarily large. - Two approaches to compare likelihoods
- First, to ensure that for every
class, and the likelihood of the data
corresponding to each class is the same in the
original and transformed spaces. (implying
.) - Second, to only consider the likelihood in the
original space even though the data is modeled in
the transformed space.(Kumars start point in
his paper.)
10Linear Transformation (cont.)
11Linear Transformation (cont.)
12Linear Transformation (cont.)
- Why rather than when modeling?
- If the data is modeled using full-covariance
Gaussians, it makes no difference. - How about diagonal or block-diagonal Gaussians?
- If we directly remove the off-diagonal entries,
the correlations in the data cannot be account
for explicitly (Ljolje, 1994). - The transformations can be used to find the basis
in which this structural constraint on the
variances is more valid as evidenced from the
data.
13Single Class
- Ignore the class labels and modeling the entire
data with one Gaussian - By ML estimation, , and the
ML value of the training data is - On average each sample contribute to the
ML value.
The ML estimator
14Single Class ML Estimates
15Single Class Linear Transformation
- Consider a global non-singular linear
transformation of the data - By ML estimation, and
the ML value of the training data is - If then
We can name such kind of matrix as unimodular or
volume-preserving linear transformation.
16Single Class Linear Transformation (MLE)
17Single Class Diagonal Covariance
- If we are constrained to use a diagonal
covariance model, The ML estimators will be
and . - The ML value
- Because of the diagonal constraint,
Another proof of Hadamards inequality.
18Single Class Diagonal Covariance (MLE)
19Single Class Diagonal Covariance (LT)
- If one linearly transforms the data and models
using a diagonal Gaussian, the ML value is - One can maximize the ML value over A to obtain
the best feature space in which to model with the
diagonal covariance constraint. - Note that A is still supposed to be unimodular.
- one optimal choice of A
No loss in likelihood!
20Multiclass
- In this case the training data is modeled with a
Gaussian for each class . one can
split the data into J classes and model each one
separately. - The ML estimators
- The ML value
- Note that there is no interaction between the
classes and therefore unconstrained ML modeling
is not discriminating. - For the same sake, the unimodular transformations
cannot help in better classification.
21Multiclass Diagonal Covariance
- The ML estimators
- The ML value
- If one linearly transforms the data and models
using a diagonal Gaussian, the ML value is
The main task is to choose A.
22Multiclass Some Issues
- If the sample size for each class is not large
enough then the ML parameter estimates may have
large variance and hence be unreliable. - The storage requirement for the model
- The computational requirement
- The parameters for each class are obtained
independently ML principle dose not allow for
discrimination between classes.
MLE
population covariance
23Multiclass Some Issues (cont.)
- Parameters sharing across classes
- reduces the number of parameters
- reduces storage requirements
- reduces computational requirements
- is more discriminating leading to better
classifiers. - But, we can appeal to Fishers criterion of LDA
and a result of Campbell to argue that sometimes
constrained ML modeling is discriminating .
but comes with a loss in likelihood
hard to justify
24Multiclass Some Issues (cont.)
- Solution
- We can globally transform the data with a
unimodular matrix A and model the transformed
data with diagonal Gaussians.(There is a loss in
likelihood too.) - Among all possible transformation A, we can
choose the one that takes the least loss in
likelihood.(In essence we will find a linearly
transformed (shared) feature space in which the
diagonal Gaussian assumption is most valid.)
25Multiclass Equal Covariance
- Here all the covariances are assumed to be
equal. - Notice that the ML estimate of covariance for
each class is not supposed to be equal a priori. - The ML estimators
equal covariance of each class
26Multiclass Equal Covariance (cont.)
ML estimator of mean
within-class scatter as the ML estimate of
covariance
27Multiclass Equal Covariance (cont.)
- Each sample on average contributes to the
likelihood and the ML value - From , we can prove
- The sample covariance of the entire data
- From , we can derive
The geometric is smaller than or equal to the
arithmetic mean.
within-class scatter
between-class scatter
28Multiclass Equal Covariance Clusters
- Classes are organized into clusters and each
cluster modeled with a single mean or collection
of means and a single covariance. - The former case the data can be relabeled using
cluster labels and ML estimates and
ML values can be obtained as before for the
full-covariance multiclass case. - The latter case the data can be split into K
groups in which case this essentially becomes
the equal-covariance problem for each group.
29Multiclass Diagonal Covariance Clusters
- Again classes are grouped into clusters. Each
cluster is modeled with a diagonal Gaussian in a
transformed feature space. - The ML estimators in the original feature
space - The ML value
- One can choose the best feature space for each
class cluster by maximizing over the . - Notice that the for each class cluster is
obtained independently.
30Multiclass One Cluster
- When the number of clusters is one, there is
single global transformation and the classes are
modeled as diagonal Gaussians in this feature
space. - Note that we have three parameters
needed to be optimized. - The ML estimators
- The ML value
- The optimal A can be obtained by optimization as
follows
objective function
31Multiclass One Cluster (cont.)
- Optimization the numerical approach
- The objective function F
- Differentiating F with respect to A, and we will
get the derivative G
32Multiclass One Cluster (cont.)
- Directly optimizing the objective function is
nontrivial and requires numerical optimization
techniques and full matrix, , to be stored
at each class. - A more efficient algorithm from (Gales, TR291,
1997) can be used.
33Multiclass Gales Approach
- Algorithm
- Estimate the mean (sample mean), which is
independent of the other model parameters. - Use the current estimate of the transform A, and
estimate the set of class-specific diagonal
variances. - Estimate the transform A using the current set of
diagonal covariances. - Go to 2) until convergence, or appropriate
criterion satisfied.
sample covariance of class j
The i-th entry of diagonal variance of class j
the i-th row vector of A
The i-th row vector of the cofactors
34Multiclass Gales Approach (Appendix)
- Lets go back to the log-likelihood of the
training data
35Multiclass Gales Approach (Appendix)
- Differentiating with respect to and
equating to zero
scalar
36MLLT vs. HDA
- The maximum likelihood linear transformation
(MLLT) aims at minimizing the loss in likelihood
between full and diagonal covariance Gaussian
models. - (Gopinath, 1998)
- (Gales, 1999)
- The objective is to find a transformation that
maximizes the log likelihood difference of the
data