Maximum Likelihood Modeling with Gaussian Distributions for Classification - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Maximum Likelihood Modeling with Gaussian Distributions for Classification

Description:

E. Alpaydin, 'Introduction to Machine Learning,' The MIT Press, 2004. ... nonsingular d x d matrix. new mean. new covariance. Linear Transformation (cont. ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 37
Provided by: Ryan85
Category:

less

Transcript and Presenter's Notes

Title: Maximum Likelihood Modeling with Gaussian Distributions for Classification


1
Maximum Likelihood Modeling with Gaussian
Distributions for Classification
R. A. Gopinath IBM T. J. Watson Research
Center ICASSP 1998
Presentator Winston Lee
2
References
  • S. R. Searle, Matrix Algebra Useful for
    Statistics, John Wiley and Sons, 1982.
  • N. Kumar and A. G. Andreou, Heteroscedastic
    Discriminant Analysis and Reduced Rank HMMs for
    Improved Speech Recognition, Speech
    Communication, 1998.
  • E. Alpaydin, Introduction to Machine Learning,
    The MIT Press, 2004.
  • R. A. Gopinath, Constrained Maximum Likelihood
    Modeling with Gaussian Distributions.
  • M. J. F. Gales, Maximum Likelihood Linear
    Transformations for HMM-based Speech
    Recognition, Cambridge, U.K. Cambridge Univ.,
    Tech. Rep. CUED/F-INFENG/TR291, 1997
  • M. J. F. Gales, Semi-Tied Covariance Matrices
    for Hidden Markov Models, IEEE Transactions.
    Speech and Audio Processing, 7272281, 1999
  • G. Saon, et al., Maximum Likelihood.
    Discriminant Feature Spaces, in Proc. ICASSP
    2000.

3
Outline
  • Abstract
  • Introduction
  • ML Modeling
  • Linear Transformation
  • Single Class
  • Multiclass

4
Abstract
  • Maximum Likelihood (ML) modeling of multiclass
    data for classification often suffers from the
    following problems a) data insufficiency
    implying over-trained or unreliable models, b)
    large storage requirement, c) large computational
    requirement, and/or d) ML is not discriminating
    between classes. Sharing parameters across
    classes (or constraining the parameters) clearly
    tends to alleviate the first three problems.
  • In this paper we show that in some cases it can
    also lead to better discrimination (as evidenced
    by reduced misclassification error). The
    parameters considered are the means and variances
    of the Gaussians and linear transformations of
    the feature space (or equivalently the Gaussian
    means).
  • Some constraints on the parameters are shown to
    lead to Linear Discrimination Analysis (a
    well-known result) while others are shown to lead
    to optimal feature spaces (a relatively new
    result). Applications of some of these ideas to
    the speech recognition problem are also given.

5
Introduction
  • Why Gaussians when modeling data?
  • Any distribution can be approximated by Gaussian
    Mixtures.
  • A rich set of mathematical results is available
    for Gaussians.
  • How to model the labeled training data well for
    classification?
  • Assumption the training and test data have the
    same distribution.
  • So, Just model the training data as well as
    possible.
  • Criterion Maximum Likelihood (ML) Principle.
  • The main idea
  • In constrained ML modeling (e.g., diagonal
    covariance), there are optimal feature spaces.
  • Kumars HLDA.
  • Gales efficient algorithm.

To search for the parameters that maximizes the
likelihood of the sample.
6
ML Modeling
  • The likelihood of the training data
    is given by
  • The idea is to choose the parameters and
    so as to maximize
    .

label
data numbers
class
dimension
7
ML Modeling (cont.)
  • can be expressed as
    follows

sample mean of class j
sample covariance of class j
8
Linear Transformation
  • Now consider linearly transforming the samples
    from each class
  • can also be modeled with
    Gaussians. Why?
  • Proof The moment generating function for normal
    distribution is The
    moment generating function of is

nonsingular d x d matrix
new covariance
new mean
9
Linear Transformation (cont.)
  • The problem of scaling
  • The choice of can make the likelihood of a
    test data sample difficult to be compared.
  • The likelihood of data from class j may be
    arbitrarily large.
  • Two approaches to compare likelihoods
  • First, to ensure that for every
    class, and the likelihood of the data
    corresponding to each class is the same in the
    original and transformed spaces. (implying
    .)
  • Second, to only consider the likelihood in the
    original space even though the data is modeled in
    the transformed space.(Kumars start point in
    his paper.)

10
Linear Transformation (cont.)
  • For the first approach

11
Linear Transformation (cont.)
  • For the second approach

12
Linear Transformation (cont.)
  • Why rather than when modeling?
  • If the data is modeled using full-covariance
    Gaussians, it makes no difference.
  • How about diagonal or block-diagonal Gaussians?
  • If we directly remove the off-diagonal entries,
    the correlations in the data cannot be account
    for explicitly (Ljolje, 1994).
  • The transformations can be used to find the basis
    in which this structural constraint on the
    variances is more valid as evidenced from the
    data.

13
Single Class
  • Ignore the class labels and modeling the entire
    data with one Gaussian
  • By ML estimation, , and the
    ML value of the training data is
  • On average each sample contribute to the
    ML value.

The ML estimator
14
Single Class ML Estimates
  • ML estimation

15
Single Class Linear Transformation
  • Consider a global non-singular linear
    transformation of the data
  • By ML estimation, and
    the ML value of the training data is
  • If then

We can name such kind of matrix as unimodular or
volume-preserving linear transformation.
16
Single Class Linear Transformation (MLE)
  • ML estimation

17
Single Class Diagonal Covariance
  • If we are constrained to use a diagonal
    covariance model, The ML estimators will be
    and .
  • The ML value
  • Because of the diagonal constraint,

Another proof of Hadamards inequality.
18
Single Class Diagonal Covariance (MLE)
  • ML estimation

19
Single Class Diagonal Covariance (LT)
  • If one linearly transforms the data and models
    using a diagonal Gaussian, the ML value is
  • One can maximize the ML value over A to obtain
    the best feature space in which to model with the
    diagonal covariance constraint.
  • Note that A is still supposed to be unimodular.
  • one optimal choice of A

No loss in likelihood!
20
Multiclass
  • In this case the training data is modeled with a
    Gaussian for each class . one can
    split the data into J classes and model each one
    separately.
  • The ML estimators
  • The ML value
  • Note that there is no interaction between the
    classes and therefore unconstrained ML modeling
    is not discriminating.
  • For the same sake, the unimodular transformations
    cannot help in better classification.

21
Multiclass Diagonal Covariance
  • The ML estimators
  • The ML value
  • If one linearly transforms the data and models
    using a diagonal Gaussian, the ML value is

The main task is to choose A.
22
Multiclass Some Issues
  • If the sample size for each class is not large
    enough then the ML parameter estimates may have
    large variance and hence be unreliable.
  • The storage requirement for the model
  • The computational requirement
  • The parameters for each class are obtained
    independently ML principle dose not allow for
    discrimination between classes.

MLE
population covariance
23
Multiclass Some Issues (cont.)
  • Parameters sharing across classes
  • reduces the number of parameters
  • reduces storage requirements
  • reduces computational requirements
  • is more discriminating leading to better
    classifiers.
  • But, we can appeal to Fishers criterion of LDA
    and a result of Campbell to argue that sometimes
    constrained ML modeling is discriminating .

but comes with a loss in likelihood
hard to justify
24
Multiclass Some Issues (cont.)
  • Solution
  • We can globally transform the data with a
    unimodular matrix A and model the transformed
    data with diagonal Gaussians.(There is a loss in
    likelihood too.)
  • Among all possible transformation A, we can
    choose the one that takes the least loss in
    likelihood.(In essence we will find a linearly
    transformed (shared) feature space in which the
    diagonal Gaussian assumption is most valid.)

25
Multiclass Equal Covariance
  • Here all the covariances are assumed to be
    equal.
  • Notice that the ML estimate of covariance for
    each class is not supposed to be equal a priori.
  • The ML estimators

equal covariance of each class
26
Multiclass Equal Covariance (cont.)
  • ML estimation

ML estimator of mean
within-class scatter as the ML estimate of
covariance
27
Multiclass Equal Covariance (cont.)
  • Each sample on average contributes to the
    likelihood and the ML value
  • From , we can prove
  • The sample covariance of the entire data
  • From , we can derive

The geometric is smaller than or equal to the
arithmetic mean.
within-class scatter
between-class scatter
28
Multiclass Equal Covariance Clusters
  • Classes are organized into clusters and each
    cluster modeled with a single mean or collection
    of means and a single covariance.
  • The former case the data can be relabeled using
    cluster labels and ML estimates and
    ML values can be obtained as before for the
    full-covariance multiclass case.
  • The latter case the data can be split into K
    groups in which case this essentially becomes
    the equal-covariance problem for each group.

29
Multiclass Diagonal Covariance Clusters
  • Again classes are grouped into clusters. Each
    cluster is modeled with a diagonal Gaussian in a
    transformed feature space.
  • The ML estimators in the original feature
    space
  • The ML value
  • One can choose the best feature space for each
    class cluster by maximizing over the .
  • Notice that the  for each class cluster is
    obtained independently.

30
Multiclass One Cluster
  • When the number of clusters is one, there is
    single global transformation and the classes are
    modeled as diagonal Gaussians in this feature
    space.
  • Note that we have three parameters
    needed to be optimized.
  • The ML estimators
  • The ML value
  • The optimal A can be obtained by optimization as
    follows

objective function
31
Multiclass One Cluster (cont.)
  • Optimization the numerical approach
  • The objective function F
  • Differentiating F with respect to A, and we will
    get the derivative G

32
Multiclass One Cluster (cont.)
  • Directly optimizing the objective function is
    nontrivial and requires numerical optimization
    techniques and full matrix, , to be stored
    at each class.
  • A more efficient algorithm from (Gales, TR291,
    1997) can be used.

33
Multiclass Gales Approach
  • Algorithm
  • Estimate the mean (sample mean), which is
    independent of the other model parameters.
  • Use the current estimate of the transform A, and
    estimate the set of class-specific diagonal
    variances.
  • Estimate the transform A using the current set of
    diagonal covariances.
  • Go to 2) until convergence, or appropriate
    criterion satisfied.

sample covariance of class j
The i-th entry of diagonal variance of class j
the i-th row vector of A
The i-th row vector of the cofactors
34
Multiclass Gales Approach (Appendix)
  • Lets go back to the log-likelihood of the
    training data

35
Multiclass Gales Approach (Appendix)
  • Differentiating with respect to and
    equating to zero

scalar
36
MLLT vs. HDA
  • The maximum likelihood linear transformation
    (MLLT) aims at minimizing the loss in likelihood
    between full and diagonal covariance Gaussian
    models.
  • (Gopinath, 1998)
  • (Gales, 1999)
  • The objective is to find a transformation that
    maximizes the log likelihood difference of the
    data
Write a Comment
User Comments (0)
About PowerShow.com