Title: Linear Models I
1Linear Models (I)
2Review of Information Theory
- What is information?
- What is entropy?
- Average information
- Minimum coding length
- Important inequality
Distribution for Generating Symbols
Distribution for Coding Symbols
3Review of Information Theory (contd)
- Mutual information
- Measure the correlation between two random
variables - Symmetric
- Kullback-Leibler distance
- Difference between two distributions
4Outline
- Classification problems
- Information theory for text classification
- Gaussian generative
- Naïve Bayes
- Logistic regression
5Classification Problems
- Given input Xx1, x2, , xm
- Predict the class label y
- y?-1,1, binary class classification problems
- y ?1, 2, 3, , c, multiple class
classification problems - Goal need to learn the function
6Examples of Classification Problems
- Text categorization
- Input features words campaigning, efforts,
Iowa, Democrats, - Class label politics and non-politics
- Image Classification
- Input features color histogram, texture
distribution, edge distribution, - Class label bird image and non-bird image
7Learning Setup for Classification Problems
- Training examples
-
- Identical Independent Distribution (i.i.d.)
- Training examples are similar to testing examples
- Goal
- Find a model or a function that is consistent
with the training data
8Information Theory for Text Classification
Distribution for Generating Symbols
Distribution for Coding Symbols
- If coding distribution is similar to the
generating distribution ? short coding length ?
good compression rate
9Compression Algorithm for TC
Topic Sports
New Document
Compression Model M1
Politics
16K bits
Compression Model M2
10K bits
Sports
10Probabilistic Models for Classification Problems
- Apply statistical inference methods
- Key finding the best parameters ?
- Maximum likelihood (MLE) approach
- Log-likelihood of data
- Find the parameters ? that maximizes the
log-likelihood
11Generative Models
- Not directly estimate p(yx?)
- Using Bayes rule
- Estimate p(xly?) instead of p(yx?)
- Why p(xly?)?
- Most well known distributions are p(xl?).
- Allocate a separate set of parameters for each
class - ? ? ?1, ?2,, ?c
- p(xly?) ? p(xl?y)
- Describes the special input patterns for each
class y
12Gaussian Generative Model (I)
- Assume a Gaussian model for each class
- One dimension case
- Results for MLE
13Example
- Height histogram for males and females.
- Using Gaussian generative model
- P(male1.8) ? , P(female1.4) ?
14Gaussian Generative Model (II)
- Consider multiple input features
- Xx1, x2, , xm
- Multi-variate Gaussian distribution
- ?y is a m?m covariance matrix
- Results for MLE
- Problem
- Singularity of ?y too many parameters
15Overfitting Issue
- Complex model
- Insufficient training
- Consider a classification problem of multiple
inputs - 100 input features
- 5 classes
- 1000 training examples
- Total number parameters for a full Gaussian model
is - 5 means ? 500 parameters
- 5 covariance matrices ? 50,000 parameters
- 50,500 parameters ? insufficient training data
16Another Example of Overfitting
17Another Example of Overfitting
18Another Example of Overfitting
19Another Example of Overfitting
20Naïve Bayes
- Simplify the model complexity
- Diagonalize the covariance matrix ?y
- Simplified Gaussian distribution
- Feature independence assumption
- Naïve Bayes assumption
21Naïve Bayes
- A terrible estimator for
- But it is a very reasonable estimator for
- Why?
-
- The ratio of likelihood is more
important - Naïve Bayes does a reasonable job on the
estimation of ratio
22The Ratio of Likelihood
- Binary class
- Both classes share the similar variance
23Decision Boundary
- Gaussian Generative Models Finding a linear
decision boundary - Why not do it directly?