Title: Review
1Review
2Comparison of Different Classification Models
- The goal of all classifiers
- Predicating class label y for an input x
- Estimate p(yx)
3K Nearest Neighbor (kNN) Approach
4K Nearest Neighbor Approach (KNN)
- What is the appropriate size for neighborhood
N(x)? - Leave one out approach
- Weight K nearest neighbor
- Neighbor is defined through a weight function
- Estimate p(yx)
- How to estimate the appropriate value for ?2?
5K Nearest Neighbor Approach (KNN)
- What is the appropriate size for neighborhood
N(x)? - Leave one out approach
- Weight K nearest neighbor
- Neighbor is defined through a weight function
- Estimate p(yx)
- How to estimate the appropriate value for ?2?
6K Nearest Neighbor Approach (KNN)
- What is the appropriate size for neighborhood
N(x)? - Leave one out approach
- Weight K nearest neighbor
- Neighbor is defined through a weight function
- Estimate p(yx)
- How to estimate the appropriate value for ?2?
7Weighted K Nearest Neighbor
- Leave one out maximum likelihood
- Estimate leave one out probability
- Leave one out likelihood of training data
- Search the optimal ?2 by maximizing the leave one
out likelihood
8Weight K Nearest Neighbor
- Leave one out maximum likelihood
- Estimate leave one out probability
- Leave one out likelihood of training data
- Search the optimal ?2 by maximizing the leave one
out likelihood
9Gaussian Generative Model
- p(yx) p(xy) p(y) posterior likelihood ?
prior - Estimate p(xy) and p(y)
- Allocate a separate set of parameters for each
class - ? ? ?1, ?2,, ?c
- p(xly?) ? p(x?y)
- Maximum likelihood estimation
10Gaussian Generative Model
- p(yx) p(xy) p(y) posterior likelihood ?
prior - Estimate p(xy) and p(y)
- Allocate a separate set of parameters for each
class - ? ? ?1, ?2,, ?c
- p(xly?) ? p(x?y)
- Maximum likelihood estimation
11Gaussian Generative Model
- Difficult to estimate p(xy) if x is of high
dimensionality - Naïve Bayes
- Essentially a linear model
- How to make a Gaussian generative model
discriminative? - (?m,?m) of each class are only based on the data
belonging to that class ? lack of discriminative
power
12Gaussian Generative Model
- Maximum likelihood estimation
13Gaussian Generative Model
- Bound optimization algorithm
14Gaussian Generative Model
We have decomposed the interaction of parameters
between different classes
Question how to handle x with multiple features ?
15Logistic Regression Model
- A linear decision boundary w?xb
- A probabilistic model p(yx)
- Maximum likelihood approach for estimating
weights w and threshold b
16Logistic Regression Model
- Overfitting issue
- Example text classification
- Words that appears in only one document will be
assigned with infinite large weight - Solution regularization
17Non-linear Logistic Regression Model
- Kernelize logistic regression model
18Non-linear Logistic Regression Model
- Hierarchical Mixture Expert Model
- Group linear classifiers into a tree structure
Products generates nonlinearity in the prediction
function
19Non-linear Logistic Regression Model
- It could be a rough assumption by assuming all
data points can be fitted by a linear model - But, it is usually appropriate to assume a local
linear model - KNN can be viewed as a localized model without
any parameters - Can we extend the KNN approach by introducing a
localized linear model?
20Localized Logistic Regression Model
- Similar to the weight KNN
- Weigh each training example by
- Build a logistic regression model using the
weighted examples
21Localized Logistic Regression Model
- Similar to the weight KNN
- Weigh each training example by
- Build a logistic regression model using the
weighted examples
22Conditional Exponential Model
- An extension of logistic regression model to
multiple class case - A different set of weights wy and threshold b for
each class y - Translation invariance
-
23Maximum Entropy Model
- Finding the simplest model that matches with the
data
- Iterative scaling methods for optimization
24Support Vector Machine
- Classification margin
- Maximum margin principle
- Separate data far away from the decision boundary
- Two objectives
- Minimize the classification error over training
data - Maximize the classification margin
- Support vectors
- Only support vectors have impact on the location
of decision boundary
denotes 1 denotes -1
25Support Vector Machine
- Classification margin
- Maximum margin principle
- Separate data far away from the decision boundary
- Two objectives
- Minimize the classification error over training
data - Maximize the classification margin
- Support vectors
- Only support vectors have impact on the location
of decision boundary
denotes 1 denotes -1
26Support Vector Machine
- Separable case
- Noisy case
27Support Vector Machine
- Separable case
- Noisy case
28Logistic Regression Model vs. Support Vector
Machine
- Logistic regression model
- Support vector machine
29Logistic Regression Model vs. Support Vector
Machine
Logistic regression differs from support vector
machine only in the loss function
30Kernel Tricks
- Introducing nonlinearity into the discriminative
models - Diffusion kernel
- A graph laplacian L for local similarity
- Diffusion kernel
- Propagate local similarity information into a
global one
31Fisher Kernel
- Derive a kernel function from a generative model
- Key idea
- Map a point x in original input space into the
model space - The similarity of two data points are measured in
the model space
Model Space
32Kernel Methods in Generative Model
- Usually, kernels can be introduced to a
generative model through a Gaussian process - Define a kernelized covariance matrix
- Positive semi-definitive, similar to Mercers
condition
33Multi-class SVM
- SVMs can only handle two-class outputs
- One-against-all
- Learn N SVMs
- SVM 1 learns Output1 vs Output ! 1
- SVM 2 learns Output2 vs Output ! 2
-
- SVM N learns OutputN vs Output ! N
34Error Correct Output Code (ECOC)
- Encode each class into a bit vector
1 1 2
x
1 1 1 0
35Ordinal Regression
- A special class of multi-class classification
problem - There a natural ordinal relationship between
multiple classes - Maximum margin principle
- The computation of margin involves multiple
classes
36Ordinal Regression
37Decision Tree
From slides of Andrew Moore
38Decision Tree
- A greedy approach for generating a decision tree
- Choose the most informative feature
- Using the mutual information measurements
- Split data set according to the values of the
selected feature - Recursive until each data item is classified
correctly - Attributes with real values
- Quantize the real value into a discrete one
39Decision Tree
- The overfitting problem
- Tree pruning
- Reduced error pruning
- Rule post-pruning
40Decision Tree
- The overfitting problem
- Tree pruning
- Reduced error pruning
- Rule post-pruning
41Generalize Decision Tree
Each node is a linear classifier
?
?
?
?
?
?
a decision tree using classifiers for data
partition
a decision tree with simple data partition