Title: Statistical Classification
 1Statistical Classification
  2Classification Problems
-  Given input Xx1, x2, , xm 
-  Predict the class label y ? Y 
-  Y  -1,1, binary class classification problems 
-  Y  1, 2, 3, , c, multiple class 
 classification problems
-  Goal need to learn the function f X ? Y
3Examples of Classification Problem
- Text categorization 
- Input features X 
- Word frequency 
- (campaigning, 1), (democrats, 2), (basketball, 
 0),
- Class label y 
- Y  1 politics 
- Y  -1 non-politics
Politics Non-politics
Doc Months of campaigning and weeks of 
round-the-clock efforts in Iowa all came down to 
a final push Sunday, 
Topic 
 4Examples of Classification Problem
- Text categorization 
- Input features X 
- Word frequency 
- (campaigning, 1), (democrats, 2), (basketball, 
 0),
- Class label y 
- Y  1 politics 
- Y  -1 not-politics
Politics Non-politics
Doc Months of campaigning and weeks of 
round-the-clock efforts in Iowa all came down to 
a final push Sunday, 
Topic 
 5Examples of Classification Problem
- Image Classification 
- Input features X 
- Color histogram 
- (red, 1004), (red, 23000),  
- Class label y 
- Y  1 bird image 
- Y  -1 non-bird image
Which images are birds, which are not? 
 6Examples of Classification Problem
- Image Classification 
- Input features X 
- Color histogram 
- (red, 1004), (blue, 23000),  
- Class label y 
- Y  1 bird image 
- Y  -1 non-bird image
Which images are birds, which are not? 
 7Classification Problems
How to obtain f ?
Learn classification function f from examples 
 8Learning from Examples
- Training examples 
-  
- Identical Independent Distribution (i.i.d.) 
- Each training example is drawn independently from 
 the identical source
- Training examples are similar to testing examples
9Learning from Examples
- Training examples 
-  
- Identical Independent Distribution (i.i.d.) 
- Each training example is drawn independently from 
 the identical source
10Learning from Examples
- Given training examples 
- Goal learn a classification function f(x)X?Y 
 that is consistent with training examples
- What is the easiest way to do it ?
11K Nearest Neighbor (kNN) Approach 
How many neighbors should we count ? 
 12Cross Validation
- Divide training examples into two sets 
- A training set (80) and a validation set (20) 
- Predict the class labels of the examples in the 
 validation set by the examples in the training
 set
- Choose the number of neighbors k that maximizes 
 the classification accuracy
13Leave-One-Out Method
- For k  1, 2, , K 
-  Err(k)  0 
- Randomly select a training data point and hide 
 its class label
- Using the remaining data and given K to predict 
 the class label for the left data point
- Err(k)  Err(k)  1 if the predicted label is 
 different from the true label
-  Repeat the procedure until all training examples 
 are tested
- Choose the k whose Err(k) is minimal
14Leave-One-Out Method
- For k  1, 2, , K 
-  Err(k)  0 
- Randomly select a training data point and hide 
 its class label
- Using the remaining data and given K to predict 
 the class label for the left data point
- Err(k)  Err(k)  1 if the predicted label is 
 different from the true label
-  Repeat the procedure until all training examples 
 are tested
- Choose the k whose Err(k) is minimal
15Leave-One-Out Method
- For k  1, 2, , K 
-  Err(k)  0 
- Randomly select a training data point and hide 
 its class label
- Using the remaining data and given k to predict 
 the class label for the left data point
- Err(k)  Err(k)  1 if the predicted label is 
 different from the true label
-  Repeat the procedure until all training examples 
 are tested
- Choose the k whose Err(k) is minimal
16Leave-One-Out Method
- For k  1, 2, , K 
-  Err(k)  0 
- Randomly select a training data point and hide 
 its class label
- Using the remaining data and given k to predict 
 the class label for the left data point
- Err(k)  Err(k)  1 if the predicted label is 
 different from the true label
-  Repeat the procedure until all training examples 
 are tested
- Choose the k whose Err(k) is minimal
Err(1)  1 
 17Leave-One-Out Method
- For k  1, 2, , K 
-  Err(k)  0 
- Randomly select a training data point and hide 
 its class label
- Using the remaining data and given k to predict 
 the class label for the left data point
- Err(k)  Err(k)  1 if the predicted label is 
 different from the true label
-  Repeat the procedure until all training examples 
 are tested
- Choose the k whose Err(k) is minimal
Err(1)  1 
 18Leave-One-Out Method
- For k  1, 2, , K 
-  Err(k)  0 
- Randomly select a training data point and hide 
 its class label
- Using the remaining data and given k to predict 
 the class label for the left data point
- Err(k)  Err(k)  1 if the predicted label is 
 different from the true label
-  Repeat the procedure until all training examples 
 are tested
- Choose the k whose Err(k) is minimal
Err(1)  3 Err(2)  2 Err(3)  6 
 19Probabilistic interpretation of KNN
- Estimate the probability density function Pr(yx) 
 around the location of x
- Count of data points in class y in the 
 neighborhood of x
- Bias and variance tradeoff 
- A small neighborhood ? large variance ? 
 unreliable estimation
- A large neighborhood ? large bias ? inaccurate 
 estimation
20Weighted kNN
- Weight the contribution of each close neighbor 
 based on their distances
- Weight function 
- Prediction
21Estimate ?2 in the Weight Function
- Leave one cross validation 
- Training dataset D is divided into two sets 
- Validation set 
- Training set 
- Compute the 
22Estimate ?2 in the Weight Function
Pr(yx1, D-1) is a function of ?2 
 23Estimate ?2 in the Weight Function
Pr(yx1, D-1) is a function of ?2 
 24Estimate ?2 in the Weight Function
- In general, we can have expression for 
- Validation set 
- Training set 
- Estimate ?2 by maximizing the likelihood
25Estimate ?2 in the Weight Function
- In general, we can have expression for 
- Validation set 
- Training set 
- Estimate ?2 by maximizing the likelihood
26Optimization
- It is a DC (difference of two convex functions) 
 function
27Challenges in Optimization
- Convex functions are easiest to be optimized 
- Single-mode functions are the second easiest 
- Multi-mode functions are difficult to be optimized
28Gradient Ascent 
 29Gradient Ascent (contd)
- Compute the derivative of l(?), i.e., 
- Update ?
How to decide the step size t? 
 30Gradient Ascent Line Search
Excerpt from the slides by Steven Boyd 
 31Gradient Ascent
- Stop criterion 
- ? is predefined small value 
- Start ?0, Define ?, ?, and ? 
- Compute 
- Choose step size t via backtracking line search 
- Update 
- Repeat till 
32Gradient Ascent
- Stop criterion 
- ? is predefined small value 
- Start ?0, Define ?, ?, and ? 
- Compute 
- Choose step size t via backtracking line search 
- Update 
- Repeat till 
33ML  Statistics  Optimization
- Modeling Pr(yx?) 
- ? is the parameter(s) involved in the model 
- Search for the best parameter ? 
- Maximum likelihood estimation 
- Construct a log-likelihood function l(?) 
- Search for the optimal solution ? 
34Instance-Based Learning (Ch. 8)
- Key idea just store all training examples 
- k Nearest neighbor 
- Given query example , take vote among its k 
 nearest neighbors (if discrete-valued target
 function)
- take mean of f values of k nearest neighbors if 
 real-valued target function
35When to Consider Nearest Neighbor ?
- Lots of training data 
- Less than 20 attributes per example 
- Advantages 
- Training is very fast 
- Learn complex target functions 
- Dont lose information 
- Disadvantages 
- Slow at query time 
- Easily fooled by irrelevant attributes
36KD Tree for NN Search
- Each node contains 
- Children information 
- The tightest box that bounds all the data points 
 within the node.
37NN Search by KD Tree 
 38NN Search by KD Tree 
 39NN Search by KD Tree 
 40NN Search by KD Tree 
 41NN Search by KD Tree 
 42NN Search by KD Tree 
 43NN Search by KD Tree 
 44Curse of Dimensionality 
- Imagine instances described by 20 attributes, but 
 only 2 are relevant to target function
- Curse of dimensionality nearest neighbor is 
 easily mislead when high dimensional X
- Consider N data points uniformly distributed in a 
 p-dimensional unit ball centered at original.
 Consider the nn estimate at the original. The
 mean distance from the original to the closest
 data point is
45Curse of Dimensionality 
- Imagine instances described by 20 attributes, but 
 only 2 are relevant to target function
- Curse of dimensionality nearest neighbor is 
 easily mislead when high dimensional X
- Consider N data points uniformly distributed in a 
 p-dimensional unit ball centered at origin.
 Consider the nn estimate at the original. The
 mean distance from the origin to the closest data
 point is