Title: David Newman, UC Irvine Lecture 5: Classification 1
 1CS 277 Data MiningLecture 5 Classification 
(cont.)
- David Newman 
- Department of Computer Science 
- University of California, Irvine
TexPoint fonts used in EMF. Read the TexPoint 
manual before you delete this box. AAAAAAA 
 2Notices
- Project proposal due next Tuesday (Oct 16) 
- Homework 2 (text classification) available Tues
3Homework 1 comments
- Overall, good 
- Please, no handwritten work 
- Graphs/plots/charts 
- ALWAYS label x and y axis 
- include a title 
- choose appropriate type histogram or line plot? 
- Use precise, formal language 
- dont be chatty or informal 
- less is more 
- Complexity analysis 
- define variables 
- dont use constants 
- state assumptions 
- Solutions/comments on hw1 web directory 
 hw1.solutions.txt
- lets review  
4Today
- Lecture Classification 
- Link  CAIDA 
- www.caida.org 
- www.caida.org/tools/visualization/walrus/gallery1/
 
5Nearest Neighbor Classifiers
- kNN Select the k nearest neighbors to x from the 
 training data and select the majority class from
 these neighbors
- k is a parameter 
- Small k noisier estimates, Large k smoother 
 estimates
- Best value of k often chosen by cross-validation 
- ? pseudo-code on whiteboard 
6Train and test data
class label
W (words)
Dtrain Dtest 
 7? GOLDEN RULE FOR PREDICTION ?
- NEVER LET YOUR MODEL SEE YOUR TEST DATA
8Train, hold-out and test data
class label
W (words)
Dtrain Dholdout Dtest 
 9Decision Tree Classifiers
- Widely used in practice 
- Can handle both real-valued and nominal inputs 
 (unusual)
- Good with high-dimensional data 
- Similar algorithms as used in constructing 
 regression trees
- Historically, developed both in statistics and 
 computer science
- Statistics 
- Breiman, Friedman, Olshen and Stone, CART, 1984 
- Computer science 
- Quinlan, ID3, C4.5 (1980s-1990s) 
- Try it out on Weka (implementation of C4.5 called 
 J48)
10Decision Tree Example
Debt
Income 
 11Decision Tree Example
Debt
Income gt t1
??
Income
t1 
 12Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
?? 
 13Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3 
 14Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and 
axis-parallel 
 15Decision Tree Example (2) 
 16Decision tree example (cont.) 
 17Decision tree example (cont.)
Highest information gain. Creates a pure node. 
 18Decision tree example (cont.)
Lowest information gain. All nodes have 
near-equal yes/no. 
 19Decision Tree Pseudocode
node  tree-design (Data  X,C) for i  1 to 
d quality_variable(i)  quality_score(Xi, 
C) end node  X_split, Threshold  for 
maxquality_variable Data_right, Data_left  
split(Data, X_split, threshold) if node  
leaf? return(node) else node_right  
tree-design(Data_right) node_left  
tree-design(Data_left) end end 
 20Decision Trees are not stable
Moving just one example slightly may lead to 
quite different trees and space partition! Lack 
of stability against small perturbation of data.
Figure from Duda, Hart  Stork, Chap. 8 
 21How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size 
 22Choosing a Good Tree for Prediction
- General idea 
- grow a large tree 
- prune it back to create a family of subtrees 
- weakest link pruning 
- score the subtrees and pick the best one 
- Massive data sizes (e.g., n  100k data points) 
- use training data set to fit a set of trees 
- use a validation data set to score the subtrees 
- Smaller data sizes (e.g., n 1k or less) 
- use cross-validation 
- use explicit penalty terms (e.g., Bayesian 
 methods)
23Example Spam Email Classification
- Data Set (from the UCI Machine Learning Archive) 
- 4601 email messages from 1999 
- Manually labeled as spam (60), non-spam (40) 
- 54 features percentage of words matching a 
 specific word/character (NOT BAG OF WORDS)
- Business, address, internet, free, george, !, , 
 etc
- Average/longest/sum lengths of uninterrupted 
 sequences of CAPS
- Error Rates (Hastie, Tibshirani, Friedman, 2001) 
- Training 3056 emails, Testing 1536 emails 
- Decision tree  8.7 
- Logistic regression error  7.6 
- Naïve Bayes  10 (typically) 
24(No Transcript) 
 25(No Transcript) 
 26Treating Missing Data in Trees
- Missing values are common in practice 
- Approaches to handing missing values 
- During training 
- Ignore rows with missing values (inefficient) 
- During testing 
- Send the example being classified down both 
 branches and average predictions
- Replace missing values with an imputed value 
 (can be suboptimal)
- Other approaches 
- Treat missing as a unique value (useful if 
 missing values are correlated with the class)
- Surrogate splits method 
- Search for and store surrogate variables/splits 
 during training
27Other Issues with Classification Trees
- Why use binary splits (for real-valued data)? 
- Multiway splits can be used, but cause 
 fragmentation
- Linear combination splits? 
- can produces small improvements 
- optimization is much more difficult (need weights 
 and split point)
- Trees are much less interpretable 
- Model instability 
- A small change in the data can lead to a 
 completely different tree
- Model averaging techniques (like bagging) can be 
 useful
- Tree bias 
- Poor at approximating non-axis-parallel 
 boundaries
- Producing rule sets from tree models (e.g., c5.0) 
28Why Trees are widely used in Practice
- Can handle high dimensional data 
- builds a model using 1 dimension at time 
- Can handle any type of input variables 
- categorical, real-valued, etc 
- most other methods require data of a single type 
 (e.g., only real-valued)
- Trees are (somewhat) interpretable 
- domain expert can read off the trees logic 
- Tree algorithms are relatively easy to code and 
 test
29Limitations of Trees
- Representational Bias 
- classification piecewise linear boundaries, 
 parallel to axes
- regression piecewise constant surfaces 
- High Variance 
- trees can be unstable as a function of the 
 sample
- e.g., small change in the data -gt completely 
 different tree
- causes two problems 
- 1. High variance contributes to prediction error 
- 2. High variance reduces interpretability 
- Trees are good candidates for model combining 
- Often used with boosting and bagging 
- Trees do not scale well to massive data sets 
 (e.g., N in millions)
- repeated random access of subsets of the data
30Evaluating Classification Results
- Summary statistics 
- Empirical estimate of score function on test 
 data, error rate, accuracy, etc.
- More detailed breakdown 
- Confusion matrix 
- Can be quite useful in detecting systematic 
 errors
- Detection v. false-alarm plots (2 classes) 
- Binary classifier with real-valued output for 
 each example, where higher means more likely to
 be class 1
- For each possible threshold, calculate 
- Detection rate  fraction of class 1 detected 
- False alarm rate  fraction of class 2 detected 
- Plot y (detection rate) versus x (false alarm 
 rate)
- Also known as ROC, precision-recall, 
 specificity/sensitivity
31Naïve Bayes Text Classification
- K classes c1,..cK 
- Class-conditional probabilities p( 
 d  ck )  probability of d given ck
-   Pi p( wi  ck ) 
- Posterior class probabilities (by Bayes rule) 
 p( ck  d )  p( d  ck ) p(ck)
32Naïve Bayes Text Classification
- Multivariate Bernoulli / Binary model 
- Multinomial model 
33Naïve Bayes Text Classification