David Newman, UC Irvine Lecture 5: Classification 1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 5: Classification 1

1
CS 277 Data MiningLecture 5 Classification
(cont.)

David Newman
Department of Computer Science
University of California, Irvine

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAAAA
2
Notices

Project proposal due next Tuesday (Oct 16)
Homework 2 (text classification) available Tues

3
Homework 1 comments

Overall, good
Please, no handwritten work
Graphs/plots/charts
ALWAYS label x and y axis
include a title
choose appropriate type histogram or line plot?
Use precise, formal language
dont be chatty or informal
less is more
Complexity analysis
define variables
dont use constants
state assumptions
Solutions/comments on hw1 web directory
hw1.solutions.txt
lets review

4
Today

Lecture Classification
Link CAIDA
www.caida.org
www.caida.org/tools/visualization/walrus/gallery1/

5
Nearest Neighbor Classifiers

kNN Select the k nearest neighbors to x from the
training data and select the majority class from
these neighbors
k is a parameter
Small k noisier estimates, Large k smoother
estimates
Best value of k often chosen by cross-validation
? pseudo-code on whiteboard

6
Train and test data
class label
W (words)
Dtrain Dtest
7
? GOLDEN RULE FOR PREDICTION ?

NEVER LET YOUR MODEL SEE YOUR TEST DATA

8
Train, hold-out and test data
class label
W (words)
Dtrain Dholdout Dtest
9
Decision Tree Classifiers

Widely used in practice
Can handle both real-valued and nominal inputs
(unusual)
Good with high-dimensional data
Similar algorithms as used in constructing
regression trees
Historically, developed both in statistics and
computer science
Statistics
Breiman, Friedman, Olshen and Stone, CART, 1984
Computer science
Quinlan, ID3, C4.5 (1980s-1990s)
Try it out on Weka (implementation of C4.5 called
J48)

10
Decision Tree Example
Debt
Income
11
Decision Tree Example
Debt
Income gt t1
??
Income
t1
12
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
??
13
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
14
Decision Tree Example
Debt
Income gt t1
t2
Debt gt t2
Income
t1
t3
Income gt t3
Note tree boundaries are piecewise linear and
axis-parallel
15
Decision Tree Example (2)
16
Decision tree example (cont.)
17
Decision tree example (cont.)
Highest information gain. Creates a pure node.
18
Decision tree example (cont.)
Lowest information gain. All nodes have
near-equal yes/no.
19
Decision Tree Pseudocode
node tree-design (Data X,C) for i 1 to
d quality_variable(i) quality_score(Xi,
C) end node X_split, Threshold for
maxquality_variable Data_right, Data_left
split(Data, X_split, threshold) if node
leaf? return(node) else node_right
tree-design(Data_right) node_left
tree-design(Data_left) end end
20
Decision Trees are not stable
Moving just one example slightly may lead to
quite different trees and space partition! Lack
of stability against small perturbation of data.
Figure from Duda, Hart Stork, Chap. 8
21
How to Choose the Right-Sized Tree?
Predictive Error
Error on Test Data
Error on Training Data
Size of Decision Tree
Ideal Range for Tree Size
22
Choosing a Good Tree for Prediction

General idea
grow a large tree
prune it back to create a family of subtrees
weakest link pruning
score the subtrees and pick the best one
Massive data sizes (e.g., n 100k data points)
use training data set to fit a set of trees
use a validation data set to score the subtrees
Smaller data sizes (e.g., n 1k or less)
use cross-validation
use explicit penalty terms (e.g., Bayesian
methods)

23
Example Spam Email Classification

Data Set (from the UCI Machine Learning Archive)
4601 email messages from 1999
Manually labeled as spam (60), non-spam (40)
54 features percentage of words matching a
specific word/character (NOT BAG OF WORDS)
Business, address, internet, free, george, !, ,
etc
Average/longest/sum lengths of uninterrupted
sequences of CAPS
Error Rates (Hastie, Tibshirani, Friedman, 2001)
Training 3056 emails, Testing 1536 emails
Decision tree 8.7
Logistic regression error 7.6
Naïve Bayes 10 (typically)

24
(No Transcript)
25
(No Transcript)
26
Treating Missing Data in Trees

Missing values are common in practice
Approaches to handing missing values
During training
Ignore rows with missing values (inefficient)
During testing
Send the example being classified down both
branches and average predictions
Replace missing values with an imputed value
(can be suboptimal)
Other approaches
Treat missing as a unique value (useful if
missing values are correlated with the class)
Surrogate splits method
Search for and store surrogate variables/splits
during training

27
Other Issues with Classification Trees

Why use binary splits (for real-valued data)?
Multiway splits can be used, but cause
fragmentation
Linear combination splits?
can produces small improvements
optimization is much more difficult (need weights
and split point)
Trees are much less interpretable
Model instability
A small change in the data can lead to a
completely different tree
Model averaging techniques (like bagging) can be
useful
Tree bias
Poor at approximating non-axis-parallel
boundaries
Producing rule sets from tree models (e.g., c5.0)

28
Why Trees are widely used in Practice

Can handle high dimensional data
builds a model using 1 dimension at time
Can handle any type of input variables
categorical, real-valued, etc
most other methods require data of a single type
(e.g., only real-valued)
Trees are (somewhat) interpretable
domain expert can read off the trees logic
Tree algorithms are relatively easy to code and
test

29
Limitations of Trees

Representational Bias
classification piecewise linear boundaries,
parallel to axes
regression piecewise constant surfaces
High Variance
trees can be unstable as a function of the
sample
e.g., small change in the data -gt completely
different tree
causes two problems
1. High variance contributes to prediction error
2. High variance reduces interpretability
Trees are good candidates for model combining
Often used with boosting and bagging
Trees do not scale well to massive data sets
(e.g., N in millions)
repeated random access of subsets of the data

30
Evaluating Classification Results

Summary statistics
Empirical estimate of score function on test
data, error rate, accuracy, etc.
More detailed breakdown
Confusion matrix
Can be quite useful in detecting systematic
errors
Detection v. false-alarm plots (2 classes)
Binary classifier with real-valued output for
each example, where higher means more likely to
be class 1
For each possible threshold, calculate
Detection rate fraction of class 1 detected
False alarm rate fraction of class 2 detected
Plot y (detection rate) versus x (false alarm
rate)
Also known as ROC, precision-recall,
specificity/sensitivity

31
Naïve Bayes Text Classification

K classes c1,..cK
Class-conditional probabilities p(
d ck ) probability of d given ck
Pi p( wi ck )
Posterior class probabilities (by Bayes rule)
p( ck d ) p( d ck ) p(ck)

32
Naïve Bayes Text Classification

Multivariate Bernoulli / Binary model
Multinomial model

33
Naïve Bayes Text Classification

Next Estimating f and q

Write a Comment

User Comments (0)

About PowerShow.com

David Newman, UC Irvine Lecture 5: Classification 1 PowerPoint PPT Presentation