Classification - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Classification

Description:

Classification. Based in part on Chapter 10 of Hand, Manilla, ... e.g. perceptron, SVM, CART. Regression: model p(ck | x ) - e.g. logistic regression, CART ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 40
Provided by: madi67
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
Classification
Based in part on Chapter 10 of Hand, Manilla,
Smyth and Chapter 7 of Han and Kamber David
Madigan
2
Predictive Modeling
Goal learn a mapping y f(x?) Need 1. A
model structure 2. A score function 3. An
optimization strategy Categorical y ? c1,,cm
classification Real-valued y regression Note
usually assume c1,,cm are mutually exclusive
and exhaustive
3
Probabilistic Classification
Let p(ck) prob. that a randomly chosen object
comes from ck Objects from ck have p(x ck ,
?k) (e.g., MVN) Then p(ck x ) ? p(x ck ,
?k) p(ck)
Bayes Error Rate
  • Lower bound on the best possible error rate

4
Bayes error rate about 6
5
Classifier Types
Discrimination direct mapping from x to
c1,,cm - e.g. perceptron, SVM,
CART Regression model p(ck x ) - e.g.
logistic regression, CART Class-conditional
model p(x ck , ?k) - e.g. Bayesian
classifiers, LDA
6
Simple Two-Class Perceptron
Define Classify as class 1 if h(x)gt0, class 2
otherwise Score function misclassification
errors on training data For training, replace
class 2 xjs by -xj now need h(x)gt0
Initialize weight vector Repeat one or more
times For each training data point xi If
point correctly classified, do nothing Else
Guaranteed to converge when there is perfect
separation
7
Linear Discriminant Analysis
K classes, X n p data matrix.
p(ck x ) ? p(x ck , ?k) p(ck)
Could model each class density as multivariate
normal
LDA assumes for all k. Then
This is linear in x.
8
Linear Discriminant Analysis (cont.)
It follows that the classifier should predict
linear discriminant function
If we dont assume the ?ks are identicial, get
Quadratic DA
9
Linear Discriminant Analysis (cont.)
Can estimate the LDA parameters via maximum
likelihood
10
(No Transcript)
11
(No Transcript)
12
LDA
QDA
13
(No Transcript)
14
(No Transcript)
15
LDA (cont.)
  • Fisher is optimal if the class are MVN with a
    common covariance matrix
  • Computational complexity O(mp2n)

16
Logistic Regression
Note that LDA is linear in x
Linear logistic regression looks the same
But the estimation procedure for the
co-efficicents is different. LDA maximizes joint
likelihood y,X logistic regression maximizes
conditional likelihood yX. Usually similar
predictions.
17
Logistic Regression MLE
For the two-class case, the likelihood is
The maximize need to solve (non-linear) score
equations
18
Logistic Regression Modeling
South African Heart Disease Example (yMI)
Coef. S.E. Z score
Intercept -4.130 0.964 -4.285
sbp 0.006 0.006 1.023
Tobacco 0.080 0.026 3.034
ldl 0.185 0.057 3.219
Famhist 0.939 0.225 4.178
Obesity -0.035 0.029 -1.187
Alcohol 0.001 0.004 0.136
Age 0.043 0.010 4.184
Wald
19
Tree Models
  • Easy to understand
  • Can handle mixed data, missing values, etc.
  • Sequential fitting method can be sub-optimal
  • Usually grow a large tree and prune it back
    rather than attempt to optimally stop the growing
    process

20
(No Transcript)
21
Training Dataset
This follows an example from Quinlans ID3
22
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Confusion matrix
27
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

28
Information Gain (ID3/C4.5)
  • Select the attribute with the highest information
    gain
  • Assume there are two classes, P and N
  • Let the set of examples S contain p elements of
    class P and n elements of class N
  • The amount of information, needed to decide if an
    arbitrary example in S belongs to P or N is
    defined as

e.g. I(0.5,0.5)1 I(0.9,0.1)0.47
I(0.99,0.01)0.08
29
Information Gain in Decision Tree Induction
  • Assume that using attribute A a set S will be
    partitioned into sets S1, S2 , , Sv
  • If Si contains pi examples of P and ni examples
    of N, the entropy, or the expected information
    needed to classify objects in all subtrees Si is
  • The encoding information that would be gained by
    branching on A

30
Attribute Selection by Information Gain
Computation
  • Hence
  • Similarly
  • Class P buys_computer yes
  • Class N buys_computer no
  • I(p, n) I(9, 5) 0.940
  • Compute the entropy for age

31
Gini Index (IBM IntelligentMiner)
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j
    in T.
  • If a data set T is split into two subsets T1 and
    T2 with sizes N1 and N2 respectively, the gini
    index of the split data contains examples from n
    classes, the gini index gini(T) is defined as
  • The attribute provides the smallest ginisplit(T)
    is chosen to split the node

32
Avoid Overfitting in Classification
  • The generated tree may overfit the training data
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Result is in poor accuracy for unseen samples
  • Two approaches to avoid overfitting
  • Prepruning Halt tree construction earlydo not
    split a node if this would result in the goodness
    measure falling below a threshold
  • Difficult to choose an appropriate threshold
  • Postpruning Remove branches from a fully grown
    treeget a sequence of progressively pruned trees
  • Use a set of data different from the training
    data to decide which is the best pruned tree

33
Approaches to Determine the Final Tree Size
  • Separate training (2/3) and testing (1/3) sets
  • Use cross validation, e.g., 10-fold cross
    validation
  • Use minimum description length (MDL) principle
  • halting growth of the tree when the encoding is
    minimized

34
Nearest Neighbor Methods
  • k-NN assigns an unknown object to the most common
    class of its k nearest neighbors
  • Choice of k? (bias-variance tradeoff again)
  • Choice of metric?
  • Need all the training to be present to classify a
    new point (lazy methods)
  • Surprisingly strong asymptotic results (e.g. no
    decision rule is more than twice as accurate as
    1-NN)

35
Flexible Metric NN Classification
36
Naïve Bayes Classification
Recall p(ck x) ? p(x ck)p(ck) Now
suppose Then Equivalently
C

x1
x2
xp
weights of evidence
37
Evidence Balance Sheet
38
Naïve Bayes (cont.)
  • Despite the crude conditional independence
    assumption, works well in practice (see Friedman,
    1997 for a partial explanation)
  • Can be further enhanced with boosting, bagging,
    model averaging, etc.
  • Can relax the conditional independence
    assumptions in myriad ways (Bayesian networks)

39
Dietterich (1999) Analysis of 33 UCI datasets
Write a Comment
User Comments (0)
About PowerShow.com