Classification

About This Presentation

Title:

Classification

Description:

... attribute A, splits are of the form value(A) X where X domain(A) ... I(s1, s2)=I(9, 5)= - 9/14log29/14 5/14log25/14=0.94. Spring 2003. 44. Example (3) ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 80

Provided by: maciejm

Category:

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification
2
Classification task

Input a training set of tuples, each labeled
with one class label
Output a model (classifier) that assigns a class
label to each tuple based on the other attributes
The model can be used to predict the class of new
tuples, for which the class label is missing or
unknown

3
What is Classification

Data classification is a two-step process
first step a model is built describing a
predetermined set of data classes or concepts
second step the model is used for classification
Each tuple is assumed to belong to a predefined
class, as determined by one of the attributes,
called the class label attribute
Data tuples are also referred to as samples,
examples, or objects

4
Train and test

The tuples (examples, samples) are divided into
training set test set
Classification model is built in two steps
training - build the model from the training set
test - check the accuracy of the model using test
set

5
Train and test

Kind of models
if - then rules
logical formulae
decision trees
Accuracy of models
the known class of test samples is matched
against the class predicted by the model
accuracy rate of test set samples correctly
classified by the model

6
Training step
Classification algorithm
training data
Classifier (model)
if age lt 31 or Car Type Sports then Risk High
7
Test step
Classifier (model)
test data
8
Classification (prediction)
Classifier (model)
new data
9
Classification vs. Prediction

There are two forms of data analysis that can be
used to extract models describing data classes or
to predict future data trends
classification predict categorical labels
prediction models continuous-valued functions

10
Comparing Classification Methods (1)

Predictive accuracy this refers to the ability
of the model to correctly predict the class label
of new or previously unseen data
Speed this refers to the computation costs
involved in generating and using the model
Robustness this is the ability of the model to
make correct predictions given noisy data or data
with missing values

11
Comparing Classification Methods (2)

Scalability this refers to the ability to
construct the model efficiently given large
amount of data
Interpretability this refers to the level of
understanding and insight that is provided by the
model
Simplicity
decision tree size
rule compactness
Domain-dependent quality indicators

12
Problem formulation

Given records in the database with class label
find model for each class.

Age lt 31
Car Type is sports
High
High
Low
13
Classification techniques

Decision Tree Classification
Bayesian Classifiers
Neural Networks
Statistical Analysis
Genetic Algorithms
Rough Set Approach
k-nearest neighbor classifiers

14
Classification by Decision Tree Induction

A decision tree is a tree structure, where
each internal node denotes a test on an
attribute,
each branch represents the outcome of the test,
leaf nodes represent classes or class
distributions

Age lt 31
N
Y
Car Type is sports
High
High
Low
15
Decision Tree Induction (1)

A decision tree is a class discriminator that
recursively partitions the training set until
each partition consists entirely or dominantly of
examples from one class.
Each non-leaf node of the tree contains a split
point, which is a test on one or more attributes
and determines how the data is partitioned

16
Decision Tree Induction (2)

Basic algorithm a greedy algorithm that
constructs decision trees in a top-down recursive
divide-and-conquer manner.
Many variants
from machine learning (ID3, C4.5)
from statistics (CART)
from pattern recognition (CHAID)
Main difference split criterion

17
Decision Tree Induction (3)

The algorithm consists of two phases
Build an initial tree from the training data such
that each leaf node is pure
Prune this tree to increase its accuracy on test
data

18
Tree Building

In the growth phase the tree is built by
recursively partitioning the data until each
partition is either "pure" (contains members of
the same class) or sufficiently small.
The form of the split used to partition the data
depends on the type of the attribute used in the
split
for a continuous attribute A, splits are of the
form value(A)ltx where x is a value in the domain
of A.
for a categorical attribute A, splits are of the
form value(A)?X where X?domain(A)

19
Tree Building Algorithm

Make Tree (Training Data T)
Partition(T)
Partition(Data S)
if (all points in S are in the same class) then
return
for each attribute A do
evaluate splits on attribute A
use best split found to partition S into S1 and
S2
Partition(S1)
Partition(S2)

20
Tree Building Algorithm

While growing the tree, the goal at each node is
to determine the split point that "best" divides
the training records belonging to that leaf
To evaluate the goodness of the split some
splitting indices have been proposed

21
Split Criteria

Gini index (CART, SPRINT)
select attribute that minimize impurity of a
split
Information gain (ID3, C4.5)
to measure impurity of a split use entropy
select attribute that maximize entropy reduction
?2 contingency table statistics (CHAID)
measures correlation between each attribute and
the class label
select attribute with maximal correlation

22
Gini index (1)

Given a sample training set where each record
represents a car-insurance applicant. We want to
build a model of what makes an applicant a high
or low insurance risk.

Classifier (model)
Training set
The model built can be used to screen future
insurance applicants by classifying them into the
High or Low risk categories
23
Gini index (2)

SPRINT algorithm
Partition(Data S)
if (all points in S are of the same class) then
return
for each attribute A do
evaluate splits on attribute A
Use best split found to partition S into S1 and
S2
Partition(S1)
Partition(S2)
Initial call Partition(Training Data)

24
Gini index (3)

Definition
gini(S) 1 - ?pj2
where
S is a data set containing examples from n
classes
pj is a relative frequency of class j in S
E.g. two classes, Pos and Neg, and dataset S with
p Pos-elements and n Neg-elements.
ppos p/(pn) pneg n/(np)
gini(S) 1 - ppos2 - pneg2

25
Gini index (4)

If dataset S is split into S1 and S2, then
splitting index is defined as follows
giniSPLIT(S) (p1 n1)/(pn)gini(S1)
(p2 n2)/(pn) gini(S2)
where p1, n1 (p2, n2) denote p1 Pos-elements and
n1 Neg-elements in the dataset S1 (S2),
respectively.
In this definition the "best" split point is the
one with the lowest value of the giniSPLIT index.

26
Example (1)
Training set
27
Example (1)
Attribute list for Age
Attribute list for Car Type
28
Example (2)

Possible values of a split point for Age
attribute are
Age?17, Age?20, Age?23, Age?32, Age?43, Age?68

G(Agelt17) 1- (1202) 0 G(Agegt17) 1-
((3/5)2(2/5)2) 1 - (13/25)2 12/25 GSPLIT
(1/6) 0 (5/6) (12/25) 2/5
29
Example (3)
G(Agelt20) 1- (1202) 0 G(Agegt20) 1-
((1/2)2(1/2)2) 1/2 GSPLIT (2/6) 0 (4/6)
(1/8) 1/3
G(Age?23) 1- (1202) 0 G(Agegt23) 1-
((1/3)2(2/3)2) 1 - (1/9) - (4/9) 4/9 GSPLIT
(3/6) 0 (3/6) (4/9) 2/9
30
Example (4)
G(Age?32) 1- ((3/4)2(1/4)2) 1 - (10/16)
6/16 3/8 G(Agegt32) 1- ((1/2)2(1/2)2)
1/2 GSPLIT (4/6)(3/8) (2/6)(1/2) (1/8)
(1/6)14/48 7/24
The lowest value of GSPLIT is for Age?23, thus we
have a split point at Age(2332) / 2 27.5
31
Example (5)
Decision tree after the first split of the
example set
Age?27.5
Agegt27.5
Risk High
Risk Low
32
Example (6)
Attribute lists are divided at the split
point. Attribute lists for Age?27.5
Attribute lists for Agegt27.5
33
Example (7)
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of
the 2N combinations, where N is the cardinality
of the categorical attribute.
G(Car type ?sport) 1 12 02 0 G(Car type
?family) 1 02 12 0 G(Car type ?truck)
1 02 12 0
34
Example (8)

G(Car type ? sport, family ) 1 - (1/2)2 -
(1/2)2 1/2
G(Car type ? sport, truck ) 1/2
G(Car type ? family, truck ) 1 - 02 - 12 0
GSPLIT(Car type ? sport ) (1/3) 0 (2/3)
0 0
GSPLIT(Car type ? family ) (1/3) 0
(2/3)(1/2) 1/3
GSPLIT(Car type ? truck ) (1/3) 0
(2/3)(1/2) 1/3
GSPLIT(Car type ? sport, family)
(2/3)(1/2)(1/3)0 1/3
GSPLIT(Car type ? sport, truck)
(2/3)(1/2)(1/3)0 1/3
GSPLIT(Car type ? family, truck )
(2/3)0(1/3)00

35
Example (9)

The lowest value of GSPLIT is for Car type ?
sport, thus this is our split point. Decision
tree after the second split of the example set

Age?27.5
Agegt27.5
Risk High
Car type ? family, truck
Car type ? sport
Risk Low
Risk High
36
Information Gain (1)

The information gain measure is used to select
the test attribute at each node in the tree
The attribute with the highest information gain
(or greatest entropy reduction) is chosen as the
test attribute for the current node
This attribute minimizes the information needed
to classify the samples in the resulting
partitions

37
Information Gain (2)

Let S be a set consisting of s data samples.
Suppose the class label attribute has m distinct
values defining m classes, Ci (for i1, ..., m)
Let si be the number of samples of S in class Ci
The expected information needed to classify a
given sample is given by
I(s1, s2, ..., sm) - ? pi log2(pi)
where pi is the probability that an arbitrary
sample belongs to class Ci and is estimated by
si/s.

38
Information Gain (3)

Let attribute A have v distinct values, a1, a2,
..., av. Attribute A can be used to partition S
into S1, S2, ..., Sv, where Sj contains those
samples in S that have value aj of A
If A were selected as the test attribute, then
these subsets would correspond to the branches
grown from the node containing the set S

39
Information Gain (4)

Let sij be the number of samples of the class Ci
in a subset Sj. The entropy, or expected
information based on the partitioning into
subsets by A, is given by
E(A1, A2, ...Av) ?(s1j s2j ...smj)/s
I(s1j, s2j, ..., smj)
The smaller the entropy value, the greater the
purity of the subset partitions.

40
Information Gain (5)

The term (s1j s2j ...smj)/s acts as the
weight of the jth subset and is the number of
samples in the subset (i.e. having value aj of A)
divided by the total number of samples in S. Note
that for a given subset Sj,
I(s1j, s2j, ..., smj) - ? pij log2(pij)
where pij sij/Sj and is the probability that
a sample in Sj belongs to class Ci

41
Information Gain (6)

The encoding information that would be gained by
branching on A is
Gain(A) I(s1, s2, ..., sm) E(A)
Gain(A) is the expected reduction in entropy
caused by knowing the value of attribute A

42
Example (1)
43
Example (2)

Let us consider the following training set of
tuples taken from the customer database.
The class label attribute, buys_computer, has
two distinct values (yes, no), therefore, there
are two classes (m2).
C1 correspond to yes s1 9
C2 correspond to no - s2 5
I(s1, s2)I(9, 5) - 9/14log29/14
5/14log25/140.94

44
Example (3)

Next, we need to compute the entropy of each
attribute. Let start with the attribute age
for agelt30
s112 s213 I(s11, s21) 0.971
for age31..40
s124 s220 I(s12, s22) 0
for agegt40
s132 s233 I(s13, s23) 0.971

45
Example (4)

The entropy of age is,
E(age)5/14 I(s11, s21) 4/14 I(s12, s22)
5/14 I(s13, s23) 0.694
The gain in information from such a partitioning
would be
Gain(age) I(s1, s2) E(age) 0.246

46
Example (5)

We can compute
Gain(income)0.029,
Gain(student)0.151, and Gain(credit_rating)0.0
48
Since age has the highest information gain amont
the attributes, it is selected as the test
atribute. A node is created and labeled with age,
and branches are grown for each of the
attributes values.

47
Example (6)
age
gt40
lt30
31..40
buys_computers yes, no
buys_computers yes, no
buys_computers yes
48
Example (7)
age
lt30
gt40
31..40
credit_rating
student
yes
yes
excellent
fair
no
yes
no
yes
no
49
Entropy vs. Gini index

Entropy tends to fin groups of classes that add
up to 50 of the data

Gini index tends to isolate the largest class
from all other classes

class A 40 class B 30 class C 20 class D 10
class A 40 class B 30 class C 20 class D 10
if age lt 65
if age lt 40
no
yes
yes
no
class A 40 class D 10
class B 30 class C 20
class B 30 class C 20class D 10
class A 40
50
Tree pruning

When a decision tree is built, many of the
branches will reflect anomalies in the training
data due to noise or outliers.
Tree pruning methods typically use statistical
measures to remove the least reliable branches,
generally resulting in faster classification and
an improvement in the ability of the tree to
correctly classify independent test data

51
Tree pruning

Prepruning approach (stopping) a tree is
pruned by halting its construction early (i.e.
by deciding not to further split or partition the
subset of training samples). Upon halting, the
node becomes a leaf. The leaf hold the most
frequent class among the subset samples
Postpruning approach (pruning) removes branches
from a fully grown tree. A tree node is pruned
by removing its branches. The lowest unpruned
node becomes a leaf and is labeled by the most
frequent class among its former branches

52
Extracting Classification Rules from Decision
Trees

The knowledge represented in decision trees can
be extracted and represented in the form of
classification IF-THEN rules.
One rule is created for each path from the root
to a leaf node
Each attribute-value pair along a given path
forms a conjunction in the rule antecedent the
leaf node holds the class prediction, forming the
rule consequent

53
Extracting Classification Rules from Decision
Trees

The decision tree of Example (7) can be converted
to classification rules
IF agelt30 AND studentno THEN
buys_computersno
IF agelt30 AND studentyes THEN
buys_computersyes
IF age31..40 THEN
buys_computersyes
IF agegt40 AND credit_ratingexcellent
THEN buys_computersno
IF agegt40 AND credit_ratingfair
THEN buys_computersyes

54
Other Classification Methods

There is a number of classification methods in
the literature
Bayesian classifiers
Neural-network classifiers
K-nearest neighbor classifiers
Association-based classifiers
Rough and fuzzy sets

55
Classification Based on Concepts from Association
Rule Mining

We may apply quantitative rule mining approach
to discover classification rules associative
classification
It mines rules of the form condset ? y, where
condset is a set of items (or attribute-value
pairs) and y is a class label.

56
Bayesian classifers

Bayesian classifier is a statistical classifier.
It can predict the probability that a given
sample belongs to a particular class.
Bayesian classification is based on Bayes theorem
of a-posteriori probability.
Let X is a data sample whose class label is
unknown. Each sample is represented n-dimensional
vector, X(x1, x2, ..., xn).
The classification problem may be formulated
using a-posteriori probabilities as follows
determine P(CX), the probability that the sample
X belongs to a specified class C.
P(CX) is the a-posteriori probability of C
conditioned on X.

57
Bayesian classifers

Example
Given a set of samples describing credit
applicants P(RisklowAge38, Marital_Statusdivor
ced, Incomelow, children2) is the probability
that a credit applicant X(38, divorced, low, 2)
is the low credit risk applicant.
The idea of Bayesian classification is to assign
to a new unknown sample X the class label C such
that P(C X) is maximal.

58
Bayesian classifers

The main problem is how to estimate a-posteriori
probability P(CX)?
By Bayes theorem
P(CX) (P(XC) P(C))/P(X),
where P(C) is the apriori probability of C, that
is the probability that any given sample belongs
to the class C, P(XC) is the a-posteriori
probability of X conditioned on C, and P(X) is
the apriori probability of X.
In our example, P(XC) is the probability that
X(38, divorced, low, 2) given the class
Risklow, P(C) is the probability of the class C,
and P(X) is the probability that the sample
X(38, divorced, low, 2).

59
Bayesian classifers

Suppose a training database D consists of n
samples, and suppose the class label attribute
has m distinct values defining m distinct classes
C_i, for i 1, ..., m.
Let s_i denotes the number of samples of D in
class C_i.
Bayesian classifier assigns an unknown sample X
to the class C_i that maximizes P(C_iX). Since
P(X) is constant for all classes, the class C_i
for which P(C_iX) is maximized is the class C_i
for which P(XC_i) P(C_i) is maximized.
P(C_i) may be estimated by s_i/n (relative
frequency of the class C_i), or we may assume
that all classes have the same probability P(C_1)
P(C_2) ... P(C_k).

60
Bayesian classifers

The main problem is how to compute P(C_iX)?
Given a large dataset with many predictor
attributes, it would be very expensive to compute
P(C_iX), therefore, to reduce the cost of
computing P(C_iX), the assumption of class
conditional independence, or, in other words, the
attribute independence assumption is made.
The assumption states that there are no
dependencies among predictor attributes, which
leads to the following formula
P(XC_i) ?j1k P(x_jC_i)

61
Bayesian classifers

The probabilities P(x_1C_i), P(x_2C_i), ...,
P(x_kC_i) can be estimated from the dataset
- If j-th attribute is categorical, then
P(x_jC_i) is estimated as the relative frequency
of samples of the class C_I having value x_j for
j-th attribute,
- If j-th attribute is continuous, then
P(x_jC_i) is estimated through the Gaussian
density function.
Due to the class conditional independence
assumption, the Bayesian classifier is also known
as the naive Bayesian classifier.

62
Bayesian classifers

The assumption makes computation possible.
Moreover, when the assumption is satisfied, the
naive Bayesian classifier is optimal, that is it
is the most accurate classifier in comparison to
all other classifiers.
However, the assumption is seldom satisfied in
practice, since attributes are usually
correlated.
Several attempts are being made to apply Bayesian
analysis without assuming attribute independence.
The resulting models are called Bayesian
networks or Bayesian belief networks
Bayesian belief networks combine Bayesian
analysis with causal relationships between
attributes.

63
k-nearest neighbor classifiers

Nearest neighbor classifier belongs to
instance-based learning methods.
Instance-based learning methods differ from other
classification methods discussed earlier in that
they do not build a classifier until a new
unknown sample needs to be classified.
Each training sample is described by
n-dimensional vector representing a point in an
n-dimensional space called pattern space. When a
new unknown sample has to be classified, a
distance function is used to determine a member
of the training set which is closest to the
unknown sample.

64
k-nearest neighbor classifiers

Once the nearest training sample is located in
the pattern space, its class label is assigned to
the unknown sample.
The main drawback of this approach is that it is
very sensitive to noisy training samples.
The common solution to this problem is to adopt
the k-nearest neighbor strategy.
When a new unknown sample has to be classified,
the classifier searches the pattern space for the
k training samples which are closest to the
unknown sample. These k training samples are
called the k "nearest neighbors" of the unknown
sample and the most common class label among k
"nearest neighbors" is assigned to the unknown
sample.
To find the k "nearest neighbors" of the unknown
sample a multidimensional index is used (e.g.
R-tree, Pyramid tree, etc.).

65
k-nearest neighbor classifiers

Two different issues need to be addressed
regarding k-nearest neighbor method
the distance function, and
the transformation from a sample to a point in
the pattern space.
The first issue is to define the distance
function. If the attributes are numeric, most
k-nearest neighbor classifiers use Euclidean
distance.
Instead of the Euclidean distance, we may also
apply other distance metrics like Manhattan
distance, maximum of dimensions, or Minkowski
distance.

66
k-nearest neighbor classifiers

The second issue is how to transform a sample to
a point in the pattern space.
Note that different attributes may have different
scales and units, and different variability.
Thus, if the distance metric is used directly,
the effects of some attributes might be dominated
by other attributes that have larger scale or
higher variability.
A simple solution to this problem is to weight
the various attributes. One common approach is to
normalize all attribute values into the range 0,
1.

67
k-nearest neighbor classifiers

This solution is sensitive to the outliers
problem since a single outlier could cause
virtually all other values to be contained in a
small subrange.
Another common approach is to apply a
standarization transformation, such as
subtracting the mean from the value of each
attribute and then dividing by its standard
deviation.
Recently, another approach was proposed which
consists in applying the robust space
transformation called Donoho-Stahel estimator
the estimator has some important and useful
properties that make the estimator very
attractive for different data mining applications.

68
Classifier accuracy

The accuracy of a classifier on a given test set
of samples is defined as the percentage of test
samples correctly classified by the classifier,
and it measures the overall performance of the
classifier.
Note that the accuracy of the classifier is not
estimated on the training dataset, since it would
not be a good indicator of the future accuracy
on new data.
The reason is that the classifier generated from
the training dataset tends to overfit the
training data, and any estimate of the
classifier's accuracy based on that data will be
overoptimistic.

69
Classifier accuracy

In other words, the classifier is more accurate
on the data that was used to train the
classifier, but very likely it will be less
accurate on independent set of data.
To predict the accuracy of the classifier on new
data, we need to asses its accuracy on an
independent dataset that played no part in the
formation of the classifier.
This dataset is called the test set
It is important to note that the test dataset
should not to be used in any way to built the
classifier.

70
Classifier accuracy

There are several methods for estimating
classifier accuracy. The choice of a method
depends on the amount of sample data available
for training and testing.
If there are a lot of sample data, then the
following simple holdout method is usually
applied.
The given set of samples is randomly partitioned
into two independent sets, a training set and a
test set (typically, 70 of the data is used for
training, and the remaining 30 is used for
testing)
Provided that both sets of samples are
representative, the accuracy of the classifier on
the test set will give a good indication of
accuracy on new data.

71
Classifier accuracy

In general, it is difficult to say whether a
given set of samples is representative or not,
but at least we may ensure that the random
sampling of the data set is done in such a way
that the class distribution of samples in both
training and test set is approximately the same
as that in the initial data set.
This procedure is called stratification

72
Testing large dataset
Available examples
30
70
Divide randomly
Training Set
Test Set
used to develop one tree
check accuracy
73
Classifier accuracy

If the amount of data for training and testing is
limited, the problem is how to use this limited
amount of data for training to get a good
classifier and for testing to obtain a correct
estimation of the classifier accuracy?
The standard and very common technique of
measuring the accuracy of a classifier when the
amount of data is limited is k-fold
cross-validation
In k-fold cross-validation, the initial set of
samples is randomly partitioned into k
approximately equal mutually exclusive subsets,
called folds, S_1, S_2, ..., S_k.

74
Classifier accuracy

Training and testing is performed k times. At
each iteration, one fold is used for testing
while remainder k-1 folds are used for training.
So, at the end, each fold has been used exactly
once for testing and k-1 for training.
The accuracy estimate is the overall number of
correct classifications from k iterations divided
by the total number of samples N in the initial
dataset.
Often, the k-fold cross-validation technique is
combined with stratification and is called
stratified k-fold cross-validation

75
Testing small dataset
cross-validation
Repeat 10 times
Available examples
10
90
Training Set
Test Set
used to develop 10 different trees
check accuracy
76
Classifier accuracy

There are many other methods of estimating
classifier accuracy on a particular dataset
Two popular methods are leave-one-out
cross-validation and the bootstrapping
Leave-one-out cross-validation is simply N-fold
cross-validation, where N is the number of
samples in the initial dataset
At each iteration, a single sample from the
dataset is left out for testing, and remaining
samples are used for training. The result of
testing is either success or failure.
The results of all N evaluations, one for each
sample from the dataset, are averaged, and that
average represents the final accuracy estimate.

77
Classifier accuracy

Bootstrapping is based on the sampling with
replacement
The initial dataset is sampled N times, where N
is the total number of samples in the dataset,
with replacement, to form another set of N
samples for training.
Since some samples in this new "set" will be
repeated, so it means that some samples from the
initial dataset will not appear in this training
set. These samples will form a test set.
Both mentioned estimation methods are interesting
especially for estimating classifier accuracy for
small datasets. In practice, the standard and
most popular technique of estimating a classifier
accuracy is stratified tenfold cross-validation