Title: Classification
1Classification
2Classification task
- Input a training set of tuples, each labeled
with one class label - Output a model (classifier) that assigns a class
label to each tuple based on the other attributes - The model can be used to predict the class of new
tuples, for which the class label is missing or
unknown
3What is Classification
- Data classification is a two-step process
- first step a model is built describing a
predetermined set of data classes or concepts - second step the model is used for classification
- Each tuple is assumed to belong to a predefined
class, as determined by one of the attributes,
called the class label attribute - Data tuples are also referred to as samples,
examples, or objects
4Train and test
- The tuples (examples, samples) are divided into
training set test set - Classification model is built in two steps
- training - build the model from the training set
- test - check the accuracy of the model using test
set
5Train and test
- Kind of models
- if - then rules
- logical formulae
- decision trees
- Accuracy of models
- the known class of test samples is matched
against the class predicted by the model - accuracy rate of test set samples correctly
classified by the model
6Training step
Classification algorithm
training data
Classifier (model)
if age lt 31 or Car Type Sports then Risk High
7Test step
Classifier (model)
test data
8Classification (prediction)
Classifier (model)
new data
9Classification vs. Prediction
- There are two forms of data analysis that can be
used to extract models describing data classes or
to predict future data trends - classification predict categorical labels
- prediction models continuous-valued functions
10Comparing Classification Methods (1)
- Predictive accuracy this refers to the ability
of the model to correctly predict the class label
of new or previously unseen data - Speed this refers to the computation costs
involved in generating and using the model - Robustness this is the ability of the model to
make correct predictions given noisy data or data
with missing values
11Comparing Classification Methods (2)
- Scalability this refers to the ability to
construct the model efficiently given large
amount of data - Interpretability this refers to the level of
understanding and insight that is provided by the
model - Simplicity
- decision tree size
- rule compactness
- Domain-dependent quality indicators
12Problem formulation
- Given records in the database with class label
find model for each class.
Age lt 31
Car Type is sports
High
High
Low
13Classification techniques
- Decision Tree Classification
- Bayesian Classifiers
- Neural Networks
- Statistical Analysis
- Genetic Algorithms
- Rough Set Approach
- k-nearest neighbor classifiers
14Classification by Decision Tree Induction
- A decision tree is a tree structure, where
- each internal node denotes a test on an
attribute, - each branch represents the outcome of the test,
- leaf nodes represent classes or class
distributions
Age lt 31
N
Y
Car Type is sports
High
High
Low
15Decision Tree Induction (1)
- A decision tree is a class discriminator that
recursively partitions the training set until
each partition consists entirely or dominantly of
examples from one class. - Each non-leaf node of the tree contains a split
point, which is a test on one or more attributes
and determines how the data is partitioned
16Decision Tree Induction (2)
- Basic algorithm a greedy algorithm that
constructs decision trees in a top-down recursive
divide-and-conquer manner. - Many variants
- from machine learning (ID3, C4.5)
- from statistics (CART)
- from pattern recognition (CHAID)
- Main difference split criterion
17Decision Tree Induction (3)
- The algorithm consists of two phases
- Build an initial tree from the training data such
that each leaf node is pure - Prune this tree to increase its accuracy on test
data
18Tree Building
- In the growth phase the tree is built by
recursively partitioning the data until each
partition is either "pure" (contains members of
the same class) or sufficiently small. - The form of the split used to partition the data
depends on the type of the attribute used in the
split - for a continuous attribute A, splits are of the
form value(A)ltx where x is a value in the domain
of A. - for a categorical attribute A, splits are of the
form value(A)?X where X?domain(A)
19Tree Building Algorithm
- Make Tree (Training Data T)
-
- Partition(T)
-
- Partition(Data S)
-
- if (all points in S are in the same class) then
- return
- for each attribute A do
- evaluate splits on attribute A
- use best split found to partition S into S1 and
S2 - Partition(S1)
- Partition(S2)
20Tree Building Algorithm
- While growing the tree, the goal at each node is
to determine the split point that "best" divides
the training records belonging to that leaf - To evaluate the goodness of the split some
splitting indices have been proposed
21Split Criteria
- Gini index (CART, SPRINT)
- select attribute that minimize impurity of a
split - Information gain (ID3, C4.5)
- to measure impurity of a split use entropy
- select attribute that maximize entropy reduction
- ?2 contingency table statistics (CHAID)
- measures correlation between each attribute and
the class label - select attribute with maximal correlation
22Gini index (1)
- Given a sample training set where each record
represents a car-insurance applicant. We want to
build a model of what makes an applicant a high
or low insurance risk.
Classifier (model)
Training set
The model built can be used to screen future
insurance applicants by classifying them into the
High or Low risk categories
23Gini index (2)
- SPRINT algorithm
- Partition(Data S)
- if (all points in S are of the same class) then
- return
- for each attribute A do
- evaluate splits on attribute A
- Use best split found to partition S into S1 and
S2 - Partition(S1)
- Partition(S2)
-
- Initial call Partition(Training Data)
24Gini index (3)
- Definition
- gini(S) 1 - ?pj2
- where
- S is a data set containing examples from n
classes - pj is a relative frequency of class j in S
- E.g. two classes, Pos and Neg, and dataset S with
p Pos-elements and n Neg-elements. - ppos p/(pn) pneg n/(np)
- gini(S) 1 - ppos2 - pneg2
25Gini index (4)
- If dataset S is split into S1 and S2, then
splitting index is defined as follows - giniSPLIT(S) (p1 n1)/(pn)gini(S1)
- (p2 n2)/(pn) gini(S2)
- where p1, n1 (p2, n2) denote p1 Pos-elements and
n1 Neg-elements in the dataset S1 (S2),
respectively. - In this definition the "best" split point is the
one with the lowest value of the giniSPLIT index.
26Example (1)
Training set
27Example (1)
Attribute list for Age
Attribute list for Car Type
28Example (2)
- Possible values of a split point for Age
attribute are - Age?17, Age?20, Age?23, Age?32, Age?43, Age?68
G(Agelt17) 1- (1202) 0 G(Agegt17) 1-
((3/5)2(2/5)2) 1 - (13/25)2 12/25 GSPLIT
(1/6) 0 (5/6) (12/25) 2/5
29Example (3)
G(Agelt20) 1- (1202) 0 G(Agegt20) 1-
((1/2)2(1/2)2) 1/2 GSPLIT (2/6) 0 (4/6)
(1/8) 1/3
G(Age?23) 1- (1202) 0 G(Agegt23) 1-
((1/3)2(2/3)2) 1 - (1/9) - (4/9) 4/9 GSPLIT
(3/6) 0 (3/6) (4/9) 2/9
30Example (4)
G(Age?32) 1- ((3/4)2(1/4)2) 1 - (10/16)
6/16 3/8 G(Agegt32) 1- ((1/2)2(1/2)2)
1/2 GSPLIT (4/6)(3/8) (2/6)(1/2) (1/8)
(1/6)14/48 7/24
The lowest value of GSPLIT is for Age?23, thus we
have a split point at Age(2332) / 2 27.5
31Example (5)
Decision tree after the first split of the
example set
Age?27.5
Agegt27.5
Risk High
Risk Low
32Example (6)
Attribute lists are divided at the split
point. Attribute lists for Age?27.5
Attribute lists for Agegt27.5
33Example (7)
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of
the 2N combinations, where N is the cardinality
of the categorical attribute.
G(Car type ?sport) 1 12 02 0 G(Car type
?family) 1 02 12 0 G(Car type ?truck)
1 02 12 0
34Example (8)
- G(Car type ? sport, family ) 1 - (1/2)2 -
(1/2)2 1/2 - G(Car type ? sport, truck ) 1/2
- G(Car type ? family, truck ) 1 - 02 - 12 0
- GSPLIT(Car type ? sport ) (1/3) 0 (2/3)
0 0 - GSPLIT(Car type ? family ) (1/3) 0
(2/3)(1/2) 1/3 - GSPLIT(Car type ? truck ) (1/3) 0
(2/3)(1/2) 1/3 - GSPLIT(Car type ? sport, family)
(2/3)(1/2)(1/3)0 1/3 - GSPLIT(Car type ? sport, truck)
(2/3)(1/2)(1/3)0 1/3 - GSPLIT(Car type ? family, truck )
(2/3)0(1/3)00
35Example (9)
- The lowest value of GSPLIT is for Car type ?
sport, thus this is our split point. Decision
tree after the second split of the example set
Age?27.5
Agegt27.5
Risk High
Car type ? family, truck
Car type ? sport
Risk Low
Risk High
36Information Gain (1)
- The information gain measure is used to select
the test attribute at each node in the tree - The attribute with the highest information gain
(or greatest entropy reduction) is chosen as the
test attribute for the current node - This attribute minimizes the information needed
to classify the samples in the resulting
partitions
37Information Gain (2)
- Let S be a set consisting of s data samples.
Suppose the class label attribute has m distinct
values defining m classes, Ci (for i1, ..., m) - Let si be the number of samples of S in class Ci
- The expected information needed to classify a
given sample is given by - I(s1, s2, ..., sm) - ? pi log2(pi)
- where pi is the probability that an arbitrary
sample belongs to class Ci and is estimated by
si/s.
38Information Gain (3)
- Let attribute A have v distinct values, a1, a2,
..., av. Attribute A can be used to partition S
into S1, S2, ..., Sv, where Sj contains those
samples in S that have value aj of A - If A were selected as the test attribute, then
these subsets would correspond to the branches
grown from the node containing the set S
39Information Gain (4)
- Let sij be the number of samples of the class Ci
in a subset Sj. The entropy, or expected
information based on the partitioning into
subsets by A, is given by - E(A1, A2, ...Av) ?(s1j s2j ...smj)/s
- I(s1j, s2j, ..., smj)
- The smaller the entropy value, the greater the
purity of the subset partitions.
40Information Gain (5)
- The term (s1j s2j ...smj)/s acts as the
weight of the jth subset and is the number of
samples in the subset (i.e. having value aj of A)
divided by the total number of samples in S. Note
that for a given subset Sj, - I(s1j, s2j, ..., smj) - ? pij log2(pij)
- where pij sij/Sj and is the probability that
a sample in Sj belongs to class Ci
41Information Gain (6)
- The encoding information that would be gained by
branching on A is - Gain(A) I(s1, s2, ..., sm) E(A)
- Gain(A) is the expected reduction in entropy
caused by knowing the value of attribute A
42Example (1)
43Example (2)
- Let us consider the following training set of
tuples taken from the customer database. - The class label attribute, buys_computer, has
two distinct values (yes, no), therefore, there
are two classes (m2). - C1 correspond to yes s1 9
- C2 correspond to no - s2 5
- I(s1, s2)I(9, 5) - 9/14log29/14
5/14log25/140.94
44Example (3)
- Next, we need to compute the entropy of each
attribute. Let start with the attribute age - for agelt30
- s112 s213 I(s11, s21) 0.971
- for age31..40
- s124 s220 I(s12, s22) 0
- for agegt40
- s132 s233 I(s13, s23) 0.971
45Example (4)
- The entropy of age is,
- E(age)5/14 I(s11, s21) 4/14 I(s12, s22)
- 5/14 I(s13, s23) 0.694
- The gain in information from such a partitioning
would be - Gain(age) I(s1, s2) E(age) 0.246
46Example (5)
- We can compute
- Gain(income)0.029,
- Gain(student)0.151, and Gain(credit_rating)0.0
48 - Since age has the highest information gain amont
the attributes, it is selected as the test
atribute. A node is created and labeled with age,
and branches are grown for each of the
attributes values.
47Example (6)
age
gt40
lt30
31..40
buys_computers yes, no
buys_computers yes, no
buys_computers yes
48Example (7)
age
lt30
gt40
31..40
credit_rating
student
yes
yes
excellent
fair
no
yes
no
yes
no
49Entropy vs. Gini index
- Entropy tends to fin groups of classes that add
up to 50 of the data
- Gini index tends to isolate the largest class
from all other classes
class A 40 class B 30 class C 20 class D 10
class A 40 class B 30 class C 20 class D 10
if age lt 65
if age lt 40
no
yes
yes
no
class A 40 class D 10
class B 30 class C 20
class B 30 class C 20class D 10
class A 40
50Tree pruning
- When a decision tree is built, many of the
branches will reflect anomalies in the training
data due to noise or outliers. - Tree pruning methods typically use statistical
measures to remove the least reliable branches,
generally resulting in faster classification and
an improvement in the ability of the tree to
correctly classify independent test data
51Tree pruning
- Prepruning approach (stopping) a tree is
pruned by halting its construction early (i.e.
by deciding not to further split or partition the
subset of training samples). Upon halting, the
node becomes a leaf. The leaf hold the most
frequent class among the subset samples - Postpruning approach (pruning) removes branches
from a fully grown tree. A tree node is pruned
by removing its branches. The lowest unpruned
node becomes a leaf and is labeled by the most
frequent class among its former branches
52Extracting Classification Rules from Decision
Trees
- The knowledge represented in decision trees can
be extracted and represented in the form of
classification IF-THEN rules. - One rule is created for each path from the root
to a leaf node - Each attribute-value pair along a given path
forms a conjunction in the rule antecedent the
leaf node holds the class prediction, forming the
rule consequent
53Extracting Classification Rules from Decision
Trees
- The decision tree of Example (7) can be converted
to classification rules - IF agelt30 AND studentno THEN
buys_computersno - IF agelt30 AND studentyes THEN
buys_computersyes - IF age31..40 THEN
buys_computersyes - IF agegt40 AND credit_ratingexcellent
- THEN buys_computersno
- IF agegt40 AND credit_ratingfair
- THEN buys_computersyes
54Other Classification Methods
- There is a number of classification methods in
the literature - Bayesian classifiers
- Neural-network classifiers
- K-nearest neighbor classifiers
- Association-based classifiers
- Rough and fuzzy sets
55Classification Based on Concepts from Association
Rule Mining
- We may apply quantitative rule mining approach
to discover classification rules associative
classification - It mines rules of the form condset ? y, where
condset is a set of items (or attribute-value
pairs) and y is a class label.
56Bayesian classifers
- Bayesian classifier is a statistical classifier.
It can predict the probability that a given
sample belongs to a particular class. - Bayesian classification is based on Bayes theorem
of a-posteriori probability. - Let X is a data sample whose class label is
unknown. Each sample is represented n-dimensional
vector, X(x1, x2, ..., xn). - The classification problem may be formulated
using a-posteriori probabilities as follows
determine P(CX), the probability that the sample
X belongs to a specified class C. - P(CX) is the a-posteriori probability of C
conditioned on X.
57Bayesian classifers
- Example
- Given a set of samples describing credit
applicants P(RisklowAge38, Marital_Statusdivor
ced, Incomelow, children2) is the probability
that a credit applicant X(38, divorced, low, 2)
is the low credit risk applicant. - The idea of Bayesian classification is to assign
to a new unknown sample X the class label C such
that P(C X) is maximal.
58Bayesian classifers
- The main problem is how to estimate a-posteriori
probability P(CX)? - By Bayes theorem
- P(CX) (P(XC) P(C))/P(X),
- where P(C) is the apriori probability of C, that
is the probability that any given sample belongs
to the class C, P(XC) is the a-posteriori
probability of X conditioned on C, and P(X) is
the apriori probability of X. - In our example, P(XC) is the probability that
X(38, divorced, low, 2) given the class
Risklow, P(C) is the probability of the class C,
and P(X) is the probability that the sample
X(38, divorced, low, 2).
59Bayesian classifers
- Suppose a training database D consists of n
samples, and suppose the class label attribute
has m distinct values defining m distinct classes
C_i, for i 1, ..., m. - Let s_i denotes the number of samples of D in
class C_i. - Bayesian classifier assigns an unknown sample X
to the class C_i that maximizes P(C_iX). Since
P(X) is constant for all classes, the class C_i
for which P(C_iX) is maximized is the class C_i
for which P(XC_i) P(C_i) is maximized. - P(C_i) may be estimated by s_i/n (relative
frequency of the class C_i), or we may assume
that all classes have the same probability P(C_1)
P(C_2) ... P(C_k).
60Bayesian classifers
- The main problem is how to compute P(C_iX)?
- Given a large dataset with many predictor
attributes, it would be very expensive to compute
P(C_iX), therefore, to reduce the cost of
computing P(C_iX), the assumption of class
conditional independence, or, in other words, the
attribute independence assumption is made. - The assumption states that there are no
dependencies among predictor attributes, which
leads to the following formula - P(XC_i) ?j1k P(x_jC_i)
61Bayesian classifers
- The probabilities P(x_1C_i), P(x_2C_i), ...,
P(x_kC_i) can be estimated from the dataset - - If j-th attribute is categorical, then
P(x_jC_i) is estimated as the relative frequency
of samples of the class C_I having value x_j for
j-th attribute, - - If j-th attribute is continuous, then
P(x_jC_i) is estimated through the Gaussian
density function. - Due to the class conditional independence
assumption, the Bayesian classifier is also known
as the naive Bayesian classifier.
62Bayesian classifers
- The assumption makes computation possible.
Moreover, when the assumption is satisfied, the
naive Bayesian classifier is optimal, that is it
is the most accurate classifier in comparison to
all other classifiers. - However, the assumption is seldom satisfied in
practice, since attributes are usually
correlated. - Several attempts are being made to apply Bayesian
analysis without assuming attribute independence.
The resulting models are called Bayesian
networks or Bayesian belief networks - Bayesian belief networks combine Bayesian
analysis with causal relationships between
attributes.
63k-nearest neighbor classifiers
- Nearest neighbor classifier belongs to
instance-based learning methods. - Instance-based learning methods differ from other
classification methods discussed earlier in that
they do not build a classifier until a new
unknown sample needs to be classified. - Each training sample is described by
n-dimensional vector representing a point in an
n-dimensional space called pattern space. When a
new unknown sample has to be classified, a
distance function is used to determine a member
of the training set which is closest to the
unknown sample.
64k-nearest neighbor classifiers
- Once the nearest training sample is located in
the pattern space, its class label is assigned to
the unknown sample. - The main drawback of this approach is that it is
very sensitive to noisy training samples. - The common solution to this problem is to adopt
the k-nearest neighbor strategy. - When a new unknown sample has to be classified,
the classifier searches the pattern space for the
k training samples which are closest to the
unknown sample. These k training samples are
called the k "nearest neighbors" of the unknown
sample and the most common class label among k
"nearest neighbors" is assigned to the unknown
sample. - To find the k "nearest neighbors" of the unknown
sample a multidimensional index is used (e.g.
R-tree, Pyramid tree, etc.).
65k-nearest neighbor classifiers
- Two different issues need to be addressed
regarding k-nearest neighbor method - the distance function, and
- the transformation from a sample to a point in
the pattern space. - The first issue is to define the distance
function. If the attributes are numeric, most
k-nearest neighbor classifiers use Euclidean
distance. - Instead of the Euclidean distance, we may also
apply other distance metrics like Manhattan
distance, maximum of dimensions, or Minkowski
distance.
66k-nearest neighbor classifiers
- The second issue is how to transform a sample to
a point in the pattern space. - Note that different attributes may have different
scales and units, and different variability.
Thus, if the distance metric is used directly,
the effects of some attributes might be dominated
by other attributes that have larger scale or
higher variability. - A simple solution to this problem is to weight
the various attributes. One common approach is to
normalize all attribute values into the range 0,
1.
67k-nearest neighbor classifiers
- This solution is sensitive to the outliers
problem since a single outlier could cause
virtually all other values to be contained in a
small subrange. - Another common approach is to apply a
standarization transformation, such as
subtracting the mean from the value of each
attribute and then dividing by its standard
deviation. - Recently, another approach was proposed which
consists in applying the robust space
transformation called Donoho-Stahel estimator
the estimator has some important and useful
properties that make the estimator very
attractive for different data mining applications.
68Classifier accuracy
- The accuracy of a classifier on a given test set
of samples is defined as the percentage of test
samples correctly classified by the classifier,
and it measures the overall performance of the
classifier. - Note that the accuracy of the classifier is not
estimated on the training dataset, since it would
not be a good indicator of the future accuracy
on new data. - The reason is that the classifier generated from
the training dataset tends to overfit the
training data, and any estimate of the
classifier's accuracy based on that data will be
overoptimistic.
69Classifier accuracy
- In other words, the classifier is more accurate
on the data that was used to train the
classifier, but very likely it will be less
accurate on independent set of data. - To predict the accuracy of the classifier on new
data, we need to asses its accuracy on an
independent dataset that played no part in the
formation of the classifier. - This dataset is called the test set
- It is important to note that the test dataset
should not to be used in any way to built the
classifier.
70Classifier accuracy
- There are several methods for estimating
classifier accuracy. The choice of a method
depends on the amount of sample data available
for training and testing. - If there are a lot of sample data, then the
following simple holdout method is usually
applied. - The given set of samples is randomly partitioned
into two independent sets, a training set and a
test set (typically, 70 of the data is used for
training, and the remaining 30 is used for
testing) - Provided that both sets of samples are
representative, the accuracy of the classifier on
the test set will give a good indication of
accuracy on new data.
71Classifier accuracy
- In general, it is difficult to say whether a
given set of samples is representative or not,
but at least we may ensure that the random
sampling of the data set is done in such a way
that the class distribution of samples in both
training and test set is approximately the same
as that in the initial data set. - This procedure is called stratification
72Testing large dataset
Available examples
30
70
Divide randomly
Training Set
Test Set
used to develop one tree
check accuracy
73Classifier accuracy
- If the amount of data for training and testing is
limited, the problem is how to use this limited
amount of data for training to get a good
classifier and for testing to obtain a correct
estimation of the classifier accuracy? - The standard and very common technique of
measuring the accuracy of a classifier when the
amount of data is limited is k-fold
cross-validation - In k-fold cross-validation, the initial set of
samples is randomly partitioned into k
approximately equal mutually exclusive subsets,
called folds, S_1, S_2, ..., S_k.
74Classifier accuracy
- Training and testing is performed k times. At
each iteration, one fold is used for testing
while remainder k-1 folds are used for training.
So, at the end, each fold has been used exactly
once for testing and k-1 for training. - The accuracy estimate is the overall number of
correct classifications from k iterations divided
by the total number of samples N in the initial
dataset. - Often, the k-fold cross-validation technique is
combined with stratification and is called
stratified k-fold cross-validation
75Testing small dataset
cross-validation
Repeat 10 times
Available examples
10
90
Training Set
Test Set
used to develop 10 different trees
check accuracy
76Classifier accuracy
- There are many other methods of estimating
classifier accuracy on a particular dataset - Two popular methods are leave-one-out
cross-validation and the bootstrapping - Leave-one-out cross-validation is simply N-fold
cross-validation, where N is the number of
samples in the initial dataset - At each iteration, a single sample from the
dataset is left out for testing, and remaining
samples are used for training. The result of
testing is either success or failure. - The results of all N evaluations, one for each
sample from the dataset, are averaged, and that
average represents the final accuracy estimate.
77Classifier accuracy
- Bootstrapping is based on the sampling with
replacement - The initial dataset is sampled N times, where N
is the total number of samples in the dataset,
with replacement, to form another set of N
samples for training. - Since some samples in this new "set" will be
repeated, so it means that some samples from the
initial dataset will not appear in this training
set. These samples will form a test set. - Both mentioned estimation methods are interesting
especially for estimating classifier accuracy for
small datasets. In practice, the standard and
most popular technique of estimating a classifier
accuracy is stratified tenfold cross-validation
78Requirements
- Focus on mega-induction
- Handle both continous and categorical data
- No restriction on
- number of examples
- number of attributes
- number of classes
79Applications
- Treatment effectiveness
- Credit Approval
- Store location
- Target marketing
- Insurrence company (fraud detection)
- Telecommunication company (client classification)