Title: Classification
1Classification
2Classification
- Input a training set of records, each labeled
with one class label - Output a model (classifier) that classifies new
cases of records - The model can be used to predict the class of new
cases, for which the class is unknown
3What is Classification?
- Data classification is a two-step process
- Step 1 a model is built describing a
predetermined set of data classes or concepts - Step 2 the model is used for classification
- Each record is assumed to belong to a predefined
class, as determined by one of the attributes,
called the decision attribute - Data records are also referred to as samples,
examples, or objects
4Training and Testing
- The records (examples, samples) are divided into
training data set test data set - Classification model is built in two steps
- Training build the model from the training set
- Testing check the accuracy of the model using
test set
5Training Step
Classification algorithm
Training data
Classifier (model)
if age lt 31 or Car Type Sports then Risk High
6Testing Step
Classifier (model)
Test data
7Classification
Classifier (model)
New data
8Measure for Comparing Classification Models
- Predictive accuracy this refers to the ability
of the model to correctly predict the class label
of new data - Speed this refers to the computation costs
involved in generating and using the model - Robustness this is the ability of the model to
make correct predictions given noisy data or data
with missing values
9Measure for Comparing Classification Models
(cont.)
- Scalability this refers to the ability to
construct the model efficiently given large
amount of data - Interpretability this refers to the level of
understanding and insight that is provided by the
model - Simplicity
- decision tree size
- rule compactness
- Domain-dependent quality indicators
10An example model -decision tree
- Given records in the database with class label
find model for each class.
Age lt 31
Car Type sports
High
High
Low
11Kind of Classification models
- Decision Trees
- Bayesian Classifiers
- Neural Networks
- Statistical Analysis Models
- Genetic Algorithms
- Rough Set-based models
- k-nearest neighbor classifiers
12Classification by Decision Trees
- A decision tree is a tree structure, where
- each internal node denotes a test on an
attribute, - each branch represents the outcome of the test,
- leaf nodes represent classes or class
distributions
Age lt 31
N
Y
Car Type sports
High
High
Low
13Decision Tree Induction
- A decision tree is a class discriminator that
recursively partitions the training data set
until each partition consists entirely or
dominantly of examples of one class. - Each non-leaf node of the tree contains a split
point, which is a test on one or more attributes
and determines how the data is partitioned
14Decision Tree Induction (cont.1)
- Basic algorithm a greedy algorithm that
constructs decision trees in a top-down recursive
divide-and-conquer manner. - Many variants
- from machine learning (ID3, C4.5)
- from statistics (CART)
- from pattern recognition (CHAID)
- Main difference split criterion
15Decision Tree Induction (cont.2)
- The algorithm consists of two phases
- Tree building Build an initial tree from the
training data such that each leaf node is pure - Tree pruning Prune this tree to increase its
accuracy on test data
16Tree Building Part of the Algorithm
- Make Tree (Training Data T)
-
- Partition(T)
-
- Partition(Data S)
-
- if (all points in S are in the same class) then
- return
- for each attribute A do
- evaluate splits on attribute A
- use the best split found to partition S into S1
and S2 - Partition(S1)
- Partition(S2)
17Choosing the best split
- While growing the tree, the goal at each node is
- to determine the split point that "best" divides
the training data belonging to that node - To evaluate the goodness of the split
- some splitting criteria have been proposed
18Splitting Criteria
- Gini index (CART, SPRINT)
- select attribute that minimize impurity of a
split - Information gain (ID3, C4.5)
- to measure impurity of a split use entropy
- select attribute that maximize entropy reduction
19An Example of Splitting Criteria - Gini index
- Definition
- gini(S) 1 - ?pj2
- where
- S is a data set containing examples from n
classes - pj is a relative frequency of class j in S
- E.g. two classes, Positive and Negative, and
dataset S with p Positive-elements and n
Negative-elements. - ppositive p/(pn) pnegative n/(np)
- gini(S) 1 - ppositive2 - pnegative2
20Gini index (cont.)
- If dataset S is split into S1 and S2, then
splitting index is defined as follows - giniSPLIT(S) (p1 n1)/(pn)gini(S1)
- (p2 n2)/(pn) gini(S2),
- where p1, n1 (p2, n2) denote p1
Positive-elements and n1 Negative-elements in the
dataset S1 (S2), respectively. - In this definition the "best" split point is the
one with the lowest value of the giniSPLIT index.
21Tree Pruning
- When a decision tree is built,
- many of the branches will reflect anomalies in
the training data due to noise or outliers. - Tree pruning methods typically use statistical
measures - to remove the least reliable branches,
- generally resulting in faster classification, and
- an improvement in the ability of the tree to
correctly classify test data
22Tree Pruning Methods
- Prepruning approach (stopping)
- a tree is pruned by halting its construction
early (i.e. by deciding not to further split or
partition the subset of training samples). - Upon halting, the node becomes a leaf. The leaf
holds the most frequent class among the subset
samples - Postpruning approach (pruning)
- removes branches from a fully grown tree.
- A tree node is pruned by removing its branches.
- The lowest unpruned node becomes a leaf and is
labeled by the most frequent class among its
examples
23Testing Classifier accuracy
- If there are a lot of sample data, then the
following simple holdout method is usually
applied. - The given set of samples is randomly partitioned
into two independent sets, - a training set and a test set
- 70 of the data is used for training, and the
remaining 30 is used for testing - the accuracy of the classifier on the test set
will give a good indication of accuracy on new
data.
24Applications of Classification Models
- Treatment effectiveness
- Credit Approval
- Target marketing
- Insurrence company (fraud detection)
- Telecommunication company (client classification)
25Exercise
- Try to find ID3 program from the internet and run
the program with some sample data.