Classification - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Classification

Description:

or Car Type =Sports. then Risk = High. 6. Testing Step ... Car Type = sports. High. Low. 11. Kind of Classification models. Decision Trees. Bayesian Classifiers ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 26

Provided by: maci64

Category:

Tags: classification

more less

Transcript and Presenter's Notes

Title: Classification

1
Classification
2
Classification

Input a training set of records, each labeled
with one class label
Output a model (classifier) that classifies new
cases of records
The model can be used to predict the class of new
cases, for which the class is unknown

3
What is Classification?

Data classification is a two-step process
Step 1 a model is built describing a
predetermined set of data classes or concepts
Step 2 the model is used for classification
Each record is assumed to belong to a predefined
class, as determined by one of the attributes,
called the decision attribute
Data records are also referred to as samples,
examples, or objects

4
Training and Testing

The records (examples, samples) are divided into
training data set test data set
Classification model is built in two steps
Training build the model from the training set
Testing check the accuracy of the model using
test set

5
Training Step
Classification algorithm
Training data
Classifier (model)
if age lt 31 or Car Type Sports then Risk High
6
Testing Step
Classifier (model)
Test data
7
Classification
Classifier (model)
New data
8
Measure for Comparing Classification Models

Predictive accuracy this refers to the ability
of the model to correctly predict the class label
of new data
Speed this refers to the computation costs
involved in generating and using the model
Robustness this is the ability of the model to
make correct predictions given noisy data or data
with missing values

9
Measure for Comparing Classification Models
(cont.)

Scalability this refers to the ability to
construct the model efficiently given large
amount of data
Interpretability this refers to the level of
understanding and insight that is provided by the
model
Simplicity
decision tree size
rule compactness
Domain-dependent quality indicators

10
An example model -decision tree

Given records in the database with class label
find model for each class.

Age lt 31
Car Type sports
High
High
Low
11
Kind of Classification models

Decision Trees
Bayesian Classifiers
Neural Networks
Statistical Analysis Models
Genetic Algorithms
Rough Set-based models
k-nearest neighbor classifiers

12
Classification by Decision Trees

A decision tree is a tree structure, where
each internal node denotes a test on an
attribute,
each branch represents the outcome of the test,
leaf nodes represent classes or class
distributions

Age lt 31
N
Y
Car Type sports
High
High
Low
13
Decision Tree Induction

A decision tree is a class discriminator that
recursively partitions the training data set
until each partition consists entirely or
dominantly of examples of one class.
Each non-leaf node of the tree contains a split
point, which is a test on one or more attributes
and determines how the data is partitioned

14
Decision Tree Induction (cont.1)

Basic algorithm a greedy algorithm that
constructs decision trees in a top-down recursive
divide-and-conquer manner.
Many variants
from machine learning (ID3, C4.5)
from statistics (CART)
from pattern recognition (CHAID)
Main difference split criterion

15
Decision Tree Induction (cont.2)

The algorithm consists of two phases
Tree building Build an initial tree from the
training data such that each leaf node is pure
Tree pruning Prune this tree to increase its
accuracy on test data

16
Tree Building Part of the Algorithm

Make Tree (Training Data T)
Partition(T)
Partition(Data S)
if (all points in S are in the same class) then
return
for each attribute A do
evaluate splits on attribute A
use the best split found to partition S into S1
and S2
Partition(S1)
Partition(S2)

17
Choosing the best split

While growing the tree, the goal at each node is
to determine the split point that "best" divides
the training data belonging to that node
To evaluate the goodness of the split
some splitting criteria have been proposed

18
Splitting Criteria

Gini index (CART, SPRINT)
select attribute that minimize impurity of a
split
Information gain (ID3, C4.5)
to measure impurity of a split use entropy
select attribute that maximize entropy reduction

19
An Example of Splitting Criteria - Gini index

Definition
gini(S) 1 - ?pj2
where
S is a data set containing examples from n
classes
pj is a relative frequency of class j in S
E.g. two classes, Positive and Negative, and
dataset S with p Positive-elements and n
Negative-elements.
ppositive p/(pn) pnegative n/(np)
gini(S) 1 - ppositive2 - pnegative2

20
Gini index (cont.)

If dataset S is split into S1 and S2, then
splitting index is defined as follows
giniSPLIT(S) (p1 n1)/(pn)gini(S1)
(p2 n2)/(pn) gini(S2),
where p1, n1 (p2, n2) denote p1
Positive-elements and n1 Negative-elements in the
dataset S1 (S2), respectively.
In this definition the "best" split point is the
one with the lowest value of the giniSPLIT index.

21
Tree Pruning

When a decision tree is built,
many of the branches will reflect anomalies in
the training data due to noise or outliers.
Tree pruning methods typically use statistical
measures
to remove the least reliable branches,
generally resulting in faster classification, and
an improvement in the ability of the tree to
correctly classify test data

22
Tree Pruning Methods

Prepruning approach (stopping)
a tree is pruned by halting its construction
early (i.e. by deciding not to further split or
partition the subset of training samples).
Upon halting, the node becomes a leaf. The leaf
holds the most frequent class among the subset
samples
Postpruning approach (pruning)
removes branches from a fully grown tree.
A tree node is pruned by removing its branches.
The lowest unpruned node becomes a leaf and is
labeled by the most frequent class among its
examples

23
Testing Classifier accuracy

If there are a lot of sample data, then the
following simple holdout method is usually
applied.
The given set of samples is randomly partitioned
into two independent sets,
a training set and a test set
70 of the data is used for training, and the
remaining 30 is used for testing
the accuracy of the classifier on the test set
will give a good indication of accuracy on new
data.

24
Applications of Classification Models