Classification - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Classification

Description:

or Car Type =Sports. then Risk = High. 6. Testing Step ... Car Type = sports. High. Low. 11. Kind of Classification models. Decision Trees. Bayesian Classifiers ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 26
Provided by: maci64
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
Classification
2
Classification
  • Input a training set of records, each labeled
    with one class label
  • Output a model (classifier) that classifies new
    cases of records
  • The model can be used to predict the class of new
    cases, for which the class is unknown

3
What is Classification?
  • Data classification is a two-step process
  • Step 1 a model is built describing a
    predetermined set of data classes or concepts
  • Step 2 the model is used for classification
  • Each record is assumed to belong to a predefined
    class, as determined by one of the attributes,
    called the decision attribute
  • Data records are also referred to as samples,
    examples, or objects

4
Training and Testing
  • The records (examples, samples) are divided into
    training data set test data set
  • Classification model is built in two steps
  • Training build the model from the training set
  • Testing check the accuracy of the model using
    test set

5
Training Step
Classification algorithm
Training data
Classifier (model)
if age lt 31 or Car Type Sports then Risk High
6
Testing Step
Classifier (model)
Test data
7
Classification
Classifier (model)
New data
8
Measure for Comparing Classification Models
  • Predictive accuracy this refers to the ability
    of the model to correctly predict the class label
    of new data
  • Speed this refers to the computation costs
    involved in generating and using the model
  • Robustness this is the ability of the model to
    make correct predictions given noisy data or data
    with missing values

9
Measure for Comparing Classification Models
(cont.)
  • Scalability this refers to the ability to
    construct the model efficiently given large
    amount of data
  • Interpretability this refers to the level of
    understanding and insight that is provided by the
    model
  • Simplicity
  • decision tree size
  • rule compactness
  • Domain-dependent quality indicators

10
An example model -decision tree
  • Given records in the database with class label
    find model for each class.

Age lt 31
Car Type sports
High
High
Low
11
Kind of Classification models
  • Decision Trees
  • Bayesian Classifiers
  • Neural Networks
  • Statistical Analysis Models
  • Genetic Algorithms
  • Rough Set-based models
  • k-nearest neighbor classifiers

12
Classification by Decision Trees
  • A decision tree is a tree structure, where
  • each internal node denotes a test on an
    attribute,
  • each branch represents the outcome of the test,
  • leaf nodes represent classes or class
    distributions

Age lt 31
N
Y
Car Type sports
High
High
Low
13
Decision Tree Induction
  • A decision tree is a class discriminator that
    recursively partitions the training data set
    until each partition consists entirely or
    dominantly of examples of one class.
  • Each non-leaf node of the tree contains a split
    point, which is a test on one or more attributes
    and determines how the data is partitioned

14
Decision Tree Induction (cont.1)
  • Basic algorithm a greedy algorithm that
    constructs decision trees in a top-down recursive
    divide-and-conquer manner.
  • Many variants
  • from machine learning (ID3, C4.5)
  • from statistics (CART)
  • from pattern recognition (CHAID)
  • Main difference split criterion

15
Decision Tree Induction (cont.2)
  • The algorithm consists of two phases
  • Tree building Build an initial tree from the
    training data such that each leaf node is pure
  • Tree pruning Prune this tree to increase its
    accuracy on test data

16
Tree Building Part of the Algorithm
  • Make Tree (Training Data T)
  • Partition(T)
  • Partition(Data S)
  • if (all points in S are in the same class) then
  • return
  • for each attribute A do
  • evaluate splits on attribute A
  • use the best split found to partition S into S1
    and S2
  • Partition(S1)
  • Partition(S2)

17
Choosing the best split
  • While growing the tree, the goal at each node is
  • to determine the split point that "best" divides
    the training data belonging to that node
  • To evaluate the goodness of the split
  • some splitting criteria have been proposed

18
Splitting Criteria
  • Gini index (CART, SPRINT)
  • select attribute that minimize impurity of a
    split
  • Information gain (ID3, C4.5)
  • to measure impurity of a split use entropy
  • select attribute that maximize entropy reduction

19
An Example of Splitting Criteria - Gini index
  • Definition
  • gini(S) 1 - ?pj2
  • where
  • S is a data set containing examples from n
    classes
  • pj is a relative frequency of class j in S
  • E.g. two classes, Positive and Negative, and
    dataset S with p Positive-elements and n
    Negative-elements.
  • ppositive p/(pn) pnegative n/(np)
  • gini(S) 1 - ppositive2 - pnegative2

20
Gini index (cont.)
  • If dataset S is split into S1 and S2, then
    splitting index is defined as follows
  • giniSPLIT(S) (p1 n1)/(pn)gini(S1)
  • (p2 n2)/(pn) gini(S2),
  • where p1, n1 (p2, n2) denote p1
    Positive-elements and n1 Negative-elements in the
    dataset S1 (S2), respectively.
  • In this definition the "best" split point is the
    one with the lowest value of the giniSPLIT index.

21
Tree Pruning
  • When a decision tree is built,
  • many of the branches will reflect anomalies in
    the training data due to noise or outliers.
  • Tree pruning methods typically use statistical
    measures
  • to remove the least reliable branches,
  • generally resulting in faster classification, and
  • an improvement in the ability of the tree to
    correctly classify test data

22
Tree Pruning Methods
  • Prepruning approach (stopping)
  • a tree is pruned by halting its construction
    early (i.e. by deciding not to further split or
    partition the subset of training samples).
  • Upon halting, the node becomes a leaf. The leaf
    holds the most frequent class among the subset
    samples
  • Postpruning approach (pruning)
  • removes branches from a fully grown tree.
  • A tree node is pruned by removing its branches.
  • The lowest unpruned node becomes a leaf and is
    labeled by the most frequent class among its
    examples

23
Testing Classifier accuracy
  • If there are a lot of sample data, then the
    following simple holdout method is usually
    applied.
  • The given set of samples is randomly partitioned
    into two independent sets,
  • a training set and a test set
  • 70 of the data is used for training, and the
    remaining 30 is used for testing
  • the accuracy of the classifier on the test set
    will give a good indication of accuracy on new
    data.

24
Applications of Classification Models
  • Treatment effectiveness
  • Credit Approval
  • Target marketing
  • Insurrence company (fraud detection)
  • Telecommunication company (client classification)

25
Exercise
  • Try to find ID3 program from the internet and run
    the program with some sample data.
Write a Comment
User Comments (0)
About PowerShow.com