DATA MINING : CLASSIFICATION - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

DATA MINING : CLASSIFICATION

Description:

DATA MINING : CLASSIFICATION Classification : Definition Classification is a supervised learning. Uses training sets which has correct answers (class label attributes). – PowerPoint PPT presentation

Number of Views:200
Avg rating:3.0/5.0
Slides: 23
Provided by: 62974
Category:

less

Transcript and Presenter's Notes

Title: DATA MINING : CLASSIFICATION


1
DATA MINING CLASSIFICATION
2
Classification Definition
  • Classification is a supervised learning.
  • Uses training sets which has correct answers
    (class label attributes).
  • A model is created by running the algorithm on
    the training data.
  • Test the model. If accuracy is low, regenerate
    the model, after changing features,reconsidering
    samples.
  • Identify a class label for the incoming new
    data.

3
Applications
  • Classifying credit card transactions as
    legitimate or fraudulent.
  • Classifying secondary structures of protein as
    alpha-helix, beta-sheet, or random coil.
  • Categorizing news stories as finance, weather,
    entertainment, sports, etc.

4
Classification A two step process
  • Model construction describing a set of
    predetermined classes.
  • Each sample is assumed to belong to a predefined
    class, as determined by the class label
    attribute.
  • The set of samples used for model construction is
    training set.
  • The model is represented as classification rules,
    decision trees, or mathematical formula.

5
  • Model usage for classifying future or unknown
    objects.
  • Estimate accuracy of the model.
  • The known label of test sample is compared with
    the classified result from the model.
  • Accuracy rate is the percentage of test set
    samples that are correctly classified by the
    model.
  • Test set is independent of training set.
  • If the accuracy is acceptable, use the model to
    classify data samples whose class labels are not
    known.

6
Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
7
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
8
Classification techniques
  • Decision Tree based Methods
  • Rule-based Methods
  • Neural Networks
  • Bayesian Classification
  • Support Vector Machines

9
Algorithm for decision tree induction
  • Basic algorithm
  • Tree is constructed in a top-down
    recursive divide-and-conquer manner.
  • At start, all the training examples are at the
    root.
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance).
  • Examples are partitioned recursively based on
    selected attributes.

10
Example of Decision Tree
Training Dataset
11
Output A Decision Tree forbuys_computer
12
Advantages of decision tree based classification
  • Inexpensive to construct.
  • Extremely fast at classifying unknown records.
  • Easy to interpret for small-sized trees.
  • Accuracy is comparable to other classification
    techniques for many simple data sets.

13
Enhancements to basic decision tree induction
  • Allow for continuous-valued attributes
  • Dynamically define new discrete-valued attributes
    that partition the continuous attribute value
    into a discrete set of intervals
  • Handle missing attribute values
  • Assign the most common value of the attribute
  • Assign probability to each of the possible values
  • Attribute construction
  • Create new attributes based on existing ones that
    are sparsely represented
  • This reduces fragmentation, repetition, and
    replication

14
Potential Problem
  • Over fitting This is when the generated model
    does not apply to the new incoming data.
  • Either too small of training data, not
    covering many cases.
  • Wrong assumptions
  • Over fitting results in decision trees that are
    more complex than necessary
  • Training error no longer provides a good estimate
    of how well the tree will perform on previously
    unseen records
  • Need new ways for estimating errors

15
How to avoid Over fitting
  • Two ways to avoid over fitting are
  • Pre-pruning
  • Post-pruning
  • Pre-pruning
  • Stop the algorithm before it becomes a fully
    grown tree.
  • Stop if all instances belong to the same class.
  • Stop if no. of instances is less than some user
    specified threshold

16
  • Post-pruning
  • Grow decision tree to its entirety.
  • Trim the nodes of the decision tree in a
    bottom-up fashion.
  • If generalization error improves after trimming,
    replace sub-tree by a leaf node.
  • Class label of leaf node is determined from
    majority class of instances in the sub-tree.

17
Bayesian Classification Algorithm
  • Let X be a data sample whose class label is
    unknown
  • Let H be a hypothesis that X belongs to class C
  • For classification problems, determine P(H/X)
    the probability that the hypothesis holds given
    the observed data sample X
  • P(H) prior probability of hypothesis H (i.e. the
    initial probability before we observe any data,
    reflects the background knowledge)
  • P(X) probability that sample data is observed
  • P(XH) probability of observing the sample X,
    given that the hypothesis holds

18
Training dataset for Bayesian Classification
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
19
Advantages Disadvantages of Bayesian
Classification
  • Advantages
  • Easy to implement
  • Good results obtained in most of the cases
  • Disadvantages
  • Due to assumption there is loss of accuracy.
  • Practically, dependencies exist among variables
  • E.g., hospitals patients Profile age,
    family history etc ,Symptoms fever, cough etc.,
    Disease lung cancer, diabetes etc
  • Dependencies among these cannot be modeled by
    Bayesian Classifier

20
Conclusion
  • Training data is an important factor in building
    a model in supervised algorithms.
  • The classification results generated by each of
    the algorithms (NaĂŻve Bayes, Decision Tree,
    Neural Networks,) is not considerably different
    from each other.
  • Different classification algorithms can take
    different time to train and build models.
  • Mechanical classification is faster

21
References
  • www.google.com
  • http//www.thearling.com
  • www.mamma.com
  • www.amazon.com
  • http//www.kdnuggets.com
  • C. Apte and S. Weiss. Data mining with decision
    trees and decision rules. Future Generation
    Computer Systems, 13, 1997.
  • L. Breiman, J. Friedman, R. Olshen, and C. Stone.
    Classification and Regression Trees. Wadsworth
    International Group, 1984.

22
Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com