Data Mining - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Data Mining

Description:

... etc.) that have measurements like pass or fail, leak or no leak, small, medium, or large, go or no go tests. (SixSigma.com Dictonary) ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 45
Provided by: ifmAcTzs
Category:
Tags: data | mining | sixsigma

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • Instructor Bajuna Salehe
  • Email bajunar_at_yahoo.com
  • Web http//www.ifm.ac.tz/staff/bajuna/courses

Classification and Prediction
2
Classification and Prediction
  • Classification and prediction are two forms of
    data analysis that can be used to extract models
    describing important data classes or to predict
    future data trends. Such analysis can help
    provide us with a better understanding of the
    data at large.

3
An example application
  • An emergency room in a hospital measures 17
    variables (e.g., blood pressure, age, etc) of
    newly admitted patients.
  • A decision is needed whether to put a new
    patient in an intensive-care unit.
  • Due to the high cost of ICU, those patients who
    may survive less than a month are given higher
    priority.
  • Problem to predict high-risk patients and
    discriminate them from low-risk patients.

4
Another application
  • A credit card company receives thousands of
    applications for new cards. Each application
    contains information about an applicant,
  • age
  • Marital status
  • annual salary
  • outstanding debts
  • credit rating
  • etc.
  • Problem to decide whether an application should
    approved, or to classify applications into two
    categories, approved and not approved.

5
Machine learning and our focus
  • Like human learning from past experiences.
  • A computer does not have experiences.
  • A computer system learns from data, which
    represent some past experiences of an
    application domain.
  • Our focus learn a target function that can be
    used to predict the values of a discrete class
    attribute, e.g., approve or not-approved, and
    high-risk or low risk.
  • The task is commonly called Supervised learning,
    classification, or inductive learning.

6
Classification and Prediction
  • Whereas classification predicts categorical
    (discrete, unordered) labels, prediction models
    continuous valued functions.

7
Classification and Prediction
  • For example, we can build a classification model
    to categorize bank loan applications as either
    safe or risky, or a prediction model to predict
    the expenditures in dollars of potential
    customers on computer equipment given their
    income and occupation.

8
Classification
  • Classification is the process of finding a model
    (or function) that describes and distinguishes
    data classes or concepts, for the purpose of
    being able to use the model to predict the class
    of objects whose class label is unknown.
  • The derived model is based on the analysis of a
    set of training data (i.e., data objects whose
    class label is known).

9
What is Classification
  • Classification is the task of assigning objects
    to their respective categories.
  • Examples include classifying email messages as
    spam or non-spam based upon the message header
    and content, and classifying galaxies based upon
    their respective shapes.

10
What is Classification
  • Classification can provide a valuable support for
    informed decision making in the organisation.
  • For example, suppose a mobile phone company would
    like to promote a new cell-phone product to the
    public. Instead of mass mailing the promotional
    catalog to everyone, the company may be able to
    reduce the campaign cost by targeting only a
    small segment of the population

11
What is Classification
  • It may classify each person as a potential buyer
    or non-buyer based on their personal information
    such as income, occupation, lifestyle, and credit
    ratings.

12
Discrete Data
  • Discrete Data A set of data is said to be
    discrete if the values / observations belonging
    to it are distinct and separate, i.e. they can be
    counted (1,2,3,....). Examples might include the
    number of kittens in a litter the number of
    patients in a doctors surgery the number of
    flaws in one metre of cloth gender (male,
    female) blood group (O, A, B, AB).

13
Discrete Data
  • Any data measurements that are not quantified on
    an infinitely divisible numeric scale. Includes
    items like counts, proportions, ratios, or
    percentage of a characteristics, (i.e. sex, loan
    forms, department attendance, etc.) that have
    measurements like pass or fail, leak or no leak,
    small, medium, or large, go or no go tests.
    (SixSigma.com Dictonary)

14
Continuous Data
  • Continuous/Variable Data A set of data is said
    to be continuous if the values / observations
    belonging to it may take on any value within a
    finite or infinite interval. You can count, order
    and measure continuous data. For example height,
    weight, temperature, the amount of sugar in an
    orange, the time required to run a mile.

15
Continuous Data
  • Variable data type have real numbers in the
    measurement like 2.34, 2.55, etc. (i.e. data that
    can be measured on a continuous scale)

16
Categorical Data
  • Categorical Data A set of data is said to be
    categorical if the values or observations
    belonging to it can be sorted according to
    category. Each value is chosen from a set of
    non-overlapping categories. For example, shoes in
    a cupboard can be sorted according to colour the
    characteristic 'colour' can have non-overlapping
    categories 'black', 'brown', 'red' and 'other'.
    People have the characteristic of 'gender' with
    categories 'male' and 'female'.

17
Nominal Data
  • Nominal Data A set of data is said to be
    nominal if the values / observations belonging to
    it can be assigned a code in the form of a number
    where the numbers are simply labels. You can
    count but not order or measure nominal data. For
    example, in a data set males could be coded as 0,
    females as 1 marital status of an individual
    could be coded as Y if married, N if single.

18
Ordinal Data
  • Ordinal Data - A set of data is said to be
    ordinal if the values / observations belonging to
    it can be ranked (put in order) or have a rating
    scale attached. You can count and order, but not
    measure, ordinal data.

19
Ordinal Data
  • The categories for an ordinal set of data have a
    natural order, for example, suppose a group of
    people were asked to taste varieties of biscuit
    and classify each biscuit on a rating scale of 1
    to 5, representing strongly dislike, dislike,
    neutral, like, strongly like. A rating of 5
    indicates more enjoyment than a rating of 4, for
    example, so such data are ordinal.

20
Preliminaries
  • The input data for classification task is given
    in the form of collection of records.
  • Each record also known as instance or example is
    characterised by a tuple (x,y), where x is the
    attribute set and y is the class label

21
Preliminaries
  • Table 1. Vertebrate
    Data Set

22
Preliminaries
  • In the above slide, the table shows a sample data
    set used for classifying vertebrates into one of
    the following categories mammal, bird, fish,
    reptile, or amphibian.
  • The attribute set includes properties of a
    vertebrate such as its body temperature, skin
    cover, method of reproduction, ability to fly and
    ability to live in water.

23
Preliminaries
  • The attribute set may contain discrete and
    continuous features, however on the table above
    attribute set contains mostly discrete values.
  • The class label on the other hand, must be a
    discrete attribute.
  • This is a key characteristics that distinguishes
    classification from another predictive modeling
    task known as regression, where y is a continuous
    attribute.

24
What is Classification
  • Classification can be described as a task of
    assigning objects to one of several predefined
    categories.
  • Input Output
  • Attribute Set Class label
  • (x) (y)
  • The diagram show the classification as task
    of mapping an input attribute set x into its
    class label y

Classification Model
25
Simple Definition
  • Classification is the task of learning a target
    function f that maps each attribute set x into
    one of the pre-defined class labels y.
  • The target function is also known informally as a
    classification model.

26
Usefulness of Classification Model
  • A classification model is useful for the
    following purposes
  • It may serve as an explanatory tool to
    distinguish between objects of different classes
    (Descriptive Modeling).
  • It may also be used to predict the class label of
    unknown records (Predictive Modeling). Consider
    the table below

27
Usefulness of Classification Model
  • A classification model can be treated as a black
    box that automatically assigns a class label when
    presented with the attribute set of an unknown
    record.
  • Example you can be given the characteristics of
    creature known as gila monster.

28
Usefulness of Classification Model
  • By building a classification model from the data
    set shown in Table 1, you may use the model to
    determine the class to which the creature
    belongs.
  • Classification models are most suited for
    predicting or describing data sets with binary or
    nominal target attributes.

29
Classification Prediction
  • Classification
  • Predicts categorical class labels
  • Classifies data (constructs a model) based on the
    training set and the values (class labels) in a
    classifying attribute and uses it in classifying
    new data
  • Prediction
  • Models continuous-valued functions, i.e.,
    predicts unknown or missing values
  • Typical Applications
  • Credit approval
  • Target marketing
  • Medical diagnosis
  • Treatment effectiveness analysis

30
Classification Techniques
31
Classification Technique
  • A classification technique is a systematic
    approach for building classification models from
    an input data set.
  • Examples of classification techniques include
  • Decision Tree Classifiers
  • Rule-Based Classifiers
  • Neural Networks
  • Support Vector Machines
  • Naive Bayes Classifiers
  • Nearest-Neighbor Classifiers

32
Classification Technique
  • Each technique employs a learning algorithm to
    identify a model that best fits the relationship
    between the attribute set and class label of the
    input data (produces outputs consistent with the
    class labels of the input data).

33
Classification Technique
  • A good classification model must predict
    correctly the class labels of records it has
    never seen before.
  • Building models with good generalization
    capability, i.e., models that accurately predict
    the class labels of previously unseen records, is
    therefore a key objective of the learning
    algorithm.

34
General Approach to Solve a Classification Problem
  • A general strategy to solving a classification
    problem is that
  • First, the input data is divided into two
    disjoint sets, known as the training set and test
    set, respectively.
  • The training set will be used for building a
    classification model.
  • The induced model is later applied to the test
    set to predict the class label of each test
    record.

35
Why are we dividing the data into two set?
  • This strategy of dividing the data into
    independent training and test sets allows us to
    obtain an unbiased estimate of the performance of
    a model on previously unseen records.
  • A figure below in the next slide depicts

36
General Approach to Solve a Classification Problem
37
Performance Measurement of Model
  • Evaluation of the performance of a classification
    model is based upon the number of test records
    predicted correctly and wrongly by the model.
  • The counts are tabulated in a table known as a
    confusion matrix.

38
Performance Measurement of Model
  • Table 2 depicts the confusion matrix for a binary
    classification problem.

39
Performance Measurement of Model
  • Each entry fij in this table denotes the number
    of records from class i predicted to be of class
    j.
  • For instance, f01 is the number of records from
    class 0 wrongly predicted as class 1
  • Based on the entries in the confusion matrix, the
    total number of correct predictions made by the
    model is (f11 f00) and the total number of
    wrong predictions is (f10 f01).

40
Performance Measurement of Model
  • Although a confusion matrix provides the
    information needed to determine how good is a
    classification model, it is useful to summarize
    this information into a single number.
  • This would make it more convenient to compare the
    performance of different classification models.

41
Performance Measurement of Model
  • There are several performance metrics available
    for doing this. One of the most popular metrics
    is model accuracy, which is defined as
  • Accuracy Number of correct predictions
  • Total number of
    predictions
  • f11 f00
  • f11 f10 f01 f00

42
Performance Measurement of Model
  • Equivalently, the performance of a model can be
    expressed in terms of its error rate given by the
    following equation
  • Error rate Number of wrong predictions
  • Total number of
    predictions
  • f10 f01
  • f11 f10 f01 f00

43
  • ?

44
Decision Trees
Write a Comment
User Comments (0)
About PowerShow.com