Chapter 4 Classification and Scoring - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 4 Classification and Scoring

Description:

Why the most compact? Occam's razor principle. UIC - CS 594. B. Liu. 15 ... Building a compact tree ... Class Association Rules (CARs) Mining rules with a fixed target ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 48
Provided by: dis12
Learn more at: https://www.cs.uic.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4 Classification and Scoring


1
Chapter 4Classification and Scoring
2
An example application
  • An emergency room in a hospital measures 17
    variables (e.g., blood pressure, age, etc) of
    newly admitted patients. A decision has to be
    taken whether to put the patient in an
    intensive-care unit. Due to the high cost of ICU,
    those patients who may survive less than a month
    are given higher priority. The problem is to
    predict high-risk patients and discriminate them
    from low-risk patients.

3
Another application
  • A credit card company typically receives
    thousands of applications for new cards. The
    application contains information regarding
    several different attributes, such as annual
    salary, any outstanding debts, age etc. The
    problem is to categorize applications into those
    who have good credit, bad credit, or fall into a
    gray area (thus requiring further human
    analysis).

4
Classification
  • Data It has k attributes A1, Ak. Each tuple
    (case or example) is described by values of the
    attributes and a class label.
  • Goal To learn rules or to build a model that can
    be used to predict the classes of new (or future
    or test) cases.
  • The data used for building the model is called
    the training data.

5
An example data
6
ClassificationA Two-Step Process
  • Model construction describing a set of
    predetermined classes based on a training set. It
    is also called learning.
  • Each tuple/sample is assumed to belong to a
    predefined class
  • The model is represented as classification rules,
    decision trees, or mathematical formulae
  • Model usage for classifying future test
    data/objects
  • Estimate accuracy of the model
  • The known label of test example is compared with
    the classified result from the model
  • Accuracy rate is the of test cases that are
    correctly classified by the model
  • If the accuracy is acceptable, use the model to
    classify data tuples whose class labels are not
    known.

7
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
8
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
9
Supervised vs. Unsupervised Learning
  • Supervised learning classification is seen as
    supervised learning from examples.
  • Supervision The training data (observations,
    measurements, etc.) are accompanied by labels
    indicating the classes of the observations/cases.
  • New data is classified based on the training set
  • Unsupervised learning (clustering)
  • The class labels of training data is unknown
  • Given a set of measurements, observations, etc.
    with the aim of establishing the existence of
    classes or clusters in the data

10
Evaluating Classification Methods
  • Predictive accuracy
  • Speed and scalability
  • time to construct the model
  • time to use the model
  • Robustness handling noise and missing values
  • Scalability efficiency in disk-resident
    databases
  • Interpretability
  • understandable and insight provided by the model
  • Compactness of the model size of the tree, or
    the number of rules.

11
Different classification techniques
  • There are many techniques for classification
  • Decision trees
  • Naïve Bayesian classifiers
  • Using association rules
  • Neural networks
  • Logistic regression
  • and many more ...

12
Building a decision tree an example training
dataset
13
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
14
Inducing a decision tree
  • There are many possible trees
  • lets try it on a credit data
  • How to find the most compact one
  • that is consistent with the data?
  • Why the most compact?
  • Occams razor principle

15
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (we will talk about
    continuous-valued attributes later)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All exmples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no exmples left

16
Building a compact tree
  • The key to building a decision tree - which
    attribute to choose in order to branch.
  • The heuristic is to choose the attribute with the
    maximum Information Gain based on information
    theory.
  • Another explanation is to reduce uncertainty as
    much as possible.

17
Information theory
  • Information theory provides a mathematical basis
    for measuring the information content.
  • To understand the notion of information, think
    about it as providing the answer to a question,
    for example, whether a coin will come up heads.
  • If one already has a good guess about the answer,
    then the actual answer is less informative.
  • If one already knows that the coin is rigged so
    that it will come with heads with probability
    0.99, then a message (advanced information) about
    the actual outcome of a flip is worth less than
    it would be for a honest coin.

18
Information theory (cont )
  • For a fair (honest) coin, you have no
    information, and you are willing to pay more (say
    in terms of ) for advanced information - less
    you know, the more valuable the information.
  • Information theory uses this same intuition, but
    instead of measuring the value for information in
    dollars, it measures information contents in
    bits. One bit of information is enough to answer
    a yes/no question about which one has no idea,
    such as the flip of a fair coin

19
Information theory
  • In general, if the possible answers vi have
    probabilities P(vi), then the information content
    I (entropy) of the actual answer is given by
  • For example, for the tossing of a fair coin we
    get
  • If the coin is loaded to give 99 head we get I
    0.08, and as the probability of heads goes to 1,
    the information of the actual answer goes to 0

20
Back to decision tree learning
  • For a given example, what is the correct
    classification?
  • We may think of a decision tree as conveying
    information about the classification of examples
    in the table (of examples)
  • The entropy measure characterizes the (im)purity
    of an arbitrary collection of examples.

21
Attribute Selection Measure Information Gain
(ID3/C4.5)
  • S contains si tuples of class Ci for i 1, ,
    m
  • information measures info (entropy) required to
    classify any arbitrary tuple
  • Assume a set of training examples, S. If we make
    attribute A, with v values, the root of the
    current tree, this will partition S into v
    subsets. The expected information needed to
    complete the tree after making A the root is

22
Information gain
  • information gained by branching on attribute A
  • We will choose the attribute with the highest
    information gain to branch the current tree.

23
Attribute Selection by info gain
  • Class P buys_computer yes
  • Class N buys_computer no
  • I(p, n) I(9, 5) 0.940
  • Compute the entropy for age
  • means age lt30 has 5 out of 14
    samples, with 2 yeses and 3 nos. Hence
  • Similarly,

24
We build the following tree
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
25
Extracting Classification Rules from Trees
  • Represent the knowledge in the form of IF-THEN
    rules
  • One rule is created for each path from the root
    to a leaf
  • Each attribute-value pair along a path forms a
    conjunction. The leaf node holds the class
    prediction
  • Rules are easier for humans to understand
  • Example
  • IF age lt30 AND student no THEN
    buys_computer no
  • IF age lt30 AND student yes THEN
    buys_computer yes
  • IF age 3140 THEN buys_computer yes
  • IF age gt40 AND credit_rating excellent
    THEN buys_computer yes
  • IF age lt30 AND credit_rating fair THEN
    buys_computer no

26
Avoid Overfitting in Classification
  • Overfitting An tree may overfit the training
    data
  • Good accuracy on training data but poor on test
    exmples
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Two approaches to avoid overfitting
  • Prepruning Halt tree construction early
  • Difficult to decide
  • Postpruning Remove branches from a fully grown
    treeget a sequence of progressively pruned
    trees.
  • This method is commonly used (based on validation
    set or statistical estimate or MDL)

27
Enhancements to basic decision tree induction
  • Allow for continuous-valued attributes
  • Dynamically define new discrete-valued attributes
    that partition the continuous attribute value
    into a discrete set of intervals
  • Handle missing attribute values
  • Assign the most common value of the attribute
  • Assign probability to each of the possible values
  • Attribute construction
  • Create new attributes based on existing ones that
    are sparsely represented.
  • This reduces fragmentation, repetition, and
    replication

28
Bayesian Classification Why?
  • Probabilistic learning Classification learning
    can also be seen as computing P(Cc d), i.e.,
    given a data tuple d, what is the probability
    that d is of class c. (C is the class attribute).
  • How?

29
Naïve Bayesian Classifier
  • Let A1 through Ak be attributes with discrete
    values. They are used to predict a discrete class
    C.
  • Given an example with observed attribute values
    a1 through ak.
  • The prediction is the class c such that
  • P(CcA1a1?...?Akak)
  • is maximal.

30
Compute Probabilities
  • By Bayes rule, the above can be expressed
  • P(Cc) can be easily estimated from training
    data.
  • P(A1a1?...?Akak) is irrelevant for decision
    making since it is the same for every class value
    c.

31
Computing probabilities
  • We only need P(A1a1?...?Akak Cc), which can
    be written as
  • P(A1a1A2a2?...?Akak, Cc)
    P(A2a2?...?Akak Cc)
  • Recursively, the second factor above can be
    written in the same way, and so on.

32
Computing probabilities
  • Now suppose we assume that all attributes are
    conditionally independent given the class c.
    Formally, we assume.
  • P(A1a1A2a2?...?Akak, Cc) P(A1a1 Cc)
  • and so on for A2 through Ak.
  • We are done.
  • How do we estimate P(A1a1 Cc)?

33
Training dataset
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
34
An Example
  • Compute P(A1a1 Cc) for each class
  • P(agelt30 buys_computeryes)
    2/90.222
  • P(agelt30 buys_computerno) 3/5 0.6
  • P(incomemedium buys_computeryes)
    4/9 0.444
  • P(incomemedium buys_computerno)
    2/5 0.4
  • P(studentyes buys_computeryes) 6/9
    0.667
  • P(studentyes buys_computerno)
    1/50.2
  • P(credit_ratingfair buys_computeryes)
    6/90.667
  • P(credit_ratingfair buys_computerno)
    2/50.4
  • X(agelt30 ,income medium, studentyes,credit_
    ratingfair)
  • P(Xbuys_computeryes) 0.222 x 0.444 x 0.667
    x 0.0.667 0.044
  • P(Xbuys_computerno) 0.6 x 0.4 x 0.2 x
    0.4 0.019
  • P(XCc)P(Cc) P(Xbuys_computeryes)
    P(buys_computeryes)0.028
  • P(Xbuys_computeryes)
    P(buys_computeryes)0.007
  • X belongs to class buys_computeryes

35
On Naïve Bayesian Classifier
  • Advantages
  • Easy to implement
  • Good results obtained in many applications
  • Disadvantages
  • Assumption class conditional independence,
    therefore loss of accuracy when the assumption is
    not true.
  • Practically, dependencies exist
  • How to deal with these dependencies?
  • Bayesian Belief Networks

36
Use of Association RulesClassification
  • Classification mine a small set of rules
    existing in the data to form a classifier or
    predictor.
  • It has a target attribute (on the right side)
    Class attribute
  • Association has no fixed target, but we can fix
    a target.

37
Class Association Rules (CARs)
  • Mining rules with a fixed target
  • Right-hand-side of the rules are fixed to a
    single attribute, which can have a number of
    values
  • E.g., X a, Y d ? Class yes
  • X b ? Class no
  • Call such rules class association rules

38
Mining Class Association Rules
  • Itemset in class association rules
  • ltcondset, class_valuegt
  • condset a set of items
  • item attribute value pair, e.g.,
  • attribute1 a
  • class_value a value in class attribute

39
Classification Based on Associations (CBA)
  • Two steps
  • Find all class association rules
  • Using a modified Apriori algorithm
  • Build a classifier
  • There can be many ways, e.g.,
  • Choose a small set of rules to cover the data
  • Numeric attributes need to be discrertized.

40
Advantages of the CBA Model
  • One algorithm performs 3 tasks
  • mine class association rules
  • build an accurate classifier (or predictor)
  • mine normal association rules
  • by treating class as a dummy in
  • ltcondset, class_valuegt
  • then condset itemset

41
Advantages of the CBA Model
  • Existing classification systems use
  • Table data.
  • CBA can build classifiers using either
  • Table form data or
  • Transaction form data (sparse data)
  • CBA is able to find rules that existing
    classification systems cannot.

42
Assoc. Rules can be Used in Many Ways for
Prediction
  • We have so many rules
  • Select a subset of rules
  • Using Baysian Probability together with the rules
  • Using rule combinations
  • A number of systems have been designed and
    implemented.

43
Other classification techniques
  • Support vector machines
  • Logistic regression
  • K-nearest neighbor
  • Neural networks
  • Genetic algorithms
  • Etc.

44
How to Estimated Classification Accuracy or Error
Rates
  • Partition Training-and-testing
  • use two independent data sets, e.g., training set
    (2/3), test set(1/3)
  • used for data set with large number of exmples
  • Cross-validation
  • divide the data set into k subsamples
  • use k-1 subsamples as training data and one
    sub-sample as test datak-fold cross-validation
  • for data set with moderate size
  • leave-one-out for small size data

45
Scoring the data
  • Scoring is related to classification.
  • Normally, we are only interested a single class
    (called positive class), e.g., buyers class in a
    marketing database.
  • Instead of assigning each test example a definite
    class, scoring assigns a probability estimate
    (PE) to indicate the likelihood that the example
    belongs to the positive class.

46
Ranking and lift analysis
  • After each example is given a score, we can rank
    all examples according to their PEs.
  • We then divide the data into n (say 10) bins. A
    lift curve can be drawn according how many
    positive examples are in each bin. This is called
    lift analysis.
  • Classification systems can be used for scoring.
    Need to produce a probability estimate.

47
Lift curve
Bin 1 2 3 4 5
6 7 8 9 10
Write a Comment
User Comments (0)
About PowerShow.com