Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Classification

Description:

... attribute A, splits are of the form value(A) X where X domain(A) ... I(s1, s2)=I(9, 5)= - 9/14log29/14 5/14log25/14=0.94. Spring 2003. 44. Example (3) ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 80
Provided by: maciejm
Category:

less

Transcript and Presenter's Notes

Title: Classification


1
Classification
2
Classification task
  • Input a training set of tuples, each labeled
    with one class label
  • Output a model (classifier) that assigns a class
    label to each tuple based on the other attributes
  • The model can be used to predict the class of new
    tuples, for which the class label is missing or
    unknown

3
What is Classification
  • Data classification is a two-step process
  • first step a model is built describing a
    predetermined set of data classes or concepts
  • second step the model is used for classification
  • Each tuple is assumed to belong to a predefined
    class, as determined by one of the attributes,
    called the class label attribute
  • Data tuples are also referred to as samples,
    examples, or objects

4
Train and test
  • The tuples (examples, samples) are divided into
    training set test set
  • Classification model is built in two steps
  • training - build the model from the training set
  • test - check the accuracy of the model using test
    set

5
Train and test
  • Kind of models
  • if - then rules
  • logical formulae
  • decision trees
  • Accuracy of models
  • the known class of test samples is matched
    against the class predicted by the model
  • accuracy rate of test set samples correctly
    classified by the model

6
Training step
Classification algorithm
training data
Classifier (model)
if age lt 31 or Car Type Sports then Risk High
7
Test step
Classifier (model)
test data
8
Classification (prediction)
Classifier (model)
new data
9
Classification vs. Prediction
  • There are two forms of data analysis that can be
    used to extract models describing data classes or
    to predict future data trends
  • classification predict categorical labels
  • prediction models continuous-valued functions

10
Comparing Classification Methods (1)
  • Predictive accuracy this refers to the ability
    of the model to correctly predict the class label
    of new or previously unseen data
  • Speed this refers to the computation costs
    involved in generating and using the model
  • Robustness this is the ability of the model to
    make correct predictions given noisy data or data
    with missing values

11
Comparing Classification Methods (2)
  • Scalability this refers to the ability to
    construct the model efficiently given large
    amount of data
  • Interpretability this refers to the level of
    understanding and insight that is provided by the
    model
  • Simplicity
  • decision tree size
  • rule compactness
  • Domain-dependent quality indicators

12
Problem formulation
  • Given records in the database with class label
    find model for each class.

Age lt 31
Car Type is sports
High
High
Low
13
Classification techniques
  • Decision Tree Classification
  • Bayesian Classifiers
  • Neural Networks
  • Statistical Analysis
  • Genetic Algorithms
  • Rough Set Approach
  • k-nearest neighbor classifiers

14
Classification by Decision Tree Induction
  • A decision tree is a tree structure, where
  • each internal node denotes a test on an
    attribute,
  • each branch represents the outcome of the test,
  • leaf nodes represent classes or class
    distributions

Age lt 31
N
Y
Car Type is sports
High
High
Low
15
Decision Tree Induction (1)
  • A decision tree is a class discriminator that
    recursively partitions the training set until
    each partition consists entirely or dominantly of
    examples from one class.
  • Each non-leaf node of the tree contains a split
    point, which is a test on one or more attributes
    and determines how the data is partitioned

16
Decision Tree Induction (2)
  • Basic algorithm a greedy algorithm that
    constructs decision trees in a top-down recursive
    divide-and-conquer manner.
  • Many variants
  • from machine learning (ID3, C4.5)
  • from statistics (CART)
  • from pattern recognition (CHAID)
  • Main difference split criterion

17
Decision Tree Induction (3)
  • The algorithm consists of two phases
  • Build an initial tree from the training data such
    that each leaf node is pure
  • Prune this tree to increase its accuracy on test
    data

18
Tree Building
  • In the growth phase the tree is built by
    recursively partitioning the data until each
    partition is either "pure" (contains members of
    the same class) or sufficiently small.
  • The form of the split used to partition the data
    depends on the type of the attribute used in the
    split
  • for a continuous attribute A, splits are of the
    form value(A)ltx where x is a value in the domain
    of A.
  • for a categorical attribute A, splits are of the
    form value(A)?X where X?domain(A)

19
Tree Building Algorithm
  • Make Tree (Training Data T)
  • Partition(T)
  • Partition(Data S)
  • if (all points in S are in the same class) then
  • return
  • for each attribute A do
  • evaluate splits on attribute A
  • use best split found to partition S into S1 and
    S2
  • Partition(S1)
  • Partition(S2)

20
Tree Building Algorithm
  • While growing the tree, the goal at each node is
    to determine the split point that "best" divides
    the training records belonging to that leaf
  • To evaluate the goodness of the split some
    splitting indices have been proposed

21
Split Criteria
  • Gini index (CART, SPRINT)
  • select attribute that minimize impurity of a
    split
  • Information gain (ID3, C4.5)
  • to measure impurity of a split use entropy
  • select attribute that maximize entropy reduction
  • ?2 contingency table statistics (CHAID)
  • measures correlation between each attribute and
    the class label
  • select attribute with maximal correlation

22
Gini index (1)
  • Given a sample training set where each record
    represents a car-insurance applicant. We want to
    build a model of what makes an applicant a high
    or low insurance risk.

Classifier (model)
Training set
The model built can be used to screen future
insurance applicants by classifying them into the
High or Low risk categories
23
Gini index (2)
  • SPRINT algorithm
  • Partition(Data S)
  • if (all points in S are of the same class) then
  • return
  • for each attribute A do
  • evaluate splits on attribute A
  • Use best split found to partition S into S1 and
    S2
  • Partition(S1)
  • Partition(S2)
  • Initial call Partition(Training Data)

24
Gini index (3)
  • Definition
  • gini(S) 1 - ?pj2
  • where
  • S is a data set containing examples from n
    classes
  • pj is a relative frequency of class j in S
  • E.g. two classes, Pos and Neg, and dataset S with
    p Pos-elements and n Neg-elements.
  • ppos p/(pn) pneg n/(np)
  • gini(S) 1 - ppos2 - pneg2

25
Gini index (4)
  • If dataset S is split into S1 and S2, then
    splitting index is defined as follows
  • giniSPLIT(S) (p1 n1)/(pn)gini(S1)
  • (p2 n2)/(pn) gini(S2)
  • where p1, n1 (p2, n2) denote p1 Pos-elements and
    n1 Neg-elements in the dataset S1 (S2),
    respectively.
  • In this definition the "best" split point is the
    one with the lowest value of the giniSPLIT index.

26
Example (1)
Training set
27
Example (1)
Attribute list for Age
Attribute list for Car Type
28
Example (2)
  • Possible values of a split point for Age
    attribute are
  • Age?17, Age?20, Age?23, Age?32, Age?43, Age?68

G(Agelt17) 1- (1202) 0 G(Agegt17) 1-
((3/5)2(2/5)2) 1 - (13/25)2 12/25 GSPLIT
(1/6) 0 (5/6) (12/25) 2/5
29
Example (3)
G(Agelt20) 1- (1202) 0 G(Agegt20) 1-
((1/2)2(1/2)2) 1/2 GSPLIT (2/6) 0 (4/6)
(1/8) 1/3
G(Age?23) 1- (1202) 0 G(Agegt23) 1-
((1/3)2(2/3)2) 1 - (1/9) - (4/9) 4/9 GSPLIT
(3/6) 0 (3/6) (4/9) 2/9
30
Example (4)
G(Age?32) 1- ((3/4)2(1/4)2) 1 - (10/16)
6/16 3/8 G(Agegt32) 1- ((1/2)2(1/2)2)
1/2 GSPLIT (4/6)(3/8) (2/6)(1/2) (1/8)
(1/6)14/48 7/24
The lowest value of GSPLIT is for Age?23, thus we
have a split point at Age(2332) / 2 27.5
31
Example (5)
Decision tree after the first split of the
example set
Age?27.5
Agegt27.5
Risk High
Risk Low
32
Example (6)
Attribute lists are divided at the split
point. Attribute lists for Age?27.5
Attribute lists for Agegt27.5
33
Example (7)
Evaluating splits for categorical attributes
We have to evaluate splitting index for each of
the 2N combinations, where N is the cardinality
of the categorical attribute.
G(Car type ?sport) 1 12 02 0 G(Car type
?family) 1 02 12 0 G(Car type ?truck)
1 02 12 0
34
Example (8)
  • G(Car type ? sport, family ) 1 - (1/2)2 -
    (1/2)2 1/2
  • G(Car type ? sport, truck ) 1/2
  • G(Car type ? family, truck ) 1 - 02 - 12 0
  • GSPLIT(Car type ? sport ) (1/3) 0 (2/3)
    0 0
  • GSPLIT(Car type ? family ) (1/3) 0
    (2/3)(1/2) 1/3
  • GSPLIT(Car type ? truck ) (1/3) 0
    (2/3)(1/2) 1/3
  • GSPLIT(Car type ? sport, family)
    (2/3)(1/2)(1/3)0 1/3
  • GSPLIT(Car type ? sport, truck)
    (2/3)(1/2)(1/3)0 1/3
  • GSPLIT(Car type ? family, truck )
    (2/3)0(1/3)00

35
Example (9)
  • The lowest value of GSPLIT is for Car type ?
    sport, thus this is our split point. Decision
    tree after the second split of the example set

Age?27.5
Agegt27.5
Risk High
Car type ? family, truck
Car type ? sport
Risk Low
Risk High
36
Information Gain (1)
  • The information gain measure is used to select
    the test attribute at each node in the tree
  • The attribute with the highest information gain
    (or greatest entropy reduction) is chosen as the
    test attribute for the current node
  • This attribute minimizes the information needed
    to classify the samples in the resulting
    partitions

37
Information Gain (2)
  • Let S be a set consisting of s data samples.
    Suppose the class label attribute has m distinct
    values defining m classes, Ci (for i1, ..., m)
  • Let si be the number of samples of S in class Ci
  • The expected information needed to classify a
    given sample is given by
  • I(s1, s2, ..., sm) - ? pi log2(pi)
  • where pi is the probability that an arbitrary
    sample belongs to class Ci and is estimated by
    si/s.

38
Information Gain (3)
  • Let attribute A have v distinct values, a1, a2,
    ..., av. Attribute A can be used to partition S
    into S1, S2, ..., Sv, where Sj contains those
    samples in S that have value aj of A
  • If A were selected as the test attribute, then
    these subsets would correspond to the branches
    grown from the node containing the set S

39
Information Gain (4)
  • Let sij be the number of samples of the class Ci
    in a subset Sj. The entropy, or expected
    information based on the partitioning into
    subsets by A, is given by
  • E(A1, A2, ...Av) ?(s1j s2j ...smj)/s
  • I(s1j, s2j, ..., smj)
  • The smaller the entropy value, the greater the
    purity of the subset partitions.

40
Information Gain (5)
  • The term (s1j s2j ...smj)/s acts as the
    weight of the jth subset and is the number of
    samples in the subset (i.e. having value aj of A)
    divided by the total number of samples in S. Note
    that for a given subset Sj,
  • I(s1j, s2j, ..., smj) - ? pij log2(pij)
  • where pij sij/Sj and is the probability that
    a sample in Sj belongs to class Ci

41
Information Gain (6)
  • The encoding information that would be gained by
    branching on A is
  • Gain(A) I(s1, s2, ..., sm) E(A)
  • Gain(A) is the expected reduction in entropy
    caused by knowing the value of attribute A

42
Example (1)
43
Example (2)
  • Let us consider the following training set of
    tuples taken from the customer database.
  • The class label attribute, buys_computer, has
    two distinct values (yes, no), therefore, there
    are two classes (m2).
  • C1 correspond to yes s1 9
  • C2 correspond to no - s2 5
  • I(s1, s2)I(9, 5) - 9/14log29/14
    5/14log25/140.94

44
Example (3)
  • Next, we need to compute the entropy of each
    attribute. Let start with the attribute age
  • for agelt30
  • s112 s213 I(s11, s21) 0.971
  • for age31..40
  • s124 s220 I(s12, s22) 0
  • for agegt40
  • s132 s233 I(s13, s23) 0.971

45
Example (4)
  • The entropy of age is,
  • E(age)5/14 I(s11, s21) 4/14 I(s12, s22)
  • 5/14 I(s13, s23) 0.694
  • The gain in information from such a partitioning
    would be
  • Gain(age) I(s1, s2) E(age) 0.246

46
Example (5)
  • We can compute
  • Gain(income)0.029,
  • Gain(student)0.151, and Gain(credit_rating)0.0
    48
  • Since age has the highest information gain amont
    the attributes, it is selected as the test
    atribute. A node is created and labeled with age,
    and branches are grown for each of the
    attributes values.

47
Example (6)
age
gt40
lt30
31..40
buys_computers yes, no
buys_computers yes, no
buys_computers yes
48
Example (7)
age
lt30
gt40
31..40
credit_rating
student
yes
yes
excellent
fair
no
yes
no
yes
no
49
Entropy vs. Gini index
  • Entropy tends to fin groups of classes that add
    up to 50 of the data
  • Gini index tends to isolate the largest class
    from all other classes

class A 40 class B 30 class C 20 class D 10
class A 40 class B 30 class C 20 class D 10
if age lt 65
if age lt 40
no
yes
yes
no
class A 40 class D 10
class B 30 class C 20
class B 30 class C 20class D 10
class A 40
50
Tree pruning
  • When a decision tree is built, many of the
    branches will reflect anomalies in the training
    data due to noise or outliers.
  • Tree pruning methods typically use statistical
    measures to remove the least reliable branches,
    generally resulting in faster classification and
    an improvement in the ability of the tree to
    correctly classify independent test data

51
Tree pruning
  • Prepruning approach (stopping) a tree is
    pruned by halting its construction early (i.e.
    by deciding not to further split or partition the
    subset of training samples). Upon halting, the
    node becomes a leaf. The leaf hold the most
    frequent class among the subset samples
  • Postpruning approach (pruning) removes branches
    from a fully grown tree. A tree node is pruned
    by removing its branches. The lowest unpruned
    node becomes a leaf and is labeled by the most
    frequent class among its former branches

52
Extracting Classification Rules from Decision
Trees
  • The knowledge represented in decision trees can
    be extracted and represented in the form of
    classification IF-THEN rules.
  • One rule is created for each path from the root
    to a leaf node
  • Each attribute-value pair along a given path
    forms a conjunction in the rule antecedent the
    leaf node holds the class prediction, forming the
    rule consequent

53
Extracting Classification Rules from Decision
Trees
  • The decision tree of Example (7) can be converted
    to classification rules
  • IF agelt30 AND studentno THEN
    buys_computersno
  • IF agelt30 AND studentyes THEN
    buys_computersyes
  • IF age31..40 THEN
    buys_computersyes
  • IF agegt40 AND credit_ratingexcellent
  • THEN buys_computersno
  • IF agegt40 AND credit_ratingfair
  • THEN buys_computersyes

54
Other Classification Methods
  • There is a number of classification methods in
    the literature
  • Bayesian classifiers
  • Neural-network classifiers
  • K-nearest neighbor classifiers
  • Association-based classifiers
  • Rough and fuzzy sets

55
Classification Based on Concepts from Association
Rule Mining
  • We may apply quantitative rule mining approach
    to discover classification rules associative
    classification
  • It mines rules of the form condset ? y, where
    condset is a set of items (or attribute-value
    pairs) and y is a class label.

56
Bayesian classifers
  • Bayesian classifier is a statistical classifier.
    It can predict the probability that a given
    sample belongs to a particular class.
  • Bayesian classification is based on Bayes theorem
    of a-posteriori probability.
  • Let X is a data sample whose class label is
    unknown. Each sample is represented n-dimensional
    vector, X(x1, x2, ..., xn).
  • The classification problem may be formulated
    using a-posteriori probabilities as follows
    determine P(CX), the probability that the sample
    X belongs to a specified class C.
  • P(CX) is the a-posteriori probability of C
    conditioned on X.

57
Bayesian classifers
  • Example
  • Given a set of samples describing credit
    applicants P(RisklowAge38, Marital_Statusdivor
    ced, Incomelow, children2) is the probability
    that a credit applicant X(38, divorced, low, 2)
    is the low credit risk applicant.
  • The idea of Bayesian classification is to assign
    to a new unknown sample X the class label C such
    that P(C X) is maximal.

58
Bayesian classifers
  • The main problem is how to estimate a-posteriori
    probability P(CX)?
  • By Bayes theorem
  • P(CX) (P(XC) P(C))/P(X),
  • where P(C) is the apriori probability of C, that
    is the probability that any given sample belongs
    to the class C, P(XC) is the a-posteriori
    probability of X conditioned on C, and P(X) is
    the apriori probability of X.
  • In our example, P(XC) is the probability that
    X(38, divorced, low, 2) given the class
    Risklow, P(C) is the probability of the class C,
    and P(X) is the probability that the sample
    X(38, divorced, low, 2).

59
Bayesian classifers
  • Suppose a training database D consists of n
    samples, and suppose the class label attribute
    has m distinct values defining m distinct classes
    C_i, for i 1, ..., m.
  • Let s_i denotes the number of samples of D in
    class C_i.
  • Bayesian classifier assigns an unknown sample X
    to the class C_i that maximizes P(C_iX). Since
    P(X) is constant for all classes, the class C_i
    for which P(C_iX) is maximized is the class C_i
    for which P(XC_i) P(C_i) is maximized.
  • P(C_i) may be estimated by s_i/n (relative
    frequency of the class C_i), or we may assume
    that all classes have the same probability P(C_1)
    P(C_2) ... P(C_k).

60
Bayesian classifers
  • The main problem is how to compute P(C_iX)?
  • Given a large dataset with many predictor
    attributes, it would be very expensive to compute
    P(C_iX), therefore, to reduce the cost of
    computing P(C_iX), the assumption of class
    conditional independence, or, in other words, the
    attribute independence assumption is made.
  • The assumption states that there are no
    dependencies among predictor attributes, which
    leads to the following formula
  • P(XC_i) ?j1k P(x_jC_i)

61
Bayesian classifers
  • The probabilities P(x_1C_i), P(x_2C_i), ...,
    P(x_kC_i) can be estimated from the dataset
  • - If j-th attribute is categorical, then
    P(x_jC_i) is estimated as the relative frequency
    of samples of the class C_I having value x_j for
    j-th attribute,
  • - If j-th attribute is continuous, then
    P(x_jC_i) is estimated through the Gaussian
    density function.
  • Due to the class conditional independence
    assumption, the Bayesian classifier is also known
    as the naive Bayesian classifier.

62
Bayesian classifers
  • The assumption makes computation possible.
    Moreover, when the assumption is satisfied, the
    naive Bayesian classifier is optimal, that is it
    is the most accurate classifier in comparison to
    all other classifiers.
  • However, the assumption is seldom satisfied in
    practice, since attributes are usually
    correlated.
  • Several attempts are being made to apply Bayesian
    analysis without assuming attribute independence.
    The resulting models are called Bayesian
    networks or Bayesian belief networks
  • Bayesian belief networks combine Bayesian
    analysis with causal relationships between
    attributes.

63
k-nearest neighbor classifiers
  • Nearest neighbor classifier belongs to
    instance-based learning methods.
  • Instance-based learning methods differ from other
    classification methods discussed earlier in that
    they do not build a classifier until a new
    unknown sample needs to be classified.
  • Each training sample is described by
    n-dimensional vector representing a point in an
    n-dimensional space called pattern space. When a
    new unknown sample has to be classified, a
    distance function is used to determine a member
    of the training set which is closest to the
    unknown sample.

64
k-nearest neighbor classifiers
  • Once the nearest training sample is located in
    the pattern space, its class label is assigned to
    the unknown sample.
  • The main drawback of this approach is that it is
    very sensitive to noisy training samples.
  • The common solution to this problem is to adopt
    the k-nearest neighbor strategy.
  • When a new unknown sample has to be classified,
    the classifier searches the pattern space for the
    k training samples which are closest to the
    unknown sample. These k training samples are
    called the k "nearest neighbors" of the unknown
    sample and the most common class label among k
    "nearest neighbors" is assigned to the unknown
    sample.
  • To find the k "nearest neighbors" of the unknown
    sample a multidimensional index is used (e.g.
    R-tree, Pyramid tree, etc.).

65
k-nearest neighbor classifiers
  • Two different issues need to be addressed
    regarding k-nearest neighbor method
  • the distance function, and
  • the transformation from a sample to a point in
    the pattern space.
  • The first issue is to define the distance
    function. If the attributes are numeric, most
    k-nearest neighbor classifiers use Euclidean
    distance.
  • Instead of the Euclidean distance, we may also
    apply other distance metrics like Manhattan
    distance, maximum of dimensions, or Minkowski
    distance.

66
k-nearest neighbor classifiers
  • The second issue is how to transform a sample to
    a point in the pattern space.
  • Note that different attributes may have different
    scales and units, and different variability.
    Thus, if the distance metric is used directly,
    the effects of some attributes might be dominated
    by other attributes that have larger scale or
    higher variability.
  • A simple solution to this problem is to weight
    the various attributes. One common approach is to
    normalize all attribute values into the range 0,
    1.

67
k-nearest neighbor classifiers
  • This solution is sensitive to the outliers
    problem since a single outlier could cause
    virtually all other values to be contained in a
    small subrange.
  • Another common approach is to apply a
    standarization transformation, such as
    subtracting the mean from the value of each
    attribute and then dividing by its standard
    deviation.
  • Recently, another approach was proposed which
    consists in applying the robust space
    transformation called Donoho-Stahel estimator
    the estimator has some important and useful
    properties that make the estimator very
    attractive for different data mining applications.

68
Classifier accuracy
  • The accuracy of a classifier on a given test set
    of samples is defined as the percentage of test
    samples correctly classified by the classifier,
    and it measures the overall performance of the
    classifier.
  • Note that the accuracy of the classifier is not
    estimated on the training dataset, since it would
    not be a good indicator of the future accuracy
    on new data.
  • The reason is that the classifier generated from
    the training dataset tends to overfit the
    training data, and any estimate of the
    classifier's accuracy based on that data will be
    overoptimistic.

69
Classifier accuracy
  • In other words, the classifier is more accurate
    on the data that was used to train the
    classifier, but very likely it will be less
    accurate on independent set of data.
  • To predict the accuracy of the classifier on new
    data, we need to asses its accuracy on an
    independent dataset that played no part in the
    formation of the classifier.
  • This dataset is called the test set
  • It is important to note that the test dataset
    should not to be used in any way to built the
    classifier.

70
Classifier accuracy
  • There are several methods for estimating
    classifier accuracy. The choice of a method
    depends on the amount of sample data available
    for training and testing.
  • If there are a lot of sample data, then the
    following simple holdout method is usually
    applied.
  • The given set of samples is randomly partitioned
    into two independent sets, a training set and a
    test set (typically, 70 of the data is used for
    training, and the remaining 30 is used for
    testing)
  • Provided that both sets of samples are
    representative, the accuracy of the classifier on
    the test set will give a good indication of
    accuracy on new data.

71
Classifier accuracy
  • In general, it is difficult to say whether a
    given set of samples is representative or not,
    but at least we may ensure that the random
    sampling of the data set is done in such a way
    that the class distribution of samples in both
    training and test set is approximately the same
    as that in the initial data set.
  • This procedure is called stratification

72
Testing large dataset
Available examples
30
70
Divide randomly
Training Set
Test Set
used to develop one tree
check accuracy
73
Classifier accuracy
  • If the amount of data for training and testing is
    limited, the problem is how to use this limited
    amount of data for training to get a good
    classifier and for testing to obtain a correct
    estimation of the classifier accuracy?
  • The standard and very common technique of
    measuring the accuracy of a classifier when the
    amount of data is limited is k-fold
    cross-validation
  • In k-fold cross-validation, the initial set of
    samples is randomly partitioned into k
    approximately equal mutually exclusive subsets,
    called folds, S_1, S_2, ..., S_k.

74
Classifier accuracy
  • Training and testing is performed k times. At
    each iteration, one fold is used for testing
    while remainder k-1 folds are used for training.
    So, at the end, each fold has been used exactly
    once for testing and k-1 for training.
  • The accuracy estimate is the overall number of
    correct classifications from k iterations divided
    by the total number of samples N in the initial
    dataset.
  • Often, the k-fold cross-validation technique is
    combined with stratification and is called
    stratified k-fold cross-validation

75
Testing small dataset
cross-validation
Repeat 10 times
Available examples
10
90
Training Set
Test Set
used to develop 10 different trees
check accuracy
76
Classifier accuracy
  • There are many other methods of estimating
    classifier accuracy on a particular dataset
  • Two popular methods are leave-one-out
    cross-validation and the bootstrapping
  • Leave-one-out cross-validation is simply N-fold
    cross-validation, where N is the number of
    samples in the initial dataset
  • At each iteration, a single sample from the
    dataset is left out for testing, and remaining
    samples are used for training. The result of
    testing is either success or failure.
  • The results of all N evaluations, one for each
    sample from the dataset, are averaged, and that
    average represents the final accuracy estimate.

77
Classifier accuracy
  • Bootstrapping is based on the sampling with
    replacement
  • The initial dataset is sampled N times, where N
    is the total number of samples in the dataset,
    with replacement, to form another set of N
    samples for training.
  • Since some samples in this new "set" will be
    repeated, so it means that some samples from the
    initial dataset will not appear in this training
    set. These samples will form a test set.
  • Both mentioned estimation methods are interesting
    especially for estimating classifier accuracy for
    small datasets. In practice, the standard and
    most popular technique of estimating a classifier
    accuracy is stratified tenfold cross-validation

78
Requirements
  • Focus on mega-induction
  • Handle both continous and categorical data
  • No restriction on
  • number of examples
  • number of attributes
  • number of classes

79
Applications
  • Treatment effectiveness
  • Credit Approval
  • Store location
  • Target marketing
  • Insurrence company (fraud detection)
  • Telecommunication company (client classification)
Write a Comment
User Comments (0)
About PowerShow.com