Classification: Definition - PowerPoint PPT Presentation

About This Presentation
Title:

Classification: Definition

Description:

convertible to simple and easy to understand classification rules ... Play-tennis example: classifying X. An unseen sample X = rain, hot, high, false P(X|p) P(p) ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 74
Provided by: jiaw206
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Classification: Definition


1
Classification Definition
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.
  • A test set is used to determine the accuracy of
    the model. Usually, the given data set is divided
    into training and test sets, with training set
    used to build the model and test set used to
    validate it.

2
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
3
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
4
Classification by Decision Tree Induction
  • Decision tree
  • A flow-chart-like tree structure
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • Leaf nodes represent class labels or class
    distribution
  • Decision tree generation consists of two phases
  • Tree construction
  • At start, all the training examples are at the
    root
  • Partition examples recursively based on selected
    attributes
  • Tree pruning
  • Identify and remove branches that reflect noise
    or outliers
  • Use of decision tree Classifying an unknown
    sample
  • Test the attribute values of the sample against
    the decision tree

5
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
6
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
7
Apply Model to Test Data
Test Data
Start from the root of tree.
8
Apply Model to Test Data
Test Data
9
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
10
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
11
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
12
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
13
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

14
General Structure of Hunts Algorithm
  • Let Dt be the set of training records that reach
    a node t
  • General Procedure
  • If Dt contains records that belong the same class
    yt, then t is a leaf node labeled as yt
  • If Dt is an empty set, then t is a leaf node
    labeled by the default class, yd
  • If Dt contains records that belong to more than
    one class, use an attribute test to split the
    data into smaller subsets. Recursively apply the
    procedure to each subset.

Dt
?
15
Hunts Algorithm
Dont Cheat
16
Tree Induction
  • Greedy strategy.
  • Split the records based on an attribute test that
    optimizes certain criterion.
  • Issues
  • Determine how to split the records
  • How to specify the attribute test condition?
  • How to determine the best split?
  • Determine when to stop splitting

17
How to Specify Test Condition?
  • Depends on attribute types
  • Nominal
  • Ordinal
  • Continuous
  • Depends on number of ways to split
  • 2-way split
  • Multi-way split

18
Splitting Based on Nominal Attributes
  • Multi-way split Use as many partitions as
    distinct values.
  • Binary split Divides values into two subsets.
    Need to find optimal partitioning.

19
Splitting Based on Continuous Attributes
  • Different ways of handling
  • Discretization to form an ordinal categorical
    attribute
  • Static discretize once at the beginning
  • Dynamic ranges can be found by equal interval
    bucketing, equal frequency bucketing (percenti
    les), or clustering.
  • Binary Decision (A lt v) or (A ? v)
  • consider all possible splits and finds the best
    cut
  • can be more compute intensive

20
Splitting Based on Continuous Attributes
21
Information Gain (ID3/C4.5)
  • Select the attribute with the highest information
    gain
  • Assume there are two classes, P and N
  • Let the set of examples S contain p elements of
    class P and n elements of class N
  • The amount of information, needed to decide if an
    arbitrary example in S belongs to P or N is
    defined as

22
Refund attribute Information Gain
Dont Cheat
  • Class C(cheat contains 3 tuples)
  • Class NC contains 7 tuples
  • I(C,NC) -3/10log(3/10) -7/10log(7/10) .2653
  • Check attribute Refund Yes value contains 3 NC
    and 0 C. No value contains 4 NC and 3 C
  • For Yes 3/3log3/3 0/3log0 0
  • For No 4/7log(7/4)3/7log(7/3) .2966
  • E(Refund) 3/100 7/10.2966 .2076
  • Gain I(C,NC) - E(Refund) .0577

23
Marital Status Information Gain
Dont Cheat
  • Check attribute MS Single value contains 3 NC
    and 3 C. Married value contains 4 NC and 0 C
  • For Single 3/6log6/3 3/6log6/3 .30
  • For Married 4/4log(4/4)0/4log(0) 0
  • E(Refund) 6/10.3 4/10.0.18 .
  • Gain I(C,NC) - E(MS) .0853

24
Taxable income(TI) Information Gain
Dont Cheat
  • Suppose that taxable income is discretized into
    (0, 75, (75,100,(100, 1000000
  • First segment contains 3NC 0C
  • Second segment contains 1NC 3C
  • Third segment contains 3NC 0C
  • For 1st segment 3/3log1 0/3log1 0
  • For 2d segment 1/4log(4/1)3/4log(4/3)
    .2442
  • For 3d segment we also obtain 0
  • E(TI) 3/100 4/10.2442 3/100 .0977
  • Gain I(C,NC) - E(MS) .1676

25
Information Gain in Decision Tree Induction
  • Assume that using attribute A a set S will be
    partitioned into sets S1, S2 , , Sv
  • If Si contains pi examples of P and ni examples
    of N, the entropy, or the expected information
    needed to classify objects in all subtrees Si is
  • The encoding information that would be gained by
    branching on A

26
RID Age Income Student Credit
Class buysComputer
1 30 high no Fair No
2 30 high No Excellent No
3 31..40 high No Fair Yes
4 gt40 medium No Fair Yes
5 gt40 low Yes Fair Yes
6 gt40 low Yes Excellent No
7 31..40 low Yes Excellent Yes
8 30 medium No Fair No
9 30 low Yes Fair Yes
10 gt40 medium Yes Fair Yes
11 30 medium Yes Excellent Yes
12 31..40 medium No Excellent Yes
13 31..40 high Yes Fair Yes
14 gt40 medium no excellent no
27
Attribute Selection by Information Gain
Computation
  • Hence
  • Similarly
  • Class P buys_computer yes
  • Class N buys_computer no
  • I(p, n) I(9, 5) 0.940
  • Compute the entropy for age

28
Extracting Classification Rules from Trees
  • Represent the knowledge in the form of IF-THEN
    rules
  • One rule is created for each path from the root
    to a leaf
  • Each attribute-value pair along a path forms a
    conjunction
  • The leaf node holds the class prediction
  • Rules are easier for humans to understand
  • Example
  • IF age lt30 AND student no THEN
    buys_computer no
  • IF age lt30 AND student yes THEN
    buys_computer yes
  • IF age 3140 THEN buys_computer yes
  • IF age gt40 AND credit_rating excellent
    THEN buys_computer yes
  • IF age gt40 AND credit_rating fair THEN
    buys_computer no

29
Stopping Criteria for Tree Induction
  • Stop expanding a node when all the records belong
    to the same class
  • Stop expanding a node when all the records have
    similar attribute values
  • Early termination (to be discussed later)

30
Table to be classified
name btemp skinCover GivesBirth aquatic aerial Legs? hibernates class
Human Warm Hair Yes No No Yrs No mammal
Python Cold Scales No No No No Yes Reptile
Salmon Cold Scales No Yes No No No Fish
Whale Warm Hair Yes Yes No No No Mammal
Frog Cold None No Semi No Yes Yes Amphbian
Komodo Cold Scales No No No Yes No Reptile
eel cold scales no Yes No No No Fish
Bat Warm Hair Yes No Yes Yes Yes Mammal
Pigeon Warm Feather No No Yes Yes No Bird
Cat Warm Fur Yes No No Yes No Mammal
Leopard warm fur Yes no No Yes No Mammal
Shark cold scales yes yes no no no fish
Turtle Cold Scales No Semi No Yes No Reptile
penguin warm feather No semi no yes no bird
31
Decision Tree Based Classification
  • Advantages
  • Inexpensive to construct
  • Extremely fast at classifying unknown records
  • Easy to interpret for small-sized trees
  • Accuracy is comparable to other classification
    techniques for many simple data sets

32
Practical Issues of Classification
  • Underfitting and Overfitting
  • Missing Values
  • Costs of Classification

33
Underfitting and Overfitting (Example)
500 circular and 500 triangular data
points. Circular points 0.5 ? sqrt(x12x22) ?
1 Triangular points sqrt(x12x22) gt 0.5
or sqrt(x12x22) lt 1
34
Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
35
Overfitting due to Noise
Decision boundary is distorted by noise point
36
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
37
Notes on Overfitting
  • Overfitting results in decision trees that are
    more complex than necessary
  • Training error no longer provides a good estimate
    of how well the tree will perform on previously
    unseen records
  • Need new ways for estimating errors

38
Minimum Description Length (MDL)
  • Cost(Model,Data) Cost(DataModel) Cost(Model)
  • Cost is the number of bits needed for encoding.
  • Search for the least costly model.
  • Cost(DataModel) encodes the misclassification
    errors.
  • Cost(Model) uses node encoding (number of
    children) plus splitting condition encoding.

39
Metrics for Performance Evaluation
  • Focus on the predictive capability of a model
  • Rather than how fast it takes to classify or
    build models, scalability, etc.
  • Confusion Matrix

PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
40
Metrics for Performance Evaluation
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a(TP) b(FN)
ACTUALCLASS ClassNo c(FP) d(TN)
  • Most widely-used metric

41
Limitation of Accuracy
  • Consider a 2-class problem
  • Number of Class 0 examples 9990
  • Number of Class 1 examples 10
  • If model predicts everything to be class 0,
    accuracy is 9990/10000 99.9
  • Accuracy is misleading because model does not
    detect any class 1 example

42
Cost Matrix
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS C(ij) ClassYes ClassNo
ACTUALCLASS ClassYes C(YesYes) C(NoYes)
ACTUALCLASS ClassNo C(YesNo) C(NoNo)
C(ij) Cost of misclassifying class j example as
class i
43
Computing Cost of Classification
Cost Matrix PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS C(ij) -
ACTUALCLASS -1 100
ACTUALCLASS - 1 0
Model M1 PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS -
ACTUALCLASS 150 40
ACTUALCLASS - 60 250
Model M2 PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS -
ACTUALCLASS 250 45
ACTUALCLASS - 5 200
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
44
Cost vs Accuracy
Count PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
Cost PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes p q
ACTUALCLASS ClassNo q p
45
Cost-Sensitive Measures
  • Precision is biased towards C(YesYes)
    C(YesNo)
  • Recall is biased towards C(YesYes) C(NoYes)
  • F-measure is biased towards all except C(NoNo)

46
Bayesian Classification Why?
  • Probabilistic learning Calculate explicit
    probabilities for hypothesis, among the most
    practical approaches to certain types of learning
    problems
  • Incremental Each training example can
    incrementally increase/decrease the probability
    that a hypothesis is correct. Prior knowledge
    can be combined with observed data.
  • Probabilistic prediction Predict multiple
    hypotheses, weighted by their probabilities
  • Standard Even when Bayesian methods are
    computationally intractable, they can provide a
    standard of optimal decision making against which
    other methods can be measured

47
Naive BayesianClassification Example
  • Discretization of Height is as follows
  • (0,1.6, (1.6,1.7, (1.7,1.8, (1.8,1.9,
    (1.9, (1.9,2.0, (2.0,)
  • P(Short) 4/15.267
  • P(Medium) 8/15 .533
  • P(Tall) 3/15 .2
  • P(MShort) 1/4.25
  • P(MMedium)2/8.25
  • P(MTall) 3/3 1.0
  • P(FShort)3/4.75
  • P(FMedium)6/8.75
  • P(FTall)0/30.0
  • P((0,1.6short) 2/4.5
  • P((1.6,1.7Short)2/4.0
  • P((1.9,2Tall)1/3.333
  • P((1.9,2Tall)2/3.666
  • P((1.7,1.8Medium)3/8.375
  • P((1.8,1.9Medium)4/8.5
  • P((1.9,2Medium 1/8.125

Id Gender Heigth class
1 F 1.6 Short
2 M 2.0 Tall
3 F 1.9 Medium
4 F 1.88 Medium
5 F 1.7 Short
6 M 1.85 Medium
7 F 1.6 Short
8 M 1.7 Short
9 M 2.2 Tall
10 M 2.1 Tall
11 F 1.8 Medium
12 M 1.95 Medium
13 F 1.9 Medium
14 F 1.8 Medium
15 F 1.75 Medium
48
Naive BayesianClassification Example
Id Gender Heigth class
1 F 1.6 Short
2 M 2.0 Tall
3 F 1.9 Medium
4 F 1.88 Medium
5 F 1.7 Short
6 M 1.85 Medium
7 F 1.6 Short
8 M 1.7 Short
9 M 2.2 Tall
10 M 2.1 Tall
11 F 1.8 Medium
12 M 1.95 Medium
13 F 1.9 Medium
14 F 1.8 Medium
15 F 1.75 Medium
  • Consider tuple tlt16,M,1.95gt
  • Bayesian rule
  • P(tShort)P(Short)
  • P(Shortt) ----------------------
  • P(t)
  • P(t) is a constant for any class.
  • P(tshort)P(Mshort) P((1.9,2short)
  • .2500
  • P(short).267 P(Shortt) 0/P(t)
  • P(tMedium)
  • P(MMedium)P((1.9,2Medium).25.125
  • .031
  • P(Mediumt) .031.533/P(t) .016/P(t)
  • Similarly for P(Tallt) .333.2/P(t).066P(t)
  • Thus, t is Tall

49
Estimating a-posteriori probabilities
  • Bayes theorem
  • P(CX) P(XC)P(C) / P(X)
  • P(X) is constant for all classes
  • P(C) relative freq of class C samples
  • C such that P(CX) is maximum C such that
    P(XC)P(C) is maximum
  • Problem computing P(XC) is unfeasible!

50
Naïve Bayesian Classification
  • Naïve assumption attribute independence
  • P(x1,,xkC) P(x1C)P(xkC)
  • If i-th attribute is categoricalP(xiC) is
    estimated as the relative freq of samples having
    value xi as i-th attribute in class C
  • Computationally easy in both cases

51
Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
52
Play-tennis example classifying X
  • An unseen sample X ltrain, hot, high, falsegt
  • P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
    ep)P(p) 3/92/93/96/99/14 0.010582
  • P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
    en)P(n) 2/52/54/52/55/14 0.018286
  • Sample X is classified in class n (dont play)

53
Instance-Based Methods
  • Instance-based learning
  • Store training examples and delay the processing
    (lazy evaluation) until a new instance must be
    classified
  • Typical approaches
  • k-nearest neighbor approach
  • Instances represented as points in a Euclidean
    space.
  • Locally weighted regression
  • Constructs local approximation
  • Case-based reasoning
  • Uses symbolic representations and knowledge-based
    inference

54
Other Classification Methods
  • k-nearest neighbor classifier
  • case-based reasoning
  • Rough set approach
  • Fuzzy set approaches

55
The k-Nearest Neighbor Algorithm
  • All instances correspond to points in the n-D
    space.
  • The nearest neighbor are defined in terms of
    Euclidean distance.
  • The target function could be discrete- or real-
    valued.
  • For discrete-valued, the k-NN returns the most
    common value among the k training examples
    nearest to xq.

56
Discussion on the k-NN Algorithm
  • The k-NN algorithm for continuous-valued target
    functions
  • Calculate the mean values of the k nearest
    neighbors
  • Distance-weighted nearest neighbor algorithm
  • Weight the contribution of each of the k
    neighbors according to their distance to the
    query point xq giving greater weight to closer
    neighbors
  • Robust to noisy data by averaging k-nearest
    neighbors
  • Curse of dimensionality distance between
    neighbors could be dominated by irrelevant
    attributes.
  • To overcome it, axes stretch or elimination of
    the least relevant attributes.

57
Remarks on Lazy vs. Eager Learning
  • Instance-based learning lazy evaluation
  • Decision-tree and Bayesian classification eager
    evaluation
  • Key differences
  • Lazy method may consider query instance xq when
    deciding how to generalize beyond the training
    data D
  • Eager method cannot since they have already
    chosen global approximation when seeing the query
  • Efficiency Lazy - less time training but more
    time predicting
  • Accuracy
  • Lazy method effectively uses a richer hypothesis
    space since it uses many local linear functions
    to form its implicit global approximation to the
    target function
  • Eager must commit to a single hypothesis that
    covers the entire instance space

58
Rough Set Approach
  • Rough sets are used to approximately or roughly
    define equivalent classes
  • A rough set for a given class C is approximated
    by two sets a lower approximation (certain to be
    in C) and an upper approximation (cannot be
    described as not belonging to C)
  • Finding the minimal subsets (reducts) of
    attributes (for feature reduction) is NP-hard but
    a discernibility matrix is used to reduce the
    computation intensity

59
Fuzzy Set Approaches
  • Fuzzy logic uses truth values between 0.0 and 1.0
    to represent the degree of membership (such as
    using fuzzy membership graph)
  • Attribute values are converted to fuzzy values
  • e.g., income is mapped into the discrete
    categories low, medium, high with fuzzy values
    calculated
  • For a given new sample, more than one fuzzy value
    may apply
  • Each applicable rule contributes a vote for
    membership in the categories
  • Typically, the truth values for each predicted
    category are summed

60
What Is Prediction?
  • Prediction is similar to classification
  • First, construct a model
  • Second, use model to predict unknown value
  • Major method for prediction is regression
  • Linear and multiple regression
  • Non-linear regression
  • Prediction is different from classification
  • Classification refers to predict categorical
    class label
  • Prediction models continuous-valued functions

61
Predictive Modeling in Databases
  • Predictive modeling Predict data values or
    construct generalized linear models based on
    the database data.
  • One can only predict value ranges or category
    distributions
  • Method outline
  • Minimal generalization
  • Attribute relevance analysis
  • Generalized linear model construction
  • Prediction
  • Determine the major factors which influence the
    prediction
  • Data relevance analysis uncertainty measurement,
    entropy analysis, expert judgement, etc.
  • Multi-level prediction drill-down and roll-up
    analysis

62
Linear Regression for prediction
  • Given a set of tuples (x1,y1), (x2,y2), . . .
    .(xs,ys)
  • Assume that YABX
  • SUM(xi-E(X))(yi-E(Y)
  • B-------------------------
  • SUM(xi- E(X))2
  • A E(Y)-BE(X)

63
Linear Regression for prediction
X(Years of Experience) Y(Salary
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
  • E(X) 9.1 E(Y)55.4
  • B (3-9.1)(30-55.4)(8-9.1)(57-55.4)
    ../(3-9.1)2 3.5
  • A55.4-(3.59.1)23.6
  • Y23.63.5X

64
Linear Regression using a minimal error approach
  • YABXe, where e is an error if Y is replaced by
    ABX
  • LSUM e2 SUM(y-A-Bx)2
  • Take a derivative for A and B respectively
  • DL/DA -2SUMyi2SUM A 2SUM Bxi0
  • DL/DB 2SUM (yi-A-Bxi)(xi) 0
  • A(SUMyi-SUMBxi)/N B((SUM(xiyi)-(SUMxiSUMyi)/
    N)/(SUM(xi2)-(SUM(xi)2)/N))

65
Linear regression using minimal error
X(Years of Experience) Y(Salary
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
  • A 17.91 B4.12
  • Y 17.914.12X

66
Regress Analysis and Log-Linear Models in
Prediction
  • Multiple regression Y b0 b1 X1 b2 X2.
  • Many nonlinear functions can be transformed into
    the above.
  • Log-linear models
  • The multi-way table of joint probabilities is
    approximated by a product of lower-order tables.
  • Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

67
Locally Weighted Regression
  • Construct an explicit approximation to f over a
    local region surrounding query instance xq.
  • Locally weighted linear regression
  • The target function f is approximated near xq
    using the linear function
  • minimize the squared error distance-decreasing
    weight K
  • the gradient descent training rule
  • In most cases, the target function is
    approximated by a constant, linear, or quadratic
    function.

68
Prediction Numerical Data
69
Prediction Categorical Data
70
Classification Accuracy Estimating Error Rates
  • Partition Training-and-testing
  • use two independent data sets, e.g., training set
    (2/3), test set(1/3)
  • used for data set with large number of samples
  • Cross-validation
  • divide the data set into k subsamples
  • use k-1 subsamples as training data and one
    sub-sample as test data --- k-fold
    cross-validation
  • for data set with moderate size
  • Bootstrapping (leave-one-out)
  • for small size data

71
Boosting and Bagging
  • Boosting increases classification accuracy
  • Applicable to decision trees or Bayesian
    classifier
  • Learn a series of classifiers, where each
    classifier in the series pays more attention to
    the examples misclassified by its predecessor
  • Boosting requires only linear time and constant
    space

72
Boosting Technique (II) Algorithm
  • Assign every example an equal weight 1/N
  • For t 1, 2, , T Do
  • Obtain a hypothesis (classifier) h(t) under w(t)
  • Calculate the error of h(t) and re-weight the
    examples based on the error
  • Normalize w(t1) to sum to 1
  • Output a weighted sum of all the hypothesis, with
    each hypothesis weighted according to its
    accuracy on the training set

73
Summary
  • Classification is an extensively studied problem
    (mainly in statistics, machine learning neural
    networks)
  • Classification is probably one of the most widely
    used data mining techniques with a lot of
    extensions
  • Scalability is still an important issue for
    database applications thus combining
    classification with database techniques should be
    a promising topic
  • Research directions classification of
    non-relational data, e.g., text, spatial,
    multimedia, etc..
Write a Comment
User Comments (0)
About PowerShow.com