Title: Classification: Definition
1Classification Definition
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
2Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
3Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
4Classification by Decision Tree Induction
- Decision tree
- A flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
distribution - Decision tree generation consists of two phases
- Tree construction
- At start, all the training examples are at the
root - Partition examples recursively based on selected
attributes - Tree pruning
- Identify and remove branches that reflect noise
or outliers - Use of decision tree Classifying an unknown
sample - Test the attribute values of the sample against
the decision tree
5Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
6Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
7Apply Model to Test Data
Test Data
Start from the root of tree.
8Apply Model to Test Data
Test Data
9Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
10Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
11Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
12Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
13Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
14General Structure of Hunts Algorithm
- Let Dt be the set of training records that reach
a node t - General Procedure
- If Dt contains records that belong the same class
yt, then t is a leaf node labeled as yt - If Dt is an empty set, then t is a leaf node
labeled by the default class, yd - If Dt contains records that belong to more than
one class, use an attribute test to split the
data into smaller subsets. Recursively apply the
procedure to each subset.
Dt
?
15Hunts Algorithm
Dont Cheat
16Tree Induction
- Greedy strategy.
- Split the records based on an attribute test that
optimizes certain criterion. - Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
17How to Specify Test Condition?
- Depends on attribute types
- Nominal
- Ordinal
- Continuous
- Depends on number of ways to split
- 2-way split
- Multi-way split
18Splitting Based on Nominal Attributes
- Multi-way split Use as many partitions as
distinct values. - Binary split Divides values into two subsets.
Need to find optimal partitioning.
19Splitting Based on Continuous Attributes
- Different ways of handling
- Discretization to form an ordinal categorical
attribute - Static discretize once at the beginning
- Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing (percenti
les), or clustering. - Binary Decision (A lt v) or (A ? v)
- consider all possible splits and finds the best
cut - can be more compute intensive
20Splitting Based on Continuous Attributes
21Information Gain (ID3/C4.5)
- Select the attribute with the highest information
gain - Assume there are two classes, P and N
- Let the set of examples S contain p elements of
class P and n elements of class N - The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as
22Refund attribute Information Gain
Dont Cheat
- Class C(cheat contains 3 tuples)
- Class NC contains 7 tuples
- I(C,NC) -3/10log(3/10) -7/10log(7/10) .2653
- Check attribute Refund Yes value contains 3 NC
and 0 C. No value contains 4 NC and 3 C - For Yes 3/3log3/3 0/3log0 0
- For No 4/7log(7/4)3/7log(7/3) .2966
- E(Refund) 3/100 7/10.2966 .2076
- Gain I(C,NC) - E(Refund) .0577
23Marital Status Information Gain
Dont Cheat
- Check attribute MS Single value contains 3 NC
and 3 C. Married value contains 4 NC and 0 C - For Single 3/6log6/3 3/6log6/3 .30
- For Married 4/4log(4/4)0/4log(0) 0
- E(Refund) 6/10.3 4/10.0.18 .
- Gain I(C,NC) - E(MS) .0853
24Taxable income(TI) Information Gain
Dont Cheat
- Suppose that taxable income is discretized into
(0, 75, (75,100,(100, 1000000 - First segment contains 3NC 0C
- Second segment contains 1NC 3C
- Third segment contains 3NC 0C
- For 1st segment 3/3log1 0/3log1 0
- For 2d segment 1/4log(4/1)3/4log(4/3)
.2442 - For 3d segment we also obtain 0
- E(TI) 3/100 4/10.2442 3/100 .0977
- Gain I(C,NC) - E(MS) .1676
25Information Gain in Decision Tree Induction
- Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv - If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is - The encoding information that would be gained by
branching on A
26RID Age Income Student Credit
Class buysComputer
1 30 high no Fair No
2 30 high No Excellent No
3 31..40 high No Fair Yes
4 gt40 medium No Fair Yes
5 gt40 low Yes Fair Yes
6 gt40 low Yes Excellent No
7 31..40 low Yes Excellent Yes
8 30 medium No Fair No
9 30 low Yes Fair Yes
10 gt40 medium Yes Fair Yes
11 30 medium Yes Excellent Yes
12 31..40 medium No Excellent Yes
13 31..40 high Yes Fair Yes
14 gt40 medium no excellent no
27Attribute Selection by Information Gain
Computation
- Class P buys_computer yes
- Class N buys_computer no
- I(p, n) I(9, 5) 0.940
- Compute the entropy for age
28Extracting Classification Rules from Trees
- Represent the knowledge in the form of IF-THEN
rules - One rule is created for each path from the root
to a leaf - Each attribute-value pair along a path forms a
conjunction - The leaf node holds the class prediction
- Rules are easier for humans to understand
- Example
- IF age lt30 AND student no THEN
buys_computer no - IF age lt30 AND student yes THEN
buys_computer yes - IF age 3140 THEN buys_computer yes
- IF age gt40 AND credit_rating excellent
THEN buys_computer yes - IF age gt40 AND credit_rating fair THEN
buys_computer no
29Stopping Criteria for Tree Induction
- Stop expanding a node when all the records belong
to the same class - Stop expanding a node when all the records have
similar attribute values - Early termination (to be discussed later)
30Table to be classified
name btemp skinCover GivesBirth aquatic aerial Legs? hibernates class
Human Warm Hair Yes No No Yrs No mammal
Python Cold Scales No No No No Yes Reptile
Salmon Cold Scales No Yes No No No Fish
Whale Warm Hair Yes Yes No No No Mammal
Frog Cold None No Semi No Yes Yes Amphbian
Komodo Cold Scales No No No Yes No Reptile
eel cold scales no Yes No No No Fish
Bat Warm Hair Yes No Yes Yes Yes Mammal
Pigeon Warm Feather No No Yes Yes No Bird
Cat Warm Fur Yes No No Yes No Mammal
Leopard warm fur Yes no No Yes No Mammal
Shark cold scales yes yes no no no fish
Turtle Cold Scales No Semi No Yes No Reptile
penguin warm feather No semi no yes no bird
31Decision Tree Based Classification
- Advantages
- Inexpensive to construct
- Extremely fast at classifying unknown records
- Easy to interpret for small-sized trees
- Accuracy is comparable to other classification
techniques for many simple data sets
32Practical Issues of Classification
- Underfitting and Overfitting
- Missing Values
- Costs of Classification
33Underfitting and Overfitting (Example)
500 circular and 500 triangular data
points. Circular points 0.5 ? sqrt(x12x22) ?
1 Triangular points sqrt(x12x22) gt 0.5
or sqrt(x12x22) lt 1
34Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
35Overfitting due to Noise
Decision boundary is distorted by noise point
36Overfitting due to Insufficient Examples
Lack of data points in the lower half of the
diagram makes it difficult to predict correctly
the class labels of that region - Insufficient
number of training records in the region causes
the decision tree to predict the test examples
using other training records that are irrelevant
to the classification task
37Notes on Overfitting
- Overfitting results in decision trees that are
more complex than necessary - Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records - Need new ways for estimating errors
38Minimum Description Length (MDL)
- Cost(Model,Data) Cost(DataModel) Cost(Model)
- Cost is the number of bits needed for encoding.
- Search for the least costly model.
- Cost(DataModel) encodes the misclassification
errors. - Cost(Model) uses node encoding (number of
children) plus splitting condition encoding.
39Metrics for Performance Evaluation
- Focus on the predictive capability of a model
- Rather than how fast it takes to classify or
build models, scalability, etc. - Confusion Matrix
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
a TP (true positive) b FN (false negative) c
FP (false positive) d TN (true negative)
40Metrics for Performance Evaluation
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a(TP) b(FN)
ACTUALCLASS ClassNo c(FP) d(TN)
41Limitation of Accuracy
- Consider a 2-class problem
- Number of Class 0 examples 9990
- Number of Class 1 examples 10
- If model predicts everything to be class 0,
accuracy is 9990/10000 99.9 - Accuracy is misleading because model does not
detect any class 1 example
42Cost Matrix
PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS C(ij) ClassYes ClassNo
ACTUALCLASS ClassYes C(YesYes) C(NoYes)
ACTUALCLASS ClassNo C(YesNo) C(NoNo)
C(ij) Cost of misclassifying class j example as
class i
43Computing Cost of Classification
Cost Matrix PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS C(ij) -
ACTUALCLASS -1 100
ACTUALCLASS - 1 0
Model M1 PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS -
ACTUALCLASS 150 40
ACTUALCLASS - 60 250
Model M2 PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS -
ACTUALCLASS 250 45
ACTUALCLASS - 5 200
Accuracy 80 Cost 3910
Accuracy 90 Cost 4255
44Cost vs Accuracy
Count PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes a b
ACTUALCLASS ClassNo c d
Cost PREDICTED CLASS PREDICTED CLASS PREDICTED CLASS
ACTUALCLASS ClassYes ClassNo
ACTUALCLASS ClassYes p q
ACTUALCLASS ClassNo q p
45Cost-Sensitive Measures
- Precision is biased towards C(YesYes)
C(YesNo) - Recall is biased towards C(YesYes) C(NoYes)
- F-measure is biased towards all except C(NoNo)
46Bayesian Classification Why?
- Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems - Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data. - Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities - Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
47Naive BayesianClassification Example
- Discretization of Height is as follows
- (0,1.6, (1.6,1.7, (1.7,1.8, (1.8,1.9,
(1.9, (1.9,2.0, (2.0,) - P(Short) 4/15.267
- P(Medium) 8/15 .533
- P(Tall) 3/15 .2
- P(MShort) 1/4.25
- P(MMedium)2/8.25
- P(MTall) 3/3 1.0
- P(FShort)3/4.75
- P(FMedium)6/8.75
- P(FTall)0/30.0
- P((0,1.6short) 2/4.5
- P((1.6,1.7Short)2/4.0
- P((1.9,2Tall)1/3.333
- P((1.9,2Tall)2/3.666
- P((1.7,1.8Medium)3/8.375
- P((1.8,1.9Medium)4/8.5
- P((1.9,2Medium 1/8.125
Id Gender Heigth class
1 F 1.6 Short
2 M 2.0 Tall
3 F 1.9 Medium
4 F 1.88 Medium
5 F 1.7 Short
6 M 1.85 Medium
7 F 1.6 Short
8 M 1.7 Short
9 M 2.2 Tall
10 M 2.1 Tall
11 F 1.8 Medium
12 M 1.95 Medium
13 F 1.9 Medium
14 F 1.8 Medium
15 F 1.75 Medium
48Naive BayesianClassification Example
Id Gender Heigth class
1 F 1.6 Short
2 M 2.0 Tall
3 F 1.9 Medium
4 F 1.88 Medium
5 F 1.7 Short
6 M 1.85 Medium
7 F 1.6 Short
8 M 1.7 Short
9 M 2.2 Tall
10 M 2.1 Tall
11 F 1.8 Medium
12 M 1.95 Medium
13 F 1.9 Medium
14 F 1.8 Medium
15 F 1.75 Medium
- Consider tuple tlt16,M,1.95gt
- Bayesian rule
- P(tShort)P(Short)
- P(Shortt) ----------------------
- P(t)
- P(t) is a constant for any class.
- P(tshort)P(Mshort) P((1.9,2short)
- .2500
- P(short).267 P(Shortt) 0/P(t)
- P(tMedium)
- P(MMedium)P((1.9,2Medium).25.125
- .031
- P(Mediumt) .031.533/P(t) .016/P(t)
- Similarly for P(Tallt) .333.2/P(t).066P(t)
- Thus, t is Tall
49Estimating a-posteriori probabilities
- Bayes theorem
- P(CX) P(XC)P(C) / P(X)
- P(X) is constant for all classes
- P(C) relative freq of class C samples
- C such that P(CX) is maximum C such that
P(XC)P(C) is maximum - Problem computing P(XC) is unfeasible!
50Naïve Bayesian Classification
- Naïve assumption attribute independence
- P(x1,,xkC) P(x1C)P(xkC)
- If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C - Computationally easy in both cases
51Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
52Play-tennis example classifying X
- An unseen sample X ltrain, hot, high, falsegt
- P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582 - P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286 - Sample X is classified in class n (dont play)
53Instance-Based Methods
- Instance-based learning
- Store training examples and delay the processing
(lazy evaluation) until a new instance must be
classified - Typical approaches
- k-nearest neighbor approach
- Instances represented as points in a Euclidean
space. - Locally weighted regression
- Constructs local approximation
- Case-based reasoning
- Uses symbolic representations and knowledge-based
inference
54Other Classification Methods
- k-nearest neighbor classifier
- case-based reasoning
- Rough set approach
- Fuzzy set approaches
55The k-Nearest Neighbor Algorithm
- All instances correspond to points in the n-D
space. - The nearest neighbor are defined in terms of
Euclidean distance. - The target function could be discrete- or real-
valued. - For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
56Discussion on the k-NN Algorithm
- The k-NN algorithm for continuous-valued target
functions - Calculate the mean values of the k nearest
neighbors - Distance-weighted nearest neighbor algorithm
- Weight the contribution of each of the k
neighbors according to their distance to the
query point xq giving greater weight to closer
neighbors - Robust to noisy data by averaging k-nearest
neighbors - Curse of dimensionality distance between
neighbors could be dominated by irrelevant
attributes. - To overcome it, axes stretch or elimination of
the least relevant attributes.
57Remarks on Lazy vs. Eager Learning
- Instance-based learning lazy evaluation
- Decision-tree and Bayesian classification eager
evaluation - Key differences
- Lazy method may consider query instance xq when
deciding how to generalize beyond the training
data D - Eager method cannot since they have already
chosen global approximation when seeing the query - Efficiency Lazy - less time training but more
time predicting - Accuracy
- Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions
to form its implicit global approximation to the
target function - Eager must commit to a single hypothesis that
covers the entire instance space
58Rough Set Approach
- Rough sets are used to approximately or roughly
define equivalent classes - A rough set for a given class C is approximated
by two sets a lower approximation (certain to be
in C) and an upper approximation (cannot be
described as not belonging to C) - Finding the minimal subsets (reducts) of
attributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce the
computation intensity
59Fuzzy Set Approaches
- Fuzzy logic uses truth values between 0.0 and 1.0
to represent the degree of membership (such as
using fuzzy membership graph) - Attribute values are converted to fuzzy values
- e.g., income is mapped into the discrete
categories low, medium, high with fuzzy values
calculated - For a given new sample, more than one fuzzy value
may apply - Each applicable rule contributes a vote for
membership in the categories - Typically, the truth values for each predicted
category are summed
60What Is Prediction?
- Prediction is similar to classification
- First, construct a model
- Second, use model to predict unknown value
- Major method for prediction is regression
- Linear and multiple regression
- Non-linear regression
- Prediction is different from classification
- Classification refers to predict categorical
class label - Prediction models continuous-valued functions
61Predictive Modeling in Databases
- Predictive modeling Predict data values or
construct generalized linear models based on
the database data. - One can only predict value ranges or category
distributions - Method outline
- Minimal generalization
- Attribute relevance analysis
- Generalized linear model construction
- Prediction
- Determine the major factors which influence the
prediction - Data relevance analysis uncertainty measurement,
entropy analysis, expert judgement, etc. - Multi-level prediction drill-down and roll-up
analysis
62Linear Regression for prediction
- Given a set of tuples (x1,y1), (x2,y2), . . .
.(xs,ys) - Assume that YABX
- SUM(xi-E(X))(yi-E(Y)
- B-------------------------
- SUM(xi- E(X))2
- A E(Y)-BE(X)
63Linear Regression for prediction
X(Years of Experience) Y(Salary
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
- E(X) 9.1 E(Y)55.4
- B (3-9.1)(30-55.4)(8-9.1)(57-55.4)
../(3-9.1)2 3.5 - A55.4-(3.59.1)23.6
- Y23.63.5X
64Linear Regression using a minimal error approach
- YABXe, where e is an error if Y is replaced by
ABX - LSUM e2 SUM(y-A-Bx)2
- Take a derivative for A and B respectively
- DL/DA -2SUMyi2SUM A 2SUM Bxi0
- DL/DB 2SUM (yi-A-Bxi)(xi) 0
- A(SUMyi-SUMBxi)/N B((SUM(xiyi)-(SUMxiSUMyi)/
N)/(SUM(xi2)-(SUM(xi)2)/N))
65Linear regression using minimal error
X(Years of Experience) Y(Salary
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
- A 17.91 B4.12
- Y 17.914.12X
66Regress Analysis and Log-Linear Models in
Prediction
- Multiple regression Y b0 b1 X1 b2 X2.
- Many nonlinear functions can be transformed into
the above. - Log-linear models
- The multi-way table of joint probabilities is
approximated by a product of lower-order tables. - Probability p(a, b, c, d) ?ab ?ac?ad ?bcd
67Locally Weighted Regression
- Construct an explicit approximation to f over a
local region surrounding query instance xq. - Locally weighted linear regression
- The target function f is approximated near xq
using the linear function - minimize the squared error distance-decreasing
weight K - the gradient descent training rule
- In most cases, the target function is
approximated by a constant, linear, or quadratic
function.
68Prediction Numerical Data
69Prediction Categorical Data
70Classification Accuracy Estimating Error Rates
- Partition Training-and-testing
- use two independent data sets, e.g., training set
(2/3), test set(1/3) - used for data set with large number of samples
- Cross-validation
- divide the data set into k subsamples
- use k-1 subsamples as training data and one
sub-sample as test data --- k-fold
cross-validation - for data set with moderate size
- Bootstrapping (leave-one-out)
- for small size data
71Boosting and Bagging
- Boosting increases classification accuracy
- Applicable to decision trees or Bayesian
classifier - Learn a series of classifiers, where each
classifier in the series pays more attention to
the examples misclassified by its predecessor - Boosting requires only linear time and constant
space
72Boosting Technique (II) Algorithm
- Assign every example an equal weight 1/N
- For t 1, 2, , T Do
- Obtain a hypothesis (classifier) h(t) under w(t)
- Calculate the error of h(t) and re-weight the
examples based on the error - Normalize w(t1) to sum to 1
- Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its
accuracy on the training set
73Summary
- Classification is an extensively studied problem
(mainly in statistics, machine learning neural
networks) - Classification is probably one of the most widely
used data mining techniques with a lot of
extensions - Scalability is still an important issue for
database applications thus combining
classification with database techniques should be
a promising topic - Research directions classification of
non-relational data, e.g., text, spatial,
multimedia, etc..