Title:
1Classifiers
- R D project by
- Aditya M Joshi
- adityaj_at_cse.iitb.ac.in
- IIT Bombay
Under the guidance of Prof. Pushpak
Bhattacharyya pushpakbh_at_gmail.com IIT Bombay
2Overview
3Introduction to Classification
4What is classification?
A machine learning task that deals with
identifying the class to which an instance
belongs A classifier performs classification
Classifier
Test instance Attributes (a1, a2, an)
( Age, Marital status, Health status, Salary )
( Perceptive inputs )
( Textual features Ngrams )
Discrete-valued Class label
Category of document? Politics, Movies, Biology
Issue Loan? Yes, No
Steer? Left, Straight, Right
5Classification learning
Training phase
Testing phase
Learning the classifier from the available data
Training set (Labeled)
Testing how well the classifier performs Testing
set
6Generating datasets
- Methods
- Holdout (2/3rd training, 1/3rd testing)
- Cross validation (n fold)
- Divide into n parts
- Train on (n-1), test on last
- Repeat for different combinations
- Bootstrapping
- Select random samples to form the training set
7Evaluating classifiers
- Outcome
- Accuracy
- Confusion matrix
- If cost-sensitive, the expected cost of
classification ( attribute test cost
misclassification cost) - etc.
8Decision Trees
9Example tree
Intermediate nodes Attributes
Edges Attribute value tests
Leaf nodes Class predictions
Example algorithms ID3, C4.5, SPRINT, CART
Diagram from Han-Kamber
10Decision Tree schematic
Training data set
a1 a2 a3 a4 a5 a6
a1 a2 a3 a4 a5 a6
X
Y
Z
Impure node, Select best attribute and continue
Impure node, Select best attribute and continue
Pure node, Leaf node Class RED
11Decision Tree Issues
- How to avoid overfitting?
- Problem Classifier performs well on training
data, but fails - to give good results on test data
- Example Split on primary key gives pure nodes
and good - accuracy on training not for testing
- Alternatives
- Pre-prune Halting construction at a certain
level of tree / - level of purity
- Post-prune Remove a node if the error rate
remains - the same without it. Repeat process for all nodes
in the d.tree
- How does the type of attribute affect the split?
- Discrete-valued Each branch corresponding to a
value - Continuous-valued Each branch may be a range of
values - (e.g. splits may be age lt 30, 30 lt age lt 50, age
gt 50 ) - (aimed at maximizing the gain/gain ratio)
- How to determine the attribute for split?
- Alternatives
- Information Gain
- Gain (A, S) Entropy (S) S (
(Sj/S)Entropy(Sj) ) - Other options
- Gain ratio, etc.
12Lazy learners
13Lazy learners
- Lazy Do not create a model of the training
instances in advance - When an instance arrives for testing, runs the
algorithm to get the class prediction - Example, K nearest neighbour classifier
- (K NN classifier)
- One is known by the company
- one keeps
14K-NN classifier schematic
- For a test instance,
- Calculate distances from training pts.
- Find K-nearest neighbours (say, K 3)
- Assign class label based on majority
15K-NN classifier Issues
- How good is it?
- Susceptible to noisy values
- Slow because of distance calculation
- Alternate approaches
- Distances to representative points only
- Partial distance
- Any other modifications?
- Alternatives
- Weighted attributes to decide final label
- Assign distance to missing values as ltmaxgt
- K1 returns class label of nearest neighbour
- How to determine value of K?
- Alternatives
- Determine K experimentally. The K that gives
minimum - error is selected.
- How to make real-valued prediction?
- Alternative
- Average the values returned by K-nearest
neighbours
- How to determine distances between values of
categorical - attributes?
- Alternatives
- Boolean distance (1 if same, 0 if different)
- Differential grading (e.g. weather drizzling
and rainy are - closer than rainy and sunny )
16Decision Lists
17Decision Lists
- A sequence of boolean functions that lead to a
result - if h1 (y) 1 then set f (y) c1
- else if h2 (y) 1 then set f (y) c2
- . else set f (y) cn
f ( y ) cj, if j min i hi (y) 1
exists 0 otherwise
18Decision List example
Class label
Test instance
( h i , c i )
Unit
19Decision List learning
R
S S
- Qk
( h k, )
1 / 0
If ( Pi - pn Ni gt Ni - pp Pi )
then 1 else 0
For each hi, Qi Pi U Ni ( hi 1 )
Set of candidate feature functions
Select hk, the feature with highest utility
U i max Pi - pn Ni , Ni - pp Pi
20Decision list Issues
- Pruning?
- hi is not required if
- c i c (r1)
- There is no h j ( j gt i ) such that
- Q i Q j
Accuracy / Complexity tradeoff? Size of R
Complexity (Length of the list) S contains
examples of both classes Accuracy (Purity)
- What is the terminating condition?
- Size of R (an upper threshold)
- Qk null
- S contains examples of same class
21Probabilistic classifiers
22Probabilistic classifiers NB
- Based on Bayes rule
- Naïve Bayes Conditional independence assumption
23Naïve Bayes Issues
- How are different types of attributes
- handled?
- Discrete-valued P ( X Ci ) is according to
- formula
- Continous-valued Assume gaussian distribution.
- Plug in mean and variance for the attribute
- and assign it to P ( X Ci )
Problems due to sparsity of data? Problem
Probabilities for some values may be
zero Solution Laplace smoothing For each
attribute value, update probability m / n as
(m 1) / (n k) where k domain of values
24Probabilistic classifiers BBN
- Bayesian belief networks Attributes ARE
dependent - A directed acyclic graph and conditional
probability tables
An added term for conditional probability
between attributes
Diagram from Han-Kamber
25BBN learning
- (when network structure known)
- Input Network topology of BBN
- Output Calculate the entries in conditional
probability table - (when network structure not known)
- ???
26Learning structure of BBN
- Use Naïve Bayes as a basis pattern
- Add edges as required
- Examples of algorithms TAN, K2
Loan
Age
Family status
Marital status
27Artificial Neural Networks
28Artificial Neural Networks
- Based on biological concept of neurons
- Structure of a fundamental unit of ANN
w0
threshold
w1
input
output activation function p (v) where p (v)
sgn (w0 w1x1 wnxn )
wn
29Perceptron learning algorithm
- Initialize values of weights
- Apply training instances and get output
- Update weights according to the update rule
- Repeat till converges
- Can represent linearly separable functions only
n learning rate t target output o observed
output
30Sigmoid perceptron
- Basis for multilayer feedforward networks
31Multilayer feedforward networks
Input layer
Output layer
Hidden layer
Diagram from Han-Kamber
32Backpropagation
- Apply training instances as input and produce
output - Update weights in the reverse direction as
follows
Diagram from Han-Kamber
33ANN Issues
Addition of momentum But why?
Choosing the learning factor A small learning
factor means multiple iterations required. A
large learning factor means the learner may skip
the global minimum
What are the types of learning approaches? Deter
ministic Update weights after summing up Errors
over all examples Stochastic Update weights per
example
- Learning the structure of the network
- Construct a complete network
- Prune using heuristics
- Remove edges with weights nearly zero
- Remove edges if the removal does not affect
- accuracy
34Support vector machines
35Support vector machines
Margin
Maximum separating-margin classifier
1
Support vectors
-1
Separating hyperplane wxb 0
36SVM training
Minimize (1 / 2) w 2 w.r.t. (yi ( w xi b
) 1) gt 0 for all i
Lagrangian multipliers are zero for data
instances other than support vectors
Dot product of xk and xl
37Focussing on dot product
- For non-linear separable points,
- we plan to map them to a higher dimensional (and
linearly separable) space - The product can be time-consuming.
Therefore, we use kernel functions
38Kernel functions
- Without having to know the non-linear mapping,
apply kernel function, say, - Reduces the number of computations required to
generate Q kl values.
39Testing SVM
SVM
Class label
Test instance
40SVM Issues
- SVMs are immune to the removal of
- non-support-vector points
What if n-classes are to be predicted? Problem
SVMs deal with two-class classification Solution
Have multiple SVMs each for one class
41Combining classifiers
42Combining Classifiers
- Ensemble learning
- Use a combination of models for prediction
- Bagging Majority votes
- Boosting Attention to the weak instances
- Goal An improved combined model
43Bagging
Total set
Classifier learning scheme
Classifier model M 1
Majority vote
Class Label
Training dataset D
Sample D 1
Classifier model M n
Test set
At random. May use bootstrap sampling with
replacement
44Boosting (AdaBoost)
Total set
Classifier learning scheme
Error
Classifier model M 1
Weighted vote
Class Label
Training dataset D
Sample D 1
Classifier model M n
Error
Test set
Weights of correctly classified instances
multiplied by error / (1 error) If error gt
0.5?
Selection based on weight. May use bootstrap
sampling with replacement
Initialize weights of instances to 1/d
45The last slice
46Data preprocessing
- Attribute subset selection
- Select a subset of total attributes to reduce
complexity - Dimensionality reduction
- Transform instances into smaller instances
47Attribute subset selection
- Information gain measure for attribute selection
in decision trees - Stepwise forward / backward elimination of
attributes
48Dimensionality reduction
Number of attributes of a data instance
- High dimensions Computational complexity
instance x in p-dimensions
s Wx W is k x p transformation mtrx.
instance x in k-dimensions k lt p
49Principal Component Analysis
- Computes k orthonormal vectors Principal
components - Essentially provide a new set of axes in
decreasing order of variance
Eigenvector matrix ( p X p ) First k are k PCs
( p X n )
( p X n )
(p X n)
(k X n)
(k X p)
Diagram from Han-Kamber
50Weka Weka Demo
51Weka Weka Demo
- Collection of ML algorithms
- Get it from
- http//www.cs.waikato.ac.nz/ml/weka/
- ARFF Format
- Weka Explorer
52ARFF file format
- _at_RELATION nursery
- _at_ATTRIBUTE children numeric
- _at_ATTRIBUTE housing convenient, less_conv,
critical - _at_ATTRIBUTE finance convenient, inconv
- _at_ATTRIBUTE social nonprob, slightly_prob,
problematic - _at_ATTRIBUTE health recommended, priority,
not_recom - _at_ATTRIBUTE pr_val recommend,priority,not_recom,ve
ry_recom,spec_prior - _at_DATA
- 3,less_conv,convenient,slightly_prob,recommended,s
pec_prior
Name of the relation
Attribute definition
Data instances Comma separated, each on a new
line
53Parts of weka
Explorer Basic interface to run ML Algorithms
Experimenter Comparing experiments on different
algorithms
Knowledge Flow Similar to Work Flow Customized
to ones needs
54Weka demo
55Key References
- Data Mining Concepts and techniques Han and
Kamber, Morgan Kaufmann publishers, 2006. - Machine Learning Tom Mitchell, McGraw Hill
publications. - Data Mining Practical machine learning tools
and techniques Witten and Frank, Morgan Kaufmann
publishers, 2005.
56end of slideshow
57Extra slides 1
- Difference between decision lists and decision
trees - Lists are functions tested sequentially (More
than one - attributes at a time)
- Trees are attributes tested sequentially
- Lists may not require a complete coverage for
values - of an attribute.
- All values of an attribute correspond to atleast
one - branch of the attribute split.
58Learning structure of BBN
- K2 Algorithm
- Consider nodes in an order
- For each node, calculate utility to add an edge
from previous nodes to this one - TAN
- Use Naïve Bayes as the baseline network
- Add different edges to the network based on
utility - Examples of algorithms TAN, K2
59Delta rule
- Delta rule enables to converge to a best fit if
points are not linearly separable - Uses gradient descent to choose the hypothesis
space