Title: Classification
1Classification
- A task of induction to find patterns
2Outline
- Data and its format
- Problem of Classification
- Learning a classifier
- Different approaches
- Key issues
3Data and its format
- Data
- attribute-value pairs
- with/without class
- Data type
- continuous/discrete
- nominal
- Data format
- Flat
- If not flat, what should we do?
4Sample data
5Induction from databases
- Inferring knowledge from data
- The task of deduction
- infer information that is a logical consequence
of querying a database - Who conducted this class before?
- Which courses are attended by Mary?
- Deductive databases extending the RDBMS
6Classification
- It is one type of induction
- data with class labels
- Examples -
- If weather is rainy then no golf
- If
- If
7Different approaches
- There exist many techniques
- Decision trees
- Neural networks
- K-nearest neighbors
- Naïve Bayesian classifiers
- Support Vector Machines
- Ensemble methods
- Semi-supervised
- and many more ...
8A decision tree
9Inducing a decision tree
- There are many possible trees
- lets try it on the golfing data
- How to find the most compact one
- that is consistent with the data?
- Why the most compact?
- Occams razor principle
- Issue of efficiency w.r.t. optimality
10Information gain
and
- Entropy -
- Information gain - the difference between the
node before and after splitting
11Building a compact tree
- The key to building a decision tree - which
attribute to choose in order to branch. - The heuristic is to choose the attribute with the
maximum IG. - Another explanation is to reduce uncertainty as
much as possible.
12Learn a decision tree
Outlook
sunny
overcast
rain
Humidity
Wind
YES
high
normal
strong
weak
NO
YES
NO
YES
13Issues of Decision Trees
- Number of values of an attribute
- Your solution?
- When to stop
- Data fragmentation problem
- Any solution?
- Mixed data types
- Scalability
14Rules and Tree stumps
- Generating rules from decision trees
- One path is a rule
- We can do better. Why?
- Tree stumps and 1R
- For each attribute value, determine a default
class (of values of rules) - Calculate the of errors for each rule
- Find of errors for that attributes rule set
- Choose one rule set that has the least of errors
15K-Nearest Neighbor
- One of the most intuitive classification
algorithm - An unseen instances class is determined by its
nearest neighbor - The problem is it is sensitive to noise
- Instead of using one neighbor, we can use k
neighbors
16K-NN
- New problems
- How large should k be
- lazy learning does it learn?
- large storage
- A toy example (noise, majority)
- How good is k-NN?
- How to compare
- Speed
- Accuracy
17Naïve Bayes Classifier
- This is a direct application of Bayes rule
- P(CX) P(XC)P(C)/P(X)
- X - a vector of x1,x2,,xn
- Thats the best classifier we can build
- But, there are problems
- There are only a limited number of instances
- How to estimate P(xC)
- Your suggestions?
18NBC (2)
- Assume conditional independence between xis
- We have
- P(Cx) P(x1C) P(xiC) (xnC)P(C)
- Whats missing? Is it really correct?
- An example
- How good is it in reality?
19No Free Lunch
- If the goal is to obtain good generalization
performance, there are no context-independent or
usage-independent reasons to favor one learning
or classification method over another. - http//en.wikipedia.org/wiki/No-Free-Lunch_theorem
s - What does it indicate?
- Or is it easy to choose a good classifier for
your application? - Again, there is no off-the-shelf solution for a
reasonably challenging application.
20Ensemble Methods
- Motivation
- Stability
- Model generation
- Bagging (Bootstrap Aggregating)
- Boosting
- Model combination
- Majority voting
- Meta learning
- Stacking (using different types of classifiers)
- Examples (classify-ensemble.ppt)
21AdaBoost.M1 (from Weka Book)
Model generation
- Assign equal weight to each training instance
- For t iterations
- Apply learning algorithm to weighted dataset,
- store resulting model
- Compute models error e on weighted dataset
- If e 0 or e gt 0.5
- Terminate model generation
- For each instance in dataset
- If classified correctly by model
- Multiply instances weight by e/(1-e)
- Normalize weight of all instances
Classification
Assign weight 0 to all classes For each of the
t models (or fewer) For the class this model
predicts add log e/(1-e) to this classs
weight Return class with highest weight
22Using many different classifiers
- We have learned some basic and often used
classifiers - There are many more out there.
- Regression
- Discriminant analysis
- Neural networks
- Support vector machines
- Pick the most suitable one for an application
- Where to find all these classifiers?
- Dont reinvent the wheel that is not as round
- We will likely come back to classification and
discuss support vector machines as requested
23Assignment 3
- Pick one of your favorite software package (feel
free to use any at your disposal, as we discussed
in class) - Use the mushroom dataset found at UC Irvine
Machine Learning Repository - Run a decision tree induction algorithm to get
the following - Use resubstituion error to measure
- Use 10-fold cross validation to measure
- Show the confusion matrix for the above two error
measures - Summarize and report your observations and
conjectures if any - Submit a hardcopy report on Wednesday 3/1/06
24Classification via Neural Networks
Squash
?
A perceptron
25What can a perceptron do?
- Neuron as a computing device
- To separate a linearly separable points
- Nice things about a perceptron
- distributed representation
- local learning
- weight adjusting
26Linear threshold unit
- Basic concepts projection, thresholding
W vectors evoke 1
W .11 .6
L .7 .7
.5
27E.g. 1 solution region for AND problem
- Find a weight vector that satisfies all the
constraints
AND problem 0 0 0 0 1 0 1 0 0 1
1 1
28E.g. 2 Solution region for XOR problem?
XOR problem 0 0 0 0 1 1 1 0 1 1
1 0
29Learning by error reduction
- Perceptron learning algorithm
- If the activation level of the output unit is 1
when it should be 0, reduce the weight on the
link to the ith input unit by rLi, where Li is
the ith input value and r a learning rate - If the activation level of the output unit is 0
when it should be 1, increase the weight on the
link to the ith input unit by rLi - Otherwise, do nothing
30Multi-layer perceptrons
- Using the chain rule, we can back-propagate the
errors for a multi-layer perceptrons.
Output layer
Hidden layer
Input layer