Title: Classification and Prediction
1Classification and Prediction
2Classification, Regression, and Prediction
- Classification
- Predict categorical class labels
- Classify data (constructs a model) based on
training set and values (class labels) in a
classifying attribute and uses it in classifying
new data - Regression
- Model continuous-valued functions i.e., predicts
unknown or missing values - Prediction
- Classification Regression
- Sometimes refers only to regression (e.g., in the
text book)
3ClassificationA Two-Step Process
- Step 1. Model construction describing a set of
predetermined classes - Set of tuples used for model construction
training set - Each tuple/sample is assumed to belong to a
predefined class, as determined by class label
attribute - Model is represented as classification rules,
decision trees, or mathematical formulae
IF rank professor OR years gt 6 THEN tenured
yes
4ClassificationA Two-Step Process
- Step 2. Model usage for classifying future or
unknown objects - Estimate predictive accuracy of model
- Known label of test sample is compared with
classified result from model - Accuracy rate is percentage of test set samples
that are correctly classified by model - Test set is independent of training set,
otherwise over-fitting will occur
IF rank professor OR years gt 6 THEN tenured
yes
(Jeff, Professor, 4)
5Classification Process (1) Model Construction
Classification Algorithms
Training Data
Classifier (Model)
IF rank professor OR years gt 6 THEN tenured
yes
6Classification Process (2) Use Model in
Prediction
Classifier (Model)
IF rank professor OR years gt 6 THEN tenured
yes
7Supervised versus Unsupervised Learning
- Supervised learning (classification)
- Supervision Training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations - New data is classified based on training set
- Unsupervised learning (clustering)
- Class labels of training data are unknown
- Given a set of measurements, observations, etc.,
need to establish existence of classes or
clusters in data
8Classification and Prediction
- What is classification? What is prediction?
- Issues regarding classification and prediction
- Classification by decision tree induction
- Bayesian Classification
- Classification based on concepts from association
rule mining - Other Classification Methods
- Prediction
- Classification accuracy
- Summary
9Issues (1) Data Preparation
- Data cleaning
- Preprocess data in order to reduce noise (e.g.,
by smoothing) and handle missing values (e.g.,
use most commonly occurring value) - Help to reduce confusion during learning
- Relevance analysis (feature selection)
- Remove irrelevant or redundant attributes
- Data transformation
- Generalize (to higher level concepts) and/or
normalize data (scaling values so that they fall
within specified range)
10Issues (2) Evaluating Classification Methods
- Predictive accuracy
- Predict class label
- Speed
- Time to construct model
- Time to use model
- Robustness
- Make correct prediction given noise and missing
values - Scalability
- Construct model efficiently given data size
- Interpretability
- Understanding and insight provided by model
- Goodness of rules
- Decision tree size
- Compactness of classification rules