Title: Outline
1Outline
- Midterm Review
- Liner Models
- Perceptron algorithm
- Support vector machines
- Multiple Layer Perceptrons
- Instance-based Learning
- Decision Trees
- Bayesian Decision Theory
- Classification rules
2Announcement
- The midterm exam will be on March 30, 2006
- It will be open book and open notes
- A calculator may be needed and remember to bring
one with you
3Pattern Recognition
- The problem is defined as follows
- We have a set of examples, represented by a set
of attributes - Each one example has a label also
4Hand-Written Digit Recognition
- For homework 1, we recognized ten hand written
digits
5Pattern Recognition Problem Statement
- Now we want to predict a decision or a value
based on input patterns attributes - For linear models, we have one weight each
attribute, the output will be a linear
combination of attributes
6Linear Model as a Neural Network
Figure 4.10 (b)
7Binary Classification
- There are two different labels, class 1 and class
2 - When an example i is from class 1, y(i) 1
- When an example i is from class 2, y(i) -1
- In this case, we use the sign of x to decide
which class - If x gt 0, then we will predict the first class
- Otherwise, we will predict the second class.
8McCulloch and Pitts Model
- The linear model is equivalent to a simple neuron
model, known as the McCulloch and Pitts Model - It is a simple model as a binary threshold unit
- The model neuron first computes a weighted sum of
its inputs and a bias - It outputs one if the weighted sum is larger than
zero and zero otherwise
9McCulloch and Pitts Model cont.
10Linear Models cont.
- This implements an AND gate of two inputs
11Linear Models cont.
- This implements an OR gate of two inputs
12Linear Models cont.
13Linear Models - cont.
- How to design a linear model for XOR?
14Linear Models cont.
- How to design an adder of two two-bit binary
numbers
15Perceptron Learning Rule
Figure 4.10 (a)
16Example
- We have one instance in class 1(a11 and a21)
and one in class 2 (a1-1 and a21)
17Perceptron Learning Rule
Perceptron Convergence Theorem If training
samples are linearly separable, then the sequence
of weight vectors by the above algorithm will
terminate at a solution vector in a finite number
of times.
18Linear Models cont.
- How to handle multiple class cases
- For many applications, there are often more then
two classes - There are two ways to devise multi-category
classifiers using binary linear models - One against the rest (c linear models)
- One against another (c (c-1)/2 linear models)
19Multi-category Case One against the Rest
- If there are c classes,
- We train c linear models for each class against
the rest - Then we classify an instance as one class if it
is the only class whose x gt 0 - Otherwise, it is ambiguous
20Multi-category Case One against the Rest
21Multi-category Case One against the Other
- If there are c classes, we train one linear model
for class c1 and class c2 (c1 lt c2) - There will be c(c-1)/2 total linear models
- To classify an unknown instance, we apply each of
the c(c-1)/2 linear models - For c1 against c2, if x gt 0, then class c1
receives one vote - Otherwise, class c2 receives one vote
- The unknown instance is classifies as the class
that receives the most votes
22Multi-category Case One against the Other
23Multiple Response Linear Models for Multiple
Class Classification
- Instead of combining binary class classifiers for
classification, we can also learn one linear
model for each class - That is, if we have C different classes, we will
have C different linear models, given by weights
24Multiple Response Linear Models
25Multiple Response Linear Models
- Classification
- Given an instance, we compute the response from
each class
26Multiple Response Linear Models
- Classification continued
- It is classified as the class whose x is the
maximum - In Matlab,
27Multiple Response Linear Models
- Learning the weights
- Generalized perceptron learning rule based on
Kesslers construction
28Decision Boundaries and Decision Regions
- The final result of any classifier is to
partition the attribute space into regions of
classes - Decision regions are regions of the same class in
the attribute space - Decision boundaries are the boundaries between
decision regions of different classes
29Decision Boundary of a Linear Model
- If there are only two classes, the decision
boundary is given by
30The Margin of a Linear Model
- The margin of a linear model with respect to a
training set is the minimum distance of the
instances to the decision boundary
31Maximum Margin
32Maximum Margin Hyperplane
- Support vectors are the instances that are
closest to the maximum margin hyperplane - Note that the decision boundary does not depend
on instances that are not support vectors - They can be deleted without changing the maximum
margin hyperplane - In other words, the support vectors uniquely
define the linear model that has the largest
margin
33Maximum Margin Hyperplane
- Now we change the notation a little bit
- We have two classes and we label the training
instance as either 1 (the first class) or -1 - Then the maximum margin hyperplane is
- Here are support vectors and is the
attribute vector of a test instance
34How to Determine the Parameters
- It requires solving a standard class of
optimization problem, known as the constrained
quadratic optimization - There are off-the-shelf software packages for
solving these problems - Matlab provides quadprog for quadratic
programming - Sequential minimal optimization
35Support Vector Machines cont.
- Kernel methods can be used to solve problems that
are not linearly separable - Mapping to a high dimensional is done implicitly
using a kernel function, which can be used to
compute the inner product in the high dimensional
space directly (without mapping)
36Support Vector Machines cont.
- Support vector machines with a kernel function
- Linear support vector machine
- Kernel support vector machine
37Digital Computers
- All digital computers consist of digital gates to
control the dataflow and perform computation - Any digital computer can be simulated by a
network of neurons
38Neural Network Architecture
- Neural network architecture
- Consists of a set of neurons
- The inputs are connected to each of the neurons
in the input layer - The outputs are connected to each of the neurons
in the output layer - Neurons are inter-connected following some
patterns
39A More General Neuron
40Transfer Functions
41Multiple Layers of Neurons
Note that there are no standard definitions of
number of layers in a neural network
42Notations
- We use superscript to specify neurons in a
particular layer - We use subscript to specify which neurons in one
layer - For weights, we need double subscript indices
- The first index specifies the neuron within the
layer - The second index specifies the neuron from the
preceding layer - means weights between neuron I in the
mth layer and neuron j in the (m-1)th layer
43Example
44Elementary Decision Boundaries
First Boundary
Second Boundary
First Subnetwork
45Elementary Decision Boundaries
Third Boundary
Fourth Boundary
Second Subnetwork
46Total Network
47XOR Example
48Multilayer Neural Networks
- Given the expressive power of multilayer neural
networks, the key question now is how to learn
the weights (parameters) from a set of training
examples
49Multilayer Neural Networks cont.
- Training error
- Here we consider the training error of an input
pattern to be sum over the output units of the
squared difference between the desired output and
the actual output of each output unit - Note that w is a vector consisting of all the
parameters in the multilayer neural network
50Back Propagation Algorithm
- Initialize the weights to small random values
- Choose a training pattern aq,tq and apply aq to
the input layer, i.e, let a0 aq and t tq - Propagate the input forwards through the network
- Compute the deltas (errors) for the output layer
51Back Propagation Algorithm cont.
- Compute the deltas (errors) for the preceding
layers by propagating the errors backwards - Update all the connections by
- Go back to step 2 and repeat for the next pattern
52Backpropagation Algorithm cont.
53Example Function Approximation
54Network
55Initial Conditions
56Forward Propagation
57Transfer Function Derivatives
58Backpropagation
59Weight Update
60Practical Issues
- While in theory a multiple layer neural network
with nonlinear transfer functions trained using
backpropagation is sufficient to approximate any
function or solve any recognition problem, there
are practical issues - What is the optimal architecture for a particular
problem/application? - What is the performance on unknown test data?
- Will the network converge to a good solution?
- How long does it take to train a network?
61Choice of Architecture
1-3-1 Network
i 1
i 2
i 4
i 8
62Choice of Network Architecture
1-2-1
1-3-1
1-5-1
1-4-1
63The Issue of Generalization
- We are interested in a neural network trained
only on a training set will work well for novel
and unseen test data - For example, for face recognition, a neural
network can only recognize those in the training
set is not very useful - Generalization is one of the most fundamental
problems in neural networks and many other
recognition techniques
64Improving Generalization
- Heuristics
- A neural network should have fewer parameters
than the number of data points in the training
set - More domain specific knowledge
- Cross validation
- Divide the labeled examples into training and
validation sets - Stop training when the error on the validation
set increases
65Cross Validation
66Convergence Issues
- A neural network may converge to a bad solution
- Train several neural networks from different
initial conditions - Learning rate may be too large
67Convergence Issues
- The convergence can be slow
- Practical techniques
- Variations of basic backpropagation algorithms
- Heuristics
- Momentum
- Variable learning rate
- Conjugate gradient
- Second-order methods
- Newtons method
- Levenberg-Marquardt algorithm
68Momentum
- Momentum
- A heuristic term to speed up the training
- The momentum parameter a must be between 0 and 1
a value of 0.9 is often chosen
69Momentum Backpropagation
70Derivation of Backpropagation Algorithm
- Backpropagation algorithm is not a magic
algorithm - It is essentially a gradient descent algorithm
for optimization on the training error of neural
network parameters (i.e., the weights and biases) - Understanding this allows to derive new
algorithms for different neural networks
71Gradient Descent
Choose the next step so that the function
decreases
For small changes in x we can approximate F(x)
where
If we want the function to decrease
We can maximize the decrease by choosing
72Example
73Gradient Descent
74Backpropagation for a Neuron
75Neural Network Based Face Detection
76Instance-Based Learning
77The Nearest Neighbor Rule
- Suppose we have Da(1), ......, a(n) labeled
training instances - Let a in D be the closest point to instance a,
which needs to be classified - The nearest neighbor rule is to assign a the
label associated with a
78The Nearest Neighbor Rule cont.
79The Nearest Neighbor Rule cont.
The nearest neighbor rule leads to a partition of
the attribute space into Voronoi cells
80The k-Nearest Neighbor Rule
- An extension of the nearest neighbor rule
- The k-nearest neighbor rule classifies a by
assigning it the label most frequently
represented among the k nearest samples - In other words, given a, we find the k nearest
labeled samples. The label appeared most is
assigned to a.
81The k-Nearest Neighbor Rule cont.
- Example
- k 3 (odd value) and a (0.10, 0.25)t
-
- Closest vectors to a with their labels are
- (0.10, 0.28, ?2) (0.12, 0.20, ?2) (0.15,
0.35,?1) - One voting scheme assigns the label ?2 to x
since ?2 is the most frequently represented
82The k-Nearest Neighbor Rule cont.
83Computational Complexity of the
k-Nearest-Neighbor Rule
- Straightforward implementation O(d n)
- More efficient implementations
- Tree-based data structures
- Editing
84K-d Tree
- A k-d tree is a generalization of a binary search
tree in high dimensions - Each internal node in a k-d tree is associated
with a hyper-rectangle and a hyper-plane
orthogonal to one of the coordinate axis - The hyper-plane splits the hyper-rectangle into
two parts, which are associated with the child
nodes - The partitioning process goes on until the number
of data points in the hyper-rectangle falls below
some given threshold
85K-d Tree cont.
86K-d Tree cont.
- For a given query point, the algorithm works by
first descending the tree to find the data points
lying in the cell that contains the query point - Then it examines surrounding cells if they
overlap the ball centered at the query point and
the closest data point so far
87Condensing
- Aim is to reduce the number of training samples
- Retain only the samples that are needed to define
the decision boundary
88Dataset Reduction Editing
- Training data may contain noise, overlapping
classes - Editing seeks to remove noisy points and produce
smooth decision boundaries often by retaining
points far from the decision boundaries - Results in homogenous clusters of points
89Wilson Editing
- Remove points that do not agree with the majority
of their k nearest neighbours
90Decision Trees
- Decision trees are based on the idea that we can
classify a pattern through a sequence of
questions - This is known as a 20-questions approach, very
similar to the guess who game strategy
91Decision Tree Example
92Constructing Decision Trees
- As for constructing other tree structures, we use
recursive procedure - First we select an attribute for root node and
create branch for each possible attribute value - Then we split instances into subsets, one for
each branch extending from the node - Finally we repeat recursively for each branch,
using only instances that reach the branch - Stop if all instances have the same class
93The Weather Data
94Which Attribute to Select?
95Criterion for Attribute Selection
- Which is the best attribute?
- Want to get the smallest tree
- Heuristic choose the attribute that produces the
purest nodes - Popular impurity criterion information gain
- Information gain increases with the average
purity of the subsets - Strategy choose attribute that gives greatest
information gain
96Computing Information
- Measure information in bits
- Given a probability distribution, the info
required to predict an event is the
distributions entropy - Entropy gives the information required in
bits(can involve fractions of bits!) - Formula for computing the entropy
97Example Attribute Outlook
- Outlook Sunny
- Outlook Overcast
- Outlook Rainy
- Expected information for attribute
98Computing Information Gain
- Information gain difference between information
before splitting and information after splitting - Information gain for attributes from weather data
gain(Outlook ) info(9,5) info(2,3,4,0,3
,2) 0.940 0.693 0.247 bits
gain(Outlook ) 0.247 bits gain(Temperature )
0.029 bits gain(Humidity ) 0.152
bits gain(Windy ) 0.048 bits
99Continuing to Split
gain(Temperature ) 0.571 bits gain(Humidity )
0.971 bits gain(Windy ) 0.020 bits
100Final Decision Tree
- Note not all leaves need to be pure sometimes
identical instances have different classes - ? Splitting stops when data cant be split any
further
101Highly-branching Attributes
- Attributes with a large number of values can be
problematic - Subsets are more likely to be pure if there is a
large number of values - Information gain is biased towards choosing
attributes with a large number of values - This may result in overfitting (selection of an
attribute that is non-optimal for prediction)
102Weather data with ID code
103Tree Stump for ID code Attribute
- Entropy of split
- Information gain is maximal for ID code (namely
0.940 bits)
104Gain Ratio
- Gain ratio a modification of the information
gain that reduces its bias - Gain ratio takes number and size of branches into
account when choosing an attribute - It corrects the information gain by taking the
intrinsic information of a split into account - Intrinsic information entropy of distribution of
instances into branches (i.e. how much info do we
need to tell which branch an instance belongs to)
105Computing the gain ratio
- Example intrinsic information for ID code
- Value of attribute decreases as intrinsic
information gets larger - Definition of gain ratio
- Example
106Gain ratios for Weather Data
107Numeric Attributes
- Standard method binary splits
- E.g. temp lt 45
- Unlike nominal attributes, every attribute has
many possible split points - Solution is straightforward extension
- Evaluate info gain (or other measure) for every
possible split point of attribute - Choose best split point
- Info gain for best split point is info gain for
attribute - Computationally more demanding
108Example
- Split on temperature attribute
- E.g. temperature ? 71.5 yes/4, no/2 temperature
? 71.5 yes/5, no/3 - Info(4,2,5,3) 6/14 info(4,2) 8/14
info(5,3) 0.939 bits - Place split points halfway between values
- Can evaluate all split points in one pass!
109Pruning
- Prevent overfitting to noise in the data
- Prune the decision tree
- Two strategies
- Postpruningtake a fully-grown decision tree and
discard unreliable parts - Prepruningstop growing a branch when information
becomes unreliable - Postpruning preferred in practiceprepruning can
stop early
110Statistical Pattern Recognition
- We have a set of training examples for each class
we are interested in - We want to design/learn a classifier that can
classify an input into one of the classes - Note the examples are influenced by factors that
are random in nature - While the classifiers we considered give us a
definite answer, note that the problem is
intrinsically probabilistic - In other words, we should estimate the
probability of being a particular class rather
than a definite answer
111The Illustration Example
112Review Probability Theory cont.
- Conditional probability
- The law of total probability
- Bayes rule
113Bayesian Decision Theory
- We have two pattern classes w1 and w2
- Prior probability
- Our prior knowledge of how likely to get class w1
or w2 - Class conditional probability
- Class conditional probability density p(x w1)
and p(x w2)
114Bayesian Decision Theory cont.
- Bayes formula
- In English, it can be expressed as
115Bayesian Decision Theory cont.
- Minimum error classification
- Decide w1 if P(w1x) gt P(w2x) otherwise decide
w2 - P(errorx)minP(w1x), P(w2x)
116Example
- Number of fins of sea bass and salmon
117Example cont.
- Estimated class conditional
118Example cont.
- Prior probabilities
- Priors come from prior knowledge
- For example, if a fish expert says there are
twice as many salmon than sea bass - What are priors?
- How to estimate priors from data?
119Bayesian Decision Theory cont.
- Classification results are often associated with
actions - Decisions with different risks
- For example, a typical user can tolerate a spam
being classified incorrectly as a ham more than a
ham being classified incorrectly as a spam
120Bayesian Decision Theory cont.
- Conditional risk
- Bayesian decision theory
- Take the action that has the minimum conditional
risk
121General Case
- In general, we can have more than two classes and
a number of actions - The Bayesian decision theory is the same
- For each one, we compute its posterior
probability - For minimum error rate classification, we decide
to be the class with the largest posterior
probability - If we have actions with different risks, we take
the action that has the least conditional risk
122Bayesian Decision Theory cont.
- Bayes decision rule for minimum error rate
- Decide wi if P(wix) ? P(wjx) for j ? i
- In case of ties, you can break the tie any way
you like - The probability of error is given by
123Bayesian Decision Theory General Case
- Suppose that there are c categories
- w1, w2, ..., wc
- and there are d actions associated with
recognition labels - a1, a2, ..., ad
- Loss function
- The lost function states exactly how costly each
action is and is used to convert a probability
determination into a decision - Written as
124Bayesian Decision Theory cont.
- Conditional risk
- Bayes decision rule
- For a given x, select the action ai for which the
conditional risk is minimum
125Naïve Bayesian Classifier
- When there are more than one feature, the class
conditional probability distribution can be
complicated and expensive to model - One way to overcome this problem is to assume
that all the features are statistically
independent - The resulting Bayesian classifier is called naïve
Bayesian classifier
126Training Set
- We need to have two sets of email messages
- Spam, consisting of known spam messages
- Note that the definition of spam is subjective
- Ham, consisting of known non-spam messages
- You can generate the sets based on your email
messages - You can also obtain publicly available spam
corpora
127Bayesian Spam Filtering
- Training stage
- Tokenize each message in the training sets
- This step can affect the performance
significantly - Create two tables, one for spam class and one for
non-spam class, counting the number of times each
token occurred in the corresponding corpus - Estimate the probability that a message belongs
to spam if it contains a particular word using
the following formula, resulting another table
128Estimating Probability (Pseudo Code)
- nbad the number of spam messages in its corpus
- ngood the number of non-spam messages in its
corpus - For each word in the combined list (consisting of
words of both corpora) - g the number of occurrence in the non-spam
corpus - b the number of occurrence in the spam corpus
- If 2g b lt 5, then
- Delete the word from the list
- Otherwise
- Its probability is given by
129Bayesian Spam Filtering
- Testing stage
- When a new message arrives, it is first tokenized
the same way the training messages are - Find the fifteen tokens that are most certain,
when the certainty of a token is measured by its
entropy (the smaller the entropy, the more the
certainty) - This can be done according to p-0.5
130Testing
- Combined probability
- (let ((prod (apply ' probs)))
- (/ prod ( prod (apply ' (mapcar '(lambda
(x) - (- 1 x))
- probs)))))
- This translates to
131Testing
- Thresholding
- If the combined probability is larger than a
threshold, it is classified as spam otherwise,
it is classified as ham - The threshold is an important parameter
- By varying the threshold, we will get a curve
called ROC (receiver-operating curve)
132Statistical Pattern Recognition
133Classification Using Rules
- Perform classification using If-Then rules
- Classification Rule r lta,cgt
- Antecedent, Consequent
- May generate from from other techniques (DT, NN)
or generate directly. - Algorithms 1R, PRISM
134Generating Rules from DTs
135Generating Rules Example
136Inferring Rudimentary Rules
- 1R algorithm
- It is an algorithm to find simple rules
- It often comes up with quite good rules for
characterizing the structures in the data
1371R Algorithm
1381R Algorithm Example
139PRIZM Algorithm
140The contact lenses data
141Example contact lens data cont.
- Rule we seek
- Possible tests
142Modified rule and resulting data
- Rule with best test added
- Instances covered by modified rule
143Further refinement
- Current state
- Possible tests
144Modified rule and resulting data
- Rule with best test added
- Instances covered by modified rule
145Further refinement
- Current state
- Possible tests
- Tie between the first and the fourth test
- We choose the one with greater coverage
146The result
- Final rule
- Second rule for recommending hard
lenses(built from instances not covered by
first rule) - These two rules cover all hard lenses
- Process is repeated with other two classes
147Meta learning schemes
- Basic ideabuild different experts and let
them vote - Advantage
- often improves predictive performance
- Disadvantage
- produces output that is very hard to analyze
- Schemes
- Bagging
- Boosting
- Stacking
- error-correcting output codes
apply to both classificationand numeric
prediction
148Bias-variance decomposition
- To analyze how much any specific training set
affects performance - Assume infinitely many classifiers,built from
different training sets of size n - For any learning scheme,
- Bias expected error of the combined classifier
on new data - Variance expected error due to the particular
training set used - Total expected error bias variance
149Bagging
- Combining predictions by voting/averaging
- Simplest way!
- Each model receives equal weight
- Idealized version
- Sample several training sets of size n(instead
of just having one training set of size n) - Build a classifier for each training set
- Combine the classifiers predictions
- Learning scheme is unstable ? almost always
improves performance - Small change in training data can make big change
in model - (e.g. decision trees)
150Bagging classifiers
Model generation
- Let n be the number of instances in the training
data - For each of t iterations
- Sample n instances from training set
- (with replacement)
- Apply learning algorithm to the sample
- Store resulting model
Classification
For each of the t models Predict class of
instance using model Return class that is
predicted most often
151Boosting
- Also uses voting/averaging
- Weights models according to performance
- Iterative new models are influenced by
performance of previously built ones - Encourage new model become an expert for
instances misclassified by earlier models - Intuitive justification models should be experts
that complement each other - Several variants
152AdaBoost
153AdaBoost cont.
- The final/component classifier is given by
- The error on the training set of the final
classifier is bounded to be
154AdaBoost cont.
155(No Transcript)