Outline

About This Presentation

Title:

Outline

Description:

When an example i is from class 2, y(i) = -1. In this case, we use the sign of x to decide which class ... We train c linear models for each class against the rest ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 156

Provided by: xiuwe

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Midterm Review
Liner Models
Perceptron algorithm
Support vector machines
Multiple Layer Perceptrons
Instance-based Learning
Decision Trees
Bayesian Decision Theory
Classification rules

2
Announcement

The midterm exam will be on March 30, 2006
It will be open book and open notes
A calculator may be needed and remember to bring
one with you

3
Pattern Recognition

The problem is defined as follows
We have a set of examples, represented by a set
of attributes
Each one example has a label also

4
Hand-Written Digit Recognition

For homework 1, we recognized ten hand written
digits

5
Pattern Recognition Problem Statement

Now we want to predict a decision or a value
based on input patterns attributes
For linear models, we have one weight each
attribute, the output will be a linear
combination of attributes

6
Linear Model as a Neural Network
Figure 4.10 (b)
7
Binary Classification

There are two different labels, class 1 and class
2
When an example i is from class 1, y(i) 1
When an example i is from class 2, y(i) -1
In this case, we use the sign of x to decide
which class
If x gt 0, then we will predict the first class
Otherwise, we will predict the second class.

8
McCulloch and Pitts Model

The linear model is equivalent to a simple neuron
model, known as the McCulloch and Pitts Model
It is a simple model as a binary threshold unit
The model neuron first computes a weighted sum of
its inputs and a bias
It outputs one if the weighted sum is larger than
zero and zero otherwise

9
McCulloch and Pitts Model cont.
10
Linear Models cont.

This implements an AND gate of two inputs

11
Linear Models cont.

This implements an OR gate of two inputs

12
Linear Models cont.
13
Linear Models - cont.

How to design a linear model for XOR?

14
Linear Models cont.

How to design an adder of two two-bit binary
numbers

15
Perceptron Learning Rule
Figure 4.10 (a)
16
Example

We have one instance in class 1(a11 and a21)
and one in class 2 (a1-1 and a21)

17
Perceptron Learning Rule
Perceptron Convergence Theorem If training
samples are linearly separable, then the sequence
of weight vectors by the above algorithm will
terminate at a solution vector in a finite number
of times.
18
Linear Models cont.

How to handle multiple class cases
For many applications, there are often more then
two classes
There are two ways to devise multi-category
classifiers using binary linear models
One against the rest (c linear models)
One against another (c (c-1)/2 linear models)

19
Multi-category Case One against the Rest

If there are c classes,
We train c linear models for each class against
the rest
Then we classify an instance as one class if it
is the only class whose x gt 0
Otherwise, it is ambiguous

20
Multi-category Case One against the Rest
21
Multi-category Case One against the Other

If there are c classes, we train one linear model
for class c1 and class c2 (c1 lt c2)
There will be c(c-1)/2 total linear models
To classify an unknown instance, we apply each of
the c(c-1)/2 linear models
For c1 against c2, if x gt 0, then class c1
receives one vote
Otherwise, class c2 receives one vote
The unknown instance is classifies as the class
that receives the most votes

22
Multi-category Case One against the Other
23
Multiple Response Linear Models for Multiple
Class Classification

Instead of combining binary class classifiers for
classification, we can also learn one linear
model for each class
That is, if we have C different classes, we will
have C different linear models, given by weights

24
Multiple Response Linear Models
25
Multiple Response Linear Models

Classification
Given an instance, we compute the response from
each class

26
Multiple Response Linear Models

Classification continued
It is classified as the class whose x is the
maximum
In Matlab,

27
Multiple Response Linear Models

Learning the weights
Generalized perceptron learning rule based on
Kesslers construction

28
Decision Boundaries and Decision Regions

The final result of any classifier is to
partition the attribute space into regions of
classes
Decision regions are regions of the same class in
the attribute space
Decision boundaries are the boundaries between
decision regions of different classes

29
Decision Boundary of a Linear Model

If there are only two classes, the decision
boundary is given by

30
The Margin of a Linear Model

The margin of a linear model with respect to a
training set is the minimum distance of the
instances to the decision boundary

31
Maximum Margin
32
Maximum Margin Hyperplane

Support vectors are the instances that are
closest to the maximum margin hyperplane
Note that the decision boundary does not depend
on instances that are not support vectors
They can be deleted without changing the maximum
margin hyperplane
In other words, the support vectors uniquely
define the linear model that has the largest
margin

33
Maximum Margin Hyperplane

Now we change the notation a little bit
We have two classes and we label the training
instance as either 1 (the first class) or -1
Then the maximum margin hyperplane is
Here are support vectors and is the
attribute vector of a test instance

34
How to Determine the Parameters

It requires solving a standard class of
optimization problem, known as the constrained
quadratic optimization
There are off-the-shelf software packages for
solving these problems
Matlab provides quadprog for quadratic
programming
Sequential minimal optimization

35
Support Vector Machines cont.

Kernel methods can be used to solve problems that
are not linearly separable
Mapping to a high dimensional is done implicitly
using a kernel function, which can be used to
compute the inner product in the high dimensional
space directly (without mapping)

36
Support Vector Machines cont.

Support vector machines with a kernel function
Linear support vector machine
Kernel support vector machine

37
Digital Computers

All digital computers consist of digital gates to
control the dataflow and perform computation
Any digital computer can be simulated by a
network of neurons

38
Neural Network Architecture

Neural network architecture
Consists of a set of neurons
The inputs are connected to each of the neurons
in the input layer
The outputs are connected to each of the neurons
in the output layer
Neurons are inter-connected following some
patterns

39
A More General Neuron
40
Transfer Functions
41
Multiple Layers of Neurons
Note that there are no standard definitions of
number of layers in a neural network
42
Notations

We use superscript to specify neurons in a
particular layer
We use subscript to specify which neurons in one
layer
For weights, we need double subscript indices
The first index specifies the neuron within the
layer
The second index specifies the neuron from the
preceding layer
means weights between neuron I in the
mth layer and neuron j in the (m-1)th layer

43
Example
44
Elementary Decision Boundaries
First Boundary
Second Boundary
First Subnetwork
45
Elementary Decision Boundaries
Third Boundary
Fourth Boundary
Second Subnetwork
46
Total Network
47
XOR Example
48
Multilayer Neural Networks

Given the expressive power of multilayer neural
networks, the key question now is how to learn
the weights (parameters) from a set of training
examples

49
Multilayer Neural Networks cont.

Training error
Here we consider the training error of an input
pattern to be sum over the output units of the
squared difference between the desired output and
the actual output of each output unit
Note that w is a vector consisting of all the
parameters in the multilayer neural network

50
Back Propagation Algorithm

Initialize the weights to small random values
Choose a training pattern aq,tq and apply aq to
the input layer, i.e, let a0 aq and t tq
Propagate the input forwards through the network
Compute the deltas (errors) for the output layer

51
Back Propagation Algorithm cont.

Compute the deltas (errors) for the preceding
layers by propagating the errors backwards
Update all the connections by
Go back to step 2 and repeat for the next pattern

52
Backpropagation Algorithm cont.
53
Example Function Approximation
54
Network
55
Initial Conditions
56
Forward Propagation
57
Transfer Function Derivatives
58
Backpropagation
59
Weight Update
60
Practical Issues

While in theory a multiple layer neural network
with nonlinear transfer functions trained using
backpropagation is sufficient to approximate any
function or solve any recognition problem, there
are practical issues
What is the optimal architecture for a particular
problem/application?
What is the performance on unknown test data?
Will the network converge to a good solution?
How long does it take to train a network?

61
Choice of Architecture
1-3-1 Network
i 1
i 2
i 4
i 8
62
Choice of Network Architecture
1-2-1
1-3-1
1-5-1
1-4-1
63
The Issue of Generalization

We are interested in a neural network trained
only on a training set will work well for novel
and unseen test data
For example, for face recognition, a neural
network can only recognize those in the training
set is not very useful
Generalization is one of the most fundamental
problems in neural networks and many other
recognition techniques

64
Improving Generalization

Heuristics
A neural network should have fewer parameters
than the number of data points in the training
set
More domain specific knowledge
Cross validation
Divide the labeled examples into training and
validation sets
Stop training when the error on the validation
set increases

65
Cross Validation
66
Convergence Issues

A neural network may converge to a bad solution
Train several neural networks from different
initial conditions
Learning rate may be too large

67
Convergence Issues

The convergence can be slow
Practical techniques
Variations of basic backpropagation algorithms
Heuristics
Momentum
Variable learning rate
Conjugate gradient
Second-order methods
Newtons method
Levenberg-Marquardt algorithm

68
Momentum

Momentum
A heuristic term to speed up the training
The momentum parameter a must be between 0 and 1
a value of 0.9 is often chosen

69
Momentum Backpropagation
70
Derivation of Backpropagation Algorithm

Backpropagation algorithm is not a magic
algorithm
It is essentially a gradient descent algorithm
for optimization on the training error of neural
network parameters (i.e., the weights and biases)
Understanding this allows to derive new
algorithms for different neural networks

71
Gradient Descent
Choose the next step so that the function
decreases
For small changes in x we can approximate F(x)
where
If we want the function to decrease
We can maximize the decrease by choosing
72
Example
73
Gradient Descent
74
Backpropagation for a Neuron
75
Neural Network Based Face Detection
76
Instance-Based Learning
77
The Nearest Neighbor Rule

Suppose we have Da(1), ......, a(n) labeled
training instances
Let a in D be the closest point to instance a,
which needs to be classified
The nearest neighbor rule is to assign a the
label associated with a

78
The Nearest Neighbor Rule cont.

Example x (0.10, 0.25)t

79
The Nearest Neighbor Rule cont.
The nearest neighbor rule leads to a partition of
the attribute space into Voronoi cells
80
The k-Nearest Neighbor Rule

An extension of the nearest neighbor rule
The k-nearest neighbor rule classifies a by
assigning it the label most frequently
represented among the k nearest samples
In other words, given a, we find the k nearest
labeled samples. The label appeared most is
assigned to a.

81
The k-Nearest Neighbor Rule cont.

Example
k 3 (odd value) and a (0.10, 0.25)t
Closest vectors to a with their labels are
(0.10, 0.28, ?2) (0.12, 0.20, ?2) (0.15,
0.35,?1)
One voting scheme assigns the label ?2 to x
since ?2 is the most frequently represented

82
The k-Nearest Neighbor Rule cont.
83
Computational Complexity of the
k-Nearest-Neighbor Rule

Straightforward implementation O(d n)
More efficient implementations
Tree-based data structures
Editing

84
K-d Tree

A k-d tree is a generalization of a binary search
tree in high dimensions
Each internal node in a k-d tree is associated
with a hyper-rectangle and a hyper-plane
orthogonal to one of the coordinate axis
The hyper-plane splits the hyper-rectangle into
two parts, which are associated with the child
nodes
The partitioning process goes on until the number
of data points in the hyper-rectangle falls below
some given threshold

85
K-d Tree cont.
86
K-d Tree cont.

For a given query point, the algorithm works by
first descending the tree to find the data points
lying in the cell that contains the query point
Then it examines surrounding cells if they
overlap the ball centered at the query point and
the closest data point so far

87
Condensing

Aim is to reduce the number of training samples
Retain only the samples that are needed to define
the decision boundary

88
Dataset Reduction Editing

Training data may contain noise, overlapping
classes
Editing seeks to remove noisy points and produce
smooth decision boundaries often by retaining
points far from the decision boundaries
Results in homogenous clusters of points

89
Wilson Editing

Remove points that do not agree with the majority
of their k nearest neighbours

90
Decision Trees

Decision trees are based on the idea that we can
classify a pattern through a sequence of
questions
This is known as a 20-questions approach, very
similar to the guess who game strategy

91
Decision Tree Example
92
Constructing Decision Trees

As for constructing other tree structures, we use
recursive procedure
First we select an attribute for root node and
create branch for each possible attribute value
Then we split instances into subsets, one for
each branch extending from the node
Finally we repeat recursively for each branch,
using only instances that reach the branch
Stop if all instances have the same class

93
The Weather Data
94
Which Attribute to Select?
95
Criterion for Attribute Selection

Which is the best attribute?
Want to get the smallest tree
Heuristic choose the attribute that produces the
purest nodes
Popular impurity criterion information gain
Information gain increases with the average
purity of the subsets
Strategy choose attribute that gives greatest
information gain

96
Computing Information

Measure information in bits
Given a probability distribution, the info
required to predict an event is the
distributions entropy
Entropy gives the information required in
bits(can involve fractions of bits!)
Formula for computing the entropy

97
Example Attribute Outlook

Outlook Sunny
Outlook Overcast
Outlook Rainy
Expected information for attribute

98
Computing Information Gain

Information gain difference between information
before splitting and information after splitting
Information gain for attributes from weather data

gain(Outlook ) info(9,5) info(2,3,4,0,3
,2) 0.940 0.693 0.247 bits
gain(Outlook ) 0.247 bits gain(Temperature )
0.029 bits gain(Humidity ) 0.152
bits gain(Windy ) 0.048 bits
99
Continuing to Split
gain(Temperature ) 0.571 bits gain(Humidity )
0.971 bits gain(Windy ) 0.020 bits
100
Final Decision Tree

Note not all leaves need to be pure sometimes
identical instances have different classes
? Splitting stops when data cant be split any
further

101
Highly-branching Attributes

Attributes with a large number of values can be
problematic
Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing
attributes with a large number of values
This may result in overfitting (selection of an
attribute that is non-optimal for prediction)

102
Weather data with ID code
103
Tree Stump for ID code Attribute

Entropy of split
Information gain is maximal for ID code (namely
0.940 bits)

104
Gain Ratio

Gain ratio a modification of the information
gain that reduces its bias
Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the
intrinsic information of a split into account
Intrinsic information entropy of distribution of
instances into branches (i.e. how much info do we
need to tell which branch an instance belongs to)

105
Computing the gain ratio

Example intrinsic information for ID code
Value of attribute decreases as intrinsic
information gets larger
Definition of gain ratio
Example

106
Gain ratios for Weather Data
107
Numeric Attributes

Standard method binary splits
E.g. temp lt 45
Unlike nominal attributes, every attribute has
many possible split points
Solution is straightforward extension
Evaluate info gain (or other measure) for every
possible split point of attribute
Choose best split point
Info gain for best split point is info gain for
attribute
Computationally more demanding

108
Example

Split on temperature attribute
E.g. temperature ? 71.5 yes/4, no/2 temperature
? 71.5 yes/5, no/3
Info(4,2,5,3) 6/14 info(4,2) 8/14
info(5,3) 0.939 bits
Place split points halfway between values
Can evaluate all split points in one pass!

109
Pruning

Prevent overfitting to noise in the data
Prune the decision tree
Two strategies
Postpruningtake a fully-grown decision tree and
discard unreliable parts
Prepruningstop growing a branch when information
becomes unreliable
Postpruning preferred in practiceprepruning can
stop early

110
Statistical Pattern Recognition

We have a set of training examples for each class
we are interested in
We want to design/learn a classifier that can
classify an input into one of the classes
Note the examples are influenced by factors that
are random in nature
While the classifiers we considered give us a
definite answer, note that the problem is
intrinsically probabilistic
In other words, we should estimate the
probability of being a particular class rather
than a definite answer

111
The Illustration Example
112
Review Probability Theory cont.

Conditional probability
The law of total probability
Bayes rule

113
Bayesian Decision Theory

We have two pattern classes w1 and w2
Prior probability
Our prior knowledge of how likely to get class w1
or w2
Class conditional probability
Class conditional probability density p(x w1)
and p(x w2)

114
Bayesian Decision Theory cont.

Bayes formula
In English, it can be expressed as

115
Bayesian Decision Theory cont.

Minimum error classification
Decide w1 if P(w1x) gt P(w2x) otherwise decide
w2
P(errorx)minP(w1x), P(w2x)

116
Example

Number of fins of sea bass and salmon

117
Example cont.

Estimated class conditional

118
Example cont.

Prior probabilities
Priors come from prior knowledge
For example, if a fish expert says there are
twice as many salmon than sea bass
What are priors?
How to estimate priors from data?

119
Bayesian Decision Theory cont.

Classification results are often associated with
actions
Decisions with different risks
For example, a typical user can tolerate a spam
being classified incorrectly as a ham more than a
ham being classified incorrectly as a spam

120
Bayesian Decision Theory cont.

Conditional risk
Bayesian decision theory
Take the action that has the minimum conditional
risk

121
General Case

In general, we can have more than two classes and
a number of actions
The Bayesian decision theory is the same
For each one, we compute its posterior
probability
For minimum error rate classification, we decide
to be the class with the largest posterior
probability
If we have actions with different risks, we take
the action that has the least conditional risk

122
Bayesian Decision Theory cont.

Bayes decision rule for minimum error rate
Decide wi if P(wix) ? P(wjx) for j ? i
In case of ties, you can break the tie any way
you like
The probability of error is given by

123
Bayesian Decision Theory General Case

Suppose that there are c categories
w1, w2, ..., wc
and there are d actions associated with
recognition labels
a1, a2, ..., ad
Loss function
The lost function states exactly how costly each
action is and is used to convert a probability
determination into a decision
Written as

124
Bayesian Decision Theory cont.

Conditional risk
Bayes decision rule
For a given x, select the action ai for which the
conditional risk is minimum

125
Naïve Bayesian Classifier

When there are more than one feature, the class
conditional probability distribution can be
complicated and expensive to model
One way to overcome this problem is to assume
that all the features are statistically
independent
The resulting Bayesian classifier is called naïve
Bayesian classifier

126
Training Set

We need to have two sets of email messages
Spam, consisting of known spam messages
Note that the definition of spam is subjective
Ham, consisting of known non-spam messages
You can generate the sets based on your email
messages
You can also obtain publicly available spam
corpora

127
Bayesian Spam Filtering

Training stage
Tokenize each message in the training sets
This step can affect the performance
significantly
Create two tables, one for spam class and one for
non-spam class, counting the number of times each
token occurred in the corresponding corpus
Estimate the probability that a message belongs
to spam if it contains a particular word using
the following formula, resulting another table

128
Estimating Probability (Pseudo Code)

nbad the number of spam messages in its corpus
ngood the number of non-spam messages in its
corpus
For each word in the combined list (consisting of
words of both corpora)
g the number of occurrence in the non-spam
corpus
b the number of occurrence in the spam corpus
If 2g b lt 5, then
Delete the word from the list
Otherwise
Its probability is given by

129
Bayesian Spam Filtering

Testing stage
When a new message arrives, it is first tokenized
the same way the training messages are
Find the fifteen tokens that are most certain,
when the certainty of a token is measured by its
entropy (the smaller the entropy, the more the
certainty)
This can be done according to p-0.5

130
Testing

Combined probability
(let ((prod (apply ' probs)))
(/ prod ( prod (apply ' (mapcar '(lambda
(x)
(- 1 x))
probs)))))
This translates to

131
Testing

Thresholding
If the combined probability is larger than a
threshold, it is classified as spam otherwise,
it is classified as ham
The threshold is an important parameter
By varying the threshold, we will get a curve
called ROC (receiver-operating curve)

132
Statistical Pattern Recognition
133
Classification Using Rules

Perform classification using If-Then rules
Classification Rule r lta,cgt
Antecedent, Consequent
May generate from from other techniques (DT, NN)
or generate directly.
Algorithms 1R, PRISM

134
Generating Rules from DTs
135
Generating Rules Example
136
Inferring Rudimentary Rules

1R algorithm
It is an algorithm to find simple rules
It often comes up with quite good rules for
characterizing the structures in the data

137
1R Algorithm
138
1R Algorithm Example
139
PRIZM Algorithm
140
The contact lenses data
141
Example contact lens data cont.

Rule we seek
Possible tests

142
Modified rule and resulting data

Rule with best test added
Instances covered by modified rule

143
Further refinement

Current state
Possible tests

144
Modified rule and resulting data

Rule with best test added
Instances covered by modified rule

145
Further refinement

Current state
Possible tests
Tie between the first and the fourth test
We choose the one with greater coverage

146
The result

Final rule
Second rule for recommending hard
lenses(built from instances not covered by
first rule)
These two rules cover all hard lenses
Process is repeated with other two classes

147
Meta learning schemes

Basic ideabuild different experts and let
them vote
Advantage
often improves predictive performance
Disadvantage
produces output that is very hard to analyze
Schemes
Bagging
Boosting
Stacking
error-correcting output codes

apply to both classificationand numeric
prediction
148
Bias-variance decomposition

To analyze how much any specific training set
affects performance
Assume infinitely many classifiers,built from
different training sets of size n
For any learning scheme,
Bias expected error of the combined classifier
on new data
Variance expected error due to the particular
training set used
Total expected error bias variance

149
Bagging

Combining predictions by voting/averaging
Simplest way!
Each model receives equal weight
Idealized version
Sample several training sets of size n(instead
of just having one training set of size n)
Build a classifier for each training set
Combine the classifiers predictions
Learning scheme is unstable ? almost always
improves performance
Small change in training data can make big change
in model
(e.g. decision trees)

150
Bagging classifiers
Model generation

Let n be the number of instances in the training
data
For each of t iterations
Sample n instances from training set
(with replacement)
Apply learning algorithm to the sample
Store resulting model

Classification
For each of the t models Predict class of
instance using model Return class that is
predicted most often
151
Boosting

Also uses voting/averaging
Weights models according to performance
Iterative new models are influenced by
performance of previously built ones
Encourage new model become an expert for
instances misclassified by earlier models
Intuitive justification models should be experts
that complement each other
Several variants

152
AdaBoost
153
AdaBoost cont.

The final/component classifier is given by
The error on the training set of the final
classifier is bounded to be

154
AdaBoost cont.
155
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Outline - PowerPoint PPT Presentation

Outline

When an example i is from class 2, y(i) = -1. In this case, we use the sign of x to decide which class ... We train c linear models for each class against the rest ... – PowerPoint PPT presentation