Title: Cooperating Intelligent Systems
1Cooperating Intelligent Systems
- Statistical learning methods
- Chapter 20, AIMA
- (only ANNs SVMs)
2Artificial neural networks
- The brain is a pretty intelligent system.
- Can we copy it?
- There are approx. 1011 neurons in the brain.
- There are approx. 23?109 neurons in the male
cortex (females have about 15 less).
3The simple model
- The McCulloch-Pitts model (1943)
w2
w1
w3
y g(w0w1x1w2x2w3x3)
Image from Neuroscience Exploring the brain by
Bear, Connors, and Paradiso
4Transfer functions g(z)
The logistic function
The Heaviside function
5The simple perceptron
With -1,1 representation Traditionally
(early 60s) trained with Perceptron learning.
6Perceptron learning
Desired output
- Repeat until no errors are made anymore
- Pick a random example x(n),f(n)
- If the classification is correct, i.e. if
y(x(n)) f(n) , then do nothing - If the classification is wrong, then do the
following update to the parameters (h, the
learning rate, is a small positive number)
7Example Perceptron learning
x2
x1
The AND function
Initial values h 0.3
8Example Perceptron learning
x2
This one is correctlyclassified, no action.
x1
The AND function
9Example Perceptron learning
x2
This one is incorrectlyclassified, learning
action.
x1
The AND function
10Example Perceptron learning
x2
This one is incorrectlyclassified, learning
action.
x1
The AND function
11Example Perceptron learning
x2
This one is correctlyclassified, no action.
x1
The AND function
12Example Perceptron learning
x2
This one is incorrectlyclassified, learning
action.
x1
The AND function
13Example Perceptron learning
x2
This one is incorrectlyclassified, learning
action.
x1
The AND function
14Example Perceptron learning
x2
x1
The AND function
Final solution
15Perceptron learning
- Perceptron learning is guaranteed to find a
solution in finite time, if a solution exists. - Perceptron learning cannot be generalized to more
complex networks. - Better to use gradient descent based on
formulating an error and differentiable functions
16Gradient search
The learning rate (h) is set heuristically
E(W)
Go downhill
W
W(k)
W(k1) W(k) DW(k)
17The Multilayer Perceptron (MLP)
- Combine several single layer perceptrons.
- Each single layer perceptron uses a sigmoid
function (C?)E.g.
input
output
Can be trained using gradient descent
18Example One hidden layer
- Can approximate any continuous function
- q(z) sigmoid or linear,
- f(z) sigmoid.
19- Example of computing the gradient
What we need to do is to compute
We have the complete equation for the network
20Example of computing the gradient
21- When should you stop learning?
- After a set number of learning epochs
- When the change in the gradient becomes smaller
than a certain number - Validation data - early stopping
22- RPROP (Resilient PROPagation)
Parameter update rule
Learning rate update rule
No parameter tuning unlike standard
backpropagation!
23- Use this to determine
- Number of hidden
- nodes
- Which input signals
- to use
- If a pre-processing
- strategy is good or not
- Etc...
- Variability typically induced by
- Varying train
- and test data sets
- Random initial
- model parameters
24Support vector machines
25Linear classifier on a linearly separable problem
There are infinitely manylines that have zero
trainingerror. Which line should we choose?
26Linear classifier on a linearly separable problem
There are infinitely manylines that have zero
trainingerror. Which line should we choose? ?
Choose the line with thelargest margin. The
large margin classifier
margin
27Linear classifier on a linearly separable problem
There are infinitely manylines that have zero
trainingerror. Which line should we choose? ?
Choose the line with thelargest margin. The
large margin classifier
margin
Support vectors
28Computing the margin
The plane separating and is defined
by The dashed planes are given by
w
margin
29Computing the margin
Divide by b Define new w w/b and a a/b
w
margin
We have defined a scalefor w and b
30Computing the margin
We have which gives
x lw
lw
x
margin
31Linear classifier on a linearly separable problem
Maximizing the margin isequal to minimizing
w subject to the constraints wTx(n) a
? 1 for all wTx(n) a ? -1 for all
w
Quadratic programming problem, constraints can
be included with Lagrange multipliers.
32Quadratic programming problem
Minimize cost (Lagrangian)
Minimum of Lp occurs at the maximum of (the Wolfe
dual)
Only scalar productin cost. IMPORTANT!
33Linear Support Vector Machine
Test phase, the predicted output
Still only scalar products in the expression.
34Example Robot color vision(Competition 1999)
Classify the Lego pieces into red, blue, and
yellow. Classify white balls, black sideboard,
and green carpet.
35What the camera sees (RGB space)
Yellow
Red
Green
36Mapping RGB (3D) to rgb (2D)
37Lego in normalized rgb space
x2
Output is 6D red, blue, yellow, green, black,
white
x1
Input is 2D
38MLP classifier
E_train 0.21 E_test 0.24
2-3-1 MLP Levenberg- Marquardt
Training time (150 epochs) 51 seconds
39SVM classifier
E_train 0.19 E_test 0.20
SVM with g 1000
Training time 22 seconds
40- Lab 4 Digit recognition
- Inputs (digits) are provided as 32x32 bitmaps.
Task is to investigate how well these handwritten
digits can be recognized by neural networks. - Assignment includes changing in the program code
to answer - How good is the generalization performance? (test
data error) - Can pre-processing improve performance?
- What is the best configuration of the network?
41- public AppTrain() // create a new network
of given size nnnew NN(3232, 10, seed)
// each row contains 32321 integer //
create the matrix holding the data // read
data into the matrix filenew
TFile("digits.dat") System.out.println(file.r
ows()" digits have been loaded") double
inputnew double3232 double targetnew
double10 // the training session (below)
is iterative for (int e0 eltnEpochs e)
// reset the error accumulated over each
training epoch double err0 // in
each epoch, go through all examples/tuples/digits
// note all examples are here used for
training, consequently no systematic testing
// you may consider dividing the data set into
training, testing and validation sets. for
(int p0 pltfile.rows() p) for (int
i0 ilt3232 i) inputifile.values
pi // the last value on each row
contains the target (0-9) // convert it
to a double target vector for (int i0
ilt10 i) if (file.valuesp3232
i) targeti1 else
targeti0 // present a
sample and // calculate errors and adjust
weights errnn.train(input, target,
eta) System.out.println("Epoch
"e" finished with error "err/file.rows())
// save network weights in a file for
later use, e.g. in AppDigits
nn.save("network.m") -
42- / classify _at_param map the bitmap on
the screen _at_return int the most likely
digit (0-9) according to network / public
int classify(boolean map) double
inputnew double3232 for (int c0
cltmap.length c) for (int r0
rltmapc.length r) if (mapcr) //
bit set inputrmapr.lengthc1
else inputrmapr.lengthc0
// activate the network, produce
output vector double outputnn.feedforward(i
nput) // alternative version assumes that
the network has been trained on an 8x8 map //
double outputnn.feedforward(to8x8(input))
double highscore0 int highscoreIndex0
// print out each output value (gives an idea of
the network's support for each digit).
System.out.println("--------------") for
(int k0 klt10 k) System.out.println(k
""(double)((int)(outputk1000)/1000.0))
if (outputkgthighscore)
highscoreoutputk highscoreIndexk
System.out.println("--------------"
) return highscoreIndex