Supervised Learning - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Supervised Learning

Description:

Feed-back / auto-associative. From (output) layer back to previous (hidden/input) layer ... proof was not the heart of the princess; it was the heart of a pig. ... – PowerPoint PPT presentation

Number of Views:149

Avg rating:3.0/5.0

Slides: 65

Provided by: uwel8

Category:

more less

Transcript and Presenter's Notes

Title: Supervised Learning

1
Supervised Learning
Business SchoolInstitute of Business Informatics

Uwe Lämmel

www.wi.hs-wismar.de/laemmel U.laemmel_at_wi.hs-wisma
r.de
2
Neural Networks

Idea
Artificial Neuron Network
Supervised Learning
Unsupervised Learning
Data Mining other Techniques

3
Supervised Learning

Feed-Forward Networks
Perceptron AdaLinE LTU
Multi-Layer networks
Backpropagation Algorithm
Pattern recognition
Data preparation
Examples
Bank Customer
Customer Relationship

4
Connections

Feed-forward
Input layer
Hidden layer
Output layer

Feed-back / auto-associative
From (output) layer back to previous
(hidden/input) layer
All neurons fully connected to each other

Hopfield network
5
Perceptron Adaline TLU

One layer of trainable links only
Adaptive linear element
Threshold Linear Unit
class of neural network of a special
architecture

6
Papert, Minsky and Perceptron - History
"Once upon a time two daughter sciences were born
to the new science of cybernetics. One sister
was natural, with features inherited from the
study of the brain, from the way nature does
things. The other was artificial, related from
the beginning to the use of computers. But
Snow White was not dead. What Minsky and Papert
had shown the world as proof was not the heart of
the princess it was the heart of a pig." Seymour
Papert, 1988
7
Perception
Perceptionfirst step of recognition becoming
aware of something via the senses
8
Perceptron

Input layer
binary input, passed trough,
no trainable links
Propagation function netj ? oiwij
Activation functionoj aj 1 if netj ? ?j ,
0 otherwise
A perceptron can learn all the functions, that
can be represented, in a finite time .
(perceptron convergence theorem, F. Rosenblatt)

9
Linear separable

Neuron j should be 0,iff both neurons 1 and 2
have the same value (o1o2), otherwise 1
netj o1w1j o2w2j
0? w1j 0?w2j lt ?j
0? w1j 1?w2j ? ?j
1? w1j 0?w2j ? ?j
1? w1j 1?w2j lt ?j

?
?j
w2j
w1j
10
Linearseparable

netj o1w1j o2w2j
?line in a 2-dim. space
line divides plane so, that (0,1) and (1,0) are
in different sub planes.
the network can not solve the problem.
a perceptron can represent only some functions
? a neural network representing the XOR-function
needs hidden neurons

11
Learning is easy

while input pattern ? ? do begin
next input patter calculate output
for each j in OutputNeurons do
if oj?tj then
if oj0 then output0, but 1 expected
for each i in InputNeurons do
wijwijoi
else if oj1 then output1, but 0 expected
for each i in InputNeurons do
wijwij-oi
end

repeat until desired behaviour
12
Exercise

Decoding
input binary code of a digit
output - unary representationas many digits 1,
as the digit represents 5 1 1 1 1 1
architecture

13
Exercise

Decoding
input Binary code of a digit
output classification0 1st Neuron, 1 2nd
Neuron, ... 5 6th Neuron, ...
architecture

14
Exercises

Look at the EXCEL-file of the decoding problem
Implement (in PASCAL/Java) a 4-10-Perceptron
which transforms a binary representation of a
digit (0..9) into a decimal number. Implement
the learning algorithm and train the network.
Which task can be learned faster?(Unary
representation or classification)

15
Exercises

Develop a perceptron for the recognition of
digits 0..9. (pixel representation)input layer
3x7-input neuronsUse the SNNS or JavaNNS
Can we recognize numbers greater than 9 as well?
Develop a perceptron for the recognition of
capital letters. (input layer 5x7)

16
multi-layer Perceptron
Cancels the limits of a perceptron

several trainable layers
a two layer perceptron can classify convex
polygons
a three layer perceptron can classify any sets

multi layer perceptron feed-forward
network backpropagation network
17
Multi-layer feed-forward network
18
Feed-Forward Network
19
Evaluation of the net output in a feed forward
network
20
Backpropagation-Learning Algorithm

supervised Learning
error is a function of the weights wi E(W)
E(w1,w2, ... , wn)
We are looking for a minimal error
minimal error hollow in the error surface
Backpropagation uses the gradient for weight
adaptation

21
error curve
weight1
weight2
22
Problem
output
teaching output
hidden layer

error in output layer
difference output teaching output
error in a hidden layer?

input layer
23
Gradient descent

Gradient
Vector orthogonal to a surface in direction of
the strongest slope
derivation of a function in a certain direction
is the projection of the gradient in this
direction

example of an error curve of a weight wi
24
Example Newton-Approximation

calculation of the root
f(x) x²-5
x 2
x ½(x 5/x) 2.25
X ½(x 5/x) 2.2361

25
Backpropagation - Learning

gradient-descent algorithm
supervised learningerror signal used for weight
adaptation
error signal
teaching calculated output , if output neuron
weighted sum of error signals of successor
weight adaptation
? Learning rate
? error signal

26
Standard-Backpropagation Rule

gradient descent derivation of a function
logistic function
fact(netj) fact(netj)?(1- fact(netj))
oj(1-oj)
the error signal ?j is therefore

27
Backpropagation

Examples
XOR (Excel)
Bank Customer

28
Backpropagation - Problems
29
Backpropagation-Problems

A flat plateau
weight adaptation is slow
finding a minimum takes a lot of time
B Oscillation in a narrow gorge
it jumps from one side to the other and back
C leaving a minimum
if the modification in one training step is to
high, the minimum can be lost

30
Solutions looking at the values

change the parameter of the logistic function in
order to get other values
Modification of weights depends on the outputif
oi0 no modification will take place
If we use binary input we probably have a lot of
zero-values change 0,1 into -½ , ½ or -1,1
use another activation function, eg. tanh and use
-1..1 values

31
Solution Quickprop

assumption error curve is a square function
calculate the vertex of the curve

slope of the error curve
32
Resilient Propagation (RPROP)

sign and size of the weight modification are
calculated separately bij(t) size of
modification
? bij(t-1) ?? if S(t-1)?S(t) gt 0
bij(t) ? bij(t-1) ??- if S(t-1)?S(t) lt 0
? bij(t-1) otherwise
?gt1 both ascents are equal ? big
step0lt?-lt1 ascents are different ? smaller
step
? -bij(t) if S(t-1)gt0 ? S(t) gt 0
?wij(t) ? bij(t) íf S(t-1)lt0 ? S(t) lt 0
? -?wij(t-1) if S(t-1)?S(t) lt 0 ()
? -sgn(S(t))?bij(t) otherwise
() S(t) is set to 0, S(t)0 at time (t1) the
4th case will be applied.

33
Limits of the Learning Algorithm

it is not a model for biological learning
no teaching output in natural learning
no feedbacks in a natural neural network (at
least nobody has discovered yet)
training of an ANN is rather time consuming

34
Exercise - JavaNNS

Implement a feed forward network containing of 2
input neurons, 2 hidden neurons and one output
neuron. Train the network so that it simulates
the XOR-function.
Implement a 4-2-4-network, which works like the
identity function. (Encoder-Decoder-Network).
Try other versions 4-3-4, 8-4-8, ...What can
you say about the training effort?

35
Pattern Recognition
output layer
2. hidden layer
1. hidden layer
input layer
36
Example Pattern Recognition
JavaNNS example Font
37
font Example

input 24x24 pixel-array
output layer 75 neurons, one neuron for each
character
digits
letters (lower case, capital)
separators and operator characters
two hidden layer of 4x6 neurons each
all neuron of a row of the input layer are linked
to one neuron of the first hidden layer
all neuron of a column of the input layer are
linked to one neuron of the second hidden layer.

38
Exercise

load the network font_untrained
train the network, use various learning
algorithms
(look at the SNNS documentation for the
parameters and their meaning)
Backpropagation ?2.0
Backpropagation ?0.8 mu0.6 c0.1 with
momentum
Quickprop ?0.1 mg2.0 n0.0001
Rprop ?0.6
use various values for learning parameter,
momentum, and noise
learning parameter 0.2 0.3 0.5 1.0
Momentum 0.9 0.7 0.5 0.0
noise 0.0 0.1 0.2

39
Example Bank Customer
A1 Credit history A2 debt A3 collateral A4
income

network architecture depends on the coding of
input and output
How can we code values like good, bad, 1, 2,
3, ...?

40
Data Pre-processing

objectives
prospects of better results
adaptation to algorithms
data reduction
trouble shooting

methods
selection and integration
completion
transformation
normalization
coding
filter

41
Selection and Integration

unification of data (different origins)
selection of attributes/features
reduction
omit obviously non-relevant data
all values are equal
key values
meaning not relevant
data protection

42
Completion / Cleaning

Missing values
ignore / omit attribute
add values
manual
global constant (missing value)
average
highly probable value
remove data set
noised data
inconsistent data

43
Transformation

Normalization
Coding
Filter

44
Normalization of values

Normalization equally distributed
in the range 0,1
e.g. for the logistic functionact
(x-minValue) / (maxValue - minValue)
in the range -1,1
e.g. for activation function tanhact
(x-minValue) / (maxValue - minValue)2-1
logarithmic normalization
act (ln(x) - ln(minValue)) / (ln(maxValue)-ln(mi
nValue))

45
Binary Coding of nominal values I

no order relation, n-values
n neurons,
each neuron represents one and only one value
example red, blue,
yellow, white, black 1,0,0,0,0
0,1,0,0,0 0,0,1,0,0 ...
disadvantage n neurons necessary ? lots of
zeros in the input

46
Bank Customer
Are these customers good ones? 1 bad high adequat
e 3 2 good low adequate 2
47
The Problem A Mailing Action
Data Mining Cup 2002

mailing action of a company
special offer
estimated annual income per customer
given
10,000 sets of customer datacontaining 1,000
cancellers (training)
problem
test set containing 10,000 customer data
Who will cancel ? Whom to send an offer?

customer willcancel willnot cancel
gets an offer 43.80 66.30
gets no offer 0.00 72.00
48
Mailing Action Aim?
customer willcancel willnot cancel
gets an offer 43.80 66.30
gets no offer 0.00 72.00

no mailing action
9,000 x 72.00 648,000
everybody gets an offer
1,000 x 43.80 9,000 x 66.30 640,500
maximum (100 correct classification)
1,000 x 43.80 9,000 x 72.00 691,800

49
Goal Function Lift
customer willcancel willnot cancel
gets an offer 43.80 66.30
gets no offer 0.00 72.00

basis no mailing action 9,000 72.00
goal extra income
liftM 43.8 cM 66.30 nkM 72.00 nkM

50
Data
?----- 32 input data ------?
ltimportant
resultsgt
missing values
51
Feed Forward Network What to do?

train the net with training set (10,000)
test the net using the test set ( another 10,000)
classify all 10,000 customer into canceller or
loyal
evaluate the additional income

52
Results
data mining cup 2002

gain
additional income by the mailing actionif
target group was chosen according analysis

53
Review Students Project

copy of the data mining cup
real data
known results
contest

wishes
engineering approach data mining
real data for teaching purposes

54
Data Mining Cup 2007

started on April 10.
check-out couponing
Who will get a rebate coupon?
50,000 data sets for training

55
Data
56
DMC2007

75 output N(o)
e.g. classification has to gt 75!!
first experiments no success ?
deadline May 31st

57
Optimization of Neural Networks

objectives
good results in an application better
generalisation (improve correctness)
faster processing of patterns(improve
efficiency)
good presentation of the results(improve
comprehension)

58
Ability to generalize

a trained net can classify data (out of the
same class as the learning data)that it has
never seen before
aim of every ANN development

network too large
all training patterns are learned from memory
no ability to generalize
network too small
rules of pattern recognition can not be
learned(simple example Perceptron and XOR)

59
Development of an NN-application
60
Possible Changes

Architecture of NN
size of a network
shortcut connection
partial connected layers
remove/add links
receptive areas
Find the right parameter values
learning parameter
size of layers
using genetic algorithms

61
Memory Capacity
Number of patternsa network can store without
generalisation

figure out the memory capacity
change output-layer output-layer ? input-layer
train the network with an increasing number of
random patterns
error becomes small network stores all patterns
error remains network can not store all
patterns
in between memory capacity

62
Memory Capacity - Experiment

output-layer is a copy of the input-layer
training set consisting of n random pattern
error
error 0 network can store more than n
patterns
error gtgt 0 network can not store n patterns
memory capacityerror gt 0 and error 0 for n-1
patterns and error gtgt0 for n1 patterns

63
Layers Not fully Connected