Supervised Learning - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Supervised Learning

Description:

Feed-back / auto-associative. From (output) layer back to previous (hidden/input) layer ... proof was not the heart of the princess; it was the heart of a pig. ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 65
Provided by: uwel8
Category:

less

Transcript and Presenter's Notes

Title: Supervised Learning


1
Supervised Learning
Business SchoolInstitute of Business Informatics
  • Uwe Lämmel

www.wi.hs-wismar.de/laemmel U.laemmel_at_wi.hs-wisma
r.de
2
Neural Networks
  • Idea
  • Artificial Neuron Network
  • Supervised Learning
  • Unsupervised Learning
  • Data Mining other Techniques

3
Supervised Learning
  • Feed-Forward Networks
  • Perceptron AdaLinE LTU
  • Multi-Layer networks
  • Backpropagation Algorithm
  • Pattern recognition
  • Data preparation
  • Examples
  • Bank Customer
  • Customer Relationship

4
Connections
  • Feed-forward
  • Input layer
  • Hidden layer
  • Output layer
  • Feed-back / auto-associative
  • From (output) layer back to previous
    (hidden/input) layer
  • All neurons fully connected to each other

Hopfield network
5
Perceptron Adaline TLU
  • One layer of trainable links only
  • Adaptive linear element
  • Threshold Linear Unit
  • class of neural network of a special
    architecture

6
Papert, Minsky and Perceptron - History
"Once upon a time two daughter sciences were born
to the new science of cybernetics. One sister
was natural, with features inherited from the
study of the brain, from the way nature does
things. The other was artificial, related from
the beginning to the use of computers. But
Snow White was not dead. What Minsky and Papert
had shown the world as proof was not the heart of
the princess it was the heart of a pig." Seymour
Papert, 1988
7
Perception
Perceptionfirst step of recognition becoming
aware of something via the senses
8
Perceptron
  • Input layer
  • binary input, passed trough,
  • no trainable links
  • Propagation function netj ? oiwij
  • Activation functionoj aj 1 if netj ? ?j ,
    0 otherwise
  • A perceptron can learn all the functions, that
    can be represented, in a finite time .
    (perceptron convergence theorem, F. Rosenblatt)

9
Linear separable
  • Neuron j should be 0,iff both neurons 1 and 2
    have the same value (o1o2), otherwise 1
  • netj o1w1j o2w2j
  • 0? w1j 0?w2j lt ?j
  • 0? w1j 1?w2j ? ?j
  • 1? w1j 0?w2j ? ?j
  • 1? w1j 1?w2j lt ?j

?
?j
w2j
w1j
10
Linearseparable
  • netj o1w1j o2w2j
  • ?line in a 2-dim. space
  • line divides plane so, that (0,1) and (1,0) are
    in different sub planes.
  • the network can not solve the problem.
  • a perceptron can represent only some functions
  • ? a neural network representing the XOR-function
    needs hidden neurons

11
Learning is easy
  • while input pattern ? ? do begin
  • next input patter calculate output
  • for each j in OutputNeurons do
  • if oj?tj then
  • if oj0 then output0, but 1 expected
  • for each i in InputNeurons do
    wijwijoi
  • else if oj1 then output1, but 0 expected
  • for each i in InputNeurons do
    wijwij-oi
  • end

repeat until desired behaviour
12
Exercise
  • Decoding
  • input binary code of a digit
  • output - unary representationas many digits 1,
    as the digit represents 5 1 1 1 1 1
  • architecture

13
Exercise
  • Decoding
  • input Binary code of a digit
  • output classification0 1st Neuron, 1 2nd
    Neuron, ... 5 6th Neuron, ...
  • architecture

14
Exercises
  1. Look at the EXCEL-file of the decoding problem
  2. Implement (in PASCAL/Java) a 4-10-Perceptron
    which transforms a binary representation of a
    digit (0..9) into a decimal number. Implement
    the learning algorithm and train the network.
  3. Which task can be learned faster?(Unary
    representation or classification)

15
Exercises
  1. Develop a perceptron for the recognition of
    digits 0..9. (pixel representation)input layer
    3x7-input neuronsUse the SNNS or JavaNNS
  2. Can we recognize numbers greater than 9 as well?
  3. Develop a perceptron for the recognition of
    capital letters. (input layer 5x7)

16
multi-layer Perceptron
Cancels the limits of a perceptron
  • several trainable layers
  • a two layer perceptron can classify convex
    polygons
  • a three layer perceptron can classify any sets

multi layer perceptron feed-forward
network backpropagation network
17
Multi-layer feed-forward network
18
Feed-Forward Network
19
Evaluation of the net output in a feed forward
network
20
Backpropagation-Learning Algorithm
  • supervised Learning
  • error is a function of the weights wi E(W)
    E(w1,w2, ... , wn)
  • We are looking for a minimal error
  • minimal error hollow in the error surface
  • Backpropagation uses the gradient for weight
    adaptation

21
error curve
weight1
weight2
22
Problem
output
teaching output
hidden layer
  • error in output layer
  • difference output teaching output
  • error in a hidden layer?

input layer
23
Gradient descent
  • Gradient
  • Vector orthogonal to a surface in direction of
    the strongest slope
  • derivation of a function in a certain direction
    is the projection of the gradient in this
    direction

example of an error curve of a weight wi
24
Example Newton-Approximation
  • calculation of the root
  • f(x) x²-5
  • x 2
  • x ½(x 5/x) 2.25
  • X ½(x 5/x) 2.2361

25
Backpropagation - Learning
  • gradient-descent algorithm
  • supervised learningerror signal used for weight
    adaptation
  • error signal
  • teaching calculated output , if output neuron
  • weighted sum of error signals of successor
  • weight adaptation
  • ? Learning rate
  • ? error signal

26
Standard-Backpropagation Rule
  • gradient descent derivation of a function
  • logistic function
  • fact(netj) fact(netj)?(1- fact(netj))
    oj(1-oj)
  • the error signal ?j is therefore

27
Backpropagation
  • Examples
  • XOR (Excel)
  • Bank Customer

28
Backpropagation - Problems
29
Backpropagation-Problems
  • A flat plateau
  • weight adaptation is slow
  • finding a minimum takes a lot of time
  • B Oscillation in a narrow gorge
  • it jumps from one side to the other and back
  • C leaving a minimum
  • if the modification in one training step is to
    high, the minimum can be lost

30
Solutions looking at the values
  • change the parameter of the logistic function in
    order to get other values
  • Modification of weights depends on the outputif
    oi0 no modification will take place
  • If we use binary input we probably have a lot of
    zero-values change 0,1 into -½ , ½ or -1,1
  • use another activation function, eg. tanh and use
    -1..1 values

31
Solution Quickprop
  • assumption error curve is a square function
  • calculate the vertex of the curve

slope of the error curve
32
Resilient Propagation (RPROP)
  • sign and size of the weight modification are
    calculated separately bij(t) size of
    modification
  • ? bij(t-1) ?? if S(t-1)?S(t) gt 0
  • bij(t) ? bij(t-1) ??- if S(t-1)?S(t) lt 0
    ? bij(t-1) otherwise
  • ?gt1 both ascents are equal ? big
    step0lt?-lt1 ascents are different ? smaller
    step
  • ? -bij(t) if S(t-1)gt0 ? S(t) gt 0
  • ?wij(t) ? bij(t) íf S(t-1)lt0 ? S(t) lt 0
  • ? -?wij(t-1) if S(t-1)?S(t) lt 0 ()
  • ? -sgn(S(t))?bij(t) otherwise
  • () S(t) is set to 0, S(t)0 at time (t1) the
    4th case will be applied.

33
Limits of the Learning Algorithm
  • it is not a model for biological learning
  • no teaching output in natural learning
  • no feedbacks in a natural neural network (at
    least nobody has discovered yet)
  • training of an ANN is rather time consuming

34
Exercise - JavaNNS
  • Implement a feed forward network containing of 2
    input neurons, 2 hidden neurons and one output
    neuron. Train the network so that it simulates
    the XOR-function.
  • Implement a 4-2-4-network, which works like the
    identity function. (Encoder-Decoder-Network).
    Try other versions 4-3-4, 8-4-8, ...What can
    you say about the training effort?

35
Pattern Recognition
output layer
2. hidden layer
1. hidden layer
input layer
36
Example Pattern Recognition
JavaNNS example Font
37
font Example
  • input 24x24 pixel-array
  • output layer 75 neurons, one neuron for each
    character
  • digits
  • letters (lower case, capital)
  • separators and operator characters
  • two hidden layer of 4x6 neurons each
  • all neuron of a row of the input layer are linked
    to one neuron of the first hidden layer
  • all neuron of a column of the input layer are
    linked to one neuron of the second hidden layer.

38
Exercise
  • load the network font_untrained
  • train the network, use various learning
    algorithms
  • (look at the SNNS documentation for the
    parameters and their meaning)
  • Backpropagation ?2.0
  • Backpropagation ?0.8 mu0.6 c0.1 with
    momentum
  • Quickprop ?0.1 mg2.0 n0.0001
  • Rprop ?0.6
  • use various values for learning parameter,
    momentum, and noise
  • learning parameter 0.2 0.3 0.5 1.0
  • Momentum 0.9 0.7 0.5 0.0
  • noise 0.0 0.1 0.2

39
Example Bank Customer
A1 Credit history A2 debt A3 collateral A4
income
  • network architecture depends on the coding of
    input and output
  • How can we code values like good, bad, 1, 2,
    3, ...?

40
Data Pre-processing
  • objectives
  • prospects of better results
  • adaptation to algorithms
  • data reduction
  • trouble shooting
  • methods
  • selection and integration
  • completion
  • transformation
  • normalization
  • coding
  • filter

41
Selection and Integration
  • unification of data (different origins)
  • selection of attributes/features
  • reduction
  • omit obviously non-relevant data
  • all values are equal
  • key values
  • meaning not relevant
  • data protection

42
Completion / Cleaning
  • Missing values
  • ignore / omit attribute
  • add values
  • manual
  • global constant (missing value)
  • average
  • highly probable value
  • remove data set
  • noised data
  • inconsistent data

43
Transformation
  • Normalization
  • Coding
  • Filter

44
Normalization of values
  • Normalization equally distributed
  • in the range 0,1
  • e.g. for the logistic functionact
    (x-minValue) / (maxValue - minValue)
  • in the range -1,1
  • e.g. for activation function tanhact
    (x-minValue) / (maxValue - minValue)2-1
  • logarithmic normalization
  • act (ln(x) - ln(minValue)) / (ln(maxValue)-ln(mi
    nValue))

45
Binary Coding of nominal values I
  • no order relation, n-values
  • n neurons,
  • each neuron represents one and only one value
  • example red, blue,
    yellow, white, black 1,0,0,0,0
    0,1,0,0,0 0,0,1,0,0 ...
  • disadvantage n neurons necessary ? lots of
    zeros in the input

46
Bank Customer
Are these customers good ones? 1 bad high adequat
e 3 2 good low adequate 2
47
The Problem A Mailing Action
Data Mining Cup 2002
  • mailing action of a company
  • special offer
  • estimated annual income per customer
  • given
  • 10,000 sets of customer datacontaining 1,000
    cancellers (training)
  • problem
  • test set containing 10,000 customer data
  • Who will cancel ? Whom to send an offer?

customer willcancel willnot cancel
gets an offer 43.80 66.30
gets no offer 0.00 72.00
48
Mailing Action Aim?
customer willcancel willnot cancel
gets an offer 43.80 66.30
gets no offer 0.00 72.00
  • no mailing action
  • 9,000 x 72.00 648,000
  • everybody gets an offer
  • 1,000 x 43.80 9,000 x 66.30 640,500
  • maximum (100 correct classification)
  • 1,000 x 43.80 9,000 x 72.00 691,800

49
Goal Function Lift
customer willcancel willnot cancel
gets an offer 43.80 66.30
gets no offer 0.00 72.00
  • basis no mailing action 9,000 72.00
  • goal extra income
  • liftM 43.8 cM 66.30 nkM 72.00 nkM

50
Data
?----- 32 input data ------?
ltimportant
resultsgt
missing values
51
Feed Forward Network What to do?
  • train the net with training set (10,000)
  • test the net using the test set ( another 10,000)
  • classify all 10,000 customer into canceller or
    loyal
  • evaluate the additional income

52
Results
data mining cup 2002
  • gain
  • additional income by the mailing actionif
    target group was chosen according analysis

53
Review Students Project
  • copy of the data mining cup
  • real data
  • known results
  • contest
  • wishes
  • engineering approach data mining
  • real data for teaching purposes

54
Data Mining Cup 2007
  • started on April 10.
  • check-out couponing
  • Who will get a rebate coupon?
  • 50,000 data sets for training

55
Data
56
DMC2007
  • 75 output N(o)
  • e.g. classification has to gt 75!!
  • first experiments no success ?
  • deadline May 31st

57
Optimization of Neural Networks
  • objectives
  • good results in an application better
    generalisation (improve correctness)
  • faster processing of patterns(improve
    efficiency)
  • good presentation of the results(improve
    comprehension)

58
Ability to generalize
  • a trained net can classify data (out of the
    same class as the learning data)that it has
    never seen before
  • aim of every ANN development
  • network too large
  • all training patterns are learned from memory
  • no ability to generalize
  • network too small
  • rules of pattern recognition can not be
    learned(simple example Perceptron and XOR)

59
Development of an NN-application
60
Possible Changes
  • Architecture of NN
  • size of a network
  • shortcut connection
  • partial connected layers
  • remove/add links
  • receptive areas
  • Find the right parameter values
  • learning parameter
  • size of layers
  • using genetic algorithms

61
Memory Capacity
Number of patternsa network can store without
generalisation
  • figure out the memory capacity
  • change output-layer output-layer ? input-layer
  • train the network with an increasing number of
    random patterns
  • error becomes small network stores all patterns
  • error remains network can not store all
    patterns
  • in between memory capacity

62
Memory Capacity - Experiment
  • output-layer is a copy of the input-layer
  • training set consisting of n random pattern
  • error
  • error 0 network can store more than n
    patterns
  • error gtgt 0 network can not store n patterns
  • memory capacityerror gt 0 and error 0 for n-1
    patterns and error gtgt0 for n1 patterns

63
Layers Not fully Connected
  • partial connected (e.g. 75)
  • remove links, if weight has been nearby 0 for
    several training steps
  • build new connections (by chance)

64
Summary
  • Feed-forward network
  • Perceptron (has limits)
  • Learning is Math
  • Backpropagation is a Backpropagation of Error
    Algorithm
  • works like gradient descent
  • Activation Functions Logistics, tanh
  • Application in Data Mining, Pattern Recognition
  • data preparation is important
  • Finding an appropriate Architecture
Write a Comment
User Comments (0)
About PowerShow.com