Activation Functions

About This Presentation

Title:

Activation Functions

Description:

As the learning rate increases the network trains faster, but may go unstable ... For a positive correlation it means as X increases, so does y, and vice versa ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 108

Provided by: fakp

Category:

more less

Transcript and Presenter's Notes

Title: Activation Functions

1
Activation Functions
2
Log Sigmoid
3
Local Gradient
4
Local Gradient

Thus the weight change is
Where is the local gradient y max and where is
it minimum?

5
Hyperbolic Tangent
6
Hyperbolic Tangent
7
Local Gradient

Thus the weight change is
Where is the local gradient y max and where is
it minimum?

8
Momentum

As the learning rate increases the network trains
faster, but may go unstable
Often times a momentum term is used to make
training fast but avoid training too fast

9
Momentum

We want the learning rate small for accuracy, but
large for speed of convergence, but too large and
its unstable.
How can this be accomplished?

10
Momentum

What does the momentum term do?
When the gradient has a constant sign from
iteration to iteration, the momentum term gets
larger
When the gradient has opposite signs from
iteration to iteration, the momentum term gets
smaller

11
Heuristic Improvements

Stopping Criteria
How do we know when we have reached the correct
answer?
A necessary condition for minimum error is that
the gradient 0, i.e. w(n1) w(m)
Note this condition is not sufficient since we
may have a local minimum

12
Heuristic Improvements

Stopping Criteria
The BP algorithm is considered to have converged
when the Euclidean norm of the gradient vector is
sufficiently small
This is problematic since one must compute the
gradient vector, and it will be slow

13
Heuristic Improvements

Stopping Criteria EX gradient vector norm
Lets say we have a logistic function for the 5
neurons in the output layer with a 1. (Im
using 5 for this example)
For each neuron I compute outi (1-outi)ki
Then if

14
Heuristic Improvements

Stopping Criteria
The BP algorithm is considered to have converged
when the absolute rate od change of the average
squared error per epoch is sufficiently small
(typically lt .1 to 1 of the error)
May result in premature stopping

15
Heuristic Improvements

Stopping Criteria
The BP algorithm is considered to have when its
generalization performance stops improving
Generalization performance is found by testing
the network on a representative set of data not
used to train on

16
Heuristic Improvements

On-line, Stochastic or sequential mode training
all are synonymous for the author
On-line input a value once and train on it
(update)
Stochastic randomly select from the pool of
training samples and update (train on it)

17
Heuristic Improvements

In Matlab we have trainb, trainr, and trains
trainb
trains a network with weight and bias learning
rules with batch updates. The weights and biases
are updated at the end of an entire pass through
the input data. Inputs are presented in random
order
trainr
trains a network with weight and bias learning
rules with incremental updates after each
presentation of an input. Inputs are presented in
random order.

18
Heuristic Improvements

In Matlab we have trainb, trainr, and trains
trains
Trains a network with weight and bias learning
rules with incremental updates after each
presentation of an input. Inputs are presented in
sequential order.

19
Heuristic Improvements

Batch Mode
with each input, accumulate update values for
each weight and after all inputs (an epoch) are
received update the weights

20
Summary of BP Steps

Initialize
With no prior information, pick weights from a
uniform distribution with mean 0 and range /- 1,
causes weights to fall in the linear region of
the sigmoid
Presentation of training examples
Randomly pick values, sequential mode (easiest)

21
Summary of BP Steps

Forward Computation
Present input vector and compute output
Backward Computation
Compute the ds as

22
Adjustment
23
Heuristic Improvements

Maximize information content of samples
Examples should contain maximum information
Results in the largest training error
Radically different from previous examples
Generally this is simulated by presenting
training samples randomly.
Randomize for each epoch

24
Heuristic Improvements

Activation Function
Generally learns faster if it is antisymmetric
Not true of logsigmoid, but true of Tanh

25
Heuristic Improvements

Activation Function (contd)
The next figure is the log sigmoid which does not
meet the criterion
The figure after that is the Tanh which is
antisymmetric

26
(No Transcript)
27
(No Transcript)
28
Heuristic Improvements

Activation Function (contd)
For the Tanh function, empirical studies have
shown the following values for a and b to be
appropriate

29
Heuristic Improvements

Activation Function (contd)
Note that

30
Heuristic Improvements

Target (output) values
Choose within range of possible output values
Really should be some small value e lt the maximum
neuron output
For the tanh, with a1.7159, choose e0.7159, and
then the targets can be /-1 (see the tanh slide)

31
Heuristic Improvements

Input range problems
If dealing with a persons height (meters) and
weight (lbs), then the weight will overpower
the height
Also, it is not good if one range of values is
-/ and another is -/- or /

32
Heuristic Improvements

Preprocess the inputs so that each has
An average of 0 or
Its average is small compared to its standard
deviation
Consider the case of all positive values
All weights must change in the same direction and
this will give a zig-zag traversal of the error
surface which can be very slow.

33
Heuristic Improvements

If possible, input values should be uncorrelated
Can be done using principal components analysis

34
Principal Component Analysis
35
Topics covered

Standard Deviation
Variance
Covariance
Correlation
Eigenvectors
Eigenvalues
PCA
Application of PCA - Eigenfaces

36
Standard Deviation

Statistics analyzing data sets in terms of the
relationships between the individual points
Standard Deviation is a measure of the spread of
the data
0 8 12 20 8 9 11 12
Calculation average distance from the mean of
the data set to a point
s Si1n(Xi X)2
(n -1)
Denominator of n-1 for sample and n for entire
population

37
Standard Deviation

For example
0 8 12 20 has s 8.32
Sqrt(((0-10)2(8-10)2(12-10)2(20-10)2)/3)8.
32
8 9 11 12 has s 1.82
10 10 10 10 has s 0

38
Variance

Another measure of the spread of the data in a
data set
Calculation
s2 Si1n(Xi X)2
(n -1)
Why have both variance and SD to calculate the
spread of data?
Variance is claimed to be the original
statistical measure of spread of data. However
its unit would be expressed as a square e.g.
cm2, which is unrealistic to express heights or
other measures. Hence SD as the square root of
variance was born.

39
Covariance

Variance measure of the deviation from the mean
for points in one dimension e.g. heights
Covariance is a measure of how much each of the
dimensions vary from the mean with respect to
each other.
Covariance is measured between 2 dimensions to
see if there is a relationship between the 2
dimensions e.g. number of hours studied marks
obtained.
The covariance between one dimension and itself
is the variance

40
Covariance

variance (X) Si1n(Xi X) (Xi X)
(n -1)
covariance (X,Y) Si1n(Xi X) (Yi Y)
(n -1)
So, if you had a 3-dimensional data set (x,y,z),
then you could measure the covariance between the
x and y dimensions, the y and z dimensions, and
the x and z dimensions. Measuring the covariance
between x and x , or y and y , or z and z would
give you the variance of the x , y and z
dimensions respectively.

41
Variance Covariance - Matlab

gtgt x0 8 12 208 9 11 1210 10 10 10
gtgt var(x)
ans
28 1 1 28 note 28 is var(0,8,10)
gtgt Ccov(x)
28 5 -5 -28
5 1 -1 -5
-5 -1 1 5
-28 -5 5 28

42
Variance Covariance - Matlab

gtgt C(2,3)
-1

43
Covariance

What is the interpretation of covariance
calculations?
e.g. 2 dimensional data set
x number of hours studied for a subject
y marks obtained in that subject
covariance value is say 104.53
what does this value mean?

44
Covariance

Exact value is not as important as its sign.
A positive value of covariance indicates both
dimensions increase or decrease together e.g. as
the number of hours studied increases, the marks
in that subject increase.
A negative value indicates while one increases
the other decreases, or vice-versa e.g. active
social life at RIT vs performance in CS dept.
If covariance is zero the two dimensions are
independent of each other e.g. heights of
students vs the marks obtained in a subject

45
Covariance

Why bother with calculating covariance when we
could just plot the 2 values to see their
relationship?
Covariance calculations are used to find
relationships between dimensions in high
dimensional data sets (usually greater than 3)
where visualization is difficult.

46
Covariance Matrix

Representing Covariance between dimensions as a
matrix e.g. for 3 dimensions
cov(x,x) cov(x,y) cov(x,z)
C cov(y,x) cov(y,y) cov(y,z)
cov(z,x) cov(z,y) cov(z,z)
Diagonal is the variances of x, y and z
cov(x,y) cov(y,x) hence matrix is symmetrical
about the diagonal
N-dimensional data will result in nxn covariance
matrix

47
Correlation

For a positive correlation it means as X
increases, so does y, and vice versa
Of the Following plots which has the highest
correlation?

48
(No Transcript)
49
Transformation matrices

Consider
2 3 3 12 3
2 1 2 8 2
Square transformation matrix transforms (3,2)
from its original location. Now if we were to
take a multiple of (3,2)
3 6
2 4
2 3 6 24 6
2 1 4 16 4

x

x

4
x
2

x

x

4
50
Transformation matrices

Scale vector (3,2) by a value 2 to get (6,4)
Multiply by the square transformation matrix
We see the result is still a multiple of 4.
WHY?
A vector consists of both length and direction.
Scaling a vector only changes its length and not
its direction. This is an important observation
in the transformation of matrices leading to
formation of eigenvectors and eigenvalues.
Irrespective of how much we scale (3,2) by, the
solution is always a multiple of 4.

51
eigenvalue problem

The eigenvalue problem is any problem having the
following form
A . v ? . v
A n x n matrix
v n x 1 non-zero vector
? scalar
Any value of ? for which this equation has a
solution is called the eigenvalue of A and vector
v which corresponds to this value is called the
eigenvector of A.

52
eigenvalue problem

2 3 3 12 3
2 1 2 8 2
A . v ? . v
Therefore, (3,2) is an eigenvector of the square
matrix A and 4 is an eigenvalue of A
Given matrix A, how can we calculate the
eigenvector and eigenvalues for A?

x

x

4
53
Calculating eigenvectors eigenvalues

Given A . v ? . v
A . v - ? . I . v 0
(A - ? . I ). v 0
Finding the roots of A - ? . I will give the
eigenvalues and for each of these eigenvalues
there will be an eigenvector
Example

54
Calculating eigenvectors eigenvalues

If A 0 1
-2 -3
Then A - ? . I 0 1 ? 0
0
-2 -3 0 ?
-? 1 ?2 3? 2 0
-2 -3-?
This gives us 2 eigenvalues
?1 -1 and ?2 -2

55
Calculating eigenvectors eigenvalues

For ?1 the eigenvector is
(A ?1 . I ). v1 0
1 1 v11 0
-2 -2 v12
-2.v11 -2.v12 0
v11 -v12
Therefore the first eigenvector is any column
vector in which the two elements have equal
magnitude and opposite sign

56
Calculating eigenvectors eigenvalues

Therefore eigenvector v1 is
v1 k1 1
-1
Where k1 is some constant. Similarly we find
eigenvector v2
v2 k2 1
-2
And the eigenvalues are ?1 -1 and ?2 -2

57
Properties of eigenvectors and eigenvalues

Note that Irrespective of how much we scale (3,2)
by, the solution is always a multiple of 4.
Eigenvectors can only be found for square
matrices and not every square matrix has
eigenvectors.
Given an n x n matrix, we can find n eigenvectors

58
Properties of eigenvectors and eigenvalues

All eigenvectors of a matrix are perpendicular to
each other, no matter how many dimensions we have
In practice eigenvectors are normalized to have
unit length. Since the length of the eigenvectors
do not affect our calculations we prefer to keep
them standard by scaling them to have a length of
1. e.g.
For eigenvector (3,2)
((32 22))1/2 (13)1/2
3 (13)1/2 3/(13)1/2
2 2/(13)1/2

59
Matlab

gtgt A 0 1 2 3
A
0 1
2 3
gtgt v,d eig(A)
v
-0.8719 -0.2703
0.4896 -0.9628
d
-0.5616 0
0 3.5616

gtgt help eig V,D EIG(X) produces a diagonal
matrix D of eigenvalues and a full matrix V whose
columns are the corresponding eigenvectors so
that XV VD.
60
PCA

principal components analysis (PCA) is a
technique that can be used to simplify a dataset
It is a linear transformation that chooses a new
coordinate system for the data set such that
greatest variance by any projection of the data
set comes to lie on the first axis (then called
the first principal component), the second
greatest variance on the second axis, and so on.
PCA can be used for reducing dimensionality by
eliminating the later principal components.

61
PCA

By finding the eigenvalues and eigenvectors of
the covariance matrix, we find that the
eigenvectors with the largest eigenvalues
correspond to the dimensions that have the
strongest correlation in the dataset.
This is the principal component.
PCA is a useful statistical technique that has
found application in
fields such as face recognition and image
compression
finding patterns in data of high dimension.
Reducing dimensionality of data

62
PCA process STEP 1

Subtract the mean from each of the data
dimensions. All the x values have x subtracted
and y values have y subtracted from them. This
produces a data set whose mean is zero.
Subtracting the mean makes variance and
covariance calculation easier by simplifying
their equations. The variance and co-variance
values are not affected by the mean value.

63
PCA process STEP 1
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
ZERO MEAN DATA x y .69 .49 -1.31
-1.21 .39 .99 .09 .29 1.29 1.09 .49
.79 .19 -.31 -.81 -.81 -.31 -.31 -.71
-1.01

DATA
x y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

64
PCA process STEP 1
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
65
PCA process STEP 2

Calculate the covariance matrix
cov .616555556 .615444444
.615444444 .716555556
since the non-diagonal elements in this
covariance matrix are positive, we should expect
that both the x and y variable increase together.

66
PCA process STEP 3

Calculate the eigenvectors and eigenvalues of the
covariance matrix
eigenvalues .0490833989
1.28402771
eigenvectors .735178656 -.677873399
-.677873399 -.735178656

67
PCA process STEP 3
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf

eigenvectors are plotted as diagonal dotted lines
on the plot.
Note they are perpendicular to each other.
Note one of the eigenvectors goes through the
middle of the points, like drawing a line of best
fit.
The second eigenvector gives us the other, less
important, pattern in the data, that all the
points follow the main line, but are off to the
side of the main line by some amount.

68
PCA process STEP 4

Reduce dimensionality and form feature vector
the eigenvector with the highest eigenvalue is
the principle component of the data set.
In our example, the eigenvector with the largest
eigenvalue was the one that pointed down the
middle of the data.
Once eigenvectors are found from the covariance
matrix, the next step is to order them by
eigenvalue, highest to lowest. This gives you the
components in order of significance.

69
PCA process STEP 4

Now, if you like, you can decide to ignore the
components of lesser significance.
You do lose some information, but if the
eigenvalues are small, you dont lose much
n dimensions in your data
calculate n eigenvectors and eigenvalues
choose only the first p eigenvectors
final data set has only p dimensions.

70
PCA process STEP 4

Feature Vector
FeatureVector (eig1 eig2 eig3 eign)
We can either form a feature vector with both of
the eigenvectors
-.677873399 .735178656
-.735178656 -.677873399
or, we can choose to leave out the smaller, less
significant component and only have a single
column
- .677873399
- .735178656

71
PCA process STEP 5

Deriving the new data
FinalData RowFeatureVector x RowZeroMeanData
RowFeatureVector is the matrix with the
eigenvectors in the columns transposed so that
the eigenvectors are now in the rows, with the
most significant eigenvector at the top
RowZeroMeanData is the mean-adjusted data
transposed, ie. the data items are in each
column, with each row holding a separate
dimension.

72
PCA process STEP 5

FinalData is the final data set, with data items
in columns, and dimensions along rows.
What will this give us? It will give us the
original data solely in terms of the vectors we
chose.
We have changed our data from being in terms of
the axes x and y , and now they are in terms of
our 2 eigenvectors.

73
PCA process STEP 5

FinalData transpose dimensions along columns
x y
-.827970186 .175115307
-1.77758033 -.142857227
-.992197494 -.384374989
-.274210416 -.130417207
-1.67580142 .209498461
-.912949103 -.175282444
.0991094375 .349824698
1.14457216 -.0464172582
.438046137 -.0177646297
1.22382056 .162675287

74
PCA process STEP 5
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf
75
Reconstruction of original Data

If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example
let us assume that we considered only the x
dimension

76
Reconstruction of original Data
http//kybele.psych.cornell.edu/edelman/Psych-465
-Spring-2003/PCA-tutorial.pdf

x
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056

77
Matlab PCA

Matlab has a function called princomp(x)
COEFF,SCOREprincomp(x) performs principal
components analysis on the n-by-p data matrix X,
and returns the principal component coefficients,
also known as loadings. Rows of X correspond to
observations, columns to variables. COEFF is a
p-by-p matrix, each column containing
coefficients for one principal component. The
columns are in order of decreasing component
variance. Princomp centers X by subtracting off
column means

78
Matlab PCA

COEFF,SCORE princomp(X) returns SCORE, the
principal component scores that is, the
representation of X in the principal component
space. Rows of SCORE correspond to observations,
columns to components.

79
Matlab PCA

gtgtD (SLIDE 63)
D
0.6900 0.4900
-1.3100 -1.2100
0.3900 0.9900
0.0900 0.2900
1.2900 1.0900
0.4900 0.7900
0.1900 -0.3100
-0.8100 -0.8100
-0.3100 -0.3100
-0.7100 -1.0100

80
Matlab PCA

gtgt A,Bprincomp(D)
A
-0.6779 0.7352 (see slide 66, note
that these columns are by highest eigenvalue,
unlike slide 66)
-0.7352 -0.6779
B
-0.8280 0.1751 (see slide 73)
1.7776 -0.1429
-0.9922 -0.3844
-0.2742 -0.1304
-1.6758 0.2095
-0.9129 -0.1753
0.0991 0.3498
1.1446 -0.0464
0.4380 -0.0178
1.2238 0.1627

81
MATLAB DEMO
82
PCA applications -Eigenfaces

Eigenfaces are the eigenvectors the covariance
matrix of the probability distribution of the
vector space of human faces
Eigenfaces are the standardized face
ingredients derived from the statistical
analysis of many pictures of human faces
A human face may be considered to be a
combination of these standard faces

83
PCA applications -Eigenfaces

To generate a set of eigenfaces
Large set of digitized images of human faces is
taken under the same lighting conditions.
The images are normalized to line up the eyes and
mouths.
The eigenvectors of the covariance matrix of the
statistical distribution of face image vectors
are then extracted.
These eigenvectors are called eigenfaces.

84
PCA applications -Eigenfaces

the principal eigenface looks like a bland
androgynous average human face

http//en.wikipedia.org/wiki/ImageEigenfaces.png
85
Eigenfaces Face Recognition

When properly weighted, eigenfaces can be summed
together to create an approximate gray-scale
rendering of a human face.
Remarkably few eigenvector terms are needed to
give a fair likeness of most people's faces
Hence eigenfaces provide a means of applying data
compression to faces for identification purposes.

86
Expert Object Recognition in Video Matt McEuen
87
EOR

Principal Component Analysis (PCA)
Based on covariance
Visual memory reconstruction
Images of cats and dogs are aligned so that the
eyes are in the same position in every image

88
EOR
89
Back to Heuristic Improvements
90
Back to Heuristic Improvements
91
Heuristic Improvements

Initialization
Large initial weights will saturate neurons
All 0s for weights is also potentially bad
From the text, it can be shown (approximately,
under certain conditions) that a good choice for
weights is to select them randomly from a uniform
distribution with

92
Number of Hidden Layers

Three layers suffice to implement any function
with properly chosen transfer functions
Additional layers can help
It is easier for a four-layer net to learn
translations than for a three-layer net.
Each layer can learn an invariance - maybe

93
Feature Detection

Hidden neurons play the role of feature detectors
Tend to transform the input vector space into a
hidden or feature space
Each hidden neurons output is then a measure of
how well that feature is present in the current
input

94
Generalization

Generalization is the term used to describe how
well a NN correctly classifies a set of data that
was not used as the training set.
One generally has 3 sets of data
Training
Validate (on-going generalization test)
Testing (this is the true error rate)

95
Generalization

Network generalizes well when for
non-trained-on-data produces correct (or near
correct) outputs.
Can overfit or overtrain
Generally want to select smoothest/simplest
mapping of function in absence of prior knowledge
demogt nnd11gn

96
Generalization

Influenced by four factors
Size of training set
How representative training set is of data
Neural network architecture
Physical complexity of problem at hand
Often NN configuration or training set fixed and
so have only other two to work with

97
Generalization Over training

Over training
Overtraining occurs when the network has been
trained to only minimize the error
The next slide shows a network in which a
trigonometric function is being approximated
1-3-1 denotes one input 3 hidden layer neurons
0ne output layer neuron
The fit is perfect at 4, but at 8 it is lower
error but poorer fit

98
(No Transcript)
99
Generalization Complexity of Network

In the next slide, as the number of hidden layer
neurons goes from 1 to 5, the network does better.

100
(No Transcript)
101
Approximations of Functions

In general, for good generalization, the number
of training samples N should larger than the
ratio of the total number of free parameters
(weights) in the network to the mean-square value
of the estimation error
Normally want the simplest NN can get

102
Generalization

A commonly used value is
NO(W/? ) where
O gt is like Big-O
W total number of weights
? fraction of classification errors permitted on
test data
Nnumber of training samples

103
Approximations of Functions

NN acts as a non-linear mapping from input to
output space
Everywhere differentiable if all transfer
functions are differentiable
What is the minimum number of hidden layers in a
multilayer perceptron with an I/O mapping that
provides an approximate mapping of any continuous
mapping

104
Approximations of Functions

Part of universal approximation theorem
This theorem states (in essence) that a NN with
bounded, nonconstant, monotone increasing
continuous transfer functions and one hidden
layer can approximate any function
Says nothing about optimum in terms of learning
time, ease of implementation, or generalization

105
Practical Considerations

For high dimensional spaces, it is often better
to have 2-layer networks so that neurons in
layers do not interact so much.

106
Cross Validation

Randomly divide data into training and testing
sets
Further randomly divide training set into
estimation and validation subsets.
Use validation set to test accuracy of model, and
then test set for actual accuracy value.