Title: Overview of Back Propagation Algorithm
1Overview of Back Propagation Algorithm
2A Sample Network
3Forward Operation
- The general feed-forward operation is
4Back Propagation Algorithm
- The hidden to output weights can be learned by
minimizing the error - The power of back-propagation is that it allows
us to calculate an effective error for each
hidden unit, and thus derive a learning rule for
the input-to-hidden weights - We consider the error function
- The update rule is
5Hidden-to-output Weights
The chain rule
The sensitivity of unit k is
and
Overall, the derivative is
6Input-to-hidden Weights
The chain rule
The real back propagation
Overall the rule is
7Back Propagation of Sensitivity
- The sensitivity at a hidden unit is proportional
to the weighted sum of the sensitivities at the
output units - The output unit sensitivities are thus propagated
back to the hidden units
8Training Hierarchical Feed-forward Visual
Recognition Models Using Transfer Learning from
Pseudo-Tasks
- ECCV08
- Kai Yu
- Presented by Shuiwang Ji
9Transfer Learning
- Transfer learning, also known as multi-task
learning, is a mechanism that improves
generalization by leveraging shared
domain-specific information contained in related
tasks - In the setting considered in this paper, all
tasks share the same input space
10General Formulation
- The main task to be learnt has index m with
training examples - A neural network has a natural architecture to
tackle this learning problem by minimizing
11General Formulation
- The is learned by additionally
introducing pseudo auxiliary tasks, each
represented by learning the input-output pairs - Then the regularization term becomes
- A Bayesian perspective (skipped)
12CNN for Transfer Learning
- Input 140x140 pixel images, including R/G/B
channels and additionally two channels Dx and Dy,
which are the horizontal and vertical gradients
of gray intensities - C1 layer 16 filters of size 16 by 16
- P1 layer max pooling over each 5 by 5
neighborhood - C2 layer 256 filters of size 6 by 6,
connections with sparsity 0.5 between the
16 dimensions of P1 layer and the 256 dimensions
of C2 layer - P2 layer max pooling over each 5 by 5
neighborhood - Output layer full connections between (256 by
4 by 4) P2 features and outputs
13Generating Pseudo Tasks
- The pseudo-task is constructed by sampling a
random 2D patch and using it as a template to
form a local 2D filter that operates on every
training image. The value assigned to an image
under this task is taken to be the maximum over
the result of this 2D convolution operation - brittle to scale, translation, and slight
intensity variations
14Generating Pseudo Tasks
- Applying Gabor filters of 4 orientations and 16
scales result in 64 feature maps of size 104104
for each image - Max-pooling operation is performed first within
each non-overlapping 44 neighborhood and then
within each band of two successive scales
resulting in 32 feature maps of size 2626 for
each image - An set of K RBF filter of size 77 with 4
orientations are then sampled and used as the
parameters of the pseudo-tasks, resulting in 8
feature maps of size 2020 - Finally, max pooling is performed on the result
across all the scales and within every
non-overlapping 1010 neighborhood, giving a 22
feature map which constitutes the value of this
image under this pseudo-task - Obtained 4K pseudo-tasks (K actual random
patches, each operating at a different quadrant
of the image)
15Object Class Recognition and Localization Using
Sparse Features with Limited Receptive Fields,
IJCV, in press
16Results on Caltech-101
0.18 second for testing one image (the forward
operation)
17Gender and Ethnicity Recognition
18First-layer Features
19Convergence Rate