Restricted Boltzmann Machine and Deep Belief Net - PowerPoint PPT Presentation

About This Presentation

Title:

Restricted Boltzmann Machine and Deep Belief Net

Description:

Title: PowerPoint Presentation Author: OUYANG Wanli Last modified by: OUYANG Wanli Created Date: 1/1/1601 12:00:00 AM Document presentation format – PowerPoint PPT presentation

Number of Views:1638

Avg rating:3.0/5.0

Slides: 82

Provided by: OUYANG

Category:

more less

Transcript and Presenter's Notes

Title: Restricted Boltzmann Machine and Deep Belief Net

1
Restricted Boltzmann Machine and Deep Belief Net

Wanli Ouyang
wlouyang_at_ee.cuhk.edu.hk

Animation is available for illustration
2
Outline

Short introduction on deep learning
Short introduction on statistical models and
Graphical model
Restricted Boltzmann Machine (RBM) and
Contrastive divergence
Deep belief net (DBN)

RBM and DBN are statistical models
Deep belief net is trained using RBM and CD
Deep belief net is an unsupervised training
algorithm for deep neural network
3
Good learning resources

Webpages
Geoffrey E. Hintons readings (with source code
available for DBN) http//www.cs.toronto.edu/hint
on/csc2515/deeprefs.html
Notes on Deep Belief Networks http//www.quantumg
.net/dbns.php
MLSS Tutorial, October 2010, ANU Canberra, Marcus
Frean http//videolectures.net/mlss2010au_frean_de
epbeliefnets/
Deep Learning Tutorials http//deeplearning.net/tu
torial/
Hintons Tutorial, http//videolectures.net/mlss09
uk_hinton_dbn/
Ferguss Tutorial, http//cs.nyu.edu/fergus/prese
ntations/nips2013_final.pdf
CUHK MMlab project http//mmlab.ie.cuhk.edu.hk/p
roject_deep_learning.html
People
Geoffrey E. Hintons http//www.cs.toronto.edu/hi
nton
Andrew Ng http//www.cs.stanford.edu/people/ang/in
dex.html
Ruslan Salakhutdinov http//www.utstat.toronto.edu
/rsalakhu/
Yee-Whye Teh http//www.gatsby.ucl.ac.uk/ywteh/
Yoshua Bengio www.iro.umontreal.ca/bengioy
Yann LeCun http//yann.lecun.com/
Marcus Frean http//ecs.victoria.ac.nz/Main/Marcus
Frean
Rob Fergus http//cs.nyu.edu/fergus/pmwiki/pmwiki
.php
Acknowledgement
Many materials in this ppt are from these papers,
tutorials, etc (especially Hinton and Freans).
Sorry for not listing them in full detail.

Dumitru Erhan, Aaron Courville, Yoshua Bengio.
Understanding Representations Learned in Deep
Architectures. Technical Report.
4
Neural network Back propagation
Deep belief net Science
Speech
2006
1986
2011
2012

Unsupervised Layer-wised pre-training
Better designs for modeling and training
(normalization, nonlinearity, dropout)
Feature learning
New development of computer architectures
GPU
Multi-core computer systems
Large scale databases

Loose tie with biological systems
Shallow model
Specific methods for specific tasks
Hand crafted features (GMM-HMM, SIFT, LBP, HOG)

SVM
Boosting
Decision tree
KNN

Solve general learning problems
Tied with biological system

But it is given up

Hard to train
Insufficient computational resources
Small training sets
Does not work well

Kruger et al. TPAMI13
5
Outline

Short introduction on deep learning
Short introduction on statistical models and
Graphical model
Restricted Boltzmann Machine (RBM) and
Contrastive divergence
Deep belief net (DBN)

6
Graphical model for Statistics

Conditional independence between random variables
Given C, A and B are independent
P(A, BC) P(AC)P(BC)
P(A,B,C) P(A, BC) P(C)
P(AC)P(BC)P(C)
Any two nodes are conditionally independent given
the values of their parents.

C
Smoker?
B
A
Has Lung cancer
Has bronchitis
????
??
http//www.eecs.qmul.ac.uk/norman/BBNs/Independen
ce_and_conditional_independence.htm
7
Directed and undirected graphical model
C

Directed graphical model
P(A,B,C) P(AC)P(BC)P(C)
Any two nodes are conditionally independent given
the values of
their parents.
Undirected graphical model
P(A,B,C) P(B,C)P(A,C)
Also called Marcov Random Field (MRF)

B
A
C
B
A
C
C
B
A
B
A
P(A,B,C,D) P(DA,B)P(BC)P(AC)P(C)
D
8
Modeling undirected model

Probability

partition function
Is smoker?
Example P(A,B,C) P(B,C)P(A,C)
C
w2
w1
A
B
Is healthy
Has Lung cancer
9
More directed and undirected models
A
B
C
y1
y2
y3
D
E
F
h1
h2
h3
G
H
I
Hidden Marcov model
MRF in 2D
10
More directed and undirected models
A
B
y1
y2
y3
C
h1
h2
h3
D
P(y1, y2, y3, h1, h2, h3)P(h1)P(h2 h1) P(h3
h2) P(y1 h1)P(y2 h2)P(y3 h3)
P(A,B,C,D)P(A)P(B)P(CB)P(DA,B,C)
11
More directed and undirected models
12
Extended reading on graphical model

Zoubin Ghahramani s video lecture on graphical
models
http//videolectures.net/mlss07_ghahramani_grafm/

13
Outline

Short introduction on deep learning
Short introduction on statistical models and
Graphical model
Restricted Boltzmann machine and Contrastive
divergence
Product of experts
Contrastive divergence
Restricted Boltzmann Machine
Deep belief net

A training algorithm for
14
Outline

Short introduction on deep learning
Short introduction on statistical models and
Graphical model
Restricted Boltzmann machine and Contrastive
divergence
Product of experts
Contrastive divergence
Restricted Boltzmann Machine
Deep belief net

A specific, useful case of
15
Outline

Short introduction on deep learning
Short introduction on statistical models and
Graphical model
Restricted Boltzmann machine and Contrastive
divergence
Product of experts
Contrastive divergence
Restricted Boltzmann Machine
Deep belief net

16
Product of Experts
Partition function
Energy function
A
B
C
D
E
F
MRF in 2D
G
H
I
17
Product of Experts
18
Products of experts versus Mixture model

Products of experts
"and" operation
Sharper than mixture
Each expert can constrain a different subset of
dimensions.
Mixture model, e.g. Gaussian Mixture model
or operation
a weighted sum of many density functions

19
Outline

Basic background on statistical learning and
Graphical model
Contrastive divergence and Restricted Boltzmann
machine
Product of experts
Contrastive divergence
Restricted Boltzmann Machine
Deep belief net

20
Contrastive Divergence (CD)

Probability
Maximum Likelihood and gradient descent

model dist.
data dist.
expectation
21
Contrastive Divergence (CD)
P(A,B,C) P(AC)P(BC)P(C)
C
B
A

Gradient of Likelihood

Intractable
Easy to compute
Tractable Gibbs Sampling
Fast contrastive divergence T1
Sample p(z1,z2,,zM)
CD
Minimum
Accurate but slow gradient
Approximate but fast gradient
22
Gibbs Sampling for graphical model
h5
h1
h2
h3
h4
x1
x2
x3
More information on Gibbs sampling Pattern
recognition and machine learning(PRML)
23
Convergence of Contrastive divergence (CD)

The fixed points of ML are not fixed points of CD
and vice versa.
CD is a biased learning algorithm.
But the bias is typically very small.
CD can be used for getting close to ML solution
and then ML learning can be used for fine-tuning.
It is not clear if CD learning converges (to a
stable fixed point). At 2005, proof is not
available.
Further theoretical results? Please inform us

M. A. Carreira-Perpignan and G. E. Hinton. On
Contrastive Divergence Learning. Artificial
Intelligence and Statistics, 2005
24
Outline

Basic background on statistical learning and
Graphical model
Contrastive divergence and Restricted Boltzmann
machine
Product of experts
Contrastive divergence
Restricted Boltzmann Machine
Deep belief net

25
Boltzmann Machine

Undirected graphical model, with hidden nodes.

Boltzmann machine E(x,h)b' xc' hh'
WxxUxhVh
26
Restricted Boltzmann Machine (RBM)
Boltzmann machine E(x,h)b' xc' hh'
WxxUxhVh

Undirected, loopy, layer
E(x,h)b' xc' hh' Wx

h2
h3
h4
h5
h1
partition function
x1
x2
x3
h
W
x
P(xj 1h) s(bj W j h)
Read the manuscript for details
P(hi 1x) s(ci Wi x)
27
Restricted Boltzmann Machine (RBM)

E(x,h)b' xc' hh' Wx
x x1 x2 T, h h1 h2 T
Parameter learning
Maximum Log-Likelihood

Geoffrey E. Hinton, Training Products of Experts
by Minimizing Contrastive Divergence. Neural
Computation 14, 17711800 (2002)
28
CD for RBM

CD for RBM, very fast!

CD
P(xj 1h) s(bj W j h)
P(hi 1x) s(ci Wi x)
29
CD for RBM
P(xj 1h) s(bj W j h)
P(hi 1x) s(ci Wi x)
P(xj 1h) s(bj W j h)
h2

x1
x2
P(xj 1h) s(bj W j h)
P(hi 1x) s(ci Wi x)
30
RBM for classification

y classification label

Hugo Larochelle and Yoshua Bengio, Classification
using Discriminative Restricted Boltzmann
Machines, ICML 2008.
31
RBM itself has many applications

Multiclass classification
Collaborative filtering
Motion capture modeling
Information retrieval
Modeling natural images
Segmentation

Y Li, D Tarlow, R Zemel, Exploring
compositional high order pattern potentials for
structured output learning, CVPR 2013 V. Mnih, H
Larochelle, GE Hinton , Conditional Restricted
Boltzmann Machines for Structured Output
Prediction, Uncertainty in Artificial
Intelligence, 2011. Larochelle, H., Bengio, Y.
(2008). Classification using discriminative
restricted boltzmann machines. ICML,
2008. Salakhutdinov, R., Mnih, A., Hinton, G.
E. (2007). Restricted Boltzmann machines for
collaborative filtering. ICML 2007. Salakhutdinov,
R., Hinton, G. E. (2009). Replicated softmax
an undirected topic model., NIPS 2009. Osindero,
S., Hinton, G. E. (2008). Modeling image
patches with a directed hierarchy of markov
random field., NIPS 2008
32
Outline

Basic background on statistical learning and
Graphical model
Contrastive divergence and Restricted Boltzmann
machine
Deep belief net (DBN)
Why deep leaning?
Learning and inference
Applications

33
Belief Nets

A belief net is a directed acyclic graph composed
of random variables.

random hidden cause
visible effect
34
Deep Belief Net

Belief net that is deep
A generative model
P(x,h1,,hl) p(xh1) p(h1h2) p(hl-2hl-1)
p(hl-1,hl)
Used for unsupervised training of multi-layer
deep model.

h3

h2

h1

x
Pixelsgtedgesgt local shapesgt object parts
P(x,h1,h2,h3) p(xh1) p(h1h2) p(h2,h3)
35
Why Deep learning?
Pixelsgtedgesgt local shapesgt object parts

The mammal brain is organized in a deep
architecture with a given input percept
represented at multiple levels of abstraction,
each level corresponding to a different area of
cortex(?????????).
An architecture with insufficient depth can
require many more computational elements,
potentially exponentially more (with respect to
input size), than architectures whose depth is
matched to the task.
Since the number of computational elements one
can afford depends on the number of training
examples available to tune or select them, the
consequences are not just computational but also
statistical poor generalization may be expected
when using an insufficiently deep architecture
for representing some functions.

T. Serre, etc., A quantitative theory of
immediate visual recognition, Progress in Brain
Research, Computational Neuroscience Theoretical
Insights into Brain Function, vol. 165, pp.
3356, 2007. Yoshua Bengio, Learning Deep
Architectures for AI, Foundations and Trends in
Machine Learning, 2009.
36
Why Deep learning?

Linear regression, logistic regression depth 1
Kernel SVM depth 2
Decision tree depth 2
Boosting depth 2
The basic conclusion that these results suggest
is that when a function can be compactly
represented by a deep architecture, it might need
a very large architecture to be represented by an
insufficiently deep one. (Example logic gates,
multi-layer NN with linear threshold units and
positive weight).

Yoshua Bengio, Learning Deep Architectures for
AI, Foundations and Trends in Machine Learning,
2009.
37
Example sum product network (SPN)
2N-1
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
N?2N-1 parameters
?
?
?
?
?
X1
X2
X2
X3
X1
X4
X5
X3
X4
X5
O(N) parameters
38
Depth of existing approaches

Boosting (2 layers)
L 1 base learner
L 2 vote or linear combination of layer 1
Decision tree, LLE, KNN, Kernel SVM (2 layers)
L 1 matching degree to a set of local templates.
L 2 Combine these degrees
Brain 5-10 layers

39
Why decision tree has depth 2?

Rely on partition of input space.
Local estimator. Rely on partition of input space
and use separate params for each region. Each
region is associated with a leaf.
Need as many as training samples as there are
variations of interest in the target function.
Not good for highly varying functions.
Num. training sample is exponential to Num. dim
in order to achieve a fixed error rate.

40
Outline

Basic background on statistical learning and
Graphical model
Contrastive divergence and Restricted Boltzmann
machine
Deep belief net (DBN)
Why DBN?
Learning and inference
Applications

41
Deep Belief Net

Inference problem Infer the states of the
unobserved variables.
Learning problem Adjust the interactions between
variables to make the network more likely to
generate the observed data

h3

h2

h1

x
P(x,h1,h2,h3) p(xh1) p(h1h2) p(h2,h3)
42
Deep Belief Net

Inference problem (the problem of explaining
away)

P(A,BC) P(AC)P(BC)

B
A

P(h11, h12 x1) ? P(h11 x1) P(h12 x1)

h11
h12
h1

x1

x
An example from manuscript
Sol Complementary prior
43
Deep Belief Net

Inference problem (the problem of explaining
away)
Sol Complementary prior

h4
30
h3
500
h2

1000
2000
h1

x
Sol Complementary prior
44
Deep Belief Net
P(hi 1x) s(ci Wi x)

Explaining away problem of Inference (see the
manuscript)
Sol Complementary prior, see the manuscript
Learning problem
Greedy layer by layer RBM training (optimize
lower bound) and fine tuning
Contrastive divergence for RBM training

h3
h3

h2
h2

h2

h1

h1

h1

x

x
45
Code reading

It is much easier to read the DeepLearningToolbox
for understanding DBN.

46
(No Transcript)
47
Deep Belief Net

Why greedy layerwise learning work?
Optimizing a lower bound
When we fix parameters for layer 1 and optimize
the parameters for layer 2, we are optimizing the
P(h1) in (1)

(1)

h3

h2

h2

h1

h1

x
48
Deep Belief Net and RBM

RBM can be considered as DBN that has infinitive
layers

x2

h1
h0

x1
x0

h0

x0
49
Pretrain, fine-tune and inference (autoencoder)

(BP)

50
Pretrain, fine-tune and inference - 2

y identity or rotation degree

Pretraining
Fine-tuning
51
How many layers should we use?

There might be no universally right depth
Bengio suggests that several layers is better
than one
Results are robust against changes in the size of
a layer, but top layer should be big
A parameter. Depends on your task.
With enough narrow layers, we can model any
distribution over binary vectors 1

1 Sutskever, I. and Hinton, G. E., Deep Narrow
Sigmoid Belief Networks are Universal
Approximators. Neural Computation, 2007
Copied from http//videolectures.net/mlss09uk_hint
on_dbn/
52
Effect of Unsupervised Pre-training

Erhan et. al. AISTATS2009

53
Effect of Depth
with pre-training
without pre-training
w/o pre-training
54
Why unsupervised pre-training makes sense
stuff
stuff
high bandwidth
low bandwidth
label
label
image
image
If image-label pairs are generated this way, it
makes sense to first learn to recover the stuff
that caused the image by inverting the high
bandwidth pathway.
If image-label pairs were generated this way, it
would make sense to try to go straight from
images to labels. For example, do the pixels
have even parity?
55
Beyond layer-wise pretraining

Layer-wise pretraining is efficient but not
optimal.
It is possible to train parameters for all layers
using a wake-sleep algorithm.
Bottom-up in a layer-wise manner
Top-down and reffiting the earlier models

56
Fine-tuning with a contrastive version of the
wake-sleep algorithm

After learning many layers of features, we
can fine-tune the features to improve generation.
1. Do a stochastic bottom-up pass
Adjust the top-down weights to be good at
reconstructing the feature activities in the
layer below.
2. Do a few iterations of sampling in the top
level RBM
-- Adjust the weights in the top-level RBM.
3. Do a stochastic top-down pass
Adjust the bottom-up weights to be good at
reconstructing the feature activities in the
layer above.

57
Include lateral connections

RBM has no connection among layers
This can be generalized.
Lateral connections for the first layer 1.
Sampling from P(hx) is still easy. But sampling
from p(xh) is more difficult.
Lateral connections at multiple layers 2.
Generate more realistic images.
CD is still applicable, with small modification.

1B. A. Olshausen and D. J. Field, Sparse
coding with an overcomplete basis set a strategy
employed by V1?, Vision Research, vol. 37, pp.
33113325, December 1997. 2S. Osindero and G.
E. Hinton, Modeling image patches with a
directed hierarchy of Markov random field, in
NIPS, 2007.
58
Without lateral connection
59
With lateral connection
60
My data is real valued

Make it 0 1 linearly x ax b
Use another distribution

61
My data has temporal dependency

Static
Temporal

62
My data has temporal dependency

Static
Temporal

63
I consider DBN as

A statistical model that is used for unsupervised
training of fully connected deep model
A directed graphical model that is approximated
by fast learning and inference algorithms
A directed graphical model that is fine tuned
using mature neural network learning approach --
BP.

64
Outline

Basic background on statistical learning and
Graphical model
Contrastive divergence and Restricted Boltzmann
machine
Deep belief net (DBN)
Why DBN?
Learning and inference
Applications

65
Applications of deep learning

Hand written digits recognition
Dimensionality reduction
Information retrieval
Segmentation
Denoising
Phone recognition
Object recognition
Object detection

Hinton, G. E, Osindero, S., and Teh, Y. W.
(2006). A fast learning algorithm for deep belief
nets. Neural Computation Hinton, G. E. and
Salakhutdinov, R. R. Reducing the dimensionality
of data with neural networks, Science
2006. Welling, M. etc., Exponential Family
Harmoniums with an Application to Information
Retrieval, NIPS 2004 A. R. Mohamed, etc., Deep
Belief Networks for phone recognition, NIPS 09
workshop on deep learning for speech
recognition. Nair, V. and Hinton, G. E. 3-D
Object recognition with deep belief nets.
NIPS09 .
66
Object recognition

NORB
logistic regression 19.6, kNN (k1) 18.4,
Gaussian kernel SVM 11.6, convolutional neural
net 6.0, convolutional net SVM hybrid 5.9.
DBN 6.5.
With the extra unlabeled data (and the same
amount of labeled data as before), DBN achieves
5.2.

67
Object recognition
ImageNet
Rank Name Error rate Description
1 U. Toronto 0.15315 Deep learning
2 U. Tokyo 0.26172 Hand-crafted features and learning models. Bottleneck.
3 U. Oxford 0.26979 Hand-crafted features and learning models. Bottleneck.
4 Xerox/INRIA 0.27058 Hand-crafted features and learning models. Bottleneck.
68
Learning to extract the orientation of a face
patch (Salakhutdinov Hinton, NIPS 2007)
69
The training and test sets
11,000 unlabeled cases
100, 500, or 1000 labeled cases
face patches from new people
70
The root mean squared error in the orientation
when combining GPs with deep belief nets
GP on the pixels
GP on top-level features
GP on top-level features with fine-tuning
22.2 17.9 15.2 17.2 12.7
7.2 16.3 11.2 6.4
100 labels 500 labels 1000 labels
Conclusion The deep features are much better
than the pixels. Fine-tuning helps a lot.
71
Deep Autoencoders(Hinton Salakhutdinov, 2006)
28x28
1000 neurons

They always looked like a really nice way to do
non-linear dimensionality reduction
But it is very difficult to optimize deep
autoencoders using backpropagation.
We now have a much better way to optimize them
First train a stack of 4 RBMs
Then unroll them.
Then fine-tune with backprop.

500 neurons
250 neurons
linear units
30
250 neurons
500 neurons
1000 neurons
28x28
72
Deep Autoencoders (Hinton Salakhutdinov, 2006)
real data 30-D deep auto 30-D
PCA
73
A comparison of methods for compressing digit
images to 30 real numbers.
real data 30-D deep auto 30-D
logistic PCA 30-D PCA
74
Representation of DBN
75
Our works
http//mmlab.ie.cuhk.edu.hk/project_deep_learning.
html
76
Pedestrian Detection
ICCV13
CVPR12
CVPR13
ICCV13
77
Facial keypoint detection, CVPR13 (2 average
error on LFPW)
Face parsing, CVPR12
Pedestrian parsing, ICCV13
78
Face Recognition and Face Attribute
Recognition (LFW 96.45)
Face verification, ICCV13
Recovering Canonical-View Face Images, ICCV13
Face attribute recognition, ICCV13
79
Summary

Deep belief net (DBN)
is a network with deep layers, which provides
strong representation power
is a generative model
can be learned by layerwise RBM using Contrastive
Divergence
has many applications and more applications is
yet to be found.

Generative models explicitly or implicitly model
the distribution of inputs and outputs.
Discriminative models model the posterior
probabilities directly.
80
DBN VS SVM

A very controversial topic
Model
DBN is generative, SVM is discriminative. But
fine-tuning of DBN is discriminative
Application
SVM is widely applied.
Researchers are expanding the application area of
DBN.
Learning
DBN is non-convex and slow
SVM is convex and fast (in linear case).
Which one is better?
Time will say.
You can contribute

Hinton The superior classification performance
of discriminative learning methods holds only for
domains in which it is not possible to learn a
good generative model. This set of domains is
being eroded by Moores law.
81
(No Transcript)

Write a Comment

User Comments (0)