Restricted Boltzmann Machine and Deep Belief Net - PowerPoint PPT Presentation

About This Presentation
Title:

Restricted Boltzmann Machine and Deep Belief Net

Description:

Title: PowerPoint Presentation Author: OUYANG Wanli Last modified by: OUYANG Wanli Created Date: 1/1/1601 12:00:00 AM Document presentation format – PowerPoint PPT presentation

Number of Views:1638
Avg rating:3.0/5.0
Slides: 82
Provided by: OUYANG
Category:

less

Transcript and Presenter's Notes

Title: Restricted Boltzmann Machine and Deep Belief Net


1
Restricted Boltzmann Machine and Deep Belief Net
  • Wanli Ouyang
  • wlouyang_at_ee.cuhk.edu.hk

Animation is available for illustration
2
Outline
  • Short introduction on deep learning
  • Short introduction on statistical models and
    Graphical model
  • Restricted Boltzmann Machine (RBM) and
    Contrastive divergence
  • Deep belief net (DBN)

RBM and DBN are statistical models
Deep belief net is trained using RBM and CD
Deep belief net is an unsupervised training
algorithm for deep neural network
3
Good learning resources
  • Webpages
  • Geoffrey E. Hintons readings (with source code
    available for DBN) http//www.cs.toronto.edu/hint
    on/csc2515/deeprefs.html
  • Notes on Deep Belief Networks http//www.quantumg
    .net/dbns.php
  • MLSS Tutorial, October 2010, ANU Canberra, Marcus
    Frean http//videolectures.net/mlss2010au_frean_de
    epbeliefnets/
  • Deep Learning Tutorials http//deeplearning.net/tu
    torial/
  • Hintons Tutorial, http//videolectures.net/mlss09
    uk_hinton_dbn/
  • Ferguss Tutorial, http//cs.nyu.edu/fergus/prese
    ntations/nips2013_final.pdf
  • CUHK MMlab project http//mmlab.ie.cuhk.edu.hk/p
    roject_deep_learning.html
  • People
  • Geoffrey E. Hintons http//www.cs.toronto.edu/hi
    nton
  • Andrew Ng http//www.cs.stanford.edu/people/ang/in
    dex.html
  • Ruslan Salakhutdinov http//www.utstat.toronto.edu
    /rsalakhu/
  • Yee-Whye Teh http//www.gatsby.ucl.ac.uk/ywteh/
  • Yoshua Bengio www.iro.umontreal.ca/bengioy
  • Yann LeCun http//yann.lecun.com/
  • Marcus Frean http//ecs.victoria.ac.nz/Main/Marcus
    Frean
  • Rob Fergus http//cs.nyu.edu/fergus/pmwiki/pmwiki
    .php
  • Acknowledgement
  • Many materials in this ppt are from these papers,
    tutorials, etc (especially Hinton and Freans).
    Sorry for not listing them in full detail.

Dumitru Erhan, Aaron Courville, Yoshua Bengio.
Understanding Representations Learned in Deep
Architectures. Technical Report.
4
Neural network Back propagation
Deep belief net Science
Speech
2006
1986
2011
2012
  • Unsupervised Layer-wised pre-training
  • Better designs for modeling and training
    (normalization, nonlinearity, dropout)
  • Feature learning
  • New development of computer architectures
  • GPU
  • Multi-core computer systems
  • Large scale databases
  • Loose tie with biological systems
  • Shallow model
  • Specific methods for specific tasks
  • Hand crafted features (GMM-HMM, SIFT, LBP, HOG)
  • SVM
  • Boosting
  • Decision tree
  • KNN
  • Solve general learning problems
  • Tied with biological system

But it is given up
  • Hard to train
  • Insufficient computational resources
  • Small training sets
  • Does not work well

Kruger et al. TPAMI13
5
Outline
  • Short introduction on deep learning
  • Short introduction on statistical models and
    Graphical model
  • Restricted Boltzmann Machine (RBM) and
    Contrastive divergence
  • Deep belief net (DBN)

6
Graphical model for Statistics
  • Conditional independence between random variables
  • Given C, A and B are independent
  • P(A, BC) P(AC)P(BC)
  • P(A,B,C) P(A, BC) P(C)
  • P(AC)P(BC)P(C)
  • Any two nodes are conditionally independent given
    the values of their parents.

C
Smoker?
B
A
Has Lung cancer
Has bronchitis
????
??
http//www.eecs.qmul.ac.uk/norman/BBNs/Independen
ce_and_conditional_independence.htm
7
Directed and undirected graphical model
C
  • Directed graphical model
  • P(A,B,C) P(AC)P(BC)P(C)
  • Any two nodes are conditionally independent given
    the values of
  • their parents.
  • Undirected graphical model
  • P(A,B,C) P(B,C)P(A,C)
  • Also called Marcov Random Field (MRF)

B
A
C
B
A
C
C
B
A
B
A
P(A,B,C,D) P(DA,B)P(BC)P(AC)P(C)
D
8
Modeling undirected model
  • Probability

partition function
Is smoker?
Example P(A,B,C) P(B,C)P(A,C)
C
w2
w1
A
B
Is healthy
Has Lung cancer
9
More directed and undirected models
A
B
C
y1
y2
y3
D
E
F
h1
h2
h3
G
H
I
Hidden Marcov model
MRF in 2D
10
More directed and undirected models
A
B
y1
y2
y3
C
h1
h2
h3
D
P(y1, y2, y3, h1, h2, h3)P(h1)P(h2 h1) P(h3
h2) P(y1 h1)P(y2 h2)P(y3 h3)
P(A,B,C,D)P(A)P(B)P(CB)P(DA,B,C)
11
More directed and undirected models
12
Extended reading on graphical model
  •  Zoubin Ghahramani s video lecture on graphical
    models
  • http//videolectures.net/mlss07_ghahramani_grafm/

13
Outline
  • Short introduction on deep learning
  • Short introduction on statistical models and
    Graphical model
  • Restricted Boltzmann machine and Contrastive
    divergence
  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine
  • Deep belief net

A training algorithm for
14
Outline
  • Short introduction on deep learning
  • Short introduction on statistical models and
    Graphical model
  • Restricted Boltzmann machine and Contrastive
    divergence
  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine
  • Deep belief net

A specific, useful case of
15
Outline
  • Short introduction on deep learning
  • Short introduction on statistical models and
    Graphical model
  • Restricted Boltzmann machine and Contrastive
    divergence
  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine
  • Deep belief net

16
Product of Experts
Partition function
Energy function
A
B
C
D
E
F
MRF in 2D
G
H
I
17
Product of Experts
18
Products of experts versus Mixture model
  • Products of experts
  • "and" operation
  • Sharper than mixture
  • Each expert can constrain a different subset of
    dimensions.
  • Mixture model, e.g. Gaussian Mixture model
  • or operation
  • a weighted sum of many density functions

19
Outline
  • Basic background on statistical learning and
    Graphical model
  • Contrastive divergence and Restricted Boltzmann
    machine
  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine
  • Deep belief net

20
Contrastive Divergence (CD)
  • Probability
  • Maximum Likelihood and gradient descent

model dist.
data dist.
expectation
21
Contrastive Divergence (CD)
P(A,B,C) P(AC)P(BC)P(C)
C
B
A
  • Gradient of Likelihood

Intractable
Easy to compute
Tractable Gibbs Sampling
Fast contrastive divergence T1
Sample p(z1,z2,,zM)
CD
Minimum
Accurate but slow gradient
Approximate but fast gradient
22
Gibbs Sampling for graphical model
h5
h1
h2
h3
h4
x1
x2
x3
More information on Gibbs sampling Pattern
recognition and machine learning(PRML)
23
Convergence of Contrastive divergence (CD)
  • The fixed points of ML are not fixed points of CD
    and vice versa.
  • CD is a biased learning algorithm.
  • But the bias is typically very small.
  • CD can be used for getting close to ML solution
    and then ML learning can be used for fine-tuning.
  • It is not clear if CD learning converges (to a
    stable fixed point). At 2005, proof is not
    available.
  • Further theoretical results? Please inform us

M. A. Carreira-Perpignan and G. E. Hinton. On
Contrastive Divergence Learning. Artificial
Intelligence and Statistics, 2005
24
Outline
  • Basic background on statistical learning and
    Graphical model
  • Contrastive divergence and Restricted Boltzmann
    machine
  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine
  • Deep belief net

25
Boltzmann Machine
  • Undirected graphical model, with hidden nodes.

Boltzmann machine E(x,h)b' xc' hh'
WxxUxhVh
26
Restricted Boltzmann Machine (RBM)
Boltzmann machine E(x,h)b' xc' hh'
WxxUxhVh
  • Undirected, loopy, layer
  • E(x,h)b' xc' hh' Wx

h2
h3
h4
h5
h1
partition function
x1
x2
x3
h
W
x
P(xj 1h) s(bj W j h)
Read the manuscript for details
P(hi 1x) s(ci Wi x)
27
Restricted Boltzmann Machine (RBM)
  • E(x,h)b' xc' hh' Wx
  • x x1 x2 T, h h1 h2 T
  • Parameter learning
  • Maximum Log-Likelihood

Geoffrey E. Hinton, Training Products of Experts
by Minimizing Contrastive Divergence. Neural
Computation 14, 17711800 (2002)
28
CD for RBM
  • CD for RBM, very fast!

CD
P(xj 1h) s(bj W j h)
P(hi 1x) s(ci Wi x)
29
CD for RBM
P(xj 1h) s(bj W j h)
P(hi 1x) s(ci Wi x)
P(xj 1h) s(bj W j h)
h2
  • h1

x1
x2
P(xj 1h) s(bj W j h)
P(hi 1x) s(ci Wi x)
30
RBM for classification
  • y classification label

Hugo Larochelle and Yoshua Bengio, Classification
using Discriminative Restricted Boltzmann
Machines, ICML 2008.
31
RBM itself has many applications
  • Multiclass classification
  • Collaborative filtering
  • Motion capture modeling
  • Information retrieval
  • Modeling natural images
  • Segmentation

Y Li, D Tarlow, R Zemel, Exploring
compositional high order pattern potentials for
structured output learning, CVPR 2013 V. Mnih, H
Larochelle, GE Hinton , Conditional Restricted
Boltzmann Machines for Structured Output
Prediction, Uncertainty in Artificial
Intelligence, 2011. Larochelle, H., Bengio, Y.
(2008). Classification using discriminative
restricted boltzmann machines. ICML,
2008. Salakhutdinov, R., Mnih, A., Hinton, G.
E. (2007). Restricted Boltzmann machines for
collaborative filtering. ICML 2007. Salakhutdinov,
R., Hinton, G. E. (2009). Replicated softmax
an undirected topic model., NIPS 2009. Osindero,
S., Hinton, G. E. (2008). Modeling image
patches with a directed hierarchy of markov
random field., NIPS 2008
32
Outline
  • Basic background on statistical learning and
    Graphical model
  • Contrastive divergence and Restricted Boltzmann
    machine
  • Deep belief net (DBN)
  • Why deep leaning?
  • Learning and inference
  • Applications

33
Belief Nets
  • A belief net is a directed acyclic graph composed
    of random variables.

random hidden cause
visible effect
34
Deep Belief Net
  • Belief net that is deep
  • A generative model
  • P(x,h1,,hl) p(xh1) p(h1h2) p(hl-2hl-1)
    p(hl-1,hl)
  • Used for unsupervised training of multi-layer
    deep model.

h3


h2




h1


x
Pixelsgtedgesgt local shapesgt object parts
P(x,h1,h2,h3) p(xh1) p(h1h2) p(h2,h3)
35
Why Deep learning?
Pixelsgtedgesgt local shapesgt object parts
  • The mammal brain is organized in a deep
    architecture with a given input percept
    represented at multiple levels of abstraction,
    each level corresponding to a different area of
    cortex(?????????).
  • An architecture with insufficient depth can
    require many more computational elements,
    potentially exponentially more (with respect to
    input size), than architectures whose depth is
    matched to the task.
  • Since the number of computational elements one
    can afford depends on the number of training
    examples available to tune or select them, the
    consequences are not just computational but also
    statistical poor generalization may be expected
    when using an insufficiently deep architecture
    for representing some functions.

T. Serre, etc., A quantitative theory of
immediate visual recognition, Progress in Brain
Research, Computational Neuroscience Theoretical
Insights into Brain Function, vol. 165, pp.
3356, 2007. Yoshua Bengio, Learning Deep
Architectures for AI, Foundations and Trends in
Machine Learning, 2009.
36
Why Deep learning?
  • Linear regression, logistic regression depth 1
  • Kernel SVM depth 2
  • Decision tree depth 2
  • Boosting depth 2
  • The basic conclusion that these results suggest
    is that when a function can be compactly
    represented by a deep architecture, it might need
    a very large architecture to be represented by an
    insufficiently deep one. (Example logic gates,
    multi-layer NN with linear threshold units and
    positive weight).

Yoshua Bengio, Learning Deep Architectures for
AI, Foundations and Trends in Machine Learning,
2009.
37
Example sum product network (SPN)
2N-1
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
N?2N-1 parameters
?
?
?
?
?
X1
X2
X2
X3
X1
X4
X5
X3
X4
X5
O(N) parameters
38
Depth of existing approaches
  • Boosting (2 layers)
  • L 1 base learner
  • L 2 vote or linear combination of layer 1
  • Decision tree, LLE, KNN, Kernel SVM (2 layers)
  • L 1 matching degree to a set of local templates.
  • L 2 Combine these degrees
  • Brain 5-10 layers

39
Why decision tree has depth 2?
  • Rely on partition of input space.
  • Local estimator. Rely on partition of input space
    and use separate params for each region. Each
    region is associated with a leaf.
  • Need as many as training samples as there are
    variations of interest in the target function.
    Not good for highly varying functions.
  • Num. training sample is exponential to Num. dim
    in order to achieve a fixed error rate.

40
Outline
  • Basic background on statistical learning and
    Graphical model
  • Contrastive divergence and Restricted Boltzmann
    machine
  • Deep belief net (DBN)
  • Why DBN?
  • Learning and inference
  • Applications

41
Deep Belief Net
  • Inference problem Infer the states of the
    unobserved variables.
  • Learning problem Adjust the interactions between
    variables to make the network more likely to
    generate the observed data

h3


h2




h1


x
P(x,h1,h2,h3) p(xh1) p(h1h2) p(h2,h3)
42
Deep Belief Net
  • Inference problem (the problem of explaining
    away)

C
  • P(A,BC) P(AC)P(BC)

B
A
  • P(h11, h12 x1) ? P(h11 x1) P(h12 x1)

h11
h12
h1


x1


x
An example from manuscript
Sol Complementary prior
43
Deep Belief Net
  • Inference problem (the problem of explaining
    away)
  • Sol Complementary prior

h4
30
h3
500
h2


1000
2000
h1


x
Sol Complementary prior
44
Deep Belief Net
P(hi 1x) s(ci Wi x)
  • Explaining away problem of Inference (see the
    manuscript)
  • Sol Complementary prior, see the manuscript
  • Learning problem
  • Greedy layer by layer RBM training (optimize
    lower bound) and fine tuning
  • Contrastive divergence for RBM training



h3
h3




h2
h2




h2


h1


h1


h1


x


x
45
Code reading
  • It is much easier to read the DeepLearningToolbox
    for understanding DBN.

46
(No Transcript)
47
Deep Belief Net
  • Why greedy layerwise learning work?
  • Optimizing a lower bound
  • When we fix parameters for layer 1 and optimize
    the parameters for layer 2, we are optimizing the
    P(h1) in (1)

(1)


h3


h2


h2


h1


h1


x
48
Deep Belief Net and RBM
  • RBM can be considered as DBN that has infinitive
    layers




x2




h1
h0




x1
x0


h0


x0
49
Pretrain, fine-tune and inference (autoencoder)
  • (BP)

50
Pretrain, fine-tune and inference - 2
  • y identity or rotation degree

Pretraining
Fine-tuning
51
How many layers should we use?
  • There might be no universally right depth
  • Bengio suggests that several layers is better
    than one
  • Results are robust against changes in the size of
    a layer, but top layer should be big
  • A parameter. Depends on your task.
  • With enough narrow layers, we can model any
    distribution over binary vectors 1

1 Sutskever, I. and Hinton, G. E., Deep Narrow
Sigmoid Belief Networks are Universal
Approximators. Neural Computation, 2007
Copied from http//videolectures.net/mlss09uk_hint
on_dbn/
52
Effect of Unsupervised Pre-training
  • Erhan et. al. AISTATS2009

53
Effect of Depth
with pre-training
without pre-training
w/o pre-training
54
Why unsupervised pre-training makes sense
stuff
stuff
high bandwidth
low bandwidth
label
label
image
image
If image-label pairs are generated this way, it
makes sense to first learn to recover the stuff
that caused the image by inverting the high
bandwidth pathway.
If image-label pairs were generated this way, it
would make sense to try to go straight from
images to labels. For example, do the pixels
have even parity?
55
Beyond layer-wise pretraining
  • Layer-wise pretraining is efficient but not
    optimal.
  • It is possible to train parameters for all layers
    using a wake-sleep algorithm.
  • Bottom-up in a layer-wise manner
  • Top-down and reffiting the earlier models

56
Fine-tuning with a contrastive version of the
wake-sleep algorithm
  • After learning many layers of features, we
    can fine-tune the features to improve generation.
  • 1. Do a stochastic bottom-up pass
  • Adjust the top-down weights to be good at
    reconstructing the feature activities in the
    layer below.
  • 2. Do a few iterations of sampling in the top
    level RBM
  • -- Adjust the weights in the top-level RBM.
  • 3. Do a stochastic top-down pass
  • Adjust the bottom-up weights to be good at
    reconstructing the feature activities in the
    layer above.

57
Include lateral connections
  • RBM has no connection among layers
  • This can be generalized.
  • Lateral connections for the first layer 1.
  • Sampling from P(hx) is still easy. But sampling
    from p(xh) is more difficult.
  • Lateral connections at multiple layers 2.
  • Generate more realistic images.
  • CD is still applicable, with small modification.

1B. A. Olshausen and D. J. Field, Sparse
coding with an overcomplete basis set a strategy
employed by V1?, Vision Research, vol. 37, pp.
33113325, December 1997. 2S. Osindero and G.
E. Hinton, Modeling image patches with a
directed hierarchy of Markov random field, in
NIPS, 2007.
58
Without lateral connection
59
With lateral connection
60
My data is real valued
  • Make it 0 1 linearly x ax b
  • Use another distribution

61
My data has temporal dependency
  • Static
  • Temporal

62
My data has temporal dependency
  • Static
  • Temporal

63
I consider DBN as
  • A statistical model that is used for unsupervised
    training of fully connected deep model
  • A directed graphical model that is approximated
    by fast learning and inference algorithms
  • A directed graphical model that is fine tuned
    using mature neural network learning approach --
    BP.

64
Outline
  • Basic background on statistical learning and
    Graphical model
  • Contrastive divergence and Restricted Boltzmann
    machine
  • Deep belief net (DBN)
  • Why DBN?
  • Learning and inference
  • Applications

65
Applications of deep learning
  • Hand written digits recognition
  • Dimensionality reduction
  • Information retrieval
  • Segmentation
  • Denoising
  • Phone recognition
  • Object recognition
  • Object detection

Hinton, G. E, Osindero, S., and Teh, Y. W.
(2006). A fast learning algorithm for deep belief
nets. Neural Computation Hinton, G. E. and
Salakhutdinov, R. R. Reducing the dimensionality
of data with neural networks, Science
2006. Welling, M. etc., Exponential Family
Harmoniums with an Application to Information
Retrieval, NIPS 2004 A. R. Mohamed, etc., Deep
Belief Networks for phone recognition, NIPS 09
workshop on deep learning for speech
recognition. Nair, V. and Hinton, G. E. 3-D
Object recognition with deep belief nets.
NIPS09 .
66
Object recognition
  • NORB
  • logistic regression 19.6, kNN (k1) 18.4,
    Gaussian kernel SVM 11.6, convolutional neural
    net 6.0, convolutional net SVM hybrid 5.9.
    DBN 6.5.
  • With the extra unlabeled data (and the same
    amount of labeled data as before), DBN achieves
    5.2.

67
Object recognition
ImageNet
Rank Name Error rate Description
1 U. Toronto 0.15315 Deep learning
2 U. Tokyo 0.26172 Hand-crafted features and learning models. Bottleneck.
3 U. Oxford 0.26979 Hand-crafted features and learning models. Bottleneck.
4 Xerox/INRIA 0.27058 Hand-crafted features and learning models. Bottleneck.
68
Learning to extract the orientation of a face
patch (Salakhutdinov Hinton, NIPS 2007)
69
The training and test sets
11,000 unlabeled cases
100, 500, or 1000 labeled cases
face patches from new people
70
The root mean squared error in the orientation
when combining GPs with deep belief nets
GP on the pixels
GP on top-level features
GP on top-level features with fine-tuning
22.2 17.9 15.2 17.2 12.7
7.2 16.3 11.2 6.4
100 labels 500 labels 1000 labels
Conclusion The deep features are much better
than the pixels. Fine-tuning helps a lot.
71
Deep Autoencoders(Hinton Salakhutdinov, 2006)
28x28
1000 neurons
  • They always looked like a really nice way to do
    non-linear dimensionality reduction
  • But it is very difficult to optimize deep
    autoencoders using backpropagation.
  • We now have a much better way to optimize them
  • First train a stack of 4 RBMs
  • Then unroll them.
  • Then fine-tune with backprop.

500 neurons
250 neurons
linear units
30
250 neurons
500 neurons
1000 neurons
28x28
72
Deep Autoencoders (Hinton Salakhutdinov, 2006)
real data 30-D deep auto 30-D
PCA
73
A comparison of methods for compressing digit
images to 30 real numbers.
real data 30-D deep auto 30-D
logistic PCA 30-D PCA
74
Representation of DBN
75
Our works
http//mmlab.ie.cuhk.edu.hk/project_deep_learning.
html
76
Pedestrian Detection
ICCV13
CVPR12
CVPR13
ICCV13
77
Facial keypoint detection, CVPR13 (2 average
error on LFPW)
Face parsing, CVPR12
Pedestrian parsing, ICCV13
78
Face Recognition and Face Attribute
Recognition (LFW 96.45)
Face verification, ICCV13
Recovering Canonical-View Face Images, ICCV13
Face attribute recognition, ICCV13
79
Summary
  • Deep belief net (DBN)
  • is a network with deep layers, which provides
    strong representation power
  • is a generative model
  • can be learned by layerwise RBM using Contrastive
    Divergence
  • has many applications and more applications is
    yet to be found.

Generative models explicitly or implicitly model
the distribution of inputs and outputs.
Discriminative models model the posterior
probabilities directly.
80
DBN VS SVM
  • A very controversial topic
  • Model
  • DBN is generative, SVM is discriminative. But
    fine-tuning of DBN is discriminative
  • Application
  • SVM is widely applied.
  • Researchers are expanding the application area of
    DBN.
  • Learning
  • DBN is non-convex and slow
  • SVM is convex and fast (in linear case).
  • Which one is better?
  • Time will say.
  • You can contribute

Hinton The superior classification performance
of discriminative learning methods holds only for
domains in which it is not possible to learn a
good generative model. This set of domains is
being eroded by Moores law.
81
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com