Title: Bayesian Machine Learning for Signal Processing
1 Bayesian Machine Learning for Signal
Processing
- Hagai T. Attias
- Golden Metallic,
Inc. - San Francisco, CA
- Tutorial
- 6th International Conference on Independent
Component Analysis - and Blind Source Separation, Charleston,
SC, March 2006 -
2ICA / BSS is 15 Years Old
- First pair of papers Comon, Jutten Herault,
Signal Processing, - 1991
- First papers on a statistical machine learning
approach to - ICA/BSS Bell Sejnowski 1995 Cardoso 1996
Pearlmutter - Parra 1997
- First conference on ICA/BSS Helsinki, 2000
- Lesson drawn by many ICA is a cool problem.
Lets find many - approaches to it and many places where its
useful. - Lesson drawn by some statistical machine
learning is a cool - framework. Lets use it to transform adaptive
signal processing. - ICA is a good start.
3Noise Cancellation
4From Noise Cancellation to ICA
Background
Microphone
ICA
TV
TV
5Noise Cancellation Derivation
- y sensor, x sources, n time point
- y1(n) x1(n) w(n) x2(n)
- y2(n) x2(n)
- Joint probability distribution of observed sensor
data - p(y) px(x1 y1 wy2, x2 y2)
- Assume the sources are independent, identically
distributed Gaussians, with mean 0 and precisions
v1, v2 - Observed data likelihood
- L log p(y) -0.5 v1 (y1 wy2)2
const. - dL / dw 0 ? linear equation for w
-
6Noise Cancellation ? ICA Derivation
- y sensor, x sources, n time point
- y(n) A x(n) , A square mixing matrix
- x(n) G y(n) , G square unmixing matrix
- Probability distribution of observed sensor data
- p(y) G px(G y)
- Assume the sources are i.i.d. non-Gaussians
- Observed data likelihood
- L log p(y) log G log p(x1)
log p(x2) - dL / dG 0 ? non-linear equation for G
-
7Sensor Noise and Hidden Variables
- y sensor, x sources, u noise, n time
point - y(n) A x(n) u(n)
- x are now hidden variables even if A is known,
one cannot - obtain x exactly from y
- However, one can compute the posterior
probability of x - conditioned on y
- p(xy) p(yx) p(x) / p(y)
- where p(yx) pu(y A x)
-
- To learn A from data, one must use an expectation
- maximization (EM) algorithm (and often
approximate it) -
8Probabilistic Graphical Models
- Model the distribution of observed data
- Graph structure determines the probabilistic
dependence between variables - We focus on DAGs directed acyclic graphs
- Node variable
- Arrow probabilistic dependence
x
x
y
p(y,x) p(yx) p(x)
p(x)
9Linear Classification
- c class label discrete, multinomial
- y data continuous, Gaussian
- p(c) pc , p(yc) N( y µc, ?c )
- Training set pairs y,c
- Learn parameters by maximum likelihood
- L log p(y,c) log p(yc) log p(c)
- Test set y, classify using p(cy) p(y,c) /
p(y)
c
p(y,c) p(yc) p(c)
y
10Linear Regression
x
- x predictor continuous, Gaussian
- y dependent continuous, Gaussian
- p(x) N(x µ, ? ) , p(yx) N( y Ax, ? )
- Training set pairs y,x
- Learn parameters by maximum likelihood
- L log p(y,x) log p(yx) log p(x)
- Test set x, predict using p(yx)
p(y,x) p(yx) p(x)
y
11Clustering
- c class label discrete, multinomial
- y data continuous, Gaussian
- p(c) pc , p(yc) N( y µc, ?c )
- Training set y
- p(y) is a mixture of Gaussians (MoG)
- Learn parameters by expectation maximization
(EM) - Test set y, cluster using p(cy) p(y,c) /
p(y) - Limit of zero variance vector quantization (VQ)
c
p(y,c) p(yc) p(c) But now c is hidden
y
12Factor Analysis
- x factors continuous, Gaussian
- y data continuous, Gaussian
- p(x) N( x 0, I ) , p(yx) N( y Ax, ? )
- Training set y
- p(y) is Gaussian with covariance AA?-1
- Learn parameters by expectation maximization
(EM) - Test set y, obtain factors by p(xy) p(y,x)
/ p(x) - Limit of zero noise principal component analysis
(PCA)
p(y,x) p(yx) p(x) But now x is hidden
13Dynamical Models
Hidden Markov Model (Baum-Welch)
State Space Model (Kalman Smoothing)
Switching State Space Model (Intractable)
14Probabilistic Inference
Factor analysis model p(x) N(x0,I) p(yx,A,?)
N(yAx,?)
- Nodes inside frame variables, vary in time
- Nodes outside frame parameters, constant in time
- Parameters have prior distributions p(A), p(?)
- Bayesian Inference compute full posterior
distribution p(x,A,?y) - over all hidden nodes conditioned on
observed nodes - Bayes rule p(x,A,?y)p(yx,A,?)p(x)p(A)p(?)/p(y
) - In hidden variable models, joint posterior can
generally not be computed exactly. The
normalization factor p(y) is instractable -
A
x
?
y
15MAP and Maximum Likelihood
- MAP maximum aposteriori consider only the
parameter values the maximize the posterior
p(x,A,?y) - This is the maximum likelihood method
- compute A,? that maximize L log p(yA,?)
- However, in hidden variable models L is a
complicated function of the parameters direct
maximization would require gradient based
techniques which are slow - Solution the EM algorithm
- Iterative algorithm, each iteration has an E-step
and an M-step - E-step compute posterior over hidden variables
p(xy) - M-step maximize complete data likelihood E log
p(y,x,A,?) w.r.t. the parameters A,? E
posterior average over x
16Derivation of the EM Algorithm
- Instead of the likelihood L log p(y), consider
- F(q) E log p(y,x) E log q(xy)
- where q(xy) is a trial posterior and E
averaged over x w.r.t. q - Can show F(q) L KL q(xy) p(xy) lt
L - Hence F is upper bounded by L, and FL when
qtrue posterior - EM performs an alternate maximization of F
- The E-step maximizes F w.r.t. the posterior q
- The M-step maximizes F w.r.t. the parameters A,?
- Hence EM performs maximum likelihood
-
17ICA by EM MoG Sources
- Each source distribution
- p(x) is a 1-dim mixture
- of Gaussians
- The Gaussian labels s are
- hidden variables
- The data y A x, hence
- x G y are not hidden
- Likelihood L log G log p(x)
- F(q) log G E log p(x,s)
- E log q(sy)
- E-step q(sy) p(x,s) / z
- M-step G ? G e(I-F(x)x)G
- (natural gradient)
- F(x) is linear in x and q
- Can also learn the source parameters
- MoG1, MoG2 at the M-step
18Noisy, Non-Square ICA Independent Factor Analysis
- The Gaussian labels s are
- hidden variables
- The data y A x u,
- hence x are also hidden
- p(yx) N( y Ax, ? )
- Likelihood L log p(y)
- must marginalize over x,s
- F(q) E log p(y,x,s)
- E log q(x,sy)
- E-step q(x,sy) q(xs,y)q(sy)
- M-step linear eqs for A, ?
- Can also learn the source parameters
- MoG1, MoG2 at the M-step
- Convergence problem in low noise
19Intractability of Inference
- In many models of interest the
- E-step is computationally intractable
- Switching state space model
- posterior over discrete state p(sy)
- is exponential in time
- Independent factor analysis
- posterior over Gaussian labels
- is exponential in number of sources
- Approximations must be made
- MAP approximation consider only the
- most likely state configuration(s)
- Markov Chain Monte Carlo convergence
- may be quite slow and hard to determine
20Variational EM
- Idea use an approximate posterior which has a
factorized form - Example switching state space model
- factorize the continuous states from the
discrete states - p(x,sy) q(x,sy) q(xy) q(sy)
- make no other assumptions (e.g., functional
forms) - To derive, consider F(q) from the derivation of
EM - F(q) E log p(y,x,s) - E log q(xy) E
log q(sy) - E performs posterior averaging w.r.t. q
- Maximize F alternately w.r.t. q(xy) and q(sy)
- q(xy) Es p(y,x,s) / zs
- q(sy) Ex p(y,x,s) / zx
- This adds an internal loop in the E-step M-step
is unchanged - Convergence is guaranteed since F(q) is upper
bounded by L
21Switching Model 2 Variational Approximations
- Model
Variational approximation I
s(1)
s(2)
s(3)
x(2)
x(3)
x(1)
Variational approximation II
s(1)
s(2)
s(3)
I Baum-Welch, Kalman Gaussian, smoothed II
Baum-Welch, MoG Multimodal, not smoothed
x(2)
x(3)
x(1)
22IFA 2 Variational Approximations
- Model
Variational approximation I
Variational approximation II
I Source posterior is Gaussian, correlated
II Source posterior is multimodal, independent
23Model Order Selection
- How does one determine the optimal number of
factors in FA? - Maximum likelihood would always prefer more
complex models, since they fit the data better
but they overfit - The probabilistic inference approach place a
prior p(A) over the model parameters, and
consider the marginal likelihood - L log p(y) E log p(y,A) E log p(Ay)
- Compute L for each number of factors. Choose
the number that maximizes L - An alternative approach place a prior p(A)
assuming a maximum number of factors. The prior
has a hyperparameter for each column of A its
precision a. Optimize the precisions by
maximizing L. Unnecessary columns will have a ?
infinity - Both approaches require computing the parameter
posterior p(Ay), which is usually intractable
24Variational Bayesian EM
- Idea use an approximate posterior which
factorizes the parameters from the hidden
variables - Example factor analysis
- p(x,Ay) q(x,Ay) q(xy) q(Ay)
- make no other assumptions (e.g., functional
forms) - To derive, consider F(q) from the derivation of
EM - F(q) E log p(y,x,A) - E log q(xy) E
log q(Ay) - E performs posterior averaging w.r.t. q
- Maximize F alternately w.r.t. q(xy) and q(Ay)
- E-step q(xy) EA p(y,x,A) / zA
- M-step q(Ay) Ex p(y,x,A) / zx
- Plus, maximize F w.r.t. the noise precision ? and
hyperparameters a (MAP approximation)
25VB Approximation for IFA
s1
s2
x1
x2
A
26Conjugate Priors
- Which form should one choose for prior
distributions? - Conjugate prior idea Choose a prior such that
the resulting posterior distribution would have
the same functional form as the prior - Single Gaussian posterior over mean is
- p(µy) p(yµ) p(µ) / p(y)
- conjugate prior is Gaussian
- Single Gaussian posterior over mean precision
is - p(µ,?y) p(yµ,?) p(µ,?)
- conjugate prior is Normal-Wishart
- Factor analysis VB posterior over mixing matrix
is - q(Ay) Ex p(y,xA) p(A) / z
- conjugate prior is Gaussian
27Separation of Convoluted Mixtures of Speech
Sources
- Blind separation methods use extremely simple
models for source distributions - Speech signals have a rich structure. Models that
capture aspects of it could result in improved
separation, deconvolution, and noise robustness - One such model work in the windowed FFT domain
- x(n,k) G(k) y(n,k)
- where nframe index, kfrequency
- Train a MoG model on the x(n,k) such that
different components capture different speech
spectra - Plus this model into IFA and use EM to obtain
separation of convoluted mixtures
28Noise Suppression in Speech Signals
- Existing methods based on,
- e.g., spectral subtraction and array
- processing, often produce
- unsatisfactory noise suppression
- Algorithms based on probabilistic
- models can (1) exploit rich speech
- models, (2) learn the noise from
- noisy data (not just silent segments)
- (3) can work with one or more
- microphones
- Use speech model in the windowed FFT domain
- ?(k) noise precision per frequency (inverse
spectrum)
29Interference Suppression and Evoked Source
Separation in MEG data
- y(n) MEG sensor data, x(n) evoked brain
sources, - u(n) interference sources, v(n) sensor
noise - Evoked stimulus experimental paradigm evoked
sources are active only after the stimulus onset - pre-stimulus y(n) B u(n) v(n)
- post-stimulus y(n) A x(n) B u(n) v(n)
- SEIFA is an extension of IFA to this case model
x by MoG, model u by Gaussians N(0,I), model v by
Gaussian N(0,?) - Use pre-stimulus to learn interference mixing
matrix B and noise precision ? use post-stimulus
to learn evoked mixing matrix A - Use VB-EM to infer from data the optimal number
of interference factors u and of evoked factors
x also protect from overfitting - Cleaned data y A x Contribution of factor
j yi Aij xj - Advantages over ICA no need to discard
information by dim reduction can exploit
stimulus onset information superior noise
suppression -
30Stimulus Evoked Independent Factor Analysis
- Pre-stimulus
Post-stimulus
u
y
B
31Brain Source Localization using MEG
- Problem localize brain sources that respond to a
stimulus - Response model is simple y(n) F s(n) v(n)
- F lead field (known), s brain voxel
activity - However, the number of voxels (3000-1000) is
much larger than the number of sensors (100-300) - One approach fit multiple dipole sources cost
is exponential in the number of sources - Idea loop over voxels for each one, use VB-EM
to learn a - modified FA model
- y(n) F z(n) A x(n) v(n)
- where F lead field for that voxel, z
voxel activity, - x response from all other active voxels
- Obtain a localization map by plotting ltz(n)2gt per
voxel - Superior results to exising (beamforming based)
methods can handle correlated sources -
32MEG Localization Model
- Pre-stimulus
Post-stimulus
z
u
u
x
y
y
A,B
B
F
33Conclusion
- Statistical machine learning provides a
principled framework for - formulating and solving adaptive signal
processing problems - Process
- (1) design a probabilistic model that
corresponds to the problem - (2) use machinery for exact and approximate
inference to learn - the model from data, including model
order - (3) extend the model, by e.g. incorporating
rich signal models, - to improve performance
- Problems treated here noise suppression, source
separation, - source localization
- Domains Speech, audio, biomedical data
- Domains outside this tutorial image, video,
text, coding, .. - Future algorithms derived from probabilistic
models take over - and completely transform adaptive signal
processing