Bayesian Machine Learning for Signal Processing - PowerPoint PPT Presentation

About This Presentation

Title:

Bayesian Machine Learning for Signal Processing

Description:

6th International Conference on Independent Component Analysis. and Blind Source Separation, ... First papers on a statistical machine learning approach to ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 34

Provided by: HagaiA

Learn more at: http://www.cnel.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Machine Learning for Signal Processing

1
Bayesian Machine Learning for Signal
Processing

Hagai T. Attias
Golden Metallic,
Inc.
San Francisco, CA
Tutorial
6th International Conference on Independent
Component Analysis
and Blind Source Separation, Charleston,
SC, March 2006

2
ICA / BSS is 15 Years Old

First pair of papers Comon, Jutten Herault,
Signal Processing,
1991
First papers on a statistical machine learning
approach to
ICA/BSS Bell Sejnowski 1995 Cardoso 1996
Pearlmutter
Parra 1997
First conference on ICA/BSS Helsinki, 2000
Lesson drawn by many ICA is a cool problem.
Lets find many
approaches to it and many places where its
useful.
Lesson drawn by some statistical machine
learning is a cool
framework. Lets use it to transform adaptive
signal processing.
ICA is a good start.

3
Noise Cancellation
4
From Noise Cancellation to ICA
Background
Microphone
ICA
TV
TV
5
Noise Cancellation Derivation

y sensor, x sources, n time point
y1(n) x1(n) w(n) x2(n)
y2(n) x2(n)
Joint probability distribution of observed sensor
data
p(y) px(x1 y1 wy2, x2 y2)
Assume the sources are independent, identically
distributed Gaussians, with mean 0 and precisions
v1, v2
Observed data likelihood
L log p(y) -0.5 v1 (y1 wy2)2
const.
dL / dw 0 ? linear equation for w

6
Noise Cancellation ? ICA Derivation

y sensor, x sources, n time point
y(n) A x(n) , A square mixing matrix
x(n) G y(n) , G square unmixing matrix
Probability distribution of observed sensor data
p(y) G px(G y)
Assume the sources are i.i.d. non-Gaussians
Observed data likelihood
L log p(y) log G log p(x1)
log p(x2)
dL / dG 0 ? non-linear equation for G

7
Sensor Noise and Hidden Variables

y sensor, x sources, u noise, n time
point
y(n) A x(n) u(n)
x are now hidden variables even if A is known,
one cannot
obtain x exactly from y
However, one can compute the posterior
probability of x
conditioned on y
p(xy) p(yx) p(x) / p(y)
where p(yx) pu(y A x)
To learn A from data, one must use an expectation
maximization (EM) algorithm (and often
approximate it)

8
Probabilistic Graphical Models

Model the distribution of observed data
Graph structure determines the probabilistic
dependence between variables
We focus on DAGs directed acyclic graphs
Node variable
Arrow probabilistic dependence

x
x
y
p(y,x) p(yx) p(x)
p(x)
9
Linear Classification

c class label discrete, multinomial
y data continuous, Gaussian
p(c) pc , p(yc) N( y µc, ?c )
Training set pairs y,c
Learn parameters by maximum likelihood
L log p(y,c) log p(yc) log p(c)
Test set y, classify using p(cy) p(y,c) /
p(y)

c
p(y,c) p(yc) p(c)
y
10
Linear Regression
x

x predictor continuous, Gaussian
y dependent continuous, Gaussian
p(x) N(x µ, ? ) , p(yx) N( y Ax, ? )
Training set pairs y,x
Learn parameters by maximum likelihood
L log p(y,x) log p(yx) log p(x)
Test set x, predict using p(yx)

p(y,x) p(yx) p(x)
y
11
Clustering

c class label discrete, multinomial
y data continuous, Gaussian
p(c) pc , p(yc) N( y µc, ?c )
Training set y
p(y) is a mixture of Gaussians (MoG)
Learn parameters by expectation maximization
(EM)
Test set y, cluster using p(cy) p(y,c) /
p(y)
Limit of zero variance vector quantization (VQ)

c
p(y,c) p(yc) p(c) But now c is hidden
y
12
Factor Analysis

x factors continuous, Gaussian
y data continuous, Gaussian
p(x) N( x 0, I ) , p(yx) N( y Ax, ? )
Training set y
p(y) is Gaussian with covariance AA?-1
Learn parameters by expectation maximization
(EM)
Test set y, obtain factors by p(xy) p(y,x)
/ p(x)
Limit of zero noise principal component analysis
(PCA)

p(y,x) p(yx) p(x) But now x is hidden
13
Dynamical Models
Hidden Markov Model (Baum-Welch)
State Space Model (Kalman Smoothing)
Switching State Space Model (Intractable)
14
Probabilistic Inference
Factor analysis model p(x) N(x0,I) p(yx,A,?)
N(yAx,?)

Nodes inside frame variables, vary in time
Nodes outside frame parameters, constant in time
Parameters have prior distributions p(A), p(?)
Bayesian Inference compute full posterior
distribution p(x,A,?y)
over all hidden nodes conditioned on
observed nodes
Bayes rule p(x,A,?y)p(yx,A,?)p(x)p(A)p(?)/p(y
)
In hidden variable models, joint posterior can
generally not be computed exactly. The
normalization factor p(y) is instractable

A
x
?
y
15
MAP and Maximum Likelihood

MAP maximum aposteriori consider only the
parameter values the maximize the posterior
p(x,A,?y)
This is the maximum likelihood method
compute A,? that maximize L log p(yA,?)
However, in hidden variable models L is a
complicated function of the parameters direct
maximization would require gradient based
techniques which are slow
Solution the EM algorithm
Iterative algorithm, each iteration has an E-step
and an M-step
E-step compute posterior over hidden variables
p(xy)
M-step maximize complete data likelihood E log
p(y,x,A,?) w.r.t. the parameters A,? E
posterior average over x

16
Derivation of the EM Algorithm

Instead of the likelihood L log p(y), consider
F(q) E log p(y,x) E log q(xy)
where q(xy) is a trial posterior and E
averaged over x w.r.t. q
Can show F(q) L KL q(xy) p(xy) lt
L
Hence F is upper bounded by L, and FL when
qtrue posterior
EM performs an alternate maximization of F
The E-step maximizes F w.r.t. the posterior q
The M-step maximizes F w.r.t. the parameters A,?
Hence EM performs maximum likelihood

17
ICA by EM MoG Sources

Each source distribution
p(x) is a 1-dim mixture
of Gaussians
The Gaussian labels s are
hidden variables
The data y A x, hence
x G y are not hidden
Likelihood L log G log p(x)
F(q) log G E log p(x,s)
E log q(sy)
E-step q(sy) p(x,s) / z
M-step G ? G e(I-F(x)x)G
(natural gradient)
F(x) is linear in x and q
Can also learn the source parameters
MoG1, MoG2 at the M-step

18
Noisy, Non-Square ICA Independent Factor Analysis

The Gaussian labels s are
hidden variables
The data y A x u,
hence x are also hidden
p(yx) N( y Ax, ? )
Likelihood L log p(y)
must marginalize over x,s
F(q) E log p(y,x,s)
E log q(x,sy)
E-step q(x,sy) q(xs,y)q(sy)
M-step linear eqs for A, ?
Can also learn the source parameters
MoG1, MoG2 at the M-step
Convergence problem in low noise

19
Intractability of Inference

In many models of interest the
E-step is computationally intractable
Switching state space model
posterior over discrete state p(sy)
is exponential in time
Independent factor analysis
posterior over Gaussian labels
is exponential in number of sources
Approximations must be made
MAP approximation consider only the
most likely state configuration(s)
Markov Chain Monte Carlo convergence
may be quite slow and hard to determine

20
Variational EM

Idea use an approximate posterior which has a
factorized form
Example switching state space model
factorize the continuous states from the
discrete states
p(x,sy) q(x,sy) q(xy) q(sy)
make no other assumptions (e.g., functional
forms)
To derive, consider F(q) from the derivation of
EM
F(q) E log p(y,x,s) - E log q(xy) E
log q(sy)
E performs posterior averaging w.r.t. q
Maximize F alternately w.r.t. q(xy) and q(sy)
q(xy) Es p(y,x,s) / zs
q(sy) Ex p(y,x,s) / zx
This adds an internal loop in the E-step M-step
is unchanged
Convergence is guaranteed since F(q) is upper
bounded by L

21
Switching Model 2 Variational Approximations

Model
Variational approximation I

s(1)
s(2)
s(3)
x(2)
x(3)
x(1)
Variational approximation II
s(1)
s(2)
s(3)
I Baum-Welch, Kalman Gaussian, smoothed II
Baum-Welch, MoG Multimodal, not smoothed
x(2)
x(3)
x(1)
22
IFA 2 Variational Approximations

Model
Variational approximation I

Variational approximation II
I Source posterior is Gaussian, correlated
II Source posterior is multimodal, independent
23
Model Order Selection

How does one determine the optimal number of
factors in FA?
Maximum likelihood would always prefer more
complex models, since they fit the data better
but they overfit
The probabilistic inference approach place a
prior p(A) over the model parameters, and
consider the marginal likelihood
L log p(y) E log p(y,A) E log p(Ay)
Compute L for each number of factors. Choose
the number that maximizes L
An alternative approach place a prior p(A)
assuming a maximum number of factors. The prior
has a hyperparameter for each column of A its
precision a. Optimize the precisions by
maximizing L. Unnecessary columns will have a ?
infinity
Both approaches require computing the parameter
posterior p(Ay), which is usually intractable

24
Variational Bayesian EM

Idea use an approximate posterior which
factorizes the parameters from the hidden
variables
Example factor analysis
p(x,Ay) q(x,Ay) q(xy) q(Ay)
make no other assumptions (e.g., functional
forms)
To derive, consider F(q) from the derivation of
EM
F(q) E log p(y,x,A) - E log q(xy) E
log q(Ay)
E performs posterior averaging w.r.t. q
Maximize F alternately w.r.t. q(xy) and q(Ay)
E-step q(xy) EA p(y,x,A) / zA
M-step q(Ay) Ex p(y,x,A) / zx
Plus, maximize F w.r.t. the noise precision ? and
hyperparameters a (MAP approximation)

25
VB Approximation for IFA

Model
VB approximation

s1
s2
x1
x2
A
26
Conjugate Priors

Which form should one choose for prior
distributions?
Conjugate prior idea Choose a prior such that
the resulting posterior distribution would have
the same functional form as the prior
Single Gaussian posterior over mean is
p(µy) p(yµ) p(µ) / p(y)
conjugate prior is Gaussian
Single Gaussian posterior over mean precision
is
p(µ,?y) p(yµ,?) p(µ,?)
conjugate prior is Normal-Wishart
Factor analysis VB posterior over mixing matrix
is
q(Ay) Ex p(y,xA) p(A) / z
conjugate prior is Gaussian

27
Separation of Convoluted Mixtures of Speech
Sources

Blind separation methods use extremely simple
models for source distributions
Speech signals have a rich structure. Models that
capture aspects of it could result in improved
separation, deconvolution, and noise robustness
One such model work in the windowed FFT domain
x(n,k) G(k) y(n,k)
where nframe index, kfrequency
Train a MoG model on the x(n,k) such that
different components capture different speech
spectra
Plus this model into IFA and use EM to obtain
separation of convoluted mixtures

28
Noise Suppression in Speech Signals

Existing methods based on,
e.g., spectral subtraction and array
processing, often produce
unsatisfactory noise suppression
Algorithms based on probabilistic
models can (1) exploit rich speech
models, (2) learn the noise from
noisy data (not just silent segments)
(3) can work with one or more
microphones
Use speech model in the windowed FFT domain
?(k) noise precision per frequency (inverse
spectrum)

29
Interference Suppression and Evoked Source
Separation in MEG data

y(n) MEG sensor data, x(n) evoked brain
sources,
u(n) interference sources, v(n) sensor
noise
Evoked stimulus experimental paradigm evoked
sources are active only after the stimulus onset
pre-stimulus y(n) B u(n) v(n)
post-stimulus y(n) A x(n) B u(n) v(n)
SEIFA is an extension of IFA to this case model
x by MoG, model u by Gaussians N(0,I), model v by
Gaussian N(0,?)
Use pre-stimulus to learn interference mixing
matrix B and noise precision ? use post-stimulus
to learn evoked mixing matrix A
Use VB-EM to infer from data the optimal number
of interference factors u and of evoked factors
x also protect from overfitting
Cleaned data y A x Contribution of factor
j yi Aij xj
Advantages over ICA no need to discard
information by dim reduction can exploit
stimulus onset information superior noise
suppression

30
Stimulus Evoked Independent Factor Analysis

Pre-stimulus
Post-stimulus

u
y
B
31
Brain Source Localization using MEG

Problem localize brain sources that respond to a
stimulus
Response model is simple y(n) F s(n) v(n)
F lead field (known), s brain voxel
activity
However, the number of voxels (3000-1000) is
much larger than the number of sensors (100-300)
One approach fit multiple dipole sources cost
is exponential in the number of sources
Idea loop over voxels for each one, use VB-EM
to learn a
modified FA model
y(n) F z(n) A x(n) v(n)
where F lead field for that voxel, z
voxel activity,
x response from all other active voxels
Obtain a localization map by plotting ltz(n)2gt per
voxel
Superior results to exising (beamforming based)
methods can handle correlated sources