Title: Variational Bayesian Methods for Audio Indexing
1Variational Bayesian Methodsfor Audio Indexing
- Fabio Valente, Christian Wellekens
- Institut Eurecom
2Outline
- Generalities on speaker clustering
- Model selection/BIC
- Variational learning
- Variational model selection
- Results
3Speaker clustering
- Many applications (speaker indexing, speech
recognition) require clustering segments with the
same characteristics e.g. speech from the same
speaker. - Goal grouping together speech segments of the
same speaker - Fully connected (ergodic) HMM topology with
duration constraint. Each state represent a
speaker. - When speaker number is not known it must be
estimated with a model selection criterion (e.g.
BIC,)
4Model selection
Given data Y and model m optimal model maximizes
If prior is uniform, decision depends only on
p(Ym) (a.k.a. marginal likelihood)
Bayesian modeling assumes distributions over
parameters
The criterion is thus the marginal likelihood
Prohibitive to compute for some models (HMM,GMM)
5Bayesian information criterion (BIC)
First order approximation obtained from the
Laplace approximation of the marginal likelihood
(Schwartz, 1978)
Generally, penalty is multiplied by a constant
(threshold)
BIC does not depend on parameter distributions !
Asymptotically (n large) BIC converges to
log-marginal likelihood
6Variational Learning
Introduce an approximated variational
distribution
Applying Jensen inequality
ln p(Ym) maximization is then replaced by
maximization of
7Variational Learning with hidden variables
Sometimes model optimization needs the use of
hidden variables (e.g. state sequence in the EM)
If x is the hidden variable, we can write
Independence hypothesis
8EM-like algorithm
Under the hypothesis
E-step
M-step
9VB Model selection
In the same way an approximated posterior
distribution over models can be defined
Maximizing w.r.t. q(m) yields
Model selection based on
Best model maximizes q(m)
10Experimental framework
- BN-96 Hub4 evaluation data set
- Initialize a model with N speakers (states) and
train the system using VB and ML (or VB and MAP
with UBM) - Reduce the speaker number from N-1 to 1 and train
using VB and ML (or MAP). - Score the N models with VB and BIC and choose the
best one - Three score
- Best score
- Selected score (with VB or BIC)
- Score obtained with the known speaker number
- Results given in terms of
- Acp average cluster purity
- Asp average speaker purity
11Experiments I
12Experiments II
13Dependence on threshold
K function of the threshold
Speaker number function of the threshold
14Free Energy vs. BIC
15Experiments III
16Experiments IV
17Conclusions and Future Works
- VB uses free energy for parameter learning and
model selection. - VB generalizes both ML and MAP learning
framework. - VB outperforms ML/BIC on 3 of the 4 BN files.
- VB outperforms MAP/BIC on 4 of the 4 BN files.
- Repeat the experiments on other databases (e.g.
NIST speaker diarization).
18Thanks for your attention!
19Data vs. Gaussian components
Final gaussian components function of amount of
data for each speaker
20Experiments (file 1)
21Experiments (file 2)
22Experiments (file 3)
23Experiments (file 4)