Title: Dr. Hagai Aronowitz
1Intra-Class Variability Modeling for Speech
Processing
- Dr. Hagai Aronowitz
- IBM Haifa Research Lab
- Presentation is available online at
http//aronowitzh.googlepages.com/
2Speech Classification Proposed framework
- Given labeled training segments from class and
class , classify unlabeled test segments - Classification framework
- Represent speech segments in segment-space
- Learn a classifier in segment-space
- SVMs
- NNs
- Bayesian classifiers
3Outline Intra-Class Variability Modeling for
Speech Processing
- 1 Introduction to GMM based classification
- 2 Mapping speech segments into segment space
- 3 Intra-class variability modeling
- 4 Speaker diarization
- 5 Summary
4GMM based speaker recognition
Text-Independent Speaker Recognition GMM-Based
Algorithm Reynolds 1995
Assuming frame independence
- Estimate Pr(ytS)
- Train a universal background model (UBM) GMM
using EM - For every target speaker STrain a GMM GS by
applying MAP-adaptation
µ1
µ2
µ3
UBM Q1 - speaker 1 Q2 - speaker 2
R26 MFCC feature space
5GMM Based Algorithm - Analysis
- Invalid frame independence assumptionFactors
such as channel, emotion, lexical variability,
and speaker aging cause frame dependency - GMM scoring is inefficient linear in the length
of the audio - GMM scoring does not support indexing
6Outline Intra-Class Variability Modeling for
Speech Processing
- 1 Introduction to GMM based classification
- 2 Mapping speech segments into segment space
- 3 Intra-class variability modeling
- 4 Speaker diarization
- 5 Summary
7Mapping Speech Segments into Segment Space GMM
scoring approximation 1/4
Definitions X training session for target
speaker Y test session Q GMM trained for
X P GMM trained for Y Goal Compute Pr(Y Q)
using GMMs P and Q only
- Motivation
- Efficient speaker recognition and indexing
- More accurate modeling
8Mapping Speech Segments into Segment Space GMM
scoring approximation 2/4
Negative cross entropy
(1)
- Approximating the cross entropy between two GMMs
- Matching based lower bound Aronowitz 2004
- Unscented-transform based approximation
Goldberger Aronowitz 2005 - Others options in Hershey 2007
9Mapping Speech Segments into Segment Space GMM
scoring approximation 3/4
Matching based approximation
(2)
Assuming weights and covariance matrices are
speaker independent ( some approximations)
(3)
Mapping T is induced
(4)
10Mapping Speech Segments into Segment Space GMM
scoring approximation 4/4
Results
Figure and Table taken from H. Aronowitz, D.
Burshtein, Efficient Speaker Recognition Using
Approximated Cross Entropy (ACE), in IEEE
Trans. on Audio, Speech Language Processing,
September 2007.
11Other Mapping Techniques
- Anchor modeling projection Sturim 2001
- efficient but inaccurate
- MLLR transofrms Stolcke 2005
- accurate but inefficient
- Kernel-PCA-based mapping Aronowitz 2007c
- Given - a set of objects - a kernel
function - (a dot product between each pair of
objects)Finds a mapping of the objects into Rn
which preserves the kernel function. - accurate efficient
12Kernel-PCA Based Mapping
Feature space
Speaker unique subspace
f(x)
Session space
ux
f(y)
uy
K-PCA
x
y
Kernel induced
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
13Outline Intra-Class Variability Modeling for
Speech Processing
- 1 Introduction to GMM based classification
- 2 Mapping speech segments into segment space
- 3 Intra-class variability modeling
- 4 Speaker diarization
- 5 Summary
14Intra-Class Variability Modeling Aronowitz
2005b Introduction
- The classic GMM algorithm does not explicitly
model intra-speaker inter-session variability - channel, noise
- language
- stress, emotion, aging
- The frame independence assumption does not hold
in these cases!
(1)
Instead, we can use a more relaxed assumption
(2)
which leads to
(3)
15Old vs. New Generative Models
Old Model
New Model
a PDF over GMM space
a GMM
a GMM
Session GMM
generated independently
Frame sequence
Frame sequence
generated independently
16Session-GMM Space
GMM for session A of speaker 1
GMM for session B of speaker 1
speaker 2
speaker 1
speaker 3
Session-GMM space
17Modeling in Session-GMM space 1/2
Recall mapping T induced by the GMM approximation
analysis
- is called a supervector
- A speaker is modeled by a multivariate normal
distribution in supervector space
(3)
- A typical dimension of is 50,00050,000
- is estimated robustly using PCA
regularization Covariance is assumed to be a
low rank matrix with an additional non-zero
(noise) diagonal
18Modeling in Session-GMM Space 2/2 Estimating
covariance matrix
1
1
2
2
2
speaker 2
speaker 1
1
1
1
2
2
2
1
speaker 3
Delta supervector space
Supervector space
19Experimental Setup
Datasets
- is estimated from the NIST-2006-SRE
corpus - Evaluation is done on the NIST-2004-SRE corpus
System description
- ETSI MFCC (13-cep 13-delta-cep)
- Energy based voice activity detector
- Feature warping
- 2048 Gaussians
- Target models are adapted from GI-UBM
- ZT-norm score normalization
20Results
38 reduction in EER
21Other Modeling Techniques
- NAPSVMs Campbell 2006
- Factor Analysis Kenny 2005
- Kernel-PCA Aronowitz 2007c
Kernel-PCA based algorithm
- Model each supervector as
- s S Common speaker subspace
- u U Speaker unique subspace
- S is spanned by a set of development
supervectors (700 speakers) - U is the orthogonal complement of S in
supervector space - Intra-speaker variability is modeled separately
in S and in U - U was found to be more discriminative than S
- EER was reduced by 44 compared to baseline GMM
22Kernel-PCA Based Modeling
Feature space
Speaker unique subspace
f(x)
Session space
ux
f(y)
uy
K-PCA
x
y
Kernel induced
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
23Outline Intra-Class Variability Modeling for
Speech Processing
- 1 Introduction to GMM based classification
- 2 Mapping speech segments into segment space
- 3 Intra-class variability modeling
- 4 Speaker diarization
- 5 Summary
24Trainable Speaker Diarization Aronowitz 2007d
- Goals
- Detect speaker changes speaker segmentation
- Cluster speaker segments - speaker clustering
- Motivation for new method
- Current algorithms do not exploit available
training data! - (besides tuning thresholds, etc.)
- Method
- Explicitly model inter-segment intra-speaker
variability from labeled training data, and use
for the metric used by change-detection /
clustering algorithms.
25Speaker recognition on pairs of 3s segments
- Dev data
- BNAD05 (5hr) - Arabic, broadcast news
- Eval data
- BNAT05 Arabic, broadcast news, (207 target
models, 6756 test segments)
System EER ()
Anchor modeling (baseline) 15.1
Anchor modeling - Kernel based scoring 10.8
Kernel-PCA projection (CSS) 8.8
Kernel-PCA projection (CSS) inter-segment variability modeling 7.4
26Speaker Diarization System Experiments
- Speaker change detection
- 2 adjacent sliding windows (3s each)
- Speaker verification scoring normalization
- Speaker clustering
- Speaker verification scoring normalization
- Bottom-up clustering
- Speaker Error Rate (SER) on BNAT05
- Anchor modeling (baseline) 12.9
- Kernel-PCA based method 7.9
27Outline Intra-Class Variability Modeling for
Speech Processing
- 1 Introduction to GMM based classification
- 2 Mapping speech segments into segment space
- 3 Intra-class variability modeling
- 4 Speaker diarization
- 5 Summary
28Summary 1/2
- A method for mapping speech segments into a GMM
supervector space was described - Intra-speaker inter-session variability is
modeled in GMM supervector space - Speaker recognition
- EER was reduced by 38 on the NIST-2004 SRE
- A corresponding kernel-PCA based approach reduces
EER by 44 - Speaker diarization
- SER for speaker diarization was reduced by 39.
29Summary 2/2 Algorithms based on the proposed
framework
- Speaker recognition Aronowitz 2005b Aronowitz
2007c - Speaker diarization (who spoke when) Aronowitz
2007d - VAD (voice activity detection) Aronowitz 2007a
- Language identification Noor Aronowitz 2006
- Gender identification Bocklet 2008
- Age detection Bocklet 2008
- Channel/bandwidth classification Aronowitz 2007d
30Bibliography 1/2
1 D. A. Reynolds et al., Speaker
identification and verification using Guassian
mixture speaker models, Speech Communications,
17, 91-108. 2 D.E. Sturim et al., Speaker
indexing in large audio databases using anchor
models, in Proc. ICASSP, 2001. 3 H.
Aronowitz, D. Burshtein, A. Amir, "Speaker
indexing in audio archives using test utterance
Gaussian mixture modeling", in Proc. ICSLP,
2004. 4 H. Aronowitz, D. Burshtein, A. Amir, "A
session-GMM generative model using test utterance
Gaussian mixture modeling for speaker
verification", in Proc. ICASSP, 2005. 5 P.
Kenny et al., Factor Analysis Simplified, in
Proc. ICASSP, 2005. 6 H. Aronowitz, D. Irony,
D. Burshtein, Modeling Intra-Speaker Variability
for Speaker Recognition , in Proc. Interspeech,
2005. 7 J. Goldberger and H. Aronowitz, "A
distance measure between GMMs based on the
unscented transform and its application to
speaker recognition" , in Proc. Interspeech
2005. 8 H. Aronowitz, D. Burshtein, "Efficient
Speaker Identification and Retrieval", in Proc.
Interspeech 2005.
31Bibliography 2/2
9 A. Stolcke et al., MLLR Transforms as
Features in Speaker Recognition, in Proc.
Interspeech, 2005. 10 E. Noor, H. Aronowitz,
"Efficient language Identification using Anchor
Models and Support Vector Machines, in Proc.
ISCA Odyssey Workshop, 2006. 11 W.M. Campbell
et al., SVM Based Speaker Verification Using a
GMM Supervector Kernel and NAP Variability
Compensation, in Proc. ICASSP 2006. 12 H.
Aronowitz, Segmental modeling for audio
segmentation, in Proc. ICASSP, 2007. 13 J.R.
Hershey and P. A. Olsen, Approximating the
Kullback Leibler Divergence Between Gaussian
Mixture Models ,in Proc. ICASSP 2007. 14 H.
Aronowitz, D. Burshtein, Efficient Speaker
Recognition Using Approximated Cross Entropy
(ACE), in IEEE Trans. on Audio, Speech
Language Processing, September 2007. 15 H.
Aronowitz, Speaker Recognition using Kernel-PCA
and Intersession Variability Modeling, in Proc.
Interspeech, 2007. 16 H. Aronowitz, Trainable
Speaker Diarization, in Proc. Interspeech,
2007. 17 T. Bocklet et al., Age and Gender
Recognition for Telephone Applications Based on
GMM Supervectors and Support Vector Machines, in
Proc. ICASSP, 2008.
32Thanks!
- Presentation is available online at
http//aronowitzh.googlepages.com/
33 34Kernel-PCA Based Mapping 2/5
Dot-product feature space
Session space
f()
f(x)
x
y
Kernel trick
f(y)
Anchor sessions
Goals - Map sessions into feature space
- Model in feature space
35Kernel-PCA Based Mapping 3/5
- Given - kernel K
- - n anchor sessions
- Find an orthonormal basis for
- Method
- Compute eigenvectors of the centralized
kernel-matrix ki,j K(Ai,Aj). - Normalize eigenvectors by square-roots of
corresponding eigenvalues ? vi - for
is the requested basis
36Kernel-PCA Based Mapping 4/5
Common speaker subspace - Speaker unique
subspace -
- Given sessions x, y, may be
uniquely represented as
is a mapping x?Rn with the property
37Kernel-PCA Based Mapping 5/5
Speaker unique subspace
Session space
Feature space
ux
uy
K-PCA
f(x)
x
y
f(y)
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
38Modeling in Segment-GMM Supervector Space
Segment-GMM supervector space
speech
silence
music
Frame sequence segment 2
Frame sequence segment n
Frame sequence segment 1
39Segmental Modeling for Audio Segmentation
- Goal
- Segment audio accurately and robustly into
speech / silence / music segments. - Novel idea
- Acoustic modeling is usually done on a
frame-basis. - Segmentation/classification is usually done on a
segment-basis (using smoothing). - Why not explicitly model whole segments?
- Note speaker, noise, music-context, channel
(etc.) are constant during a segment.
40Speech / Silence Segmentation Results 1/2
System EER FA _at_ FR0.5 FR _at_ FA1
EVAL06 FA24.2 _at_ FR0.25 FA24.2 _at_ FR0.25 FA24.2 _at_ FR0.25
GMM baseline 2.9 7.9 29.6
Segmental 1.7 5.1 2.7
Error reduction 41 35 91
41Speech / Silence Segmentation Results 2/2
System EER FA _at_ FR0.5 FR _at_ FA1
EVAL06 FA69 _at_ FR0.25 FA69 _at_ FR0.25 FA69 _at_ FR0.25
GMM baseline 1.43 3.4 3.2
Segmental 1.27 2.0 1.9
Error reduction 11 41 41
42LID in Session Space
English
Session space
French
Arabic
Test session
Training session
43LID in Session Space - Algorithm
- Front end shifted delta cepstrum (SDC).
- Represent every train/test session by a GMM
super-vector. - Train a linear SVM to classify GMM super-vectors.
- Results
- EER4.1 on the NIST-03 Eval (30sec sessions).
44Anchor Modeling Projection
Given anchor models ?1,,?n and session X
x1,,xF
Projection
average normalized log-likelihood
- Speaker indexing Sturim et al., 2001
- Intersession variability modeling in projected
space Collet et al., 2005 - Speaker clustering Reynolds et al., 2004
- Speaker segmentation Collet et al., 2006
- Language identification Noor and Aronowitz, 2006
45Intra-Class Variability Modeling Introduction
- The classic GMM algorithm does not explicitly
model intra-speaker inter-session variability - Noise
- Channel
- Language
- Changing speaker characteristics stress,
emotion, aging - The frame independence assumption does not hold
in these cases!
(1)
Instead, we get
(2)