Dr. Hagai Aronowitz - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Dr. Hagai Aronowitz

Description:

Given labeled training segments from class and class , classify unlabeled test ... Intersession variability modeling in projected space [Collet et al., 2005] ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 45

Provided by: ana82

Category:

more less

Transcript and Presenter's Notes

Title: Dr. Hagai Aronowitz

1
Intra-Class Variability Modeling for Speech
Processing

Dr. Hagai Aronowitz
IBM Haifa Research Lab
Presentation is available online at
http//aronowitzh.googlepages.com/

2
Speech Classification Proposed framework

Given labeled training segments from class and
class , classify unlabeled test segments
Classification framework
Represent speech segments in segment-space
Learn a classifier in segment-space
SVMs
NNs
Bayesian classifiers

3
Outline Intra-Class Variability Modeling for
Speech Processing

1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary

4
GMM based speaker recognition
Text-Independent Speaker Recognition GMM-Based
Algorithm Reynolds 1995
Assuming frame independence

Estimate Pr(ytS)
Train a universal background model (UBM) GMM
using EM
For every target speaker STrain a GMM GS by
applying MAP-adaptation

µ1
µ2
µ3
UBM Q1 - speaker 1 Q2 - speaker 2
R26 MFCC feature space
5
GMM Based Algorithm - Analysis

Invalid frame independence assumptionFactors
such as channel, emotion, lexical variability,
and speaker aging cause frame dependency
GMM scoring is inefficient linear in the length
of the audio
GMM scoring does not support indexing

6
Outline Intra-Class Variability Modeling for
Speech Processing

1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary

7
Mapping Speech Segments into Segment Space GMM
scoring approximation 1/4
Definitions X training session for target
speaker Y test session Q GMM trained for
X P GMM trained for Y Goal Compute Pr(Y Q)
using GMMs P and Q only

Motivation
Efficient speaker recognition and indexing
More accurate modeling

8
Mapping Speech Segments into Segment Space GMM
scoring approximation 2/4
Negative cross entropy
(1)

Approximating the cross entropy between two GMMs
Matching based lower bound Aronowitz 2004
Unscented-transform based approximation
Goldberger Aronowitz 2005
Others options in Hershey 2007

9
Mapping Speech Segments into Segment Space GMM
scoring approximation 3/4
Matching based approximation
(2)
Assuming weights and covariance matrices are
speaker independent ( some approximations)
(3)
Mapping T is induced
(4)
10
Mapping Speech Segments into Segment Space GMM
scoring approximation 4/4
Results
Figure and Table taken from H. Aronowitz, D.
Burshtein, Efficient Speaker Recognition Using
Approximated Cross Entropy (ACE), in IEEE
Trans. on Audio, Speech Language Processing,
September 2007.
11
Other Mapping Techniques

Anchor modeling projection Sturim 2001
efficient but inaccurate
MLLR transofrms Stolcke 2005
accurate but inefficient
Kernel-PCA-based mapping Aronowitz 2007c
Given - a set of objects - a kernel
function
(a dot product between each pair of
objects)Finds a mapping of the objects into Rn
which preserves the kernel function.
accurate efficient

12
Kernel-PCA Based Mapping
Feature space
Speaker unique subspace
f(x)
Session space
ux
f(y)
uy
K-PCA
x
y
Kernel induced
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
13
Outline Intra-Class Variability Modeling for
Speech Processing

1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary

14
Intra-Class Variability Modeling Aronowitz
2005b Introduction

The classic GMM algorithm does not explicitly
model intra-speaker inter-session variability
channel, noise
language
stress, emotion, aging
The frame independence assumption does not hold
in these cases!

(1)
Instead, we can use a more relaxed assumption
(2)
which leads to
(3)
15
Old vs. New Generative Models
Old Model
New Model
a PDF over GMM space
a GMM
a GMM
Session GMM
generated independently
Frame sequence
Frame sequence
generated independently
16
Session-GMM Space
GMM for session A of speaker 1
GMM for session B of speaker 1
speaker 2
speaker 1
speaker 3
Session-GMM space
17
Modeling in Session-GMM space 1/2
Recall mapping T induced by the GMM approximation
analysis

is called a supervector
A speaker is modeled by a multivariate normal
distribution in supervector space

(3)

A typical dimension of is 50,00050,000
is estimated robustly using PCA
regularization Covariance is assumed to be a
low rank matrix with an additional non-zero
(noise) diagonal

18
Modeling in Session-GMM Space 2/2 Estimating
covariance matrix
1
1
2
2
2
speaker 2
speaker 1
1
1
1
2
2
2
1
speaker 3
Delta supervector space
Supervector space
19
Experimental Setup
Datasets

is estimated from the NIST-2006-SRE
corpus
Evaluation is done on the NIST-2004-SRE corpus

System description

ETSI MFCC (13-cep 13-delta-cep)
Energy based voice activity detector
Feature warping
2048 Gaussians
Target models are adapted from GI-UBM
ZT-norm score normalization

20
Results
38 reduction in EER
21
Other Modeling Techniques

NAPSVMs Campbell 2006
Factor Analysis Kenny 2005
Kernel-PCA Aronowitz 2007c

Kernel-PCA based algorithm

Model each supervector as
s S Common speaker subspace
u U Speaker unique subspace
S is spanned by a set of development
supervectors (700 speakers)
U is the orthogonal complement of S in
supervector space
Intra-speaker variability is modeled separately
in S and in U
U was found to be more discriminative than S
EER was reduced by 44 compared to baseline GMM

22
Kernel-PCA Based Modeling
Feature space
Speaker unique subspace
f(x)
Session space
ux
f(y)
uy
K-PCA
x
y
Kernel induced
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
23
Outline Intra-Class Variability Modeling for
Speech Processing

1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary

24
Trainable Speaker Diarization Aronowitz 2007d

Goals
Detect speaker changes speaker segmentation
Cluster speaker segments - speaker clustering
Motivation for new method
Current algorithms do not exploit available
training data!
(besides tuning thresholds, etc.)
Method
Explicitly model inter-segment intra-speaker
variability from labeled training data, and use
for the metric used by change-detection /
clustering algorithms.

25
Speaker recognition on pairs of 3s segments

Dev data
BNAD05 (5hr) - Arabic, broadcast news
Eval data
BNAT05 Arabic, broadcast news, (207 target
models, 6756 test segments)

System EER ()
Anchor modeling (baseline) 15.1
Anchor modeling - Kernel based scoring 10.8
Kernel-PCA projection (CSS) 8.8
Kernel-PCA projection (CSS) inter-segment variability modeling 7.4
26
Speaker Diarization System Experiments

Speaker change detection
2 adjacent sliding windows (3s each)
Speaker verification scoring normalization
Speaker clustering
Speaker verification scoring normalization
Bottom-up clustering
Speaker Error Rate (SER) on BNAT05
Anchor modeling (baseline) 12.9
Kernel-PCA based method 7.9

27
Outline Intra-Class Variability Modeling for
Speech Processing

1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary

28
Summary 1/2

A method for mapping speech segments into a GMM
supervector space was described
Intra-speaker inter-session variability is
modeled in GMM supervector space
Speaker recognition
EER was reduced by 38 on the NIST-2004 SRE
A corresponding kernel-PCA based approach reduces
EER by 44
Speaker diarization
SER for speaker diarization was reduced by 39.

29
Summary 2/2 Algorithms based on the proposed
framework

Speaker recognition Aronowitz 2005b Aronowitz
2007c
Speaker diarization (who spoke when) Aronowitz
2007d
VAD (voice activity detection) Aronowitz 2007a
Language identification Noor Aronowitz 2006
Gender identification Bocklet 2008
Age detection Bocklet 2008
Channel/bandwidth classification Aronowitz 2007d

30
Bibliography 1/2
1 D. A. Reynolds et al., Speaker
identification and verification using Guassian
mixture speaker models, Speech Communications,
17, 91-108. 2 D.E. Sturim et al., Speaker
indexing in large audio databases using anchor
models, in Proc. ICASSP, 2001. 3 H.
Aronowitz, D. Burshtein, A. Amir, "Speaker
indexing in audio archives using test utterance
Gaussian mixture modeling", in Proc. ICSLP,
2004. 4 H. Aronowitz, D. Burshtein, A. Amir, "A
session-GMM generative model using test utterance
Gaussian mixture modeling for speaker
verification", in Proc. ICASSP, 2005. 5 P.
Kenny et al., Factor Analysis Simplified, in
Proc. ICASSP, 2005. 6 H. Aronowitz, D. Irony,
D. Burshtein, Modeling Intra-Speaker Variability
for Speaker Recognition , in Proc. Interspeech,
2005. 7 J. Goldberger and H. Aronowitz, "A
distance measure between GMMs based on the
unscented transform and its application to
speaker recognition" , in Proc. Interspeech
2005. 8 H. Aronowitz, D. Burshtein, "Efficient
Speaker Identification and Retrieval", in Proc.
Interspeech 2005.
31
Bibliography 2/2
9 A. Stolcke et al., MLLR Transforms as
Features in Speaker Recognition, in Proc.
Interspeech, 2005. 10 E. Noor, H. Aronowitz,
"Efficient language Identification using Anchor
Models and Support Vector Machines, in Proc.
ISCA Odyssey Workshop, 2006. 11 W.M. Campbell
et al., SVM Based Speaker Verification Using a
GMM Supervector Kernel and NAP Variability
Compensation, in Proc. ICASSP 2006. 12 H.
Aronowitz, Segmental modeling for audio
segmentation, in Proc. ICASSP, 2007. 13 J.R.
Hershey and P. A. Olsen, Approximating the
Kullback Leibler Divergence Between Gaussian
Mixture Models ,in Proc. ICASSP 2007. 14 H.
Aronowitz, D. Burshtein, Efficient Speaker
Recognition Using Approximated Cross Entropy
(ACE), in IEEE Trans. on Audio, Speech
Language Processing, September 2007. 15 H.
Aronowitz, Speaker Recognition using Kernel-PCA
and Intersession Variability Modeling, in Proc.
Interspeech, 2007. 16 H. Aronowitz, Trainable
Speaker Diarization, in Proc. Interspeech,
2007. 17 T. Bocklet et al., Age and Gender
Recognition for Telephone Applications Based on
GMM Supervectors and Support Vector Machines, in
Proc. ICASSP, 2008.
32
Thanks!

Presentation is available online at
http//aronowitzh.googlepages.com/

Backup slides

34
Kernel-PCA Based Mapping 2/5
Dot-product feature space
Session space
f()
f(x)
x
y
Kernel trick
f(y)
Anchor sessions
Goals - Map sessions into feature space
- Model in feature space
35
Kernel-PCA Based Mapping 3/5

Given - kernel K
- n anchor sessions
Find an orthonormal basis for
Method
Compute eigenvectors of the centralized
kernel-matrix ki,j K(Ai,Aj).
Normalize eigenvectors by square-roots of
corresponding eigenvalues ? vi
for
is the requested basis

36
Kernel-PCA Based Mapping 4/5
Common speaker subspace - Speaker unique
subspace -

Given sessions x, y, may be
uniquely represented as

is a mapping x?Rn with the property
37
Kernel-PCA Based Mapping 5/5
Speaker unique subspace
Session space
Feature space
ux
uy
K-PCA
f(x)
x
y
f(y)
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
38
Modeling in Segment-GMM Supervector Space
Segment-GMM supervector space
speech
silence
music
Frame sequence segment 2
Frame sequence segment n
Frame sequence segment 1
39
Segmental Modeling for Audio Segmentation

Goal
Segment audio accurately and robustly into
speech / silence / music segments.
Novel idea
Acoustic modeling is usually done on a
frame-basis.
Segmentation/classification is usually done on a
segment-basis (using smoothing).
Why not explicitly model whole segments?
Note speaker, noise, music-context, channel
(etc.) are constant during a segment.

40
Speech / Silence Segmentation Results 1/2
System EER FA _at_ FR0.5 FR _at_ FA1
EVAL06 FA24.2 _at_ FR0.25 FA24.2 _at_ FR0.25 FA24.2 _at_ FR0.25
GMM baseline 2.9 7.9 29.6
Segmental 1.7 5.1 2.7
Error reduction 41 35 91
41
Speech / Silence Segmentation Results 2/2
System EER FA _at_ FR0.5 FR _at_ FA1
EVAL06 FA69 _at_ FR0.25 FA69 _at_ FR0.25 FA69 _at_ FR0.25
GMM baseline 1.43 3.4 3.2
Segmental 1.27 2.0 1.9
Error reduction 11 41 41
42
LID in Session Space
English
Session space
French
Arabic
Test session
Training session
43
LID in Session Space - Algorithm

Front end shifted delta cepstrum (SDC).
Represent every train/test session by a GMM
super-vector.
Train a linear SVM to classify GMM super-vectors.
Results
EER4.1 on the NIST-03 Eval (30sec sessions).

44
Anchor Modeling Projection
Given anchor models ?1,,?n and session X
x1,,xF
Projection
average normalized log-likelihood

Speaker indexing Sturim et al., 2001
Intersession variability modeling in projected
space Collet et al., 2005
Speaker clustering Reynolds et al., 2004
Speaker segmentation Collet et al., 2006
Language identification Noor and Aronowitz, 2006

45
Intra-Class Variability Modeling Introduction

The classic GMM algorithm does not explicitly
model intra-speaker inter-session variability
Noise
Channel
Language
Changing speaker characteristics stress,
emotion, aging
The frame independence assumption does not hold
in these cases!

(1)
Instead, we get
(2)

Write a Comment

User Comments (0)