Dr. Hagai Aronowitz - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Dr. Hagai Aronowitz

Description:

Given labeled training segments from class and class , classify unlabeled test ... Intersession variability modeling in projected space [Collet et al., 2005] ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 45
Provided by: ana82
Category:

less

Transcript and Presenter's Notes

Title: Dr. Hagai Aronowitz


1
Intra-Class Variability Modeling for Speech
Processing
  • Dr. Hagai Aronowitz
  • IBM Haifa Research Lab
  • Presentation is available online at
    http//aronowitzh.googlepages.com/

2
Speech Classification Proposed framework
  • Given labeled training segments from class and
    class , classify unlabeled test segments
  • Classification framework
  • Represent speech segments in segment-space
  • Learn a classifier in segment-space
  • SVMs
  • NNs
  • Bayesian classifiers

3
Outline Intra-Class Variability Modeling for
Speech Processing
  • 1 Introduction to GMM based classification
  • 2 Mapping speech segments into segment space
  • 3 Intra-class variability modeling
  • 4 Speaker diarization
  • 5 Summary

4
GMM based speaker recognition
Text-Independent Speaker Recognition GMM-Based
Algorithm Reynolds 1995
Assuming frame independence
  • Estimate Pr(ytS)
  • Train a universal background model (UBM) GMM
    using EM
  • For every target speaker STrain a GMM GS by
    applying MAP-adaptation

µ1
µ2
µ3
UBM Q1 - speaker 1 Q2 - speaker 2
R26 MFCC feature space
5
GMM Based Algorithm - Analysis
  • Invalid frame independence assumptionFactors
    such as channel, emotion, lexical variability,
    and speaker aging cause frame dependency
  • GMM scoring is inefficient linear in the length
    of the audio
  • GMM scoring does not support indexing

6
Outline Intra-Class Variability Modeling for
Speech Processing
  • 1 Introduction to GMM based classification
  • 2 Mapping speech segments into segment space
  • 3 Intra-class variability modeling
  • 4 Speaker diarization
  • 5 Summary

7
Mapping Speech Segments into Segment Space GMM
scoring approximation 1/4
Definitions X training session for target
speaker Y test session Q GMM trained for
X P GMM trained for Y Goal Compute Pr(Y Q)
using GMMs P and Q only
  • Motivation
  • Efficient speaker recognition and indexing
  • More accurate modeling

8
Mapping Speech Segments into Segment Space GMM
scoring approximation 2/4
Negative cross entropy
(1)
  • Approximating the cross entropy between two GMMs
  • Matching based lower bound Aronowitz 2004
  • Unscented-transform based approximation
    Goldberger Aronowitz 2005
  • Others options in Hershey 2007

9
Mapping Speech Segments into Segment Space GMM
scoring approximation 3/4
Matching based approximation
(2)
Assuming weights and covariance matrices are
speaker independent ( some approximations)
(3)
Mapping T is induced
(4)
10
Mapping Speech Segments into Segment Space GMM
scoring approximation 4/4
Results
Figure and Table taken from H. Aronowitz, D.
Burshtein, Efficient Speaker Recognition Using
Approximated Cross Entropy (ACE), in IEEE
Trans. on Audio, Speech Language Processing,
September 2007.
11
Other Mapping Techniques
  • Anchor modeling projection Sturim 2001
  • efficient but inaccurate
  • MLLR transofrms Stolcke 2005
  • accurate but inefficient
  • Kernel-PCA-based mapping Aronowitz 2007c
  • Given - a set of objects - a kernel
    function
  • (a dot product between each pair of
    objects)Finds a mapping of the objects into Rn
    which preserves the kernel function.
  • accurate efficient

12
Kernel-PCA Based Mapping
Feature space
Speaker unique subspace
f(x)
Session space
ux
f(y)
uy
K-PCA
x
y
Kernel induced
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
13
Outline Intra-Class Variability Modeling for
Speech Processing
  • 1 Introduction to GMM based classification
  • 2 Mapping speech segments into segment space
  • 3 Intra-class variability modeling
  • 4 Speaker diarization
  • 5 Summary

14
Intra-Class Variability Modeling Aronowitz
2005b Introduction
  • The classic GMM algorithm does not explicitly
    model intra-speaker inter-session variability
  • channel, noise
  • language
  • stress, emotion, aging
  • The frame independence assumption does not hold
    in these cases!

(1)
Instead, we can use a more relaxed assumption
(2)
which leads to
(3)
15
Old vs. New Generative Models
Old Model
New Model
a PDF over GMM space
a GMM
a GMM
Session GMM
generated independently
Frame sequence
Frame sequence
generated independently
16
Session-GMM Space
GMM for session A of speaker 1
GMM for session B of speaker 1
speaker 2
speaker 1
speaker 3
Session-GMM space
17
Modeling in Session-GMM space 1/2
Recall mapping T induced by the GMM approximation
analysis
  • is called a supervector
  • A speaker is modeled by a multivariate normal
    distribution in supervector space

(3)
  • A typical dimension of is 50,00050,000
  • is estimated robustly using PCA
    regularization Covariance is assumed to be a
    low rank matrix with an additional non-zero
    (noise) diagonal

18
Modeling in Session-GMM Space 2/2 Estimating
covariance matrix
1
1
2
2
2
speaker 2
speaker 1
1
1
1
2
2
2
1
speaker 3
Delta supervector space
Supervector space
19
Experimental Setup
Datasets
  • is estimated from the NIST-2006-SRE
    corpus
  • Evaluation is done on the NIST-2004-SRE corpus

System description
  • ETSI MFCC (13-cep 13-delta-cep)
  • Energy based voice activity detector
  • Feature warping
  • 2048 Gaussians
  • Target models are adapted from GI-UBM
  • ZT-norm score normalization

20
Results
38 reduction in EER
21
Other Modeling Techniques
  • NAPSVMs Campbell 2006
  • Factor Analysis Kenny 2005
  • Kernel-PCA Aronowitz 2007c

Kernel-PCA based algorithm
  • Model each supervector as
  • s S Common speaker subspace
  • u U Speaker unique subspace
  • S is spanned by a set of development
    supervectors (700 speakers)
  • U is the orthogonal complement of S in
    supervector space
  • Intra-speaker variability is modeled separately
    in S and in U
  • U was found to be more discriminative than S
  • EER was reduced by 44 compared to baseline GMM

22
Kernel-PCA Based Modeling
Feature space
Speaker unique subspace
f(x)
Session space
ux
f(y)
uy
K-PCA
x
y
Kernel induced
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
23
Outline Intra-Class Variability Modeling for
Speech Processing
  • 1 Introduction to GMM based classification
  • 2 Mapping speech segments into segment space
  • 3 Intra-class variability modeling
  • 4 Speaker diarization
  • 5 Summary

24
Trainable Speaker Diarization Aronowitz 2007d
  • Goals
  • Detect speaker changes speaker segmentation
  • Cluster speaker segments - speaker clustering
  • Motivation for new method
  • Current algorithms do not exploit available
    training data!
  • (besides tuning thresholds, etc.)
  • Method
  • Explicitly model inter-segment intra-speaker
    variability from labeled training data, and use
    for the metric used by change-detection /
    clustering algorithms.

25
Speaker recognition on pairs of 3s segments
  • Dev data
  • BNAD05 (5hr) - Arabic, broadcast news
  • Eval data
  • BNAT05 Arabic, broadcast news, (207 target
    models, 6756 test segments)

System EER ()
Anchor modeling (baseline) 15.1
Anchor modeling - Kernel based scoring 10.8
Kernel-PCA projection (CSS) 8.8
Kernel-PCA projection (CSS) inter-segment variability modeling 7.4
26
Speaker Diarization System Experiments
  • Speaker change detection
  • 2 adjacent sliding windows (3s each)
  • Speaker verification scoring normalization
  • Speaker clustering
  • Speaker verification scoring normalization
  • Bottom-up clustering
  • Speaker Error Rate (SER) on BNAT05
  • Anchor modeling (baseline) 12.9
  • Kernel-PCA based method 7.9

27
Outline Intra-Class Variability Modeling for
Speech Processing
  • 1 Introduction to GMM based classification
  • 2 Mapping speech segments into segment space
  • 3 Intra-class variability modeling
  • 4 Speaker diarization
  • 5 Summary

28
Summary 1/2
  • A method for mapping speech segments into a GMM
    supervector space was described
  • Intra-speaker inter-session variability is
    modeled in GMM supervector space
  • Speaker recognition
  • EER was reduced by 38 on the NIST-2004 SRE
  • A corresponding kernel-PCA based approach reduces
    EER by 44
  • Speaker diarization
  • SER for speaker diarization was reduced by 39.

29
Summary 2/2 Algorithms based on the proposed
framework
  • Speaker recognition Aronowitz 2005b Aronowitz
    2007c
  • Speaker diarization (who spoke when) Aronowitz
    2007d
  • VAD (voice activity detection) Aronowitz 2007a
  • Language identification Noor Aronowitz 2006
  • Gender identification Bocklet 2008
  • Age detection Bocklet 2008
  • Channel/bandwidth classification Aronowitz 2007d

30
Bibliography 1/2
1 D. A. Reynolds et al., Speaker
identification and verification using Guassian
mixture speaker models, Speech Communications,
17, 91-108. 2 D.E. Sturim et al., Speaker
indexing in large audio databases using anchor
models, in Proc. ICASSP, 2001. 3 H.
Aronowitz, D. Burshtein, A. Amir, "Speaker
indexing in audio archives using test utterance
Gaussian mixture modeling", in Proc. ICSLP,
2004. 4 H. Aronowitz, D. Burshtein, A. Amir, "A
session-GMM generative model using test utterance
Gaussian mixture modeling for speaker
verification", in Proc. ICASSP, 2005. 5 P.
Kenny et al., Factor Analysis Simplified, in
Proc. ICASSP, 2005. 6 H. Aronowitz, D. Irony,
D. Burshtein, Modeling Intra-Speaker Variability
for Speaker Recognition , in Proc. Interspeech,
2005. 7 J. Goldberger and H. Aronowitz, "A
distance measure between GMMs based on the
unscented transform and its application to
speaker recognition" , in Proc. Interspeech
2005. 8 H. Aronowitz, D. Burshtein, "Efficient
Speaker Identification and Retrieval", in Proc.
Interspeech 2005.
31
Bibliography 2/2
9 A. Stolcke et al., MLLR Transforms as
Features in Speaker Recognition, in Proc.
Interspeech, 2005. 10 E. Noor, H. Aronowitz,
"Efficient language Identification using Anchor
Models and Support Vector Machines, in Proc.
ISCA Odyssey Workshop, 2006. 11 W.M. Campbell
et al., SVM Based Speaker Verification Using a
GMM Supervector Kernel and NAP Variability
Compensation, in Proc. ICASSP 2006. 12 H.
Aronowitz, Segmental modeling for audio
segmentation, in Proc. ICASSP, 2007. 13 J.R.
Hershey and P. A. Olsen, Approximating the
Kullback Leibler Divergence Between Gaussian
Mixture Models ,in Proc. ICASSP 2007. 14 H.
Aronowitz, D. Burshtein, Efficient Speaker
Recognition Using Approximated Cross Entropy
(ACE), in IEEE Trans. on Audio, Speech
Language Processing, September 2007. 15 H.
Aronowitz, Speaker Recognition using Kernel-PCA
and Intersession Variability Modeling, in Proc.
Interspeech, 2007. 16 H. Aronowitz, Trainable
Speaker Diarization, in Proc. Interspeech,
2007. 17 T. Bocklet et al., Age and Gender
Recognition for Telephone Applications Based on
GMM Supervectors and Support Vector Machines, in
Proc. ICASSP, 2008.
32
Thanks!
  • Presentation is available online at
    http//aronowitzh.googlepages.com/

33
  • Backup slides

34
Kernel-PCA Based Mapping 2/5
Dot-product feature space
Session space
f()
f(x)
x
y
Kernel trick
f(y)
Anchor sessions
Goals - Map sessions into feature space
- Model in feature space
35
Kernel-PCA Based Mapping 3/5
  • Given - kernel K
  • - n anchor sessions
  • Find an orthonormal basis for
  • Method
  • Compute eigenvectors of the centralized
    kernel-matrix ki,j K(Ai,Aj).
  • Normalize eigenvectors by square-roots of
    corresponding eigenvalues ? vi
  • for
    is the requested basis

36
Kernel-PCA Based Mapping 4/5
Common speaker subspace - Speaker unique
subspace -
  • Given sessions x, y, may be
    uniquely represented as

is a mapping x?Rn with the property
37
Kernel-PCA Based Mapping 5/5
Speaker unique subspace
Session space
Feature space
ux
uy
K-PCA
f(x)
x
y
f(y)
Tx
Ty
Anchor sessions
Common speaker subspace (Rn)
38
Modeling in Segment-GMM Supervector Space
Segment-GMM supervector space
speech
silence
music
Frame sequence segment 2
Frame sequence segment n
Frame sequence segment 1
39
Segmental Modeling for Audio Segmentation
  • Goal
  • Segment audio accurately and robustly into
    speech / silence / music segments.
  • Novel idea
  • Acoustic modeling is usually done on a
    frame-basis.
  • Segmentation/classification is usually done on a
    segment-basis (using smoothing).
  • Why not explicitly model whole segments?
  • Note speaker, noise, music-context, channel
    (etc.) are constant during a segment.

40
Speech / Silence Segmentation Results 1/2
System EER FA _at_ FR0.5 FR _at_ FA1
EVAL06 FA24.2 _at_ FR0.25 FA24.2 _at_ FR0.25 FA24.2 _at_ FR0.25
GMM baseline 2.9 7.9 29.6
Segmental 1.7 5.1 2.7
Error reduction 41 35 91
41
Speech / Silence Segmentation Results 2/2
System EER FA _at_ FR0.5 FR _at_ FA1
EVAL06 FA69 _at_ FR0.25 FA69 _at_ FR0.25 FA69 _at_ FR0.25
GMM baseline 1.43 3.4 3.2
Segmental 1.27 2.0 1.9
Error reduction 11 41 41
42
LID in Session Space
English
Session space
French
Arabic
Test session
Training session
43
LID in Session Space - Algorithm
  • Front end shifted delta cepstrum (SDC).
  • Represent every train/test session by a GMM
    super-vector.
  • Train a linear SVM to classify GMM super-vectors.
  • Results
  • EER4.1 on the NIST-03 Eval (30sec sessions).

44
Anchor Modeling Projection
Given anchor models ?1,,?n and session X
x1,,xF
Projection
average normalized log-likelihood
  • Speaker indexing Sturim et al., 2001
  • Intersession variability modeling in projected
    space Collet et al., 2005
  • Speaker clustering Reynolds et al., 2004
  • Speaker segmentation Collet et al., 2006
  • Language identification Noor and Aronowitz, 2006

45
Intra-Class Variability Modeling Introduction
  • The classic GMM algorithm does not explicitly
    model intra-speaker inter-session variability
  • Noise
  • Channel
  • Language
  • Changing speaker characteristics stress,
    emotion, aging
  • The frame independence assumption does not hold
    in these cases!

(1)
Instead, we get
(2)
Write a Comment
User Comments (0)
About PowerShow.com