Title: Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification
1Acoustic Vector Re-sampling for GMMSVM-Based
Speaker Verification
- Man-Wai MAK and Wei RAO
- The Hong Kong Polytechnic University
- enmwmak_at_polyu.edu.hk
- http//www.eie.polyu.edu.hk/mwmak/
2Outline
- GMM-UBM for Speaker Verification
- GMM-SVM for Speaker Verification
- Data-Imbalance Problem in GMM-SVM
- Utterance Partitioning for GMM-SVM
- Experiments on NIST SRE
3Speaker Verification
- To verify the identify of a claimant based on
his/her own voices
Is this Marys voice?
I am Mary
4Verification Process
Im John
Decision Threshold
Johns Voiceprint
Johns Model
Score Normalization and Decision Making
Feature Extraction
Scores
Impostor Model
_
Accept/Reject
Impostors Voiceprints
5Acoustic Features
- Speech is a continuous evolution of the vocal
tract - Need to extract a sequence of spectra or sequence
of spectral coefficients - Use a sliding window - 25 ms window, 10 ms shift
MFCC
DCT
LogX(?)
6GMM-UBM for Speaker Verification
- The acoustic vectors (MFCC) of speaker s is
modeled by a prob. density function parameterized
by
- Gaussian mixture model (GMM) for speaker s
7GMM-UBM for Speaker Verification
- The acoustic vectors of a general population is
modeled by another GMM called the universal
background model (UBM)
8GMM-UBM for Speaker Verification
Enrollment Utterance (X(s)) of Client Speaker
Universal Background Model
MAP
Client Speaker Model
9GMM-UBM Scoring
- 2-class Hypothesis problem
- H0 MFCC sequence X(c) comes from to the true
speaker - H1 MFCC sequence X(c) comes from an impostor
- Verification score is a likelihood ratio
Speaker Model
Score
Feature extraction
Decision
-
Background Model
10Outline
- GMM-UBM for Speaker Verification
- GMM-SVM for Speaker Verification
- Data-Imbalance Problem in GMM-SVM
- Acoustic Vector Resampling for GMM-SVM
- Results on NIST SRE
11GMM-SVM for Speaker Verification
GMM supervector
UBM
Mean Stacking
Feature Extraction
MAP Adaptation
Mapping
12GMM-SVM Scoring
SVM Scoring
Compute GMM- Supervector of Target Speaker s
Feature Extraction
UBM
Compute GMM- Supervectors of Background Speakers
Feature Extraction
Feature Extraction
Compute GMM- Supervector of Claimant c
UBM
13GMM-UBM Scoring Vs. GMM-SVM Scoring
GMM-UBM
GMM-SVM
Normalized GMM-supervector of claimants
utterance
Normalized GMM-supervector of target-speakers
utterance
14Outline
- GMM-UBM for Speaker Verification
- GMM-SVM for Speaker Verification
- Data-Imbalance Problem in GMM-SVM
- Utterance Partitioning for GMM-SVM
- Results on NIST SRE
15Data Imbalance in GMM-SVM
- For each target speaker, we only have one
utterance (GMM-supervector) from the target
speaker and many utterances from the background
speakers. - So, we have a highly imbalance learning problem.
Only one training vector from the target speaker
16Data Imbalance in GMM-SVM
Orientation of the decision boundary depends
mainly on impostor-class data
17A 3-dim two-class problem illustrating the
problem that the SVM decision plane is largely
governed by the impostor-class supervectors.
Data Imbalance in GMM-SVM
Impostor Class
Speaker Class
Region for which the target-speaker vector can be
located without changing the orientation of the
decision plane
18Outline
- GMM-UBM for Speaker Verification
- GMM-SVM for Speaker Verification
- Data-Imbalance Problem in GMM-SVM
- Utterance Partitioning for GMM-SVM
- Results on NIST SRE
19Utterance Partitioning
- Partition an enrollment utterance of a target
speaker into number of sub-utterances, with each
sub-utterance producing one GMM-supervector.
20Utterance Partitioning
Background-speakers Utterances
Target-speakers Enrollment Utterance
Feature Extraction
Feature Extraction
UBM
MAP Adaptation and Mean Stacking
SVM Training
SVM of Target Speaker s
21Length-Representation Trade-off
- When the number of partitions increases, the
length of sub-utterance decreases. - If the utterance-length is too short, the
supervectors of the sub-utterances will be almost
the same as that of the UBM
Supervector corresponding to the UBM
22Utterance Partitioning with Acoustic Vector
Resampling (UP-AVR)
Goal Increase the number of sub-utterances
without compromising their representation power
Procedure of UP-AVR
- 1. Randomly rearrange the sequence of acoustic
vectors in an utterance - 2. Partition the acoustic vectors of an
utterance into N segments - 3. If Step 1 and Step 2 are repeated R times,
we obtain RN1 target-speakers supervectors .
MFCC seq. before randomization
MFCC seq. after randomization
23Utterance Partitioning with Acoustic Vector
Resampling (UP-AVR)
Target
-
speaker
s Enrollment
U
tterance
Background
-
speaker
s
U
tterances
Feature
Extraction
and
Feature
Extraction
and
Index Randomization
Index Randomization
MAP Adaptation
and
UBM
Mean Stacking
SVM
Training
s
SVM of Target Speaker
24Utterance Partitioning with Acoustic Vector
Resampling (UP-AVR)
- Characteristics of supervectors created by UP-AVR
- Average pairwise distance between sub-utt SVs is
larger than the average pairwise distance between
sub-utt SVs and full-utt SV. - Average pairwise distance between speaker-classs
sub-utt SVs and impostor-classs SVs is smaller
than the average pairwise distance between
speaker-classs full-utt SV and impostor-classs
SVs.
Imposter-class
Speaker-class
Sub-utt supervector
Full-utt supervector
25Nuisance Attribute Projection
Nuisance Attribute Project (NAP) Solomonoff et
al., ICASSP2005
Goal To reduce the effect of session variability
Recall the GMM-supervector kernel
Define the session- and speaker-dependent
supervector as
Remove the session-dependent part (h) by removing
the sub-space that causes the session variability
Sub-space representing session variability. Define
d by V
The New kernel becomes
26Nuisance Attribute Projection
Nuisance Attribute Project (NAP) Solomonoff et
al., ICASSP2005
Sub-space representing session variability. Define
d by V
27Enrollment Process of GMM-SVM with UP-AVR
Resampling/ Partitioning
MFCCs of an utterance from target-speaker s
UBM
MAP and Mean Stacking
Session-dependent supervectors
NAP
Session-independent supervectors
SVM of target-speaker s
SVM Training
28Verification Process of GMM-SVM with UP-AVR
MFCCs of a test utterance from claimant c
UBM
MAP and Mean Stacking
Session-dependent supervector
Tnorm Models
NAP
Session-independent supervector
score
Normalized score
SVM Scoring
T-Norm
SVM of target-speaker s
29T-Norm (Auckenthaler, 2000)
Goal To shift and scale the verification scores
so that a global decision threshold can be used
for all speakers
T-Norm SVM 1
SVM Scoring
Compute Mean and Standard Deviation
Z-norm
from test utterance
SVM Scoring
T-Norm SVM R
30Outline
- GMM-UBM for Speaker Verification
- GMM-SVM for Speaker Verification
- Data-Imbalance Problem in GMM-SVM
- Utterance Partitioning for GMM-SVM
- Experiments on NIST SRE
31Experiments
Speech Data
- Evaluations on NIST SRE 2002 and 2004
- NIST SRE 2002
- Use NIST01 for computing the UBMs,
impostor-class supervectors of SVMs, Tnorm
models, and NAP parameters - 2983 true-speaker trials and 36287 impostor
attempts - 2-min utterances for training and about 1-min utt
for test - NIST SRE 2004
- Use the Fisher corpus for computing UBMs,
impostor-class supervectors of SVMs, and Tnorm
models - NIST99 and NIST00 for computing NAP parameters
- 2386 true-speaker trials and 23838 impostor
attempts - 5-min utterances for training and testing
32Experiments
Features and Models
- 12 MFCC 12 ?MFCC with feature warping
- 1024-mixture GMMs for GMM-UBM
- 256-mixture GMMs for GMM-SVM
- MAP relevance factor 16
- 300 impostor-class supervectors for GMM-SVM
- 200 T-norm models
- 64-dim session variability subspace (NAP corank,
rank of V)
33Results
No. of mixtures in GMM-SVM (NIST02)
Threshold below which the variances of feature
are deemed too small
Normalized
Large number of features with small variance
34Results
Effects of NAP on Different NIST SRE
Large eigenvalues mean large session variation
35Results
Effect of NAP Corank on Performance
No NAP
36Results
Comparing discriminative power of GMM-SVM and
GMM-SVM with UP-AVR
37Results
EER and MinDCF vs. No. of Target-Speaker
Supervectors
NIST02
38Results
Varying the number of resampling (R) and number
of partitions (N)
NIST02
39Results
NIST02
40Experiments and Results
Performance on NIST02
EER9.39
EER9.05
EER8.16
41Experiments and Results
Performance on NIST04
GMM-UBM
EER16.05
GMM-SVM
GMM-SVM w/ UP-AVR
EER10.42
EER9.46
42References
- S.X. Zhang and M.W. Mak "Optimized Discriminative
Kernel for SVM Scoring and its Application to
Speaker Verification", IEEE Trans. on Neural
Networks, to appear. - M.W. Mak and W. Rao, "Utterance Partitioning with
Acoustic Vector Resampling for GMM-SVM Speaker
Verification", Speech Communication, vol. 53 (1),
Jan. 2011, Pages 119-130. - M.W. Mak and W. Rao, "Acoustic Vector Resampling
for GMMSVM-Based Speaker Verification,
Interspeech 2010. Sept. 2010, Makuhari, Japan,
pp. 1449-1452. - S.Y. Kung, M.W. Mak, and S.H. Lin. Biometric
Authentication A Machine Learning Approach,
Prentice Hall, 2005 - W. M. Campbell, D. E. Sturim, and D. A. Reynolds,
Support vector machines using GMM supervectors
for speaker verification, IEEE Signal Processing
Letters, vol. 13, pp. 308311, 2006. - D. A. Reynolds, T. F. Quatieri, and R. B. Dunn,
Speaker verification using adapted Gaussian
mixture models, Digital Signal Processing, vol.
10, pp. 1941, 2000.