Title: PhonemeBased Speaker Verification using Adapted Gaussian Mixture Models
1Phoneme-Based Speaker Verificationusing Adapted
Gaussian Mixture Models MSc. Thesis by Yossi
Bar-Yosef Supervised by Prof. Yuval Bistritz
2Outline
- Speaker Verification - Introduction
- GMM Adapted GMM
- Phoneme-Based SV Configurations
- Results
- Phoneme Selection
- Conclusion
3Introduction to SV
Enrollment speech for each speaker
Trained Model for each speaker
4Introduction to SV
Likelihood Ratio Test (LRT)
?
X is from the claimed speaker. X is not
from the claimed speaker.
5Introduction to SV
- Receiver Operating Curve (ROC)
Equal Error Rate (EER) False Acceptance rate
False Rejection rate
6GMM
- Gaussian Mixture Models (GMM)
- For x, a D-dimensional feature vector, GMM is a
weighted sum of M Gaussian densities
- The log-likelihood of a sequence of T feature
vectors,
7GMM
- Expectation-Maximization (EM)
- An iterative algorithm for refining the models
parameters to increase the likelihood
8Adapted GMM
- Adjusts the current model parameters to a new
data.
- Expectation step is identical to EM
9Adapted GMM
- The adaptation coefficients depends on the data
r is the relevance factor
New Training Data
Adapted GMM
GMM
10Phoneme-Based SV Configurations
r
ah
n
11Phoneme-Based SV Configurations
12Phoneme-Based SV Configurations
13Phoneme-Based SV Configurations
UBM
SM
UBM
Cfg1
Cfg4
PDUBM
PDSM
PDUBM
PDSM
UBM
UBM
SM
Cfg2
Cfg5
PDUBM
PDSM
PDUBM
PDSM
PDUBM
PDSM
Cfg3
14Results
UBM
SM
UBM-SM
UBM
SM
World Bkg
Clean Speech
Telephone Speech
15Results
UBM
SM
PDUBM
PDSM
Clean Speech
Telephone Speech
16Results
UBM
SM
PDUBM
PDSM
Clean Speech
Telephone Speech
17Results
PDUBM
PDSM
Clean Speech
Telephone Speech
18Results
UBM
SM
PDUBM
PDSM
Clean Speech
Telephone Speech
19Results
UBM
SM
PDUBM
PDSM
Clean Speech
Telephone Speech
20Results
- Configurations Comparison
Clean Speech
20 sec training speech
21Results
- Configurations Comparison
Telephone Speech
20 sec training speech
22Results
- Configurations Comparison
Clean Speech
10 sec training speech
23Results
- Configurations Comparison
Telephone Speech
10 sec training speech
24Results
- Configurations Comparison
25Results
- Configurations Comparison
26Phoneme Selection
27Phoneme Selection
- Clean speech
- 20 sec training
28Phoneme Selection
- Telephone speech
- 20 sec training
29Phoneme Selection
Phoneme set of size N 38
Required number of evaluations 2N
274,877,906,944
Requires N evaluations, N 38
Required number of evaluations N(N1)/2 - 1 740
30Phoneme Selection
Clean speech
UBM-SM order 128
UBM
SM
Cfg5 order 128
UBM
SM
PDUBM
PDSM
31Phoneme Selection
Clean speech
Cfg4 order 128
UBM
SM
PDUBM
PDSM
Cfg3 order 8
UBM
SM
PDUBM
PDSM
32Phoneme Selection
Telephone speech
UBM-SM order 128
UBM
SM
Cfg5 order 128
UBM
SM
PDUBM
PDSM
33Phoneme Selection
Telephone speech
Cfg4 order 128
UBM
SM
PDUBM
PDSM
Cfg3 order 18
UBM
SM
PDUBM
PDSM
34Phoneme Selection
Clean Speech
Telephone Speech
35Phoneme Selection
UBM
SM
UBM
SM
PDUBM
PDSM
36Conclusion
- The coupling principle in models generation and
scoring - Adaptation of a well-trained speaker model
- Lower storage-cost system
- Phoneme Selection by the Knockout Rejection
procedure