Title: An Overview of Statistical Pattern Recognition Techniques for Speaker Verification*
1An Overview of Statistical Pattern Recognition
Techniques for Speaker Verification
- Group - Arecio Junior Isabela Cota Michelle
Moreira Renan Vilas Novas Institute of
Computing, Unicamp MO447 Digital Forensics
Fazel and Chakrabartty 2011
2Popularity
- Speaker verification is a popular biometric
identification technique used for authenticating
and monitoring human subjects using their speech
signal. - It is attractive for two main reasons
- It does not require direct contact with the
individual, thus avoiding the hurdle of
perceived invasiveness - it does not require deployment of specialized
signal transducers as microphones are now
ubiquitous on most portable devices.
3Popularity
4Applications of Speaker Verification and
Recognition
Forensics and Tele-commerce Where the objective
is to automatically authenticate speakers of
interest using his/her conversation over a voice
channel (telephone or wireless phone). Multimedia
Web-portals (Facebook, Youtube, etc) Searching
for metadata like topic of discussion or
participant names and genders from these
multimedia documents would require automated
technology like speaker verification and
recognition.
5Types of Speaker Recognition Systems
- Traditionally, Speaker Verification systems are
classified into two different categories based on
constraints imposed on the authentication
process - Text-dependent systems
- text-independent systems
6Text-depedent
Users are assumed to be cooperative and use
identical pass-phreases during the training and
testing phases Speech Recognition is used in
text-dependent speaker verification
7Text-independent
No vocabulary constraints are imposed on the
training and testing phase The reference (what
is spoken in training) and the test (what is
uttered in actual use) utterances may have
completely different content
8Fundamentals of Speech Based Biometrics
- Speech is produced when air from the lungs passes
through - throat
- vocal cords
- mouth
- nasal tract
Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
9Fundamentals of Speech Based Biometrics
- Different position of the lips, tongue and the
palate creates different sound patterns and gives
rise to the physiological and spectral properties
of the speech signal like - pitch
- tone
- volume.
Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
10Fundamentals of Speech Based Biometrics
- The combination of these properties is typically
considered unique to the speaker because they are
modulated by the size and shape of the mouth,
vocal and nasal tract along with the size, shape
and tension of the vocal cords.
Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
11Fundamentals of Speech Based Biometrics
- Spectrograms corresponding to a sample utterance
fifty-six thirty-five seventy-two for a male
and female speaker. - horizontal axis represents the time
- vertical axis corresponds to the frequency
- color map represents the magnitude of the
spectrogram.
Different parameters of the speech signal that
makes it unique for each person are pitch,
formants (F1-F3), and prosodylabeled
12Achitecture of a Speaker Verification System
- Operation of a speaker verification system
typically consists of two phases - enrollment parameters of a speaker specific
statistical model are determined using annotated
(pre-labeled) speech data - verification an unknown speech sample is
authenticated using the trained speaker specific
model.
13Architecture of a Speaker Verification System
14Speech Acquisition
15Feature Extraction
16Importance
The heart of a speaker recognition system is
feature extraction Ningaal and Ahmad 2006
17Challenges for Speaker Verification
- Limited quantity and quality of training data
- Intra-speaker variability
- - Mismatch between recording conditions during
the enrollment and the verification phase
18Model of additive and channel noise
19Robust speaker verification systems techniques
20Challenges
21Techniques
- Mel-Frequency Cepstral Coefficients (MFCC)
- Feature widely used in automatic speech
recognition. It was introduced by Davis and
Mermelstein in the 1980's, and have been
state-of-the-art ever since. - RASTA-PLP (RelAtive SpecTraAL - Perceptual Linear
Predictive) - Robust technique which mimic some characteristics
of the human auditory perception, in order to
improve the speech signal.
22Mel-Frequency Cepstral Coefficients (MFCC)
Calculate the mel scale
Frame the signal
Windoning
Amplitude
Fast Fourier Transform (FFT)
Frequency
Discrete Cosine Transform (DCT)
Mel Coefficients
23Mel-Frequency Cepstral Coefficients (MFCC)
- Steps to calculate
- Initially it is necessary to obtain the
Mel-frequency scale (linear up to - 1000 Hz and logarithmic above) (Deller et
al 2000).
The formula for converting from frequency to Mel
scale is
Mel scale
Frequency (Hz)
24Mel-Frequency Cepstral Coefficients (MFCC)
frequency
Time (T)
- Frame the signal into short frames Deller et al
2000
25RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
- RASTA (RelAtive SpecTraAL) Analisys the
perception of human hearing, considering the
time. - PLP (Perceptual Linear Predictive) - Analisys
frequency response of the communication channel. - Combining the two techniques result in a model
more robust to such simulated channel variation.
26RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
Cepstral Coefficients
27RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
(1)
(2)
(3)
28Software for phonetic analyses
- Praat
- Program for doing phonetic analyses and sound
manipulations (Boersma and Weenink 2012). - Offers a general extremely flexible tool in the
Edit.. function to visualize and extract
information from a sound object. For example - General analysis (waveform, intensity,
spectrogram, pitch, duration)
29Praat
30Praat
Duration of Vowels /i/ and /I/ (Gomes 2014)
American Speaker
Brazilian Speaker
Brazilian Speaker
American Speaker
31Praat
Word stress Exchange (Gomes 2014)
POLICE
Brazilian Speaker
American Speaker
32Speaker Modeling
33Types of Models for Speaker Verification
Generative models - Based on the capture of the
statistical properties of speaker specific speech
signals Discriminative models - Otimized to
minimize the error on a set of genuine and
impostor training samples
34Generative Models
- Training typically involves data specific to
the target speakers - Training focused in the
capture of the empirical probability density
function corresponding to the acoustic feature
vectors - Examples Gaussian Mixture Models
(GMM) Hidden Markov Models (HMM)
35Discriminative Models
- Training involves data corresponding to the
target and imposter speakers - Training focused
in the estimation of parameters of the manifold
witch distinguishes the features for the target
speakers from the features for the imposter
speakers - Examples Support Vector Machines
(SVMs) Artificial neural networks (ANNs)
36Gaussian Mixture Models
A GMM is composed of a finite mixture of
multivariate Gaussian components Estimation of a
general probability density function to the
speaker verification
37Gaussian Mixture Models
Before training, the means of Gaussians are
uniformly spaced and the variance and weights are
chosen to be the same. After training, the mean
and variance of the Gaussians align themselves
to the data cluster centers and the weights
capture the priori probability of the data.
38Gaussian Mixture Models
Advantages - Training is relatively fast -
Models can be scaled and updated to add new
speakers with relative ease Disadvantages - By
construction, GMMs are static models that do not
take into account the dynamics inherent in the
speech vectors
39Support Vector Machines
Supervised learning model based on the
construction of a set of hyperplanes in a
high-dimensional space, which can be used for
classification
40Support Vector Machines
41Support Vector Machines
- It provides good verification performances
even with relatively few data points in the
training set - Learning ability of the
classifier is controlled by a regularizer in the
SVM training (which determines the trade-off
between its complexity and its generalization
performance) - Good out-of-sample performance
42Score Normalization
- Reduce the score variabilities across different
channel conditions - Adapt the speaker-dependent
threshold - Assumption that the impostors
scores follow a Gaussian distribution where the
mean and the standard deviation depend on the
speaker model and/or test utterance
43Score Normalization
Zero Normalization (Z-norm) - Speaker model is
tested against a set of speech signals produced
by an imposter, resulting in an imposter
similarity score distribution - Offline Test
Normalization (T-norm) - Parameters are
estimated using a test utterance - Online
44Score Normalization
Handset Test Normalization (HT-norm) -
Parameters are estimated by testing each test
utterance against handset-dependent imposter
models Channel Normalization (C-norm) -
Parameters are estimated by testing each speaker
model against a handset or channel-dependent set
of imposters - During testing, the type of
handset or channel related to the test utterance
is first detected
45Performance of Speaker Verification System
False acceptance and false rejection are
functions of the decision threshold Detection
Cost Function (DCF) National Institute of
Standards and Technology (NIST) Equal Error
Rate
46Our Implementation
- Wont be focused on robustness of background
noise and channel - Dataset comprised of 60
people, 19 audio samples for each, captured by
the same device - Z-norm or T-norm since we
dont have different handsets and channels
47Prior Work
Campbell, William M., Douglas E. Sturim, and
Douglas A. Reynolds. "Support vector machines
using GMM supervectors for speaker
verification." Signal Processing Letters,
IEEE 13.5 (2006) 308-311. - MFCC - GMM to
generate supervectors - KL Divergence (Super
L2) - Space inner product (Super Linear)
48Prior Work
- KL Divergence - Space inner product - Best
Result (GMM Super Linear)
49References
- Boersma, Paul and Weenink, David (2012). Praat
doing phonetics by computer. Available from
http//www.fon.hum.uva.nl/praat. - Davis, S. Mermelstein, P. (1980) Comparison of
Parametric Representations for Monosyllabic Word
Recognition in Continuously Spoken Sentences. In
IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol. 28 No. 4, pp. 357-366. - Deller J. R., et al. (2000). Discrete-Time
Processing of Speech Signals. IEEE Press, p. 936. - Gomes, M.L.C. (2014). O Uso do Programa Praat
para Compreensão do Jeitinho Brasileiro de
Falar Inglês Uma Experiência de um Grupo de
Estudos. IV Congresso Internacional da Abrapui
(Associação Brasileira de Professores
Universitários de Inglês). Language and
Literature in the Age of Technology, Maceió,
Alagoas. - Ningaal, I.Z. and Ahmad,A. M. (2006). The
Fundamental of Feature Extraction in Speaker
Recognition A Review. Proceedings of the
Postgraduate Annual Research Seminar. Faculty of
Computer Science and Information System,
University of Technology Malaysia. - Schroeder, M.R. (1991). Recognition of Complex
Acoustic Signals. In Life Sciences Research
Report 5, T.H. Bullock, Ed., p. 324, Abakon
Verlag, Berlin. - Zwicker, E. (1961), "Subdivision of the audible
frequency range into critical bands," The Journal
of the Acoustical Society of America, Volume 33,
Issue 2, pp. 248-248 (1961)