An Overview of Statistical Pattern Recognition Techniques for Speaker Verification* - PowerPoint PPT Presentation

About This Presentation

Title:

An Overview of Statistical Pattern Recognition Techniques for Speaker Verification*

Description:

An Overview of Statistical Pattern Recognition Techniques for Speaker Verification* - Group - Arecio Junior Isabela Cota Michelle Moreira Renan Vilas Novas – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 50

Provided by: unicampBr

Category:

more less

Transcript and Presenter's Notes

Title: An Overview of Statistical Pattern Recognition Techniques for Speaker Verification*

1
An Overview of Statistical Pattern Recognition
Techniques for Speaker Verification
- Group - Arecio Junior Isabela Cota Michelle
Moreira Renan Vilas Novas Institute of
Computing, Unicamp MO447 Digital Forensics
Fazel and Chakrabartty 2011
2
Popularity

Speaker verification is a popular biometric
identification technique used for authenticating
and monitoring human subjects using their speech
signal.
It is attractive for two main reasons
It does not require direct contact with the
individual, thus avoiding the hurdle of
perceived invasiveness
it does not require deployment of specialized
signal transducers as microphones are now
ubiquitous on most portable devices.

3
Popularity

4
Applications of Speaker Verification and
Recognition
Forensics and Tele-commerce Where the objective
is to automatically authenticate speakers of
interest using his/her conversation over a voice
channel (telephone or wireless phone). Multimedia
Web-portals (Facebook, Youtube, etc) Searching
for metadata like topic of discussion or
participant names and genders from these
multimedia documents would require automated
technology like speaker verification and
recognition.
5
Types of Speaker Recognition Systems

Traditionally, Speaker Verification systems are
classified into two different categories based on
constraints imposed on the authentication
process
Text-dependent systems
text-independent systems

6
Text-depedent
Users are assumed to be cooperative and use
identical pass-phreases during the training and
testing phases Speech Recognition is used in
text-dependent speaker verification

7
Text-independent
No vocabulary constraints are imposed on the
training and testing phase The reference (what
is spoken in training) and the test (what is
uttered in actual use) utterances may have
completely different content

8
Fundamentals of Speech Based Biometrics

Speech is produced when air from the lungs passes
through
throat
vocal cords
mouth
nasal tract

Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
9
Fundamentals of Speech Based Biometrics

Different position of the lips, tongue and the
palate creates different sound patterns and gives
rise to the physiological and spectral properties
of the speech signal like
pitch
tone
volume.

Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
10
Fundamentals of Speech Based Biometrics

The combination of these properties is typically
considered unique to the speaker because they are
modulated by the size and shape of the mouth,
vocal and nasal tract along with the size, shape
and tension of the vocal cords.

Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
11
Fundamentals of Speech Based Biometrics

Spectrograms corresponding to a sample utterance
fifty-six thirty-five seventy-two for a male
and female speaker.
horizontal axis represents the time
vertical axis corresponds to the frequency
color map represents the magnitude of the
spectrogram.

Different parameters of the speech signal that
makes it unique for each person are pitch,
formants (F1-F3), and prosodylabeled
12
Achitecture of a Speaker Verification System

Operation of a speaker verification system
typically consists of two phases
enrollment parameters of a speaker specific
statistical model are determined using annotated
(pre-labeled) speech data
verification an unknown speech sample is
authenticated using the trained speaker specific
model.

13
Architecture of a Speaker Verification System
14
Speech Acquisition
15
Feature Extraction
16
Importance
The heart of a speaker recognition system is
feature extraction Ningaal and Ahmad 2006
17
Challenges for Speaker Verification

Limited quantity and quality of training data
Intra-speaker variability
- Mismatch between recording conditions during
the enrollment and the verification phase

18
Model of additive and channel noise
19
Robust speaker verification systems techniques
20
Challenges

21
Techniques

Mel-Frequency Cepstral Coefficients (MFCC)
Feature widely used in automatic speech
recognition. It was introduced by Davis and
Mermelstein in the 1980's, and have been
state-of-the-art ever since.
RASTA-PLP (RelAtive SpecTraAL - Perceptual Linear
Predictive)
Robust technique which mimic some characteristics
of the human auditory perception, in order to
improve the speech signal.

22
Mel-Frequency Cepstral Coefficients (MFCC)

Steps to calculate

Calculate the mel scale
Frame the signal
Windoning
Amplitude
Fast Fourier Transform (FFT)
Frequency
Discrete Cosine Transform (DCT)
Mel Coefficients
23
Mel-Frequency Cepstral Coefficients (MFCC)

Steps to calculate
Initially it is necessary to obtain the
Mel-frequency scale (linear up to
1000 Hz and logarithmic above) (Deller et
al 2000).

The formula for converting from frequency to Mel
scale is

Mel scale
Frequency (Hz)
24
Mel-Frequency Cepstral Coefficients (MFCC)
frequency
Time (T)

Frame the signal into short frames Deller et al
2000

25
RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976

RASTA (RelAtive SpecTraAL) Analisys the
perception of human hearing, considering the
time.
PLP (Perceptual Linear Predictive) - Analisys
frequency response of the communication channel.
Combining the two techniques result in a model
more robust to such simulated channel variation.

26
RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
Cepstral Coefficients
27
RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
(1)
(2)
(3)

28
Software for phonetic analyses

Praat
Program for doing phonetic analyses and sound
manipulations (Boersma and Weenink 2012).
Offers a general extremely flexible tool in the
Edit.. function to visualize and extract
information from a sound object. For example
General analysis (waveform, intensity,
spectrogram, pitch, duration)

29
Praat
30
Praat
Duration of Vowels /i/ and /I/ (Gomes 2014)
American Speaker
Brazilian Speaker
Brazilian Speaker
American Speaker
31
Praat
Word stress Exchange (Gomes 2014)
POLICE
Brazilian Speaker
American Speaker
32
Speaker Modeling
33
Types of Models for Speaker Verification
Generative models - Based on the capture of the
statistical properties of speaker specific speech
signals Discriminative models - Otimized to
minimize the error on a set of genuine and
impostor training samples
34
Generative Models
- Training typically involves data specific to
the target speakers - Training focused in the
capture of the empirical probability density
function corresponding to the acoustic feature
vectors - Examples Gaussian Mixture Models
(GMM) Hidden Markov Models (HMM)
35
Discriminative Models
- Training involves data corresponding to the
target and imposter speakers - Training focused
in the estimation of parameters of the manifold
witch distinguishes the features for the target
speakers from the features for the imposter
speakers - Examples Support Vector Machines
(SVMs) Artificial neural networks (ANNs)
36
Gaussian Mixture Models
A GMM is composed of a finite mixture of
multivariate Gaussian components Estimation of a
general probability density function to the
speaker verification
37
Gaussian Mixture Models
Before training, the means of Gaussians are
uniformly spaced and the variance and weights are
chosen to be the same. After training, the mean
and variance of the Gaussians align themselves
to the data cluster centers and the weights
capture the priori probability of the data.
38
Gaussian Mixture Models
Advantages - Training is relatively fast -
Models can be scaled and updated to add new
speakers with relative ease Disadvantages - By
construction, GMMs are static models that do not
take into account the dynamics inherent in the
speech vectors
39
Support Vector Machines
Supervised learning model based on the
construction of a set of hyperplanes in a
high-dimensional space, which can be used for
classification
40
Support Vector Machines
41
Support Vector Machines
- It provides good verification performances
even with relatively few data points in the
training set - Learning ability of the
classifier is controlled by a regularizer in the
SVM training (which determines the trade-off
between its complexity and its generalization
performance) - Good out-of-sample performance
42
Score Normalization
- Reduce the score variabilities across different
channel conditions - Adapt the speaker-dependent
threshold - Assumption that the impostors
scores follow a Gaussian distribution where the
mean and the standard deviation depend on the
speaker model and/or test utterance
43
Score Normalization
Zero Normalization (Z-norm) - Speaker model is
tested against a set of speech signals produced
by an imposter, resulting in an imposter
similarity score distribution - Offline Test
Normalization (T-norm) - Parameters are
estimated using a test utterance - Online
44
Score Normalization
Handset Test Normalization (HT-norm) -
Parameters are estimated by testing each test
utterance against handset-dependent imposter
models Channel Normalization (C-norm) -
Parameters are estimated by testing each speaker
model against a handset or channel-dependent set
of imposters - During testing, the type of
handset or channel related to the test utterance
is first detected
45
Performance of Speaker Verification System
False acceptance and false rejection are
functions of the decision threshold Detection
Cost Function (DCF) National Institute of
Standards and Technology (NIST) Equal Error
Rate

46
Our Implementation
- Wont be focused on robustness of background
noise and channel - Dataset comprised of 60
people, 19 audio samples for each, captured by
the same device - Z-norm or T-norm since we
dont have different handsets and channels
47
Prior Work
Campbell, William M., Douglas E. Sturim, and
Douglas A. Reynolds. "Support vector machines
using GMM supervectors for speaker
verification." Signal Processing Letters,
IEEE 13.5 (2006) 308-311. - MFCC - GMM to
generate supervectors - KL Divergence (Super
L2) - Space inner product (Super Linear)
48
Prior Work
- KL Divergence - Space inner product - Best
Result (GMM Super Linear)
49
References

Boersma, Paul and Weenink, David (2012). Praat
doing phonetics by computer. Available from
http//www.fon.hum.uva.nl/praat.
Davis, S. Mermelstein, P. (1980) Comparison of
Parametric Representations for Monosyllabic Word
Recognition in Continuously Spoken Sentences. In
IEEE Transactions on Acoustics, Speech, and
Signal Processing, Vol. 28 No. 4, pp. 357-366.
Deller J. R., et al. (2000). Discrete-Time
Processing of Speech Signals. IEEE Press, p. 936.
Gomes, M.L.C. (2014). O Uso do Programa Praat
para Compreensão do Jeitinho Brasileiro de
Falar Inglês Uma Experiência de um Grupo de
Estudos. IV Congresso Internacional da Abrapui
(Associação Brasileira de Professores
Universitários de Inglês). Language and
Literature in the Age of Technology, Maceió,
Alagoas.
Ningaal, I.Z. and Ahmad,A. M. (2006). The
Fundamental of Feature Extraction in Speaker
Recognition A Review. Proceedings of the
Postgraduate Annual Research Seminar. Faculty of
Computer Science and Information System,
University of Technology Malaysia.
Schroeder, M.R. (1991). Recognition of Complex
Acoustic Signals. In Life Sciences Research
Report 5, T.H. Bullock, Ed., p. 324, Abakon
Verlag, Berlin.
Zwicker, E. (1961), "Subdivision of the audible
frequency range into critical bands," The Journal
of the Acoustical Society of America, Volume 33,
Issue 2, pp. 248-248 (1961)