An Overview of Statistical Pattern Recognition Techniques for Speaker Verification* - PowerPoint PPT Presentation

About This Presentation
Title:

An Overview of Statistical Pattern Recognition Techniques for Speaker Verification*

Description:

An Overview of Statistical Pattern Recognition Techniques for Speaker Verification* - Group - Arecio Junior Isabela Cota Michelle Moreira Renan Vilas Novas – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 50
Provided by: unicampBr
Category:

less

Transcript and Presenter's Notes

Title: An Overview of Statistical Pattern Recognition Techniques for Speaker Verification*


1
An Overview of Statistical Pattern Recognition
Techniques for Speaker Verification
- Group - Arecio Junior Isabela Cota Michelle
Moreira Renan Vilas Novas Institute of
Computing, Unicamp MO447 Digital Forensics
Fazel and Chakrabartty 2011
2
Popularity
  • Speaker verification is a popular biometric
    identification technique used for authenticating
    and monitoring human subjects using their speech
    signal.
  • It is attractive for two main reasons
  • It does not require direct contact with the
    individual, thus avoiding the hurdle of
    perceived invasiveness
  • it does not require deployment of specialized
    signal transducers as microphones are now
    ubiquitous on most portable devices.

3
Popularity

 
4
Applications of Speaker Verification and
Recognition
Forensics and Tele-commerce Where the objective
is to automatically authenticate speakers of
interest using his/her conversation over a voice
channel (telephone or wireless phone). Multimedia
Web-portals (Facebook, Youtube, etc) Searching
for metadata like topic of discussion or
participant names and genders from these
multimedia documents would require automated
technology like speaker verification and
recognition.
5
Types of Speaker Recognition Systems
  • Traditionally, Speaker Verification systems are
    classified into two different categories based on
    constraints imposed on the authentication
    process
  • Text-dependent systems
  • text-independent systems

6
Text-depedent
Users are assumed to be cooperative and use
identical pass-phreases during the training and
testing phases Speech Recognition is used in
text-dependent speaker verification
 
7
Text-independent
No vocabulary constraints are imposed on the
training and testing phase The reference (what
is spoken in training) and the test (what is
uttered in actual use) utterances may have
completely different content
 
8
Fundamentals of Speech Based Biometrics
  • Speech is produced when air from the lungs passes
    through
  • throat
  • vocal cords
  • mouth
  • nasal tract

Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
9
Fundamentals of Speech Based Biometrics
  • Different position of the lips, tongue and the
    palate creates different sound patterns and gives
    rise to the physiological and spectral properties
    of the speech signal like
  • pitch
  • tone
  • volume.

Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
10
Fundamentals of Speech Based Biometrics
  • The combination of these properties is typically
    considered unique to the speaker because they are
    modulated by the size and shape of the mouth,
    vocal and nasal tract along with the size, shape
    and tension of the vocal cords.

Martinsa, I. Carboneb, A. Pintoc, A. Silvab and
A.Teixeira, European Portuguese MRI based speech
production stud- ies, Speech Commun., vol. 50 ,
no. 1112, pp. 925952, 2008.
11
Fundamentals of Speech Based Biometrics
  • Spectrograms corresponding to a sample utterance
    fifty-six thirty-five seventy-two for a male
    and female speaker.
  • horizontal axis represents the time
  • vertical axis corresponds to the frequency
  • color map represents the magnitude of the
    spectrogram.

Different parameters of the speech signal that
makes it unique for each person are pitch,
formants (F1-F3), and prosodylabeled
12
Achitecture of a Speaker Verification System
  • Operation of a speaker verification system
    typically consists of two phases
  • enrollment parameters of a speaker specific
    statistical model are determined using annotated
    (pre-labeled) speech data
  • verification an unknown speech sample is
    authenticated using the trained speaker specific
    model.

13
Architecture of a Speaker Verification System
14
Speech Acquisition
15
Feature Extraction
16
Importance
The heart of a speaker recognition system is
feature extraction Ningaal and Ahmad 2006
17
Challenges for Speaker Verification
  • Limited quantity and quality of training data
  • Intra-speaker variability
  • - Mismatch between recording conditions during
    the enrollment and the verification phase

18
Model of additive and channel noise
19
Robust speaker verification systems techniques
20
Challenges

 
21
Techniques
  • Mel-Frequency Cepstral Coefficients (MFCC)
  • Feature widely used in automatic speech
    recognition. It was introduced by Davis and
    Mermelstein in the 1980's, and have been
    state-of-the-art ever since.
  • RASTA-PLP (RelAtive SpecTraAL - Perceptual Linear
    Predictive)
  • Robust technique which mimic some characteristics
    of the human auditory perception, in order to
    improve the speech signal.

22
Mel-Frequency Cepstral Coefficients (MFCC)
  • Steps to calculate

Calculate the mel scale
Frame the signal
Windoning
Amplitude
Fast Fourier Transform (FFT)
Frequency
Discrete Cosine Transform (DCT)
Mel Coefficients
23
Mel-Frequency Cepstral Coefficients (MFCC)
  • Steps to calculate
  • Initially it is necessary to obtain the
    Mel-frequency scale (linear up to
  • 1000 Hz and logarithmic above) (Deller et
    al 2000).

The formula for converting from frequency to Mel
scale is
 
Mel scale
Frequency (Hz)
24
Mel-Frequency Cepstral Coefficients (MFCC)
frequency
Time (T)
  • Frame the signal into short frames Deller et al
    2000

25
RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
  • RASTA (RelAtive SpecTraAL) Analisys the
    perception of human hearing, considering the
    time.
  • PLP (Perceptual Linear Predictive) - Analisys
    frequency response of the communication channel.
  • Combining the two techniques result in a model
    more robust to such simulated channel variation.

26
RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
Cepstral Coefficients
27
RASTA-PLP (RelAtive SpecTraAL Perceptual Linear
Predictive) Schroeder 1976
(1)
(2)
(3)
 

 
28
Software for phonetic analyses
  • Praat
  • Program for doing phonetic analyses and sound
    manipulations (Boersma and Weenink 2012).
  • Offers a general extremely flexible tool in the
    Edit.. function to visualize and extract
    information from a sound object. For example
  • General analysis (waveform, intensity,
    spectrogram, pitch, duration)

29
Praat
30
Praat
Duration of Vowels /i/ and /I/ (Gomes 2014)
American Speaker
Brazilian Speaker
Brazilian Speaker
American Speaker
31
Praat
Word stress Exchange (Gomes 2014)
POLICE
Brazilian Speaker
American Speaker
32
Speaker Modeling
33
Types of Models for Speaker Verification
Generative models - Based on the capture of the
statistical properties of speaker specific speech
signals Discriminative models - Otimized to
minimize the error on a set of genuine and
impostor training samples
34
Generative Models
- Training typically involves data specific to
the target speakers - Training focused in the
capture of the empirical probability density
function corresponding to the acoustic feature
vectors - Examples Gaussian Mixture Models
(GMM) Hidden Markov Models (HMM)
35
Discriminative Models
- Training involves data corresponding to the
target and imposter speakers - Training focused
in the estimation of parameters of the manifold
witch distinguishes the features for the target
speakers from the features for the imposter
speakers - Examples Support Vector Machines
(SVMs) Artificial neural networks (ANNs)
36
Gaussian Mixture Models
A GMM is composed of a finite mixture of
multivariate Gaussian components Estimation of a
general probability density function to the
speaker verification
37
Gaussian Mixture Models
Before training, the means of Gaussians are
uniformly spaced and the variance and weights are
chosen to be the same. After training, the mean
and variance of the Gaussians align themselves
to the data cluster centers and the weights
capture the priori probability of the data.
38
Gaussian Mixture Models
Advantages - Training is relatively fast -
Models can be scaled and updated to add new
speakers with relative ease Disadvantages - By
construction, GMMs are static models that do not
take into account the dynamics inherent in the
speech vectors
39
Support Vector Machines
Supervised learning model based on the
construction of a set of hyperplanes in a
high-dimensional space, which can be used for
classification
40
Support Vector Machines
41
Support Vector Machines
- It provides good verification performances
even with relatively few data points in the
training set - Learning ability of the
classifier is controlled by a regularizer in the
SVM training (which determines the trade-off
between its complexity and its generalization
performance) - Good out-of-sample performance
42
Score Normalization
- Reduce the score variabilities across different
channel conditions - Adapt the speaker-dependent
threshold - Assumption that the impostors
scores follow a Gaussian distribution where the
mean and the standard deviation depend on the
speaker model and/or test utterance
43
Score Normalization
Zero Normalization (Z-norm) - Speaker model is
tested against a set of speech signals produced
by an imposter, resulting in an imposter
similarity score distribution - Offline Test
Normalization (T-norm) - Parameters are
estimated using a test utterance - Online
44
Score Normalization
Handset Test Normalization (HT-norm) -
Parameters are estimated by testing each test
utterance against handset-dependent imposter
models Channel Normalization (C-norm) -
Parameters are estimated by testing each speaker
model against a handset or channel-dependent set
of imposters - During testing, the type of
handset or channel related to the test utterance
is first detected
45
Performance of Speaker Verification System
False acceptance and false rejection are
functions of the decision threshold Detection
Cost Function (DCF) National Institute of
Standards and Technology (NIST) Equal Error
Rate
 
46
Our Implementation
- Wont be focused on robustness of background
noise and channel - Dataset comprised of 60
people, 19 audio samples for each, captured by
the same device - Z-norm or T-norm since we
dont have different handsets and channels
47
Prior Work
Campbell, William M., Douglas E. Sturim, and
Douglas A. Reynolds. "Support vector machines
using GMM supervectors for speaker
verification." Signal Processing Letters,
IEEE 13.5 (2006) 308-311. - MFCC - GMM to
generate supervectors - KL Divergence (Super
L2) - Space inner product (Super Linear)
48
Prior Work
- KL Divergence - Space inner product - Best
Result (GMM Super Linear)
49
References
  • Boersma, Paul and Weenink, David (2012). Praat
    doing phonetics by computer. Available from
    http//www.fon.hum.uva.nl/praat.
  • Davis, S. Mermelstein, P. (1980) Comparison of
    Parametric Representations for Monosyllabic Word
    Recognition in Continuously Spoken Sentences. In
    IEEE Transactions on Acoustics, Speech, and
    Signal Processing, Vol. 28 No. 4, pp. 357-366.
  • Deller J. R., et al. (2000). Discrete-Time
    Processing of Speech Signals. IEEE Press, p. 936.
  • Gomes, M.L.C. (2014). O Uso do Programa Praat
    para Compreensão do Jeitinho Brasileiro de
    Falar Inglês Uma Experiência de um Grupo de
    Estudos. IV Congresso Internacional da Abrapui
    (Associação Brasileira de Professores
    Universitários de Inglês). Language and
    Literature in the Age of Technology, Maceió,
    Alagoas.
  • Ningaal, I.Z. and Ahmad,A. M. (2006). The
    Fundamental of Feature Extraction in Speaker
    Recognition A Review. Proceedings of the
    Postgraduate Annual Research Seminar. Faculty of
    Computer Science and Information System,
    University of Technology Malaysia.
  • Schroeder, M.R. (1991). Recognition of Complex
    Acoustic Signals. In Life Sciences Research
    Report 5, T.H. Bullock, Ed., p. 324, Abakon
    Verlag, Berlin.
  • Zwicker, E. (1961), "Subdivision of the audible
    frequency range into critical bands," The Journal
    of the Acoustical Society of America, Volume 33,
    Issue 2, pp. 248-248 (1961)
Write a Comment
User Comments (0)
About PowerShow.com