Automatic speaker recognition overview by Bhusan Chettri - PowerPoint PPT Presentation

About This Presentation

Automatic speaker recognition overview by Bhusan Chettri


In this presentation, Bhusan Chettri provides an overview of voice authentication system that is based on automatic speaker verification technology. He provides background on both the traditional approaches of modelling speakers and current deep learning based approaches. A brief introduction to how these systems can be manipulated is also provided. – PowerPoint PPT presentation

Number of Views:13


Transcript and Presenter's Notes

Title: Automatic speaker recognition overview by Bhusan Chettri

Recognising a person using voice Automatic
speaker recognition and AI
Bhusan Chettri gives an overview of the
technology behind Voice Authentication using
computer So, what is Automatic Speaker
Recognition? Automatic Speaker Recognition is
the task of recognizing humans through their
voice by using a computer. Automatic Speaker
Recognition generally comprises of two tasks
Speaker identification and Speaker verification.
Speaker identification involves finding the
correct person from a given pool of known
speakers or voices. A speaker identification
usually comprises of a set of N speakers who are
already registered in the system and these N
speakers can only have access to the system.
Speaker verification on the other hand involves
verifying whether a person is who he/she claims
to be using their voice sample. These systems
are further classified into two categories
depending upon the level of user cooperation (1)
Text dependent (2) Text independent. In text
dependent application, the system has prior
knowledge of the spoken text and therefore
expects same utterance during test time (or
deployment phase). For example, pass-phrase such
as "My voice is my password" will be used both
during speaker enrollment (registration) and
during deployment (when the system is running).
On the contrary, in text independent systems
there is no prior knowledge about the lexical
contents, and therefore these systems are much
more complex than text dependent ones.
So how does the speaker verification algorithm
work? How are they trained and deployed? Bhusan
Chettri says well, in order to build automatic
speaker recognition systems first thing we need
is data. Big amount of speech data collected
from hundreds and thousands of speakers spoken
across varied acoustic conditions. It would be
nice to have pictures illustrating the
methodology as pictures speak louder than
thousand words. The block diagram shown below
summarises a typical speaker verification
system. It consists of speaker enrollment phase
(Fig a) and speaker verification phase (Fig b).
The role of a feature extraction module is to
transform the raw speech signal into some
representation (features) that retains speaker
specific attributes useful to the downstream
components in building speaker models. The
enrollment phase comprises offline and online
modes of building models. During the offline
mode, background models are trained on features
computed from a large speech collection
representing a diverse population of speakers.
The online phase comprises building a target
speaker model using features computed from
target speakers speech. Usually, training the
target speaker model from scratch is avoided
because learning reliable model parameters
requires a sufficiently large amount of speech
data,which is usually not available for every
individual speaker. To overcome this, the
parameters of a pretrained background model
representing the speaker population are adapted
using the speaker data yielding a reliable
speaker model estimate. During the speaker
verification phase, for a given test speech
utterance, a claimed speakers model and the
background model (representing the world of all
other possible speakers) is used to derive a
confidence score. The decision logic module then
makes a binary decision it either accepts the
claimed identity as a genuine speaker or rejects
it as an impostor based on some decision
(a) Speaker enrollment phase. The goal here is to
build speaker specific models by adapting a
background model which is trained on a large
speech database.
(b) Speaker verification phase. For a given
speech utterance the system obtains a
verification score and makes a decision whether
to accept or reject the claimed identity.
How has the state-of-the-art changed and driven
by big-data and AI? Bhusan Chettri explains that
there has been a big paradigm shift in the way we
build these systems. To bring clarity on this,
Dr. Bhusan Chettri summarises the recent
advancement in state-of-the-art in two broad
categories. (1) Traditional approaches (2) Deep
learning (and Big data) approaches. Traditional
methods. By traditional methods he refers to
approaches driven by a Gaussian mixture model -
universal background model (GMM-UBM) that were
adopted in the ASV literature until deep learning
techniques became popular in the field.
Mel-frequency cepstral coefficients (MFCCs) were
popular frame- level feature representations
used in speaker verification. Using short-term
MFCC feature vectors, utterance level features
such as i-vectors are often derived which have
shown state-of-the-art performance in speaker
verification. The background models such as the
Universal back-ground model (UBM) and total
variability (T) matrix are learned in an offline
phase using a large collection of speech data.
The UBM and T matrix are used in computing
i-vector (this is just a fixed length vector
representing a variable-length speech utterance)
representations. The training process involves
learning model (target or background) parameters
from training data. As for modelling techniques,
vector quantization (VQ) was one of the earliest
approaches used to represent a speaker, after
which Gaussian mixture models (GMMs), an
extension to VQ methods, and Support vector
machines became popular methods for speaker
modelling. The traditional approach also includes
training an i-vector extractor (GMM-UBM, T-
matrix) on MFCCs and using a probabilistic linear
discriminant analysis (PLDA) backend for
scoring. Deep learning methods. In deep learning
based approaches for ASV, features are often
learned in a data- driven manner directly from
the raw speech signal or from some intermediate
speech representations such as filter bank
energies. Handcrafted features, for example
MFCCs, are often used as input to train deep
neural network (DNN) based ASV systems. Features
learned from DNNs are often used to build
traditional ASV systems. Researchers have used
the output from the penultimate layer of a
pre-trained DNN as features to train a
traditional i-vector PLDA setup (replacing
i-vectors with DNN features). Extracting
bottleneck features (output from a hidden layer
with a relatively small number of units) from a
DNN to train a GMM-UBM system which uses the
log-likelihood ratio as scoring is also used
commonly. Utterance-level discriminative
features, so called embeddings extracted from
pre-trained DNNs have become popular recently,
demonstrating good results. End-to-end modelling
approaches have also been extensively studied in
speaker verification showing promising results.
In this setting, both feature learning and model
training are jointly optimised from the raw
speech input. A wide range of neural
architectures have been studied for speaker
verification. This includes feed forward neural
networks, commonly referred as deep neural
networks (DNNs), convolutional neural networks
(CNNs), recurrent neural networks, and attention
models. Training background models in deep
learning approaches can be thought of as a
pretrainng phase where network parameters are
trained on a large dataset. Speaker models are
then derived by adapting the pretrained model
parameters using speaker specific data, much like
the same way a traditional GMM-UBM system
So Dr. Bhusan Chettri tell us where these
technology are being used? Its applications? Thes
e can be used across wide-range of domains such
as (a) access control - voice based access
control systems (b) in banking applications for
authenticating a transaction (c) personalisation
in mobile devices, lock/unlock vehicle door
(engine start/off) based on specific user etc.
Are they safe and secure? Are they prone to any
manipulation when they are deployed? Bhusan
Chettri further explains that although the
current advancement in algorithms with the aid of
big data have shown remarkable state-of-the-art
results, these systems are not 100 secure. They
are prone to spoofing attacks where an attacker
aims to manipulate voice to sound like registered
user and gain illegitimate access to their
system. A significant amount of research is being
promoted by the ASV community recently along
this direction.
  • References
  • Bhusan Chettri scholar and personal website
  • M. Sahidullah et. al. Introduction to Voice
    Presentation Attack Detection and Recent
    Advances, 2019.
  • 3. Bhusan Chettri. Voice biometric system
    security Design and analysis of countermeasures
    for replay attacks. PhD thesis, Queen Mary
    University of London, August 2020.
  • 4 ASVspoof The automatic speaker verification
    spoofing and countermeasures challenge website.

Tags Bhusan Chettri London Bhusan Chettri
Queen Mary University of London Dr. Bhusan
Chettri Bhusan Chettri social Bhusan Chettri
Write a Comment
User Comments (0)