Fooling a computer that is trained to recognise human voices - overview by Bhusan Chettri - PowerPoint PPT Presentation

About This Presentation
Title:

Fooling a computer that is trained to recognise human voices - overview by Bhusan Chettri

Description:

In this article Dr. Bhusan Chettri provides an overview of how voice authentication systems can be compromised through spoofing attacks. He adds "spoofing attack refers to the process of making an un-authorised attempt to break into someone else's authentication system either using synthetic voices produces through AI technology or by performing a mimicry or by simply replaying a pre-recorded voice samples of the target user." – PowerPoint PPT presentation

Number of Views:0

less

Transcript and Presenter's Notes

Title: Fooling a computer that is trained to recognise human voices - overview by Bhusan Chettri


1
Voice authentication systems are they secure?
can AI be used to fool them?
Bhusan Chettri explains how voice authentication
systems can be fooled using AI and how they can
be protected Although todays speaker
verification systems driven by deep learning and
big data shows superior performance in verifying
a speaker, they are not secure. They are prone to
spoofing attacks. In this article Dr. Bhusan
Chettri gives an overview of the technology used
for spoofing a voice aunthetication system that
uses automatic speaker verification (ASV)
technology.
Spoofing attacks in ASV an overview by Dr Bhusan
Chettri A spoofing attack (or presentation
attack ) involves illegitimate access to the
personal data of a targeted user. These attacks
are performed on a biometric system to provoke an
increase in its false acceptance rate. The
security threats imposed by such attacks are now
well acknowledged within the speech community.
As identified in the ISO/IEC 30107-1 standard, a
biometric system could be potentially attacked
from nine different points. Fig. 1 provides a
summary of this. The first two attack points are
of specific interest as they are particularly
vulnerable in terms of enabling an adversary to
inject spoofed biometric data. These two points
are commonly referred as physical access (PA) and
logical access (LA) attacks. As illustrated in
the figure, PA attacks involve presentation
attack at the sensor (microphone in case of ASV)
level and LA attacks involve modifying biometric
samples to bypass the sensor. Text-to- speech
and voice conversion techniques are used to
produce artificial speech to bypass an ASV
system. These two methods are examples of LA
attacks. On the other hand, mimicry and playing
back speech recordings (replay) are examples of
PA attacks.
2
Figure 1 Possible locations ISO/IEC, 2016 to
attack an ASV system. 1 microphone point, 2
transmission point, 3 override feature
extractor, 4 modify features, 5 override
classifier, 6 modify speaker database, 7
modify biometric reference, 8 modify score and
9 override decision.
Below, Bhusan Chettri provides a brief summary of
the four different spoofing methods used to fool
an ASV system
1. Mimicry (or Impersonation) This form of
attack involves an attacker attempting to modify
their voice characteristics to sound like a
target speaker. In other words, an attacker aims
to transform their lexical and prosodic
properties to be able to sound as close as
possible to the target speaker. Therefore, this
form of attack can be highly effective when the
attackers voice is similar to the target
speaker, as less effort would be required to
adjust the voice of an attacker in contrast to
situations where the voice of the attacker is
less similar to the target speaker. In other
words, the success of mimicry attacks often
depends on the degree or quality of the
impersonated voice, suggesting that professional
impersonators may be better at mimicking a target
speakers voice than inexperienced
impersonators. Research has shown that successful
attackers were found to be able to transform
their F0 (fundamental frequency) and sometimes
the formants close to the target speaker.
2. Speech synthesis
3
Speech synthesis or text-to-speech (TTS), is a
method to generate speech from a given text input
that sounds as natural and intelligible as
possible. It has a wide range of applications
including spoken dialogue systems,
speech-to-speech translation, assisting people
with vocal disorders, and automatic e- book
reading, to name a few. Text analysis and speech
waveform generation are the two main components
of a typical TTS system. The text analysis
component analyses the input text and produces
sequence of phonemes defining the linguistic
specification of the text. Using these phonemes,
the speech waveform generation module produces
the speech waveform. However, in end-to-end deep
learning frameworks, speech waveforms are
directly generated from the input text.
3. Voice conversion Voice conversion aims at
converting the voice of a speaker to that of
another. In the context of ASV spoofing, the
source voice corresponds to an attacker which is
converted to that of a target speaker to fool an
ASV system. Typical VC systems operate directly
on speech signals of the source and target
speaker using a parallel corpus of the two
speakers (speaking the same utterances) on which
a transformation function is learned to convert
the attacker acoustic parameters to that of a
target speaker. Applications of VC technologies
include producing natural sounding voices for
people with speech disabilities and voice
dubbing in entertainment industries to name a few.
4. Replay attacks A replay spoofing attack
involves playing back recorded speech samples of
a target speaker (enrolled speaker) to bypass an
ASV system. This type of attack requires physical
transmission of spoofed speech through the
system microphone. This is shown as point 1 in
Fig. 1. Replay is the simplest form of a
spoofing attack that can be implemented using
smartphones, and does not require specific
expertise either in speech processing or machine
learning techniques. A bonafide or genuine speech
corresponds to speech spoken by a target speaker
during enrollment (or the verification phase) and
is acquired by an ASV systems microphone. On
the other hand, a replayed speech denotes the
speech signal that is obtained by playing back a
pre-recorded bonafide speech which is then
acquired by the systems microphone. The
acoustic environment for the acquisition of
bonafide speech, and the replayed speech can be
the same situations where an attacker manages
to launch the attack from the same physical
space. But, in practice the acoustic space is
usually different (eg. a different closed
room/office with no background noise) as an
attacker would not want to risk getting caught
while launching such attacks. Therefore, factors
of interest in detecting replay attacks are
changes/noise induced in bonafide speech from
the loudspeaker of playback device, recording
device and the acoustic environment where the
replay attack is simulated.
4
Therefore, it is very important to secure these
systems from being manipulated. For this,
spoofing countermeasure solutions are often
integrated within the verfication pipeline. And,
voice spoofing countermeasures is currently an
active research topic within the speech research
community. In the next article, Dr Bhusan
Chettri will be talking more about how AI and
big-data can be used to design anti- spoofing
solutions in order to protect voice
authentication systems from spoofing attacks.
  • References
  • Bhusan Chettri scholar and personal website
  • M. Sahidullah et. al. Introduction to Voice
    Presentation Attack Detection and Recent
    Advances, 2019.
  • 3. Bhusan Chettri. Voice biometric system
    security Design and analysis of countermeasures
    for replay attacks. PhD thesis, Queen Mary
    University of London, August 2020.
  • 4 ASVspoof The automatic speaker verification
    spoofing and countermeasures challenge website.

Tags Bhusan Chettri London Bhusan Chettri
Queen Mary University of London Dr. Bhusan
Chettri Bhusan Chettri social Bhusan Chettri
Research
Write a Comment
User Comments (0)
About PowerShow.com