Lecture 16 Speaker Recognition - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Lecture 16 Speaker Recognition

Description:

Lecture 16 Speaker Recognition Information College, Shandong University _at_ Weihai Definition Method of recognizing a Person form his/her voice. Depends on Speaker ... – PowerPoint PPT presentation

Number of Views:220

Avg rating:3.0/5.0

Slides: 27

Provided by: bill475

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 16 Speaker Recognition

1
Lecture 16 Speaker Recognition
Information College, Shandong University _at_
Weihai
2
Definition

Method of recognizing a Person form his/her
voice.
Depends on Speaker Specific Characteristics
To determine whether a specified speaker is
speaking in a given segment of speech
This task is the one closest to biometric
identification using speech

3
Voice is a popular Biometric

Voice Biometric
Natural signal to produce
Does not require a specialized input device
Can be used on site or remotely
Telephone banking, Voice mail browsing, .
Security
Keys, card, ...
Passwords, PIN, ...
Fingerprint, voiceprint, Iris-print

4
Similar Tasks

Speaker Verification
Extract information from the stream of speech.
Verifies that a person is who she/he claims to
be.
One-to-one comparison.
Speaker Recognition
Extract information from the stream of speech.
Assigns an identity to the voice of an unknown
person.
One-to-many comparison.
Speech Recognition
Extracts information from the stream of speech.
Figures out what a person is saying.

5
Task of Today

Speech Recognition
History
Scheme
Speaker Features
Methods

6
Recognition Milestone

1920, first electromechanical toy Rex',
(Elmwood Co. )
Late 1940s, US Defense, Automatic Translation
Machine
Project failed, but sparked the research at MIT,
CMU, commercial institutions.
During 1950's, first system capable of
recognizing digits spoken over the telephone was
developed by Bell Labs.
1962, Shoebox form IBM

In early 1970's, the system HARPY capable of
sentences, limited grammar, by Carnegie-Mellon
University.
HARPY required so much computing power as in 50
contemporary computers.
Moreover, the system recognized discrete speech,
where words are separated by longer pauses than
usual.

7
Recognition Milestone

In the 1980s, significant progress in speech
recognition technology
Word error rates continue to drop by factor of 2
every two years.
IBM in 1985, in real time, isolated words from
set of 20,000 after 20-minute training, with
error rate lt 5.
ATT, call routing system, speaker independent
word-spotting technology, few key phrases.
Several very large vocabulary dictation systems
require speakers to pause between words.
Better for specific domain.
In 1990's
VoiceBroker deployed by Charles Schwab, stock
brokerage, in 1996.
ViaVoice by IBM, first distributed with the now
almost forgotten operating system OS/2 in 1996.
1997, Dragon introduced Naturaly Speaking, first
continuous speech recognition package
Today
Airline reservations with British Airways,
Train reservation for Amtrak,
Weather forecasts telephone directory
information

8
Terminology of Speech Recognition

Speaker Dependent Recognition
The recognition system is designed to work with
just one or a small number of individual speakers
Speaker Independent Recognition
These systems are designed to work with all the
speakers from a given linguistic community

9
Terminology of Speech Recognition

Large Vocabulary Recognition
Example are domain specific recognition systems
such as used by medical consultants for
dictating notes on their ward rounds
Very difficult to make accurate large vocabulary,
speaker independent systems
Small Vocabulary Recognition
Typically recognition of a few keywords such as
digits or a set of commands.
Example voice operated telephone number dialing

10
Terminology of Speech Recognition

Isolated Word Recognition
Systems which can only recognize individual words
which are preceded and followed by relatively
long period of silence
Connected Word Recognition
Systems which can recognize a limited sequence of
words spoken in succession (e.g. Ninety-eight
thirty-five four thousand)
Continuous Word Recognition
These systems can recognize speech as it occurs
and recognize the speech in real time. Such
system usually work with large vocabulary, but
with moderate accuracy.

11
Speech Recognition Scheme

Three steps in Speech recognition are performed
in ANY recognition system
Feature Extraction
Measurement of similarity
Decision making

12
Recognition Systems
Pattern matching is constrained in many ways,
e.g. the rules of language (grammar), spelling
and possible pronunciations
Derive a compact representation of the speech
waveform
reference patterns
accept/ reject
speech
feature extraction
pattern matching
decision rule
test pattern
Find the word with the greatest similarity to the
input speech
c0(t)
...
c1(t)
cM(t)

?cM(t)
?c1(t)
?c0(t)

?2c0(t)
?2c1(t)
?2cM(t)
13
Speech Model Features
14
Speaker Recognition Features

The features are low-level speech signal
representation parameters that convey complete
information about the signal.
High-level characteristics like accent,
intonation, etc. are encoded within the
representation in a very complex and cryptic
manner.
The features contain speaker-dependent
components.
Uniqueness and permanence of the features is
problematic.

15
Questions

Do the features that uniquely characterize people
exist?
Uniqueness and permanence of most of the features
used in biometric systems have not been proven.
Is the humans ability to identify a person a
limit that no automatic system can overcome?
Automated systems might be able to identify
people better than average person can do. In
practice, expert systems do not perform the task
better than the experts who built them.

16
Questions

How important are the algorithms versus the
knowledge of features and their relationships to
achieve high identification accuracy?
Knowledge of features and their relationships is
fundamental for accurate biometric systems. The
algorithms play an important, still secondary,
role in the process as no algorithm can
compensate for the lack of the adequate features.

17
Speaker models

Used to represent the speaker specific
information conveyed in the feature vectors
Several different modeling techniques have been
applied
Template Matching
Nearest Neighbor
Neural Networks
Hidden Markov Models
State-of-the-art speaker recognition algorithms
are based on statistical models of short-term
acoustic measurements on the input speech signal

18
Speaker models

Use long-term averages of acoustic features
(spectrum, pitch) first and earliest Idea
To average out the factors influencing
intra-speaker variation, leave only the speaker
dependent component.
Drawback required long speech utterance(gt20s)
Training SD model for each speaker
Explicit segmentation HMM
Implicit segmentation VQ,GMM

19
Speaker models

HMM
Advantage Text-independent
Drawback A significant increase in
computational complexity
VQ
Advantage Unsupervised clustering
Drawback Text-dependent
GMM
Advantage Text-Independent, Probabilistic
framework (robust), Computationally efficient,
Easily to be implemented.

20
Speaker models

Discriminative Neural Network
Model the decision function which best
discriminate speakers
Advantage
Less parameters, higher performance compared
to VQ model.
Drawback
The network must be retrained when a new
speaker is added to the system.

21
Progressing
VQ NN
HMM VQ NN
GMM HMM VQ NN
1985
1995
Easy
Word Error Rate
Hard
21
State of the Art Speech Recognition
22
QV Example
distortion
This sample has less distortion for A than for B
Acoustic Space 2
Speaker A
Speaker B
Acoustic Space 1
23
HMM Example

Two model of tomato

Word in the vocabulary is presented with
phonemes. Each phoneme is viewed as an HMM A
word model is constructed by combining HMMs for
the phonemes
24
Gaussian Mixture Model (GMM)
Speech Recognition
(GMM) State Level
25
Gaussian Mixture Model (GMM)
Speaker Recognition
Speaker k

26
Limits

The best performing algorithms for
text-independent speaker verification use
Gaussian Mixture Models (GMM) (single state HMM)
The linguistic structure of the speech signal is
not taken into account and all sounds are
represented using a unique model
The sequential information is ignored
There is a recent trend in using High-level
features
Large Vocabulary Continuous Speech Recognition
System
Good results for a small set of languages
Need huge amount of annotated speech databases
(an enormous amount of time and human effort )
Language and task dependent