Text-Constrained Speaker Recognition - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Text-Constrained Speaker Recognition

Description:

Title: PowerPoint Presentation Author: EECS Last modified by: Kofi Agyeman Boakye Created Date: 8/12/2003 6:58:59 AM Document presentation format – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 33

Provided by: EECS

Category:

more less

Transcript and Presenter's Notes

Title: Text-Constrained Speaker Recognition

1
Text-Constrained Speaker Recognition Using
Hidden Markov Models
Kofi A. Boakye International Computer Science
Institute
2
Outline

Introduction
Design and System Description
Initial Results
System Enhancements
More words
Higher order cepstra
Cepstral Mean Subtraction
Conclusions
Future Work

3
Introduction

Speaker Recognition Problem Determine if
spoken segment is putative target
Also referred to as Speaker Verification/Authentic
ation

4
Introduction
Method of Solution Requires Two Phases
Similar to speech recognition, though noise
(inter-speaker variability) is now signal.
Training Phase
Testing Phase
claimed identity Sally
5
Introduction

Also like speech recognition, different domains
exist
Two major divisions
Text-dependent/Text-constrained
Highly constrained text spoken by person
Examples fixed phrase, prompted phrase
Text-independent
Unconstrained text spoken by person
Example conversational speech

6
Introduction

Text-dependent systems can have high
performance because of input constraints
More acoustic variation arises from speaker
distinction(vs. phones)
Text-independent systems have greater flexibility

7
Introduction
Question Is it possible to capitalize
on advantages of text dependent systems in
text-independent domains? Answer Yes!
8
Introduction
Idea Limit words of interest to a select
group-Words should have high frequency in
domain-Words should have high speaker-discriminat
ive quality What kind of words match
these criteria for conversational speech ?1)
Discourse markers (like, well, now)2) Filled
pauses (um, uh) 3) Backchannels (yeah, right,
uhhuh, ) These words are fairly spontaneous
and represent an involuntary speaking style
(Heck, WS2002)
9
Design
Likelihood Ratio Detector ? p(XS)
/p(XUBM) Task is a detection problem, so use
likelihood ratio detector -In implementation,
log-likelihood is used
Speaker Model
gt T Accept lt T Reject
Feature Extraction
/
?
signal
adapt
Background Model
10
Design

State-of-the Art Speaker Recognition Systems use
Gaussian Mixture Models
Speakers acoustic space is represented by
many-component mixture of Gaussians

speaker 1
speaker 2
11
Design

Speaker models are obtained via adaptation of a
Universal Background Model (UBM)
Probabilistically align target training data into
UBM mixture states
Update mixture weights, means and variances based
on the number of occurrences in mixtures
Gives very good performance, but

Target training data
12
Design

Concern GMMs utilize a bag-of-frames approach
Frames assumed to be independent
Sequential information is not really utilized
Alternative Use HMMs
Do likelihood test on output from recognizer,
which is an accumulated log-probability score
Text-independent system has been analyzed (Weber
et al. from Dragon Systems)
Lets try a text-dependent one!

13
System
Word-level HMM-UBM detectors
HMM-UBM 1
Combination
Word Extractor
HMM-UBM 2
signal
?
HMM-UBM N
Topology Left-to-right HMM with self-loops and
no skips 4 Gaussian components per state Number
of states related to number of phones and median
number of frames for word
14
System
HMMs implemented using HMM toolkit (HTK) -Used
for speech recognition Input features were 12
mel-cepstra, first differences, and zeroth order
cepstrum (energy parameter) Adaptation Means
were adapted using Maximum A Posteriori
adaptation In cases of no adaptation data, UBM
was used -LLR score cancels
15
Word Selection
13 Words Discourse markers actually, anyway,
like, see, well, now Filled pauses um,
uh Backchannels yeah, yep, okay, uhhuh, right
Words account for approx 8 of total tokens
16
Recognition Task
NIST Extended Data Evaluation Training for
1,2,4,8, and 16 complete conversation sides and
testing on one side (side duration 2.5
mins) Uses Switchboard I corpus -Conversational
telephone speech Cross-validation method where
data is partitioned Test on one partition use
others for background models and
normalization For project, used splits 4-6 for
background and 1 for testing with 8-conversation
training
17
Scoring
LLR(X) log(p(XS)) log(p(XUBM)) Target
score output of adapted HMM scoring forced
alignment recognition of word from true
transcripts (aligned via SRI recognizer) UBM
score output of non-adapted HMM scoring same
forced alignment Frame normalization Word
normalization Average of word-level frame
normalizations N-best normalization Frame
normalization on n best matching (i.e. high
log-prob) words
18
Initial Results
Observations 1) Frame norm result word norm
result 2) EER of n-best decreases with increasing
n -Suggests benefit from an increase in data
19
Initial Results
Comparable results Sturim et al. text-dependent
GMM Yielded EER of 1.3 -Larger word pool (50
words) -Channel normalization
20
Initial Results
Observations EERs for most lie in a small range
around 7 -Suggests that words, as a group, share
some qualities -last two may differ greatly
partly because of data scarcity Best word
(yeah) yielded EER of 4.63 compared with 2.87
for all words
21
System Enhancements
22
System Enhancements New Words
Some discourse markers and backchannels are
bigrams 6 Additional Words Bigrams Discourse
markersyou_know, you_see, i_think,
i_mean Backchannelsi_see, i_know Total
coverage of 10 with these additional words
23
System Enhancements New Words
Results

EER reduced from 2.87 to 2.53
Significant reduction, especially given the size
of coverage increase

24
System Enhancements New Words
Results

Observations
Well-performing bigrams have comparable EERs
Poorly-performing bigrams suffer from a paucity
of data
Suggests possibility of frequency threshold for
performance

25
System Enhancements More Cepstra
Idea Higher order cepstra may posses more
variability that can be used for speaker
discrimination Input features modified to 19
mel-cepstra from 12
26
System Enhancements More Cepstra
Results
EER Reduced from 2.87 to 1.88
27
System Enhancements CMS

Idea Channel response may introduce undesirable
variability (e.g., the same speaker on different
handsets), so try and remove it
Common approach is to perform Cepstral Mean
Subtraction (CMS)
Convolutional effects in the time domain become
additive effects in the log power domain
X(?,t) S(?,t)C(?,t)
logX(?,t)2 logS(?,t)2 logC(?,t)2

28
System Enhancements CMS
Results

EER reduced from 2.87 to 1.35
Poor performance in low false alarm region
possibly due to small number of data points
also may have removed good channel info

29
System Enhancements Combined System
Results
grab bag system yields EER of 1.01 Suffers
from same problem of poor performance for low
false alarms
30
Conclusions
Well performing text-dependent speaker
recognition in an unconstrained speech domain is
very feasible Benefit of sequential information
appears to have been established Benefits of
higher order cepstra and CMS for input features
have been demonstrated
31
Future Work
-Analyze performance with ASR output -Closer
analysis of word frequency to performance -More
words! -Normalizations (Hnorm, Tnorm) -Examine
influence of word context (e.g., well as
discourse marker and as adverb)
32
Fin

Write a Comment

User Comments (0)