Title: Text-Constrained Speaker Recognition
1Text-Constrained Speaker Recognition Using
Hidden Markov Models
Kofi A. Boakye International Computer Science
Institute
2Outline
- Introduction
- Design and System Description
- Initial Results
- System Enhancements
- More words
- Higher order cepstra
- Cepstral Mean Subtraction
- Conclusions
- Future Work
3Introduction
- Speaker Recognition Problem Determine if
spoken segment is putative target - Also referred to as Speaker Verification/Authentic
ation
4Introduction
Method of Solution Requires Two Phases
Similar to speech recognition, though noise
(inter-speaker variability) is now signal.
Training Phase
Testing Phase
claimed identity Sally
5Introduction
- Also like speech recognition, different domains
exist - Two major divisions
- Text-dependent/Text-constrained
- Highly constrained text spoken by person
- Examples fixed phrase, prompted phrase
- Text-independent
- Unconstrained text spoken by person
- Example conversational speech
6Introduction
- Text-dependent systems can have high
performance because of input constraints - More acoustic variation arises from speaker
distinction(vs. phones) - Text-independent systems have greater flexibility
7Introduction
Question Is it possible to capitalize
on advantages of text dependent systems in
text-independent domains? Answer Yes!
8Introduction
Idea Limit words of interest to a select
group-Words should have high frequency in
domain-Words should have high speaker-discriminat
ive quality  What kind of words match
these criteria for conversational speech ?1)
Discourse markers (like, well, now)2) Filled
pauses (um, uh) 3) Backchannels (yeah, right,
uhhuh, ) These words are fairly spontaneous
and represent an involuntary speaking style
(Heck, WS2002)
9Design
Likelihood Ratio Detector ? p(XS)
/p(XUBM) Task is a detection problem, so use
likelihood ratio detector -In implementation,
log-likelihood is used
Speaker Model
gt T Accept lt T Reject
Feature Extraction
/
?
signal
adapt
Background Model
10Design
- State-of-the Art Speaker Recognition Systems use
Gaussian Mixture Models - Speakers acoustic space is represented by
many-component mixture of Gaussians
speaker 1
speaker 2
11Design
- Speaker models are obtained via adaptation of a
Universal Background Model (UBM) - Probabilistically align target training data into
UBM mixture states - Update mixture weights, means and variances based
on the number of occurrences in mixtures - Gives very good performance, but
Target training data
12Design
- Concern GMMs utilize a bag-of-frames approach
- Frames assumed to be independent
- Sequential information is not really utilized
- Â
- Alternative Use HMMs
- Do likelihood test on output from recognizer,
which is an accumulated log-probability score - Text-independent system has been analyzed (Weber
et al. from Dragon Systems) - Lets try a text-dependent one!
- Â
13System
Word-level HMM-UBM detectors Â
HMM-UBM 1
Combination
Word Extractor
HMM-UBM 2
signal
?
HMM-UBM N
Topology Left-to-right HMM with self-loops and
no skips 4 Gaussian components per state Number
of states related to number of phones and median
number of frames for word
14System
HMMs implemented using HMM toolkit (HTK) -Used
for speech recognition Input features were 12
mel-cepstra, first differences, and zeroth order
cepstrum (energy parameter) Â Adaptation Means
were adapted using Maximum A Posteriori
adaptation In cases of no adaptation data, UBM
was used -LLR score cancels
15Word Selection
13 Words Discourse markers actually, anyway,
like, see, well, now Filled pauses um,
uh Backchannels yeah, yep, okay, uhhuh, right
Words account for approx 8 of total tokens
16Recognition Task
NIST Extended Data Evaluation Training for
1,2,4,8, and 16 complete conversation sides and
testing on one side (side duration 2.5
mins) Uses Switchboard I corpus -Conversational
telephone speech Cross-validation method where
data is partitioned Test on one partition use
others for background models and
normalization For project, used splits 4-6 for
background and 1 for testing with 8-conversation
training
17Scoring
LLR(X) log(p(XS)) log(p(XUBM)) Target
score output of adapted HMM scoring forced
alignment recognition of word from true
transcripts (aligned via SRI recognizer) UBM
score output of non-adapted HMM scoring same
forced alignment Frame normalization Word
normalization Average of word-level frame
normalizations N-best normalization Frame
normalization on n best matching (i.e. high
log-prob) words
18Initial Results
Observations 1) Frame norm result word norm
result 2) EER of n-best decreases with increasing
n -Suggests benefit from an increase in data
19Initial Results
Comparable results Sturim et al. text-dependent
GMM Yielded EER of 1.3 -Larger word pool (50
words) -Channel normalization
20Initial Results
Observations EERs for most lie in a small range
around 7 -Suggests that words, as a group, share
some qualities -last two may differ greatly
partly because of data scarcity Best word
(yeah) yielded EER of 4.63 compared with 2.87
for all words
21System Enhancements
22System Enhancements New Words
Some discourse markers and backchannels are
bigrams 6 Additional Words Bigrams Discourse
markersyou_know, you_see, i_think,
i_mean Backchannelsi_see, i_know Total
coverage of 10 with these additional words
23System Enhancements New Words
Results
- EER reduced from 2.87 to 2.53
- Significant reduction, especially given the size
of coverage increase
24System Enhancements New Words
Results
- Observations
- Well-performing bigrams have comparable EERs
- Poorly-performing bigrams suffer from a paucity
of data - Suggests possibility of frequency threshold for
performance
25System Enhancements More Cepstra
Idea Higher order cepstra may posses more
variability that can be used for speaker
discrimination Input features modified to 19
mel-cepstra from 12
26System Enhancements More Cepstra
Results
EER Reduced from 2.87 to 1.88
27System Enhancements CMS
- Idea Channel response may introduce undesirable
variability (e.g., the same speaker on different
handsets), so try and remove it - Common approach is to perform Cepstral Mean
Subtraction (CMS) - Convolutional effects in the time domain become
additive effects in the log power domain - X(?,t) S(?,t)C(?,t)
- logX(?,t)2 logS(?,t)2 logC(?,t)2
28System Enhancements CMS
Results
- EER reduced from 2.87 to 1.35
- Poor performance in low false alarm region
- possibly due to small number of data points
- also may have removed good channel info
29System Enhancements Combined System
Results
grab bag system yields EER of 1.01 Suffers
from same problem of poor performance for low
false alarms
30Conclusions
Well performing text-dependent speaker
recognition in an unconstrained speech domain is
very feasible Benefit of sequential information
appears to have been established Benefits of
higher order cepstra and CMS for input features
have been demonstrated
31Future Work
-Analyze performance with ASR output -Closer
analysis of word frequency to performance -More
words! -Normalizations (Hnorm, Tnorm) -Examine
influence of word context (e.g., well as
discourse marker and as adverb)
32Fin