Title: Vowel Recognition Scores
1Vowel and Consonant Confusion in Noise Analysis
and Comparison of Acoustic Models using
Statistical Signal Processing Techniques Jeremiah
J. Remus and Leslie M. Collins Department of
Electrical and Computer Engineering, Duke
University, Durham, NC
Poster TU19
2
INTRODUCTION
Methods Used for Confusion Prediction
Cochlear implants have been shown to restore
hearing, with varying degrees of success, in
severely deafened patients. As a result of the
limited frequency resolution possible with
cochlear implants, the confusion of certain
speech tokens will occur in patterns that are
dependent on ambient noise level. In our
previous work, speech recognition data for vowels
and consonants was collected from a group of
normal hearing subjects over a range of noise
levels using two acoustic cochlear implant
models. The models used during the listening
experiments were an eight analysis, eight
presentation filter (8F) model analogous to the
Continuous Interleaved Sampling (CIS) speech
processor, and a twenty analysis, six
presentation filter (6/20F) model similar to the
Spectral Peak (SPEAK) speech processor.
Information transmission analysis Miller and
Nicely, J. Acoust. Soc. Am. 27, 338-52 (1955)
was performed to assess patterns of information
loss with increases in noise level and to compare
these patterns across models and noise mitigation
algorithms. Information transmission analyses
support the phenomenological bases for
confusions, but do not attempt to identify the
signal structure or change in signal structure
with noise that is responsible for the underlying
confusions. An analytic approach that provides
an explanation for token confusions based on the
signal structure could be useful in the design
and analysis of such listening experiments. The
intention of this study is to model trends in
the speech recognition results from these
listening experiments using different
statistics-based methods. The methods for
generating the predictions of speech recognition
scores are all based on signal processing
operations utilizing information present in the
speech waveform for different signal and acoustic
model configurations. The results of the analyses
were configured as confusion matrices for
comparison with the listening experiment results
to investigate similarities between the
theoretical and experimental results. Results
indicate that some signal and prediction models
result in confusion matrices that share
performance characteristics with the listening
experiment results. Robustness of these results
is also considered.
Token Envelope Cross Correlation
Dynamic Time Warping
Hidden Markov Models
Difference between feature vectors calculated
using Euclidean distance
Mel-cepstrum coefficients
Unmapped discrete envelopes
Least cost mapping
Least cost mapping through distance matrix
Model Parameters M number of Gaussian
mixtures Q number of model states
- Hidden Markov Model performance evaluated two
ways - Log likelihood of real tokens
- Log likelihood of observations generated from HMMs
Mapped discrete envelopes
Mapped mel-cepstrum coefficients
Decision metric
Decision metric
Decision metric
1
Listening Experiment Results
4
3
- 8F Acoustic Model
- Eight analysis/ eight presentation filters
(p-of-p) - Logarithmically spaced from 150Hz 6450Hz
- 6/20F Acoustic Model
- Six presentation / twenty analysis filters
(n-of-m) - Linearly spaced below 1600Hz, logarithmically
spaced above 1600Hz, spanning 250Hz 10,823Hz - Speech processed in 2 millisecond windows
Performance Predicting Token Recognition Rankings
Processor Near Prediction of Confusions
- Ideally, prediction methods would indicate
- 6/20F vowel tokens more separable than 8F
- Consonant tokens for both models separable at
nearly the same level. - Consonants about as separable as 8F vowels.
- Dynamic time warping provided most reliable
results -
Listening experiment trends
Vowel Recognition Scores
Consonant Recognition Scores
Percent correct Least confused tokens Most
confused tokens
Rank (least to most recognized) Consonants
Vowels
Percent correct
8F 6/20F
Statistically significant differences
quiet
quiet
SNR (dB)
Average confusion distance computed using DTW
6/20F Confusion Matrices
8F Confusion Matrices
Near prediction one of the two (three for
consonants) most confused tokens matches one of
the two/three predicted most confused. Hidden
Markov model used for all prediction performance
measures has 3 Gaussian mixtures, 6 states, and
calculated the log likelihood using real tokens
(method 1)
Response
Response
had hawed head heard heed hid hood hud who'd
had 459 2 74 2 0 2 0 0 1
hawed 1 529 1 1 1 1 1 4 1
head 47 0 404 1 4 75 2 2 5
heard 0 3 2 532 0 0 2 0 1
heed 1 0 5 3 508 7 3 3 10
hid 6 0 20 1 5 506 1 0 1
hood 1 5 0 3 0 2 496 12 21
hud 1 35 0 17 0 0 1 486 0
who'd 1 11 4 1 1 1 80 5 436
had hawed head heard heed hid hood hud who'd
had 296 33 142 18 6 10 9 22 4
hawed 3 480 4 12 1 2 5 32 1
head 54 14 289 30 14 88 22 19 10
heard 13 10 36 364 13 16 32 18 38
heed 13 1 25 22 352 54 24 6 43
hid 22 6 103 26 36 294 19 13 21
hood 6 22 15 81 17 9 311 22 57
hud 16 136 7 11 3 3 12 344 8
who'd 7 8 3 56 25 5 72 11 353
- Comparison of token recognition rates predicted
using the three methods (discrete envelopes,
dynamic time warping, hidden Markov models) with
results from listening experiment ( correct).
Token length was also included to observe
relationship and potential effect on methods. - Pattern of vowels recognition rates best
predicted by hidden Markov models. Consonants
were equally well predicted by hidden Markov
models and dynamic time warping.
Average confusion distance (large distance
between tokens suggests confusions are less
likely)
Played
Played
Response
Response
b d f g j k m n p s sh t v z
b 120 14 86 9 2 14 106 25 66 3 2 39 48 6
d 4 343 4 77 38 16 2 7 5 3 20 10 8 3
f 28 2 162 6 0 47 105 23 57 8 7 20 67 8
g 11 21 10 369 7 34 5 52 5 4 5 11 4 2
j 1 3 0 8 511 6 1 2 1 2 3 2 0 0
k 1 0 2 17 2 444 3 22 1 2 2 37 7 0
m 40 12 39 8 0 1 216 33 21 2 3 13 128 24
n 9 29 8 36 24 17 15 352 8 5 3 27 5 2
p 31 5 59 14 11 13 143 14 164 6 1 56 13 10
s 0 1 1 0 0 1 0 1 0 500 7 0 0 29
sh 2 1 1 1 5 1 0 0 1 2 524 1 0 1
t 6 6 3 15 8 85 5 110 14 11 9 261 2 5
v 77 14 26 56 4 13 48 14 36 1 1 8 236 6
z 3 3 1 6 5 4 1 1 1 13 7 4 0 491
b d f g j k m n p s sh t v z
b 119 24 95 23 8 17 31 31 50 26 9 30 56 21
d 12 275 7 75 34 17 8 8 6 11 7 12 12 56
f 52 27 157 20 8 20 30 38 47 28 3 21 62 27
g 16 24 12 362 26 15 14 12 13 8 9 9 14 6
j 3 9 9 13 442 3 5 5 9 6 8 14 4 10
k 20 17 15 11 3 391 9 19 18 4 2 21 7 3
m 10 12 22 8 8 20 264 118 8 11 1 14 32 12
n 5 10 4 10 8 5 77 365 10 7 5 14 15 5
p 34 7 109 8 2 42 25 20 179 31 1 45 33 4
s 1 1 3 1 4 4 6 3 1 376 79 5 2 54
sh 3 4 4 3 33 6 4 3 1 30 436 5 2 6
t 15 19 68 17 9 80 8 45 51 25 4 177 13 9
v 36 34 22 16 5 21 37 14 18 1 5 17 278 36
z 5 13 7 7 11 4 4 5 2 27 8 7 12 428
- The results show that statistical signal
processing methods utilizing the speech waveform
can identify trends in token confusion.
Consideration of other factors influencing
subject responses, such as noise characteristics
and experimental setup, might allow for more
accurate prediction of confusions.
- The data supports the hypothesis that the
mel-cepstrum representation of the speech signal
contains sufficient information about potential
token confusions to indicate trends. However,
the temporal envelope was not much better for
confusion prediction than token length.
Played
Played
This research is supported by the National
Science Foundation grant NSF-BES-00-85370