Title: SPEECH DECOMPOSITION AND INTELLIGIBILITY
1SPEECH ENHANCEMENT BASED ON TRANSIENT SPEECH
INFORMATION
Sungyub Yoo1, J. Robert Boston1,2, John D.
Durrant2, Kristie Kovacyk2, Stacey Karn2, Susan
Shaiman2, Amro El-Jaroudi1, Ching-Chung Li1
Departments of 1Electrical and Computer
Engineering and 2Communication Science and
Disorders, University of Pittsburgh, Pittsburgh,
PA 15261, USA
Support from Grant N000140310277 from the Office
of Naval Research
2- Introduction
- The purpose of this project was to investigate
the role of speech transitions in speech
intelligibility. - Transitions around consonants, between vowels and
consonants, and within vowels are important
acoustic cues for speech intelligibility. - These transitions are difficult to isolate using
fixed-frequency filters -
- A time-varying and data-adaptive filter algorithm
to emphasize transitions in speech was developed
to overcome the limitations of traditional fixed
filter analysis. - The effects of amplification of these
transitions on speech intelligibility was
examined.
3- Speech Decomposition
- S(t) SQSS(t) Stran(t)
- Quasi-steady-state (QSS) component, SQSS(t),
includes most of the energy of the speech,
primarily energy in vowels and hubs of
consonants. - Transient component, Stran(t), is intended to
capture energy of transitions between vowels and
consonants and within vowels. - In this presentation
- 1. Energy and intelligibility of
SQSS(t) and Stran(t) are compared to
the original speech. - 2. The intelligibility of original
speech is compared to speech enhanced by
adding an amplified Stran(t) component.
4Overview of Algorithm
- Speech is highpass filtered at 700 Hz.
- The QSS speech component is obtained as the sum
of the outputs of three adaptive time-varying
filters, designed to extract the steady-state
segments of the largest formant components from
the highpass filtered speech. - The transient component is obtained as the
difference between the highpass filtered speech
and the QSS component.
5The time-varying filter
- Each time-varying filter combines an all-zero
filter (AZF), a dynamic tracking filter (DTF) -
developed by Rao and Kumaresan - and a
time-varying bandpass filter. - The output of each time-varying bandpass filter
is intended to estimate activity in one speech
formant. - AZF is updated from FM information of other
time-varying filters. - FM and envelope information estimated by linear
prediction in spectral domain (LPSD) analysis is
used to set the center frequency and - bandwidth of each time-varying bandpass filter.
6Decomposition Details
- Center frequency of time-varying bandpass filter
- Based on FM information from LPSD
- Bandwidth of the time-varying bandpass filter
- Based on envelope information from LPSD
- Parameters required for time-varying bandpass
filters - Maximum bandwidth 900Hz
- Filter activation threshold 15dB
- Parameters were selected to remove as much of
the QSS energy from the original speech as
possible, while maintaining reasonable
intelligibility in the transient component.
7Synthetic Signal Example
8Synthetic Signal Example
Original speech
QSS component
Transient component
Waveforms (left) and spectrograms (right) of
decomposed synthetic signal
9Decomposition Example Pike
Speech Waveforms with equalized energy
Spectrogram
Algorithm output
Original
HPF
QSS
Transient
Speech waveforms (left), spectrograms (middle),
and speech waveform with equal energy (right) for
the word Pike spoken by a female speaker.
10Energy and Intelligibility in Components
- 300 CVC words (from NU-6 list) were highpass
filtered and processed to obtain QSS and
transient components.
( p lt 0.05 for pair-wise comparisons with
other components)
- Word recognition rates were measured in 5
subjects as speech amplitude was increased above
auditory threshold. - PBmax (asymptotic word recognition) was based on
fit to ogive for each subject. - QSS component had a significantly lower PBmax
than the other - components, while the transient component had
approximately the same - PBmax as original and highpass components.
11Speech Enhancement
- Motivation for speech enhancement
-
- The transition information has low energy and
is particularly susceptible to noise. It is
critical to speech perception. - Selectively amplifying this component may
improve the recognition of speech in noise. - Senh(t) k Sorig(t) 12 Stran(t)
- Procedure
- Speech sounds were decomposed, and the
transient component was amplified and then
recombined with the original (base) speech. - Energy adjustment constant k - the energy of
enhanced speech was adjusted to be equal to the
energy of the original speech - The intelligibility of these two speech
versions was evaluated using the modified rhyme
protocol
12Psychoacoustic Procedure
- Subjects sat in a sound-attenuated booth, and
test words were delivered monaurally though
headphones. - A trial used one set of six words, where each
word in the set rhymes with the others. - At the beginning of the trial, a target word
appeared on the computer monitor and remained
until all six rhyming words in the set were
presented. - The subjects were asked to click on OK as soon
as they heard the target word.
13Psychoacoustic Test - Speech Enhancement Test
- Test material three hundred mono-syllable words
(150 sets for original speech and 150 sets for
enhanced speech) spoken by a male speaker - Test words were presented with six different SNR
levels (-25 dB, -20 dB, -15 dB, -10 dB, -5 dB,
and 0 dB) of speech-weighted background noise. - Eleven volunteer subjects with negative otologic
histories and hearing sensitivity of 15 dB HL or
better by conventional audiometry (250 8 kHz)
were tested. - Percentage of words correctly recognized for each
condition was recorded for each subject and
paired differences between recognition of
original and enhanced speech were analyzed.
14Paired Difference Results - Speech Enhancement
Differences (enhanced speech original speech)
of means and 95 confidence intervals of word
recognition rates ( p lt 0.05)
15Actual Word Recognition Rates - Speech Enhancement
Means and 95 confidence intervals of word
recognition rates for original (blue) and
enhanced (red) speech
16- A new dynamic method to emphasize transition
information in speech has been developed. - Algorithm isolates a component of the speech
signal (the QSS component) that appears not to be
critical to intelligibility. - Results suggest that transient components make a
significant contribution to speech
intelligibility. - Emphasis of transient information can enhance
speech in noise at low SNRs.
Support from Grant N000140310277 from the Office
of Naval Research
17