SPEECH DECOMPOSITION AND INTELLIGIBILITY - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

SPEECH DECOMPOSITION AND INTELLIGIBILITY

Description:

Speech waveforms (left), spectrograms (middle), and speech waveform with equal ... Spectrogram. Energy and Intelligibility in Components ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 18
Provided by: rese170
Category:

less

Transcript and Presenter's Notes

Title: SPEECH DECOMPOSITION AND INTELLIGIBILITY


1
SPEECH ENHANCEMENT BASED ON TRANSIENT SPEECH
INFORMATION
Sungyub Yoo1, J. Robert Boston1,2, John D.
Durrant2, Kristie Kovacyk2, Stacey Karn2, Susan
Shaiman2, Amro El-Jaroudi1, Ching-Chung Li1
Departments of 1Electrical and Computer
Engineering and 2Communication Science and
Disorders, University of Pittsburgh, Pittsburgh,
PA 15261, USA
Support from Grant N000140310277 from the Office
of Naval Research
2
  • Introduction
  • The purpose of this project was to investigate
    the role of speech transitions in speech
    intelligibility.
  • Transitions around consonants, between vowels and
    consonants, and within vowels are important
    acoustic cues for speech intelligibility.
  • These transitions are difficult to isolate using
    fixed-frequency filters
  • A time-varying and data-adaptive filter algorithm
    to emphasize transitions in speech was developed
    to overcome the limitations of traditional fixed
    filter analysis.
  • The effects of amplification of these
    transitions on speech intelligibility was
    examined.

3
  • Speech Decomposition
  • S(t) SQSS(t) Stran(t)
  • Quasi-steady-state (QSS) component, SQSS(t),
    includes most of the energy of the speech,
    primarily energy in vowels and hubs of
    consonants.
  • Transient component, Stran(t), is intended to
    capture energy of transitions between vowels and
    consonants and within vowels.
  • In this presentation
  • 1. Energy and intelligibility of
    SQSS(t) and Stran(t) are compared to
    the original speech.
  • 2. The intelligibility of original
    speech is compared to speech enhanced by
    adding an amplified Stran(t) component.

4
Overview of Algorithm
  • Speech is highpass filtered at 700 Hz.
  • The QSS speech component is obtained as the sum
    of the outputs of three adaptive time-varying
    filters, designed to extract the steady-state
    segments of the largest formant components from
    the highpass filtered speech.
  • The transient component is obtained as the
    difference between the highpass filtered speech
    and the QSS component.

5
The time-varying filter
  • Each time-varying filter combines an all-zero
    filter (AZF), a dynamic tracking filter (DTF) -
    developed by Rao and Kumaresan - and a
    time-varying bandpass filter.
  • The output of each time-varying bandpass filter
    is intended to estimate activity in one speech
    formant.
  • AZF is updated from FM information of other
    time-varying filters.
  • FM and envelope information estimated by linear
    prediction in spectral domain (LPSD) analysis is
    used to set the center frequency and
  • bandwidth of each time-varying bandpass filter.

6
Decomposition Details
  • Center frequency of time-varying bandpass filter
  • Based on FM information from LPSD
  • Bandwidth of the time-varying bandpass filter
  • Based on envelope information from LPSD
  • Parameters required for time-varying bandpass
    filters
  • Maximum bandwidth 900Hz
  • Filter activation threshold 15dB
  • Parameters were selected to remove as much of
    the QSS energy from the original speech as
    possible, while maintaining reasonable
    intelligibility in the transient component.

7
Synthetic Signal Example
8
Synthetic Signal Example
Original speech
QSS component
Transient component
Waveforms (left) and spectrograms (right) of
decomposed synthetic signal
9
Decomposition Example Pike
Speech Waveforms with equalized energy
Spectrogram
Algorithm output

Original
HPF
QSS
Transient
Speech waveforms (left), spectrograms (middle),
and speech waveform with equal energy (right) for
the word Pike spoken by a female speaker.
10
Energy and Intelligibility in Components
  • 300 CVC words (from NU-6 list) were highpass
    filtered and processed to obtain QSS and
    transient components.

( p lt 0.05 for pair-wise comparisons with
other components)
  • Word recognition rates were measured in 5
    subjects as speech amplitude was increased above
    auditory threshold.
  • PBmax (asymptotic word recognition) was based on
    fit to ogive for each subject.
  • QSS component had a significantly lower PBmax
    than the other
  • components, while the transient component had
    approximately the same
  • PBmax as original and highpass components.

11
Speech Enhancement
  • Motivation for speech enhancement
  • The transition information has low energy and
    is particularly susceptible to noise. It is
    critical to speech perception.
  • Selectively amplifying this component may
    improve the recognition of speech in noise.
  • Senh(t) k Sorig(t) 12 Stran(t)
  • Procedure
  • Speech sounds were decomposed, and the
    transient component was amplified and then
    recombined with the original (base) speech.
  • Energy adjustment constant k - the energy of
    enhanced speech was adjusted to be equal to the
    energy of the original speech
  • The intelligibility of these two speech
    versions was evaluated using the modified rhyme
    protocol

12
Psychoacoustic Procedure
  • Subjects sat in a sound-attenuated booth, and
    test words were delivered monaurally though
    headphones.
  • A trial used one set of six words, where each
    word in the set rhymes with the others.
  • At the beginning of the trial, a target word
    appeared on the computer monitor and remained
    until all six rhyming words in the set were
    presented.
  • The subjects were asked to click on OK as soon
    as they heard the target word.

13
Psychoacoustic Test - Speech Enhancement Test
  • Test material three hundred mono-syllable words
    (150 sets for original speech and 150 sets for
    enhanced speech) spoken by a male speaker
  • Test words were presented with six different SNR
    levels (-25 dB, -20 dB, -15 dB, -10 dB, -5 dB,
    and 0 dB) of speech-weighted background noise.
  • Eleven volunteer subjects with negative otologic
    histories and hearing sensitivity of 15 dB HL or
    better by conventional audiometry (250 8 kHz)
    were tested.
  • Percentage of words correctly recognized for each
    condition was recorded for each subject and
    paired differences between recognition of
    original and enhanced speech were analyzed.

14
Paired Difference Results - Speech Enhancement
Differences (enhanced speech original speech)
of means and 95 confidence intervals of word
recognition rates ( p lt 0.05)
15
Actual Word Recognition Rates - Speech Enhancement
Means and 95 confidence intervals of word
recognition rates for original (blue) and
enhanced (red) speech
16
  • Discussion
  • A new dynamic method to emphasize transition
    information in speech has been developed.
  • Algorithm isolates a component of the speech
    signal (the QSS component) that appears not to be
    critical to intelligibility.
  • Results suggest that transient components make a
    significant contribution to speech
    intelligibility.
  • Emphasis of transient information can enhance
    speech in noise at low SNRs.

Support from Grant N000140310277 from the Office
of Naval Research
17
  • Thank You.
Write a Comment
User Comments (0)
About PowerShow.com