Reverberationrobust automatic speech recognition using missing data techniques - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Reverberationrobust automatic speech recognition using missing data techniques

Description:

Reverberation-robust automatic speech recognition using missing data techniques ... Uses PLP and modulation-filtered spectrogram features. 18 ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 20
Provided by: guyb
Category:

less

Transcript and Presenter's Notes

Title: Reverberationrobust automatic speech recognition using missing data techniques


1
Reverberation-robust automatic speech recognition
using missing data techniques
  • Guy J. Brown1, Kalle Palomäki2 and Jon Barker1
  • 1 Department of Computer Science, University of
    Sheffield, UK
  • 2 Laboratory of Acoustics and Audio Signal
    Processing, Helsinki University of Technology,
    Finland
  • g.brown_at_dcs.shef.ac.uk, kalle.palomaki_at_hut.fi,
    j.barker_at_dcs.shef.ac.uk

2
Overview
  • Missing data approach to ASR
  • The reverberation problem
  • System description
  • Acoustic features
  • Reverberation mask estimation
  • Spectral normalisation
  • Feature combination
  • ASR results
  • Conclusions

3
Missing data approach to ASR
  • Devised by Cooke et al. (2001) as a means of
    handling additive noise in ASR.
  • Motivated by observation that listeners are able
    to recognise speech even when parts of the
    spectrum are rendered unreliable (by noise) or
    removed (by filtering).
  • Adapt a hidden Markov model (HMM) classifier to
    cope with missing or unreliable features.
  • Requires that reliable acoustic features are
    labelled.

4
The time-frequency mask
  • Likelihood associated with
    spectral features xs cannot be computed directly
  • Partition xs into reliable part xs,r and
    unreliable part xs,u
  • Compute estimate of likelihood
    by using xs,r directly and
    exploiting bounds on xs,u (bounded
    marginalization)
  • Provide a time-frequency mask showing reliable
    regions.
  • Mask may be binary (as used here) or real-valued.

5
Example time-frequency mask
Rate map for clean speech
Rate map for noisy speech
Mask
6
Characteristics of room reverberation
  • Room impulse response comprises early and late
    (higher-order) reflections.
  • Early reflections
  • Sparse, highly correlated with speech
  • May enhance intelligibility by increasing
    loudness of speech
  • May cause spectral deviation due to comb
    filtering and varying characteristics of surface
    absorption
  • Higher-order reflections
  • Dense, poorly correlated with original speech
  • More like additive noise

7
Missing data ASR and reverberation
  • We use the following approach for reverberated
    speech
  • Use spectral normalisation to deal with
    distortion caused by early reflections
  • Treat late reverberation as additive noise, and
    apply standard missing data techniques.
  • Identify spectral features which are relatively
    uncontaminated by reverberation and contain
    strong speech energy.
  • Approach based on modulation filtering.

8
System architecture
Auditory filterbank
Rate map
Spectral normalisation
Missing data speech recogniser
Reverberated speech
Reverberation mask
9
Acoustic features
  • Missing data approach requires spectral features,
    so local time-frequency regions can be selected
    in the mask.
  • Features derived from an auditory model
  • Filterbank consisting of 32 gammatone filters,
    centre frequencies between 50 Hz and 3850 Hz on
    ERB scale
  • Envelope of each filter extracted and smoothed by
    first-order lowpass filter with a time constant
    of 8 ms.
  • Smoothed envelope is sampled at 10 ms intervals
    and cube root compressed to give a rate map.

10
Reverberation mask estimation
  • Identify acoustic features that contain strong
    speech energy and are relatively unaffected by
    reverberation.
  • Detect modulations in the speech range by
    filtering each channel of the rate map with a
    modulation filter, pass band between 1.5 Hz and
    8.2 Hz.
  • Apply threshold to modulation-filtered rate map
    ym(i,j)
  • where m(i,j) is mask at time i and frequency j,
    and q(j) is a frequency-dependent threshold.

11
Form of the modulation filter
  • The modulation filter h(n) has the following
    form
  • hlp(n) is a linear phase low pass filter which
    detects modulations in the speech range.
  • hdiff(n) is a differentiator which emphasizes
    abrupt onsets, which are likely to correspond to
    direct sound and early reflections.
  • Overall filter h(n) is band pass, with 3 dB
    cutoff points at 1.5 Hz and 8.2 Hz.

12
Form of the modulation filter (contd)
13
Example reverberation mask estimation
Rate map (CF103Hz)
Rate map filtered by low pass part
Rate map filtered by entire modulation filter
Estimated reliable regions (solid line) and
unreliable regions (dotted line)
14
Spectral normalisation
  • Need to compensate for spectral distortion caused
    by room impulse response, but with partial
    information.
  • Features known to be unreliable should not be
    included in the normalisation process.
  • We use an utterance-based normalisation scheme.
  • Normalisation factor for each channel is the mean
    of the L largest reliable features in that
    channel.
  • Generally set L to M/D, where M is the number of
    time frames in the rate map and D is a constant
    parameter.

15
Example of mask estimation
Rate map of unreverberated speech
A priori mask
Frequency (kHz)
Rate map of reverberated speech (T60 1.7sec)
Reverberation mask
Frequency (kHz)
Time (seconds)
Time (seconds)
16
Recent work feature combination
  • MD approach requires spectral features, which are
    more correlated than cepstral features
  • Problems with modeling spectral data may reduce
    baseline ASR performance, offsetting the gain in
    robustness
  • Compute estimate of likelihood
    for spectral features xs with missing
    data
  • Compute likelihood f(xcC) for (complete)
    cepstral features
  • Combine likelihoods from the two feature streams
    as weighted average in the log domain

17
Evaluation
  • Compare against missing data using a priori
    masks
  • Measure difference between each element in the
    rate map for clean and reverberation-contaminated
    speech
  • Only set mask elements to unity if this
    difference lies within a threshold value (tuned
    for each condition).
  • Compare against Kingsburys HMM-MLP recogniser
  • Hidden Markov model / multilayer perceptron
    architecture
  • Uses PLP and modulation-filtered spectrogram
    features.

18
Results
  • Test set of 1001 utterances drawn from Aurora 2
    corpus.
  • Connected digits (1-9 plus oh and zero).
  • Missing data HMM system trained on rate maps and
    deltas.
  • Reverberated using recorded room impulse
    responses.

19
Conclusions
  • Missing data techniques can be used to tackle the
    problem of reverberation.
  • Detection of modulations in the speech range is a
    key element of our approach.
  • Advantage of the missing data framework
    different mask estimation rules can be selected
    dynamically to deal with varying acoustic
    environments.
  • May be important for mobile devices.
  • Experiments with a priori masks suggest good
    potential most recent results superior to
    Kingsburys system.
Write a Comment
User Comments (0)
About PowerShow.com