Title: Perceptual compensation for reverberation: computer modelling
1Perceptual compensation for reverberation
computer modelling
- Guy Brown and Amy Beeston
- Department of Computer Science
- University of Sheffield
2Overview
- Work in this period (WP2)
- Development of a computer model of perceptual
compensation for reverberation - Metrics for assessing the amount of reverberation
present - Models based on nonlinear cochlear filterbank
- Work in the next period
- WP2 within-band mechanisms
- WP4 exploiting statistics of naturally occurring
sounds
3Work in this period
4Modelling approach
- Aim to build computer models that replicate the
performance of human subjects in specific
perceptual experiments - Current focus is compensation for effects of
reverberation in sir / stir continuum
(Watkins, 2005) - Could the auditory efferent system play a role?
- Modelling the efferent system provides a useful
framework for the modelling effort - factors that determine amount of efferent
suppression - effect of linear vs. nonlinear filtering
- within-band vs. across-band processing
5Watkins (2005) sir/stir experiment
context
test
sir/stir category boundary recorded
clean speech context and test
testreverberated
more sir responses
compensation more stir responses
test and context reverberated
6Watkins (2005) experiment 5
- Compensation measured with contexts reverberated
by time-reversed room impulse responses - slowly decaying tails do not occur at offsets
- slowly growing heads added at onsets
- is compensation related to reverberation tails?
- Compensation measured with time-reversed speech
carrier contexts - do words need to be identifiable for compensation
to occur?
7Experiment 5 pictorially
forward speech carrier
forward reverb
reverse reverb
reversed speech carrier
forward reverb
reverse reverb
8Results from Watkins (2005) experiment 5
- Compensation markedly reduced in reversed
reverberation conditions - for both forward and reversed speech carrier
- Compensation remains substantial in forwards and
reversed speech carrier conditions - slight reduction in compensation with reversed
speech carrier - Conclusions
- Reverberation tails appear to be important for
compensation - Intelligibility of the carrier is not required
9Putative role of auditory efferent processing
- Reverberating a speech signal reduces its dynamic
range - Compensation could be characterised as
restoration of dynamic range? - Efferent system implicated in control of dynamic
range via a closed-loop feedback mechanism
(Guinan Gifford, 1988) - Evidence that compensation effects are primarily
within-frequency-channel - Feedback from efferent system appears to be
fairly narrowly tuned - Plausible time scales
- Efferent feedback is sluggish, in the range
100-200 ms. Also long term effects over tens of
seconds (Sridhar et al. 1997).
10Main features of computer model
- Medial olivocochlear (MOC) system is known to
exert a suppressive influence on the basilar
membrane - Adapted models of Ferry Meddis (2007) and
Ghitza (2007) - Suppression modelled by attenuation in nonlinear
path of dual-resonance-nonlinear (DRNL)
filterbank - Amount of efferent suppression determined by
metric applied to auditory nerve response - May be within-channel or across-channel
- Metric computed over appropriate time scale
- Eventually this should be implemented as a
closed-loop feedback system (currently not)
11Schematic of the model
Auditory periphery
DRNL
Hair cell
Framing
Outer Middle Ear
DCT(optional)
Efferent attenuation
Stimulus
DTW-based recogniser
AN response
Metric
Efferent system
12Dual resonance nonlinear filterbank (DRNL)
- Originally proposed by Meddis, OMard and
Lopez-Poveda (2001), human parameters from Meddis
(2006) - Efferent attenuation introduced by Ferry and
Meddis (2007)
13Simple hair cell model
- Hair cell is a simple threshold and rate limiter
as described by Messing (2007). - Half-wave rectification.
- Linear output between threshold and saturated
firing rate. - Tuned to approximate low spontaneous rate AN
fibres.
14Rate-level response effect of suppression
15Auditory spectrograms effect of suppression
16Control of efferent attenuation
- Intend to implement a closed-loop system in which
a metric is applied to AN response over a sliding
window and used to determined efferent
attenuation. - Currently
- Only investigated metrics applied to pooled AN
response over all channels within-channel
mechanisms will be addressed next. - Estimated metric over 1 s preceding the target
word, and then used this to set the efferent
attenuation for the remainder of the stimulus
17Template-based speech recogniser
- Use a template-based speech recogniser based on
dynamic time warping (DTW) and cosine distance. - Templates are sir and stir from extreme ends
of the dry (unreverberated) continuum - DTW not necessary at this stage, but later intend
to model both fast and slow cases in Watkins
study. - Features for recognition were either
- Firing rate, computed at 5 ms intervals over 20
ms Hann window - DCT-transformed firing rate (15 coefficients, not
including the first)
18Configuration of the model
- Outer/middle ear and DRNL modified from code
supplied by Ray Meddis and Robert Ferry (Essex
University) - 80 frequency channels
- Best frequencies in range 100 Hz to 8 kHz, log
spaced - Stimuli presented at level of 56 dB SPL
- Implemented in MATLAB
- Invested some time developing framework for
running simulations on Sheffields computing grid
19Scoring the category boundary
- Watkins (2005) characterised listeners responses
in terms of shifts in the category boundary - 11 steps of continuum presented three times
- category boundary computed as (total number of
sir responses)/3-0.5, giving step between -0.5
and 10.5 - Model produces same output for each presentation
of 11 steps - 11 steps of continuum presented once
- category boundary computed as (total number of
sir responses)-0.5, giving same range of steps - quantisation of step is greater for model
20Manual tuning of efferent attenuation (rate)
Listeners (Watkins, 2005)
Computer model
21Category boundary vs. attenuation (rate)
reverse reverberation forward reverberation
forward speech carrier reverse speech
carrier
22Efferent attenuation vs mean-to-peak ratio
- Need a function that maps the amount of
reverberation in the context to the efferent
attenuation - Should be
- insensitive to timereversal of speech
- sensitive of reversal of impulse response
- Mean-to-peak ratio of pooled AN response gives
reasonable linear fit
23Results autonomous model (firing rate)
attenuation m(metric)c
- Not a good fit to listener data, although for
near (0.32) context get shift in boundary when
test word is reverberated, and right pattern of
compensation for the far (10m) test word.
24Results autonomous model (DCT)
attenuation m(metric)c
- Better fit of general pattern, but size of
category boundary changes is not well matched
25Other reverb metrics kurtosis
- Kurtosis is ?4/?4, where ?4 is the fourth
central moment - Measures the peakedness of the p.d.f. of a
random variable - Used as a reverberation metric in a number of
blind dereverberation studies - Reverberated speech has a lower kurtosis (more
Gaussian) - Not well correlated with efferent attenuation
(e.g., kurtosis is affected by reversing the
context)
26Other reverb metrics offset density
- Reverberation reduces the occurrence of sharp
offsets in the temporal envelope - Offset density number of consecutive TF bins
that differ by more than x dB in value - Coherence across frequency is enforced
- Reasonable low-order fit should be possible
27Other reverb metrics measuring tails
- A metric that emphasises reverberant tails may
provide a better fit, sensitive to
reverse-reverberation conditions - Mean-to-peak ratio of negative part of
differentiated temporal envelope - Not well correlated with efferent attenuation
(somewhat sensitive to reversal of context)
28Conclusions
- The mean-to-peak ratio of the pooled AN response
gives a reverberation metric that is easy to map
(linearly) to efferent attenuation - Key problem with the current model is steep
function relating category boundary to amount of
attenuation - Are the templates appropriate?
- Could matching be done another way?
- But
- Currently this works on pooled AN response
- Maybe a good fit to listener data cannot be
obtained unless efferent attenuation is adjusted
within individual channels?
29Work in the next periodWP2 within-band
mechanisms
30Summary of planned WP2 work
- Within-channel mechanisms
- Reverberation metrics that emphasize tails
- Comparison of linear and nonlinear models
- Distance metrics
- Frequency-dependent suppression
- Model of efferent processing as a front-end for
automatic speech recognition
31Distance metrics
- Templates are currently matched using cosine
distance based on firing rate or DCT features. - Perceptually motivated metrics may be more
appropriate - A number of metrics in the literature emphasize
formant peak frequencies (applied to vowel
identification) - Weighted spectral slope metric (Klatt, 1982)
- Weighted level metric (Assmann Summerfield,
1989) - Weighted negative second differential metric
(ditto)
32Frequency-dependent suppression
- Guinan Gifford (1988) find a fall-off in the
effect of efferent-induced threshold shift at
low BFs (data from cat) - This will improverepresentation of
low-frequency speech structure when efferent
attenuation is high
33Model of efferent processing as a front-end for
ASR
- Have applied model of efferent processing to ASR
in work with Ray Meddis and Robert Ferry (Essex) - Open loop, complex hair cell
- Efferent suppression improves representation of
speech in broadband noise, and hence recognition
accuracy - Evaluate reverberation robustness (of simpler
model) in next period
34Work in the next periodWP4 exploiting
statistics of naturally occurring sounds
35Summary of previous work
- Previously in work with Kalle Palomäki we have
used missing data to handle reverberation in ASR - Train acoustic models for ASR on clean speech
- Compute a binary time-frequency (TF) mask in
which TF units dominated by speech are labelled
as reliable and TF units dominated by
reverberation are labelled as unreliable - Treat reliable and unreliable regions differently
during ASR decoding using missing feature theory
(Cooke et al.) - Hynek (yesterday) are the least reverberated
regions the most reliable?
36Reverberation mask and oracle mask
Reverberated T600.7s, S/R dist.3.05m
Clean speech 98415
Frequency
Oracle mask (15dB criterion)
Reverberation mask
Frequency
Time
Time
37Schematic of current system
Reverberated speech
Spectralnormalization
Spectralfeatures
Missing dataspeechrecognizer
Auditoryfilterbank
Firingrate
Blurredness
Modulation filter
GMMClassifier
Mask
Shadowed boxes indicate processing inmultiple
frequency channels
Local slopeof envelope
38Mask derived from GMM classifier
- For the development set of 300 utterances in 4
reverberation conditions - Compute oracle masks that maximize speech
recognition accuracy - Probability densities for reliable regions Prj
and unreliable regions Puj in oracle masks are
modelled with 3-mixture GMMs, trained using EM
with feature vectors ?ij - GMMs trained separately for channel
- During testing, for feature vector ?ij set mask
as - maskij 1 iff Prj(?ij) gt Puj(?ij), 0 otherwise
39Training procedure (for each channel)
40Speech recognition experiments
- Spoken digit strings from Aurora 2.0 corpus
- HMM-based ASR system trained on clean training
section of Aurora corpus - Word models trained for each digit (1-9 plus oh
and zero) plus silence and short pause models - For testing, reverberated speech obtained by
convolving Aurora clean1 utterances with
recorded room impulse responses
41Results (from ASA meeting 6/08)
- Competitive with Kingsburys hybrid HMM-MLP
system - Oracle mask gives upper limit on performance
- Gap between oracle mask and reverb mask
performance suggests that reverberation masking
algorithm can be further tuned
42Planned work for WP4 (1)
- What are the cues to reliable time-frequency
(TF) regions? - Use the corpus of binaural room impulse responses
(BRIRs) recorded at Reading - Collect statistics that relate reliable TF
regions to cues that listeners may use to
determine the amount of reverberation present - pitch structure
- interaural coherence (requires binaural model)
- Low-frequency modulation
- Measure of reverberant tails
43Planned work for WP4 (2)
- Contextual effects are also important
- The salience of a cue depends on its local
context (as in the sir/stir case) - Auditory (and visual) perception based on
relative values - Hermansky Morgan (1994) How else could you
see the black-and-white movie on a white screen? - Can this contextual information be captured in
statistical models?
44Sir/stir experiment with large vocab ASR (Kalle
Palomäki)
- Preliminary experiment with existing large
vocabulary recogniser at HUT - Justifications
- Technical Large vocabulary ASR is an obvious
technological application - Scientific Large vocabulary ASR has learned
phoneme models close to humans - Using well-tested ASR system that works
reasonably well on broadcast news data - Recogniser was forced to choose from two
sentences - Next you'll get stir to click on
- Next you'll get sir to click on
45Results (Kalle Palomäki)
- Test on mildly reverberated signals (context
near, test near) - For steps 0-6 sir- sentence chosen
- For steps 7-10 stir - sentence chosen
- Indicates that recogniser can correctly align
phonemes to corresponding utterances - Free recognition too hard
- The far reverberant case was also too hard in
this preliminary test
46Suggestions for further work
- American English recogniser, should be British
English - Sampling rate was only 8 kHz
- We have Wall Street Journal corpus in British
English in 16 kHz need some work to train it - Measure phoneme or state probability near st in
stir-sir - Transform phoneme boundary test to an
intelligibility test - Apply SRT test and measure intelligibility after
change in acoustic conditions - Could also compare ASR performance against
intelligibility