Linking Computational Auditory Scene Analysis with - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Linking Computational Auditory Scene Analysis with

Description:

... source identified by common azimuth; Azimuth estimated by ITD, with ... Warp to azimuth axis, since HRIR-filtered sounds show weak frequency-dependence in ITD. ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 39

Provided by: guyjb

Learn more at: http://www.cse.ohio-state.edu

Category:

more less

Transcript and Presenter's Notes

Title: Linking Computational Auditory Scene Analysis with

1
Linking Computational Auditory Scene Analysis
with Missing Data Recognition of Speech

Guy J. Brown
Department of Computer Science, University of
Sheffield
g.brown_at_dcs.shef.ac.uk
Collaborators
Kalle Palomäki, University of Sheffield and
Helsinki University of Technology
DeLiang Wang, The Ohio State University

2
Introduction

Human speech perception is remarkably robust,
even in the presence of interfering sounds and
reverberation.
In contrast, automatic speech recognition (ASR)
is very problematic in such conditions
error rates of humans are much lower than those
of machines in quiet, and error rates of current
recognizers increase substantially at noise
levels which have little effect on human
listeners Lippmann (1997)
Can we improve ASR performance by taking an
approach that models auditory processing more
closely?

3
Auditory processing in ASR

Until recently, the influence of auditory
processing on ASR has been largely limited to the
front-end.
Noise robust feature vectors, e.g. RASTA-PLP,
modulation filtered spectrograms.
Can auditory processing be applied in the
recogniser itself?
Cooke et al. (2001) suggest that speech
perception is robust because listeners can
recognise speech from a partial description, i.e.
with missing data.
Modify conventional recogniser to deal with
missing or unreliable features.

4
Missing data approach to ASR

Aim of ASR is to assign an acoustic vector Y to a
class W such that the posterior probability
P(WY) is maximised
P(WY) ? P(YW) P(W)
If components of Y are unreliable or missing,
cannot compute P(YW) as usual.
Solution partition Y into reliable parts Yr and
unreliable parts Yu, and use marginal
distribution P(YrW).
Provide a time-frequency mask showing reliable
regions.

acoustic model
language model
5
Missing data mask
Rate map
Frequency
Time
Mask
Frequency
Time
6
Binaural hearing and ASA

Spatial location of sound sources is encoded by
Interaural time difference (ITD)
Interaural level difference (ILD)
Spectral (pinna) cues
Intelligibility of masked speech is improved if
the speech and masker originate from different
locations in space (Spieth, 1954).
Gestalt principle of similarity/proximity events
that arise from a similar location are grouped.

7
Binaural processor for MD ASR

Assumptions
Two sound sources, speech and an interfering
sound
Sources spatialised by filtering with realistic
head-related impulse responses (HRIR)
Reverberation may be present.
Key features of the system
Components of the same source identified by
common azimuth
Azimuth estimated by ITD, with ILD constraint
Spectral normalisation technique for handling
convolutional distortion due to HRIR filtering
and reverberation.

8
Block diagram of the system

Missing data ASR
Auditory filterbank
Envelope
Grouping common azimuth
Precedence model
Cross correlation
9
Stimulus generation

Speech and noise sources are located in a virtual
room same height, different azimuthal angle.
Transfer function of path between source and ears
is modelled by a binaural room impulse response.
Impulse response has three components
Surface reflections estimated by the image model
Air propagation filter (assume 50 relative
humidity)
Head-related impulse response (HRIR)
Alter surface absorption to vary reverberation
time.

10
Virtual room

Noise source
Speech source
Height 3m
Length 6m
Width 4m
11
Auditory periphery

Cochlear frequency analysis modelled by bank of
32 gammatone filters, rectify and cube root
compress.
Instantaneous envelope computed.
Smooth envelope and downsample to obtain rate
map feature vectors for the recogniser.

Frequency
Time
12
A model of precedence processing

A simple model of a complex phenomenon!
Create inhibitory signal by lowpass filtering
envelope with
hlp(t) A t exp(-t/a)
Inhibited auditory nerve response r(t,f) given by
r(t,f) a(t,f) - G (hlp(t) env(t,f))
where a(t,f) is auditory nerve response, is
half-wave rectification and G determines the
strength of inhibition.

13
Output from the precedence model
Channel envelope and fine time structure
Amplitude
Inhibitory signal
Amplitude
Inhibited fine structure
Amplitude
0
50
Time ms
14
Azimuth estimation

Estimate ITD by computing cross-correlation in
each frequency band.
Form a cross-correlogram (CCG), a two-dimensional
plot of ITD against frequency band.
Sum across frequency, giving pooled
cross-correlogram.
Warp to azimuth axis, since HRIR-filtered sounds
show weak frequency-dependence in ITD.
Sharpen CCG by replacing local peaks with narrow
Gaussians skeleton CCG. Like lateral inhibition.

15
Cross-correlogram (ITD)

Mixture of male and female speech
Channel centre frequency
Azimuths Male speech 20 deg Female speech -20
deg
Interaural time difference (ITD)
16
Skeleton cross-correlogram (azimuth)

Mixture of male and female speech
Channel centre frequency
Azimuths Male speech 20 deg Female speech -20
deg
Azimuth (degrees)
17
Grouping by common azimuth

Locate source azimuths from pooled CCG.
For each channel i at each time frame j, set mask
to 1 iff
C(i,j,?s) gt C(i,j,?n) and C(i,j,?s) gt Q
where C(i,j,???is cross-correlogram, ?s is
azimuth of speech, ?n is azimuth of noise and Q
is a threshold.
Motivation
Select channels in missing data mask in which
speech dominates the noise, and energy is not too
low.
Hint given system knows that ?s gt ?n

18
ILD constraint

Compute interaural level difference as
ILD(i,j) 10 log10 engR(i,j)/engL(i,j)
where engk(i,j,n) is energy in channel i at time
frame j for ear k.
Store ideal ILD for a particular azimuth in a
lookup table.
Cross-check observed ILD against ideal ILD for
observed azimuth if they do not agree to within
0.5 dB set mask to zero.

19
Spectral energy normalisation

HRIR filtering and reverberation introduces
convolutional distortion.
Usually normalise by mean and variance of
features in each frequency band but what if data
is missing?
Current approach is simple normalise by the mean
of the N largest reliable feature valuesYr in
each channel.
Motivation
Features that have high energy and are marked as
reliable should be least affected by the noise
background.

20
A priori mask

To assess limits of the missing data approach, we
employ an a priori mask.
Derived by measuring the difference between the
rate map for clean speech and its
noise/reverberation contaminated counterpart.
Only set mask elements to 1 if this difference
lies within a threshold value (tuned for each
condition).
Should give near-optimal performance.

21
Masks estimated by binaural grouping

Mask estimated by binaural processor
Rate maps
A priori mask
Mixture of speech (20 deg azimuth) and
interfering talker (-20 deg azimuth) SNR 0dB Top
anechoic Bottom T60 reverberation time of 0.3 sec
22
Evaluation

Hidden Markov model (HMM) recogniser, modified
for missing data approach.
Tested on 240 utterances from TiDigits connected
digit corpus.
12 word-level HMMs (silence, oh, zero and 1
to 9).
Noise intrusions from Cookes (1993) corpus male
speaker and rock music.
Baseline recogniser for comparison, trained on
mel-frequency cepstral coefficients (MFCCs) and
derivatives.

23
Example sounds

one five zero zero six, male speaker, anechoic
With T60 reverberation time 0.3 sec
With interfering male speaker, 0 dB SNR,
anechoic, 40 degrees azimuth separation
Two speakers, T60 reverberation time 0.3 sec

24
Effect of reverberation (anechoic)

Reverberation time 0 sec

A priori
Accuracy
Binaural
MFCC
Male speech masker 40 degrees separation
Signal-to-noise ratio (dB)
25
Effect of reverberation (small office)

Reverberation time 0.3 sec

A priori
Accuracy
Binaural
MFCC
Male speech masker 40 degrees separation
Signal-to-noise ratio (dB)
26
Effect of spatial separation (10 deg)

Reverberation time 0.3 sec
A priori
Accuracy
Binaural
MFCC
Signal-to-noise ratio (dB)
27
Effect of spatial separation (20 deg)

Reverberation time 0.3 sec
A priori
Accuracy
Binaural
MFCC
Signal-to-noise ratio (dB)
28
Effect of spatial separation (40 deg)

Reverberation time 0.3 sec
A priori
Accuracy
Binaural
MFCC
Signal-to-noise ratio (dB)
29
Effect of noise source (rock music)

Reverberation time 0.3 sec
A priori
Accuracy
Binaural
MFCC
Signal-to-noise ratio (dB)
30
Effect of noise source (male speech)

Reverberation time 0.3 sec
A priori
Accuracy
Binaural
MFCC
Signal-to-noise ratio (dB)
31
Effect of precedence processing

Without inhibition (G0.0)
With inhibition (G1.0)
32
Summary of results

The binaural missing data system is more robust
than a conventional MFCC-based recogniser when
interfering sounds and reverberation are present.
The performance of the binaural system depends on
the angular separation between sources.
Source characteristics influence performance of
binaural system most helpful when spectra of
speech and interfering sounds substantially
overlap.
Performance of binaural system is close to a
priori masks in anechoic conditions room for
improvement elsewhere.

33
Conclusions and future work

Combination of binaural model and missing data
framework appears promising.
However, still far from matching human
performance.
Major outstanding issues
Better model of precedence processing
Source identification (top-down constraints)
Source selection (role of attention)
Moving sound sources
More complex acoustic environments.

Additional Slides

35
Precedence effect

A group of phenomena which underlie the ability
of listeners to localise sound sources in
reverberant spaces.
Direct sound followed by reflections but
listeners usually report that source originates
from direction corresponding to first wavefront.
Usually explained by delayed inhibition, which
suppresses location information 1ms after onset
of abrupt sound.

36
Full set of example sounds

one five zero zero six, male speaker, anechoic
With T60 reverberation time 0.3 sec (small
office)
With T60 reverberation time 0.45 sec (larger
office)
With interfering male speaker, 0 dB SNR,
anechoic, 40 degrees azimuth separation
Two speakers, T60 reverberation time 0.3 sec
Two speakers, T60 reverberation time 0.45 sec

37
Effect of reverberation (larger office)