Title: An Auditory Scene Analysis Approach to Speech Segregation
1An Auditory Scene Analysis Approach to Speech
Segregation
- DeLiang Wang
- Perception and Neurodynamics Lab
- The Ohio State University
2Outline of presentation
- Introduction
- Speech segregation problem
- Auditory scene analysis (ASA) approach
- Voiced speech segregation based on pitch tracking
and amplitude modulation analysis - Ideal binary mask as CASA goal
- Unvoiced speech segregation
- Auditory segmentation
- Neurobiological basis of ASA
3Real-world audition
- What?
- Source type
- Speech
- message
- speaker
- age, gender, linguistic origin, mood,
- Music
- Car passing by
- Where?
- Left, right, up, down
- How close?
- Channel characteristics
- Environment characteristics
- Room configuration
- Ambient noise
4Humans versus machines
- Additionally
- Car noise is not a very effective speech masker
- At 10 dB
- At 0 dB
- Human word error rate at 0 dB SNR is around 1 as
opposed to 100 for unmodified recognisers
(around 40 with noise adaptation)
Source Lippmann (1997)
5Speech segregation problem
- In a natural environment, speech is usually
corrupted by acoustic interference. Speech
segregation is critical for many applications,
such as automatic speech recognition and hearing
prosthesis - Most speech separation techniques, e.g.
beamforming and blind source separation via
independent analysis, require multiple sensors.
However, such techniques have clear limits - Suffer from configuration stationarity
- Cant deal with single-microphone mixtures or
situations where multiple sounds arrive from
close directions - Most speech enhancement developed for monaural
situation can deal with only stationary acoustic
interference
6Auditory scene analysis (Bregman90)
- Listeners are able to parse the complex mixture
of sounds arriving at the ears in order to
retrieve a mental representation of each sound
source - Ball-room problem, Helmholtz, 1863 (complicated
beyond conception) - Cocktail-party problem, Cherry53
- Two conceptual processes of auditory scene
analysis (ASA) - Segmentation. Decompose the acoustic mixture into
sensory elements (segments) - Grouping. Combine segments into groups, so that
segments in the same group are likely to have
originated from the same environmental source
7Computational auditory scene analysis
- Computational ASA (CASA) systems approach sound
separation based on ASA principles - Weintraub85, Cooke93, Brown Cooke94,
Ellis96, Wang Brown99 - CASA progress Monaural segregation with minimal
assumptions - CASA challenges
- Broadband high-frequency mixtures
- Reliable pitch tracking of noisy speech
- Unvoiced speech
8Outline of presentation
- Introduction
- Speech segregation problem
- Auditory scene analysis (ASA) approach
- Voiced speech segregation based on pitch tracking
and amplitude modulation analysis - Ideal binary mask as CASA goal
- Unvoiced speech segregation
- Auditory segmentation
- Neurobiological basis of ASA
9Resolved and unresolved harmonics
- For voiced speech, lower harmonics are resolved
while higher harmonics are not - For unresolved harmonics, the envelopes of filter
responses fluctuate at the fundamental frequency
of speech - Our model (Hu Wang04) applies different
grouping mechanisms for low-frequency and
high-frequency signals - Low-frequency signals are grouped based on
periodicity and temporal continuity - High-frequency signals are grouped based on
amplitude modulation (AM) and temporal continuity
10Diagram of the Hu-Wang model
11Cochleogram Auditory peripheral model
Spectrogram
- Spectrogram
- Plot of log energy across time and frequency
(linear frequency scale) - Cochleogram
- Cochlear filtering by the gammatone filterbank
(or other models of cochlear filtering), followed
by a stage of nonlinear rectification the latter
corresponds to hair cell transduction by either a
hair cell model or simple compression operations
(log and cube root) - Quasi-logarithmic frequency scale, and filter
bandwidth is frequency-dependent - Previous work suggests better resilience to noise
than spectrogram
Cochleogram
12Mid-level auditory representations
- Mid-level representations form the basis for
segment formation and subsequent grouping - Correlogram extracts periodicity and AM from
simulated auditory nerve firing patterns - Summary correlogram is used to identify global
pitch - Cross-channel correlation between adjacent
correlogram channels identifies regions that are
excited by the same harmonic or formant
13Correlogram
- Short-term autocorrelation of the output of each
frequency channel of the cochleogram - Peaks in summary correlogram indicate pitch
periods (F0) - A standard model of pitch perception
Correlogram summary correlogram of a double
vowel, showing F0s
14Cross-channel correlation
(a) Correlogram and cross-channel correlation of
hair cell response to clean speech (b)
Corresponding representations for response
envelopes
15Initial segregation
- Segments are formed based on temporal continuity
and cross-channel correlation - Segments generated in this stage tend to reflect
resolved harmonics, but not unresolved ones - Initial grouping into a foreground (target)
stream and a background stream according to
global pitch using the oscillatory correlation
model of Wang and Brown (1999)
16Pitch tracking
- Pitch periods of target speech are estimated from
the segregated speech stream - Estimated pitch periods are checked and
re-estimated using two psychoacoustically
motivated constraints - Target pitch should agree with the periodicity of
the time-frequency units in the initial speech
stream - Pitch periods change smoothly, thus allowing for
verification and interpolation
17Pitch tracking example
- (a) Global pitch (Line pitch track of clean
speech) for a mixture of target speech and
cocktail-party intrusion - (b) Estimated target pitch
18T-F unit labeling
- In the low-frequency range
- A time-frequency (T-F) unit is labeled by
comparing the periodicity of its autocorrelation
with the estimated target pitch - In the high-frequency range
- Due to their wide bandwidths, high-frequency
filters respond to multiple harmonics. These
responses are amplitude modulated due to beats
and combinational tones (Helmholtz, 1863) - A T-F unit in the high-frequency range is labeled
by comparing its AM repetition rate with the
estimated target pitch
19AM example
- (a) The output of a gammatone filter (center
frequency 2.6 kHz) in response to clean speech - (b) The corresponding autocorrelation function
20AM repetition rates
- To obtain AM repetition rates, a filter response
is half-wave rectified and bandpass filtered - The resulting signal within a T-F unit is modeled
by a single sinusoid using the gradient descent
method. The frequency of the sinusoid indicates
the AM repetition rate of the corresponding
response
21Final segregation
- New segments corresponding to unresolved
harmonics are formed based on temporal continuity
and cross-channel correlation of response
envelopes (i.e. common AM). Then they are grouped
into the foreground stream according to AM
repetition rates - Other units are grouped according to temporal and
spectral continuity
22Ideal binary mask for performance evaluation
- Within a T-F unit, the ideal binary mask is 1 if
target energy is stronger than interference
energy, and 0 otherwise - Motivation Auditory masking - stronger signal
masks weaker one within a critical band - We have suggested to use ideal binary masks as
ground truth for CASA performance evaluation - Consistent with recent speech intelligibility
results (Roman et al.03 Brungart et al.05)
23Ideal binary mask illustration
24Voiced speech segregation example
25Systematic SNR results
SNR (in dB)
Hu-Wang model
- Evaluation on a corpus of 100 mixtures (Cooke,
1993) 10 voiced utterances x 10 noise intrusions
(see next slide) - Average SNR gain 12.3 dB 5.2 dB better than the
Wang-Brown model (1999), and 6.4 dB better than
the spectral subtraction method
26CASA progress on voiced speech segregation
- 100 mixture set used by Cooke (1993)
- 10 voiced utterances mixed with 10 noise
intrusions (N0 tone, N1 white noise, N2 noise
bursts, N3 cocktail party, N4 rock music, N5
siren, N6 telephone, N7 female utterance, N8
male utterance, N9 female utterance)
Wang Brown (1999)
Original mixture of voiced speech
Cooke (1993)
Ellis (1996)
Hu Wang (2004)
telephone
male
female
27Outline of presentation
- Introduction
- Speech segregation problem
- Auditory scene analysis (ASA) approach
- Voiced speech segregation based on pitch tracking
and amplitude modulation analysis - Ideal binary mask as CASA goal
- Unvoiced speech segregation
- Auditory segmentation
- Neurobiological basis of ASA
28Segmentation and unvoiced speech segretation
- To deal with unvoiced speech segregation, we (Hu
Wang04) proposed a model of auditory
segmentation that applies to both voiced and
unvoiced speech - The task of segmentation is to decompose an
auditory scene into contiguous T-F regions, each
of which should contain signal from the same
sound source - The definition of segmentation does not
distinguish between voiced and unvoiced sounds - This is equivalent to identifying onsets and
offsets of individual T-F regions, which
generally correspond to sudden changes of
acoustic energy - The segmentation strategy is based on onset and
offset analysis
29Scale-space analysis for auditory segmentation
- From a computational standpoint, auditory
segmentation is similar to image (visual)
segmentation - Visual segmentation Finding bounding contours of
visual objects - Auditory segmentation Finding onset and offset
fronts of segments - Onset/offset analysis employs scale-space theory,
which is a multiscale analysis commonly used in
image segmentation - Smoothing
- Onset/offset detection and onset/offset front
matching - Multiscale integration
30Example of auditory segmentation
31Speech segregation
- The general strategy for speech segregation is to
first segregate voiced speech using the pitch
cue, and then deal with unvoiced speech - To segregate unvoiced speech, we perform auditory
segmentation, and then group segments that
correspond to unvoiced speech
32Segment classification
- For nonspeech interference, grouping is in fact a
classification task to classify segments as
either speech or non-speech - The following features are used for
classification - Spectral envelope
- Segment duration
- Segment intensity
- Training data
- Speech Training part of the TIMIT database
- Interference 90 natural intrusions including
street noise, crowd noise, wind, etc. - A Gaussian mixture model is trained for each
phoneme, and for interference as well which
provides the basis for a likelihood ratio test
33Example of segregating fricatives/affricates
Utterance That noise problem grows more
annoying each day Interference Crowd noise with
music (IBM Ideal binary mask)
34Example of segregating stops
Utterance A good morrow to you, my
boy Interference Rain
35Outline of presentation
- Introduction
- Speech segregation problem
- Auditory scene analysis (ASA) approach
- Voiced speech segregation based on pitch tracking
and amplitude modulation analysis - Ideal binary mask as CASA goal
- Unvoiced speech segregation
- Auditory segmentation
- Neurobiological basis of ASA
36How does the auditory system perform ASA?
- Information about acoustic features (pitch,
spectral shape, interaural differences, AM, FM)
is extracted in distributed areas of the auditory
system - Binding problem How are these features combined
to form a perceptual whole (stream)? - Hierarchies of feature-detecting cells exist, but
do not seem to constitute a solution to the
binding problem
37Oscillatory correlation theory for ASA
- Neural oscillators are used to represent auditory
features - Oscillators representing features of the same
source are synchronized, and are desynchronized
from those representing different sources - Originally proposed by von der Malsburg
Schneider (1986), and further developed by Wang
(1996) - Supported by growing experimental evidence
38Oscillatory correlation representation
39Oscillatory correlation for ASA
- LEGION dynamics (Terman Wang95) provides a
computational foundation for the oscillatory
correlation theory - The utility of oscillatory correlation has been
demonstrated for speech segregation
(Wang-Brown99), modeling auditory attention
(Wrigley-Brown04), etc.
40Summary
- CASA approach to monaural speech segregation
- Performs substantially better than previous CASA
systems for voiced speech segregation - AM cue and target pitch tracking are important
for performance improvement - Early steps for unvoiced speech segregation
- Auditory segmentation based on onset/offset
analysis - Segregation using speech classification
- Oscillatory correlation theory for ASA
41Acknowledgment
- Joint work with Guoning Hu
- Funded by AFOSR/AFRL and NSF