An Auditory Scene Analysis Approach to Speech Segregation - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

An Auditory Scene Analysis Approach to Speech Segregation

Description:

... enhancement developed for monaural situation can deal with ... CASA progress: Monaural segregation with minimal assumptions ... to monaural speech ... – PowerPoint PPT presentation

Number of Views:225

Avg rating:3.0/5.0

Slides: 42

Provided by: Shzu1

Category:

more less

Transcript and Presenter's Notes

Title: An Auditory Scene Analysis Approach to Speech Segregation

1
An Auditory Scene Analysis Approach to Speech
Segregation

DeLiang Wang
Perception and Neurodynamics Lab
The Ohio State University

2
Outline of presentation

Introduction
Speech segregation problem
Auditory scene analysis (ASA) approach
Voiced speech segregation based on pitch tracking
and amplitude modulation analysis
Ideal binary mask as CASA goal
Unvoiced speech segregation
Auditory segmentation
Neurobiological basis of ASA

3
Real-world audition

What?
Source type
Speech
message
speaker
age, gender, linguistic origin, mood,
Music
Car passing by
Where?
Left, right, up, down
How close?
Channel characteristics
Environment characteristics
Room configuration
Ambient noise

4
Humans versus machines

Additionally
Car noise is not a very effective speech masker
At 10 dB
At 0 dB
Human word error rate at 0 dB SNR is around 1 as
opposed to 100 for unmodified recognisers
(around 40 with noise adaptation)

Source Lippmann (1997)
5
Speech segregation problem

In a natural environment, speech is usually
corrupted by acoustic interference. Speech
segregation is critical for many applications,
such as automatic speech recognition and hearing
prosthesis
Most speech separation techniques, e.g.
beamforming and blind source separation via
independent analysis, require multiple sensors.
However, such techniques have clear limits
Suffer from configuration stationarity
Cant deal with single-microphone mixtures or
situations where multiple sounds arrive from
close directions
Most speech enhancement developed for monaural
situation can deal with only stationary acoustic
interference

6
Auditory scene analysis (Bregman90)

Listeners are able to parse the complex mixture
of sounds arriving at the ears in order to
retrieve a mental representation of each sound
source
Ball-room problem, Helmholtz, 1863 (complicated
beyond conception)
Cocktail-party problem, Cherry53
Two conceptual processes of auditory scene
analysis (ASA)
Segmentation. Decompose the acoustic mixture into
sensory elements (segments)
Grouping. Combine segments into groups, so that
segments in the same group are likely to have
originated from the same environmental source

7
Computational auditory scene analysis

Computational ASA (CASA) systems approach sound
separation based on ASA principles
Weintraub85, Cooke93, Brown Cooke94,
Ellis96, Wang Brown99
CASA progress Monaural segregation with minimal
assumptions
CASA challenges
Broadband high-frequency mixtures
Reliable pitch tracking of noisy speech
Unvoiced speech

8
Outline of presentation

Introduction
Speech segregation problem
Auditory scene analysis (ASA) approach
Voiced speech segregation based on pitch tracking
and amplitude modulation analysis
Ideal binary mask as CASA goal
Unvoiced speech segregation
Auditory segmentation
Neurobiological basis of ASA

9
Resolved and unresolved harmonics

For voiced speech, lower harmonics are resolved
while higher harmonics are not
For unresolved harmonics, the envelopes of filter
responses fluctuate at the fundamental frequency
of speech
Our model (Hu Wang04) applies different
grouping mechanisms for low-frequency and
high-frequency signals
Low-frequency signals are grouped based on
periodicity and temporal continuity
High-frequency signals are grouped based on
amplitude modulation (AM) and temporal continuity

10
Diagram of the Hu-Wang model
11
Cochleogram Auditory peripheral model
Spectrogram

Spectrogram
Plot of log energy across time and frequency
(linear frequency scale)
Cochleogram
Cochlear filtering by the gammatone filterbank
(or other models of cochlear filtering), followed
by a stage of nonlinear rectification the latter
corresponds to hair cell transduction by either a
hair cell model or simple compression operations
(log and cube root)
Quasi-logarithmic frequency scale, and filter
bandwidth is frequency-dependent
Previous work suggests better resilience to noise
than spectrogram

Cochleogram
12
Mid-level auditory representations

Mid-level representations form the basis for
segment formation and subsequent grouping
Correlogram extracts periodicity and AM from
simulated auditory nerve firing patterns
Summary correlogram is used to identify global
pitch
Cross-channel correlation between adjacent
correlogram channels identifies regions that are
excited by the same harmonic or formant

13
Correlogram

Short-term autocorrelation of the output of each
frequency channel of the cochleogram
Peaks in summary correlogram indicate pitch
periods (F0)
A standard model of pitch perception

Correlogram summary correlogram of a double
vowel, showing F0s
14
Cross-channel correlation
(a) Correlogram and cross-channel correlation of
hair cell response to clean speech (b)
Corresponding representations for response
envelopes
15
Initial segregation

Segments are formed based on temporal continuity
and cross-channel correlation
Segments generated in this stage tend to reflect
resolved harmonics, but not unresolved ones
Initial grouping into a foreground (target)
stream and a background stream according to
global pitch using the oscillatory correlation
model of Wang and Brown (1999)

16
Pitch tracking

Pitch periods of target speech are estimated from
the segregated speech stream
Estimated pitch periods are checked and
re-estimated using two psychoacoustically
motivated constraints
Target pitch should agree with the periodicity of
the time-frequency units in the initial speech
stream
Pitch periods change smoothly, thus allowing for
verification and interpolation

17
Pitch tracking example

(a) Global pitch (Line pitch track of clean
speech) for a mixture of target speech and
cocktail-party intrusion
(b) Estimated target pitch

18
T-F unit labeling

In the low-frequency range
A time-frequency (T-F) unit is labeled by
comparing the periodicity of its autocorrelation
with the estimated target pitch
In the high-frequency range
Due to their wide bandwidths, high-frequency
filters respond to multiple harmonics. These
responses are amplitude modulated due to beats
and combinational tones (Helmholtz, 1863)
A T-F unit in the high-frequency range is labeled
by comparing its AM repetition rate with the
estimated target pitch

19
AM example

(a) The output of a gammatone filter (center
frequency 2.6 kHz) in response to clean speech
(b) The corresponding autocorrelation function

20
AM repetition rates

To obtain AM repetition rates, a filter response
is half-wave rectified and bandpass filtered
The resulting signal within a T-F unit is modeled
by a single sinusoid using the gradient descent
method. The frequency of the sinusoid indicates
the AM repetition rate of the corresponding
response

21
Final segregation

New segments corresponding to unresolved
harmonics are formed based on temporal continuity
and cross-channel correlation of response
envelopes (i.e. common AM). Then they are grouped
into the foreground stream according to AM
repetition rates
Other units are grouped according to temporal and
spectral continuity

22
Ideal binary mask for performance evaluation

Within a T-F unit, the ideal binary mask is 1 if
target energy is stronger than interference
energy, and 0 otherwise
Motivation Auditory masking - stronger signal
masks weaker one within a critical band
We have suggested to use ideal binary masks as
ground truth for CASA performance evaluation
Consistent with recent speech intelligibility
results (Roman et al.03 Brungart et al.05)

23
Ideal binary mask illustration
24
Voiced speech segregation example
25
Systematic SNR results
SNR (in dB)
Hu-Wang model

Evaluation on a corpus of 100 mixtures (Cooke,
1993) 10 voiced utterances x 10 noise intrusions
(see next slide)
Average SNR gain 12.3 dB 5.2 dB better than the
Wang-Brown model (1999), and 6.4 dB better than
the spectral subtraction method

26
CASA progress on voiced speech segregation

100 mixture set used by Cooke (1993)
10 voiced utterances mixed with 10 noise
intrusions (N0 tone, N1 white noise, N2 noise
bursts, N3 cocktail party, N4 rock music, N5
siren, N6 telephone, N7 female utterance, N8
male utterance, N9 female utterance)

Wang Brown (1999)
Original mixture of voiced speech
Cooke (1993)
Ellis (1996)
Hu Wang (2004)
telephone
male
female
27
Outline of presentation

Introduction
Speech segregation problem
Auditory scene analysis (ASA) approach
Voiced speech segregation based on pitch tracking
and amplitude modulation analysis
Ideal binary mask as CASA goal
Unvoiced speech segregation
Auditory segmentation
Neurobiological basis of ASA

28
Segmentation and unvoiced speech segretation

To deal with unvoiced speech segregation, we (Hu
Wang04) proposed a model of auditory
segmentation that applies to both voiced and
unvoiced speech
The task of segmentation is to decompose an
auditory scene into contiguous T-F regions, each
of which should contain signal from the same
sound source
The definition of segmentation does not
distinguish between voiced and unvoiced sounds
This is equivalent to identifying onsets and
offsets of individual T-F regions, which
generally correspond to sudden changes of
acoustic energy
The segmentation strategy is based on onset and
offset analysis

29
Scale-space analysis for auditory segmentation

From a computational standpoint, auditory
segmentation is similar to image (visual)
segmentation
Visual segmentation Finding bounding contours of
visual objects
Auditory segmentation Finding onset and offset
fronts of segments
Onset/offset analysis employs scale-space theory,
which is a multiscale analysis commonly used in
image segmentation
Smoothing
Onset/offset detection and onset/offset front
matching
Multiscale integration

30
Example of auditory segmentation
31
Speech segregation

The general strategy for speech segregation is to
first segregate voiced speech using the pitch
cue, and then deal with unvoiced speech
To segregate unvoiced speech, we perform auditory
segmentation, and then group segments that
correspond to unvoiced speech

32
Segment classification

For nonspeech interference, grouping is in fact a
classification task to classify segments as
either speech or non-speech
The following features are used for
classification
Spectral envelope
Segment duration
Segment intensity
Training data
Speech Training part of the TIMIT database
Interference 90 natural intrusions including
street noise, crowd noise, wind, etc.
A Gaussian mixture model is trained for each
phoneme, and for interference as well which
provides the basis for a likelihood ratio test

33
Example of segregating fricatives/affricates
Utterance That noise problem grows more
annoying each day Interference Crowd noise with
music (IBM Ideal binary mask)
34
Example of segregating stops
Utterance A good morrow to you, my
boy Interference Rain
35
Outline of presentation

Introduction
Speech segregation problem
Auditory scene analysis (ASA) approach
Voiced speech segregation based on pitch tracking
and amplitude modulation analysis
Ideal binary mask as CASA goal
Unvoiced speech segregation
Auditory segmentation
Neurobiological basis of ASA

36
How does the auditory system perform ASA?

Information about acoustic features (pitch,
spectral shape, interaural differences, AM, FM)
is extracted in distributed areas of the auditory
system
Binding problem How are these features combined
to form a perceptual whole (stream)?
Hierarchies of feature-detecting cells exist, but
do not seem to constitute a solution to the
binding problem

37
Oscillatory correlation theory for ASA

Neural oscillators are used to represent auditory
features
Oscillators representing features of the same
source are synchronized, and are desynchronized
from those representing different sources
Originally proposed by von der Malsburg
Schneider (1986), and further developed by Wang
(1996)
Supported by growing experimental evidence

38
Oscillatory correlation representation

FD Feature
Detector

39
Oscillatory correlation for ASA

LEGION dynamics (Terman Wang95) provides a
computational foundation for the oscillatory
correlation theory
The utility of oscillatory correlation has been
demonstrated for speech segregation
(Wang-Brown99), modeling auditory attention
(Wrigley-Brown04), etc.

40
Summary

CASA approach to monaural speech segregation
Performs substantially better than previous CASA
systems for voiced speech segregation
AM cue and target pitch tracking are important
for performance improvement
Early steps for unvoiced speech segregation
Auditory segmentation based on onset/offset
analysis
Segregation using speech classification
Oscillatory correlation theory for ASA

41
Acknowledgment