Title: HEARING AND SPEECH BY HUMANS AND MACHINES
1HEARING AND SPEECHBY HUMANS AND MACHINES
- Richard Stern
- Robust Speech Recognition Group
- Carnegie Mellon University
- Telephone (412) 268-2535
- Fax (412) 268-3890
- rms_at_cs.cmu.edu
- http//www.cs.cmu.edu/rms
- 18-493 Electroacoustics
- November 15, 2004
2Introduction
- Conventional signal processing schemes for speech
recognition are motivated more by knowledge of
speech production than by knowledge of speech
perception. - Nevertheless, the auditory system does a lot of
interesting things! - In this talk I will ..
- Talk a bit about some basic findings in auditory
physiology and perception - Briefly review conventional signal processing
- Talk a bit about how knowledge of perception is
starting to impact on how we design signal
processing for speech recognition
3The source-filter model of speech production
- A useful model for representing the generation of
speech sounds
Amplitude
pn
4The speech spectrogram
5Separating the vocal-tract excitation from the
filter
- Original speech
- Speech with 75-Hz excitation
- Speech with 150-Hz excitation
- Speech with noise excitation
6Basic auditory anatomy
- Structures involved in auditory processing
7Excitation along the basilar membrane
- Some of von Békésys (1960) measurements of
motion along the basilar membrane - Comment Different locations are most sensitive
to different frequencies
8Transient response of auditory-nerve fibers
- Histograms of response to tone bursts (Kiang et
al., 1965) - Comment Onsets and offsets produce overshoot
9Frequency response of auditory-nerve fibers
tuning curves
- Threshold level for auditory-nerve response to
tones - Note dependence of bandwidth on center frequency
and asymmetry of response
10Typical response of auditory-nerve fibers as a
function of stimulus level
- Typical response of auditory-nerve fibers to
tones as a function of intensity - Comment
- Saturation and limited dynamic range
11Synchronized auditory-nerve responseto
low-frequency tones
- Comment response remains synchronized over a
wide range of intensities
12Comments on synchronized auditory response
- Nerve fibers synchronize to fine structure at
low frequencies, signal envelopes at high
frequencies - Synchrony clearly important for auditory
localization - Synchrony now believed important for monaural
processing of complex signals as well
13Lateral suppression in auditory processing
- Auditory-nerve response to pairs of tones
- Comment Lateral suppression enhances local
contrast in frequency
14Auditory masking patterns
- Masking produced by narrowband noise at 410 Hz
- Comment asymmetries in auditory-nerve patterns
preserved
15Auditory frequency selectivity critical bands
- Measurements of psychophysical filter bandwidth
by various methods - Comments
- Bandwidth increases with center frequency
- Solid curve is Equivalent Rectangular Bandwidth
(ERB)
16Three perceptual auditory frequency scales
- Bark scale
- Mel scale
- ERB scale
17Comparison of normalized perceptual frequency
scales
- Bark scale (in blue), Mel scale (in red), and ERB
scale (in green)
18Forward and backward masking
- Masking can have an effect even if target and
masker are not simultaneously presented - Forward masking - masking precedes target
- Backward masking - target precedes masker
- Examples
- Introduction
- Backward masking
- Forward masking
19The loudness of sounds
- Equal loudness contours (Fletcher-Munson curves)
20Summary of basic auditory physiology and
perception
- Major physiological attributes
- Frequency analysis in parallel channels
- Preservation of temporal fine structure
- Limited dynamic range in individual channels
- Enhancement of temporal contrast (at onsets and
offsets) - Enhancement of spectral contrast (at adjacent
frequencies) - Most major physiological attributes have
psychophysical correlates - Most physiological and psychophysical effects are
not preserved in conventional representations for
speech recognition
21Conventional ASR signal processing MFCCs
- Segment incoming waveform into frames
- Compute frequency response for each frame using
DFTs - Multiply magnitude of frequency response by
triangular weighting functions to produce 25-40
channels - Compute log of weighted magnitudes for each
channel - Take inverse discrete cosine transform (DCT) of
weighted magnitudes for each channel, producing
14 cepstral coefficients for each frame - Calculate delta and double-delta coefficients
22AN EXAMPLE DERIVING MFCC coefficients
23THE MEL WEIGHTING FUNCTIONS
24THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
25THE CEPSTRAL COEFFICIENTS
26LOGSPECTRA RECOVERED FROM CEPSTRA
27COMPARING SPECTRAL REPRESENTATIONS
- ORIGINAL SPEECH MEL LOG MAGS
AFTER CEPSTRA
28Comments on the MFCC representation
- Its very blurry compared to a wideband
spectrogram! - Aspects of auditory processing represented
- Frequency selectivity and spectral bandwidth
(but using a constant analysis window duration!) - Wavelet schemes exploit time-frequency resolution
better - Nonlinear amplitude response
- Aspects of auditory processing NOT represented
- Detailed timing structure
- Lateral suppression
- Enhancement of temporal contrast
- Other auditory nonlinearities
29Speech representation using mean rate
- Representation of vowels by Young and Sachs using
mean rate - Mean rate representation does not preserve
spectral information
30Speech representation using average localized
synchrony measure
- Representation of vowels by Young and Sachs using
ALSR
31The importance of timing information
- Re-analysis of Young-Sachs data by Searle
- Temporal processing captures dominant formants in
a spectral region
32Paths to the realization of temporal fine
structure in speech
- Correlograms (Slaney and Lyon)
- Computations based on interval processing
- Seneffs Generalized Synchrony Detector (GSD)
model - Ghitzas Ensemble Interval Histogram (EIH) model
- D.C. Kims Zero Crossing Peak Analysis (ZCPA)
model
33A typical auditory model of the 1980s The Seneff
model
34Recognition accuracy using the Seneff model
(Ohshima, 1994)
- Comment CDCN performs just as well as the Seneff
model
35Computational complexity of Seneff model
- Number of multiplications per ms of speech
- Comment auditory computation is extremely
expensive
36A very simple model based on timing information
37Response of timing model to chirp stimuli
38Response of timing model to speech stimuli
39Summary
- Major physiological attributes
- Frequency analysis in parallel channels
- Preservation of temporal fine structure
- Limited dynamic range in individual channels
- Enhancement of temporal contrast (at onsets and
offsets) - Enhancement of spectral contrast (at adjacent
frequencies) - Most major physiological attributes have
psychophysical correlates - We are trying to capture important attributes
of the representation for recognition, and we
believe that this may help performance in noise
and competing signals.
40(No Transcript)