HEARING AND SPEECH BY HUMANS AND MACHINES - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

HEARING AND SPEECH BY HUMANS AND MACHINES

Description:

Auditory frequency selectivity: critical bands ... Frequency selectivity and spectral bandwidth (but using a constant analysis window duration! ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 41
Provided by: Richar8
Category:

less

Transcript and Presenter's Notes

Title: HEARING AND SPEECH BY HUMANS AND MACHINES


1
HEARING AND SPEECHBY HUMANS AND MACHINES
  • Richard Stern
  • Robust Speech Recognition Group
  • Carnegie Mellon University
  • Telephone (412) 268-2535
  • Fax (412) 268-3890
  • rms_at_cs.cmu.edu
  • http//www.cs.cmu.edu/rms
  • 18-493 Electroacoustics
  • November 15, 2004

2
Introduction
  • Conventional signal processing schemes for speech
    recognition are motivated more by knowledge of
    speech production than by knowledge of speech
    perception.
  • Nevertheless, the auditory system does a lot of
    interesting things!
  • In this talk I will ..
  • Talk a bit about some basic findings in auditory
    physiology and perception
  • Briefly review conventional signal processing
  • Talk a bit about how knowledge of perception is
    starting to impact on how we design signal
    processing for speech recognition

3
The source-filter model of speech production
  • A useful model for representing the generation of
    speech sounds

Amplitude
pn
4
The speech spectrogram
5
Separating the vocal-tract excitation from the
filter
  • Original speech
  • Speech with 75-Hz excitation
  • Speech with 150-Hz excitation
  • Speech with noise excitation

6
Basic auditory anatomy
  • Structures involved in auditory processing

7
Excitation along the basilar membrane
  • Some of von Békésys (1960) measurements of
    motion along the basilar membrane
  • Comment Different locations are most sensitive
    to different frequencies

8
Transient response of auditory-nerve fibers
  • Histograms of response to tone bursts (Kiang et
    al., 1965)
  • Comment Onsets and offsets produce overshoot

9
Frequency response of auditory-nerve fibers
tuning curves
  • Threshold level for auditory-nerve response to
    tones
  • Note dependence of bandwidth on center frequency
    and asymmetry of response

10
Typical response of auditory-nerve fibers as a
function of stimulus level
  • Typical response of auditory-nerve fibers to
    tones as a function of intensity
  • Comment
  • Saturation and limited dynamic range

11
Synchronized auditory-nerve responseto
low-frequency tones
  • Comment response remains synchronized over a
    wide range of intensities

12
Comments on synchronized auditory response
  • Nerve fibers synchronize to fine structure at
    low frequencies, signal envelopes at high
    frequencies
  • Synchrony clearly important for auditory
    localization
  • Synchrony now believed important for monaural
    processing of complex signals as well

13
Lateral suppression in auditory processing
  • Auditory-nerve response to pairs of tones
  • Comment Lateral suppression enhances local
    contrast in frequency

14
Auditory masking patterns
  • Masking produced by narrowband noise at 410 Hz
  • Comment asymmetries in auditory-nerve patterns
    preserved

15
Auditory frequency selectivity critical bands
  • Measurements of psychophysical filter bandwidth
    by various methods
  • Comments
  • Bandwidth increases with center frequency
  • Solid curve is Equivalent Rectangular Bandwidth
    (ERB)

16
Three perceptual auditory frequency scales
  • Bark scale
  • Mel scale
  • ERB scale

17
Comparison of normalized perceptual frequency
scales
  • Bark scale (in blue), Mel scale (in red), and ERB
    scale (in green)

18
Forward and backward masking
  • Masking can have an effect even if target and
    masker are not simultaneously presented
  • Forward masking - masking precedes target
  • Backward masking - target precedes masker
  • Examples
  • Introduction
  • Backward masking
  • Forward masking

19
The loudness of sounds
  • Equal loudness contours (Fletcher-Munson curves)

20
Summary of basic auditory physiology and
perception
  • Major physiological attributes
  • Frequency analysis in parallel channels
  • Preservation of temporal fine structure
  • Limited dynamic range in individual channels
  • Enhancement of temporal contrast (at onsets and
    offsets)
  • Enhancement of spectral contrast (at adjacent
    frequencies)
  • Most major physiological attributes have
    psychophysical correlates
  • Most physiological and psychophysical effects are
    not preserved in conventional representations for
    speech recognition

21
Conventional ASR signal processing MFCCs
  • Segment incoming waveform into frames
  • Compute frequency response for each frame using
    DFTs
  • Multiply magnitude of frequency response by
    triangular weighting functions to produce 25-40
    channels
  • Compute log of weighted magnitudes for each
    channel
  • Take inverse discrete cosine transform (DCT) of
    weighted magnitudes for each channel, producing
    14 cepstral coefficients for each frame
  • Calculate delta and double-delta coefficients

22
AN EXAMPLE DERIVING MFCC coefficients
23
THE MEL WEIGHTING FUNCTIONS
24
THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
25
THE CEPSTRAL COEFFICIENTS
26
LOGSPECTRA RECOVERED FROM CEPSTRA
27
COMPARING SPECTRAL REPRESENTATIONS
  • ORIGINAL SPEECH MEL LOG MAGS
    AFTER CEPSTRA

28
Comments on the MFCC representation
  • Its very blurry compared to a wideband
    spectrogram!
  • Aspects of auditory processing represented
  • Frequency selectivity and spectral bandwidth
    (but using a constant analysis window duration!)
  • Wavelet schemes exploit time-frequency resolution
    better
  • Nonlinear amplitude response
  • Aspects of auditory processing NOT represented
  • Detailed timing structure
  • Lateral suppression
  • Enhancement of temporal contrast
  • Other auditory nonlinearities

29
Speech representation using mean rate
  • Representation of vowels by Young and Sachs using
    mean rate
  • Mean rate representation does not preserve
    spectral information

30
Speech representation using average localized
synchrony measure
  • Representation of vowels by Young and Sachs using
    ALSR

31
The importance of timing information
  • Re-analysis of Young-Sachs data by Searle
  • Temporal processing captures dominant formants in
    a spectral region

32
Paths to the realization of temporal fine
structure in speech
  • Correlograms (Slaney and Lyon)
  • Computations based on interval processing
  • Seneffs Generalized Synchrony Detector (GSD)
    model
  • Ghitzas Ensemble Interval Histogram (EIH) model
  • D.C. Kims Zero Crossing Peak Analysis (ZCPA)
    model

33
A typical auditory model of the 1980s The Seneff
model
34
Recognition accuracy using the Seneff model
(Ohshima, 1994)
  • Comment CDCN performs just as well as the Seneff
    model

35
Computational complexity of Seneff model
  • Number of multiplications per ms of speech
  • Comment auditory computation is extremely
    expensive

36
A very simple model based on timing information
37
Response of timing model to chirp stimuli
38
Response of timing model to speech stimuli
39
Summary
  • Major physiological attributes
  • Frequency analysis in parallel channels
  • Preservation of temporal fine structure
  • Limited dynamic range in individual channels
  • Enhancement of temporal contrast (at onsets and
    offsets)
  • Enhancement of spectral contrast (at adjacent
    frequencies)
  • Most major physiological attributes have
    psychophysical correlates
  • We are trying to capture important attributes
    of the representation for recognition, and we
    believe that this may help performance in noise
    and competing signals.

40
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com