Biologically Inspired Noise-Robust Speech Recognition for Both Man and Machine PowerPoint PPT Presentation

presentation player overlay
1 / 37
About This Presentation
Transcript and Presenter's Notes

Title: Biologically Inspired Noise-Robust Speech Recognition for Both Man and Machine


1
Biologically Inspired Noise-Robust Speech
Recognition for Both Man and Machine
  • Mark D. Skowronski
  • Ph.D. Proposal
  • University of Florida
  • Gainesville, FL, USA

2
Outline
  • Introduction
  • Biologically inspired algorithms
  • Speech Energy Redistribution
  • Features Human Factor Cepstral Coefficients
  • Classifier Nonlinear dynamic systems
  • Future work

3
  • Introduction
  • Biologically inspired algorithms
  • Speech Energy Redistribution
  • Features Human Factor Cepstral Coefficients
  • Classifier Nonlinear dynamic systems
  • Future work

4
Biological Inspiration
Example of Read Speech
  • Wall Street Journal/Broadcast news readings
  • Untrained human listeners vs Cambridge HTK LVCSR
    system

5
  • Introduction
  • Biologically inspired algorithms
  • Speech Energy Redistribution
  • Features Human Factor Cepstral Coefficients
  • Classifier Nonlinear dynamic systems
  • Future work

6
Speech Enhancement
  • Motivations
  • Noisy cell phone conversations
  • Power-constrained transducers
  • Public address systems in noisy environments

What can you do when turning up the volume is not
an option?
7
The Lombard Effect
Lombard Effect changes in vocal characteristics,
produced by a speaker in the presence of
background noise.
  • Amplitude increases.
  • Duration increases.
  • Pitch increases.
  • Formant frequencies increase.
  • High-freq to low-freq energy ratio increases.
  • Intelligibility increases.

8
Psychoacoustic Experiments
Speech contains regions of relatively high
information content, and emphasis of these
regions increases perceived intelligibility.
  • Fletcher (1953) LPF or HPF phonemes varied in
    robustness to the filtering process, with vowels
    being the most robust.
  • Miller and Nicely (1955) AWGN to speech affects
    place of articulation and frication most, less so
    for voicing and nasality.
  • Furui (1986) Truncated vowels in consonant-vowel
    pairs dramatically decreased in intelligibility
    beyond a certain point of truncation. These
    points correspond to spectrally dynamic regions.

9
Solution Energy Redistribution
We redistribute energy from regions of low
information content to regions of high
information content while conserving overall
energy across words.
SFM of clarification
We partition speech into Voiced/Unvoiced regions
using the Spectral Flatness Measure (SFM)
Xj(k) is the magnitude of the short-term Fourier
transform of the jth speech window of length N.
10
Listening Test
Confusable set test, from Junqua
I f, s, x, yes
II a, h, k, 8
III b, c, d, e, g, p, t, v, z, 3
IV m, n
  • 500 trials forced decision
  • 3 algorithms (control, ERVU, HPF)
  • 0 dB and -10 dB SNR, AWGN
  • unlimited playback over headphones
  • 26 participants, 30-45 minutes

11
Listening Test Results
-10 dB SNR, white noise
Errors decreased 20 compared to control.
M
E
A
S
12
Energy Redistribution Summary
  • Biologically inspired
  • Lombard Effect says how to modify.
  • Psychoacoustic experiments say where to modify.
  • Increases intelligibility while maintaining
    naturalness and conserving energy.
  • Naturalness elegantly preserved by retaining
    spectral and temporal cues.
  • Effective because everyday speech is not clearly
    enunciated.

13
  • Introduction
  • Biologically inspired algorithms
  • Speech Energy Redistribution
  • Features Human Factor Cepstral Coefficients
  • Classifier Nonlinear dynamic systems
  • Future work

14
ASR Introduction
Automatic Speech Recognition is the extraction of
linguistic information from an utterance of
speech (Text-to-Speech).
  • Isolated/Continuous speech
  • Dependent/Independent speaker operation
  • Word/Phoneme recognition unit
  • Vocabulary size and perplexity

Input Feature Extraction
Classification
15
Input
seven
Information phonetic, gender, age, emotion,
pitch, accent, physical state, additive/channel
noise
16
Feature Extraction
Goal emphasize phonetic information over other
characteristics.
  • Acoustic formant frequencies, bandwidths
  • Model based linear prediction
  • Filter-bank based mel freq cepstral coeff (mfcc)

Provides dimensionality reduction on
quasi-stationary windows.
seven
Features
17
Hidden Markov Model
Time domain
one
State space
Feature space
18
MFCC Algorithm
MFCC--the most widely-used speech feature
extractor.
seven
x(t)
F
Mel-scaled filter bank
Log energy
DCT
Cepstral domain
19
DCT vs Eigenvectors
Spectra of Eigenvectors from log energy of
filtered speech
Spectra of DCT basis vectors
Average spectral difference lt 15
20
MFCC Filter Bank
  • Design parameters FB freq range, number of
    filters.
  • Center freqs equally-spaced in mel frequency.
  • Triangle endpoints set by center freqs of
    adjacent filters.

Although filter spacing is determined by
perceptual mel frequency scale, bandwidth is set
more for convenience than by biological
motivation.
21
Human Factor Cepstral Coefficients
  • Decouple filter bandwidth from filter bank design
    parameters.
  • Set filter width according to the critical
    bandwidth of the human auditory system.
  • Use Moore and Glasberg approximation of critical
    bandwidth, defined in Equivalent Rectangular
    Bandwidth (ERB).

fc is critical band center frequency (KHz).
22
ASR Experiments Review
  • Isolated English digits zero through nine
    from TI-46 corpus, 8 male speakers,
  • HMM word models, 8 states per model, diagonal
    covariance matrix,
  • Three mfcc versions (different filter banks),
  • Several degrees of freedom,
  • Linear ERB scale factor.

23
ASR Results
White noise (local SNR), hfcc vs DM
24
ASR Results
White noise (global SNR), hfcc vs DM, Linear ERB
scale factor (E-factor).
25
HFCC Conclusions
  • Added biologically inspired bandwidth to filter
    bank of popular speech feature extractor.
  • Decoupled bandwidth from other filter bank design
    parameters.
  • Demonstrated superior noise-robust performance of
    new feature extractor.
  • Demonstrated advantages of wider filters.

26
  • Introduction
  • Biologically inspired algorithms
  • Speech Energy Redistribution
  • Features Human Factor Cepstral Coefficients
  • Classifier Nonlinear dynamic systems
  • Future work

27
HMM Limitations
  • HMMs are piecewise-stationary, while speech is
    continuous and nonstationary.
  • Assumes frames of speech are i.i.d.
  • State pdf estimates are data-driven.

HMMs make no claim of modeling biology.
28
Novel Classifiers
  • Deng's trended HMM.
  • Rabiner's autoregression HMM.
  • Morgan's HMM/neural network hybrid.
  • Robinson's recurrent neural network.
  • Wismüller's self-organizing map.
  • Herrmann's transient attractor network.
  • Maass' dynamic synapse MLP.
  • Berger's dynamic synapse RNN.

29
Freeman's Chaotic Model
  • Biologically inspired nonlinear dynamic model of
    cortical signal processing, from rabbit olfactory
    neo-cortex experiments.
  • A hierarchical network of oscillators that are
    locally stable and globally chaotic.
  • Demonstrated as classifier of static patterns.
  • Represents a radical departure from current
    classifier paradigms.

30
KI Model
  • Smallest element in network hierarchy.
  • a,b constants
  • state variable xi(t)
  • N states
  • Wij weight from state i to state j
  • asymmetric sigmoid Q
  • input Ii(t) to state i.

31
Reduced KII Network
  • Locally stable element is KII network.
  • m(t) excitatory mitral cell
  • g(t) inhibitory granule cell
  • Weights Kmg gt 0, Kgm lt 0
  • N pairs in parallel
  • Mitral cells fully connected
  • Granule cells fully connected
  • Input I(t) into excitatory cell.

32
KII Simulations
g(t)
m(t)
Reduced KII reaches steady state point attractor
or limit cycle, based on Kmg Kgm.
33
  • Introduction
  • Biologically inspired algorithms
  • Speech Energy Redistribution
  • Features Human Factor Cepstral Coefficients
  • Classifier Nonlinear dynamic systems
  • Future work

34
Work Completed
  • Developed biologically inspired algorithms
  • Energy redistribution combines Lombard Effect
    (how) with psychoacoustic experimental results
    (where) to increase speech intelligibility.
  • Human factor cepstral coefficients combines
    existing speech front end (mfcc) with critical
    bandwidth information (ERB).
  • Published 3 papers, and submitted 3 more, on
    novel algorithms.
  • Literature survey on novel speech classifiers,
    and simulations of nonlinear Freeman model.

35
Work Proposed
  1. Compare hfcc to human speech recognition using
    rhyming test in ASR experiments.
  2. Measure affects of ERVU in ASR experiments.
  3. Analyze hfcc algorithm, accounting for nonlinear
    log() function.
  4. Experiment with other bandwidth functions besides
    ERB or scaled ERB.
  5. Quantify tradeoff between spectral resolution and
    noise smoothing for hfcc using synthetic data.

36
Work Proposed, Con't
  1. Build on the reduced KII network results recently
    reported by CNEL suggesting the network can
    operate as a content-addressable memory (CAM).
  2. Investigate alternative information storage
    strategies to CAM, focusing on inherent
    time-varying nature of dynamic system (coupling
    theory is intriguing).
  3. Expand literature search to areas outside speech
    recognition experiments that use nonlinear
    dynamic (chaotic) systems for information
    processing/storage, with emphasis on applications
    with time-varying signals.

37
Work Proposed, Con't
  1. Consider alternative roles for nonlinear
    dynamics embedded extracted features for
    hfcc/HMM system, trajectory tracking in the
    spirit of Dengs trended HMM.
  2. Demonstrate classification of static vowel
    patterns (vowel phonemes) with novel classifier,
    in presence of noise.
  3. Demonstrate classification of time-varying
    signals (isolated English digits, rhyming test
    corpus), in noisy environments.
Write a Comment
User Comments (0)
About PowerShow.com