Relating Reliability in Phonetic Feature Streams to Noise Robustness - PowerPoint PPT Presentation

About This Presentation
Title:

Relating Reliability in Phonetic Feature Streams to Noise Robustness

Description:

Bursts (Niyogi 2002), Nasality (Glass 1986), Voicing (Saul 2003) ... Burst. el, l, uw, ... Liquid/Glide. n, m, ng, ... Nasal. w, ow, uw, ... Rounding ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 24
Provided by: michell189
Category:

less

Transcript and Presenter's Notes

Title: Relating Reliability in Phonetic Feature Streams to Noise Robustness


1
Relating Reliability in Phonetic Feature Streams
to Noise Robustness
  • Alex Park
  • August 26th, 2003

2
Overview
  • Motivation for using layered, phonetic feature
    stream approach
  • Building a recognizer based on phonetic features
  • MFCC-based GMM feature detectors (baseline)
  • Sample feature stream outputs
  • Training digit recognizer using concatenated
    feature streams as input
  • Robust alternatives for voicing feature stream
    module
  • Saul sinusoid detector
  • Autocorrelation
  • GMM classifier using alternative features
  • Evaluation of stream reliability using distortion
    between clean and noisy speech
  • Hard question what is ground truth for
    continuous measurements?
  • Relating stream extraction reliability to word
    recognition accuracy
  • Conclusions and Future Work

Introduction
3
Motivation
  • Failure of recognizers in noise is due to
    mismatch between features observed in training
    and testing
  • In order to reduce mismatch, we can evaluate and
    optimize the reliability of features presented to
    the acoustic models at a middle layer
  • Current recognizers typically use one set of
    front end features to train acoustic models at
    the phone level
  • Typical front end features can only be evaluated
    by looking at WER, which is influenced by many
    factors. Global optimization can mask serious
    inconsistencies in the speech representation
    under noise
  • Phonetic features can change asynchronously,
    especially in spontaneous speech
  • Why phonetic features?
  • Are perceivable by humans and relevant to speech
  • Several examples of phonetic feature/phone class
    detection exist
  • Bursts (Niyogi 2002), Nasality (Glass 1986),
    Voicing (Saul 2003)
  • Other researchers have recently proposed acoustic
    modelling frameworks based on related feature
    streams (articulatory, acoustic, distinctive)
  • Articulatory (Livescu 2003, Metze 2002), acoustic
    (Kirchoff 2002)
  • Why not?

Introduction
4
Training MFCC GMM Feature Classifiers
  • Sparse set of 6 phonetic features chosen for
    simplicity
  • For a less constrained task, more features should
    probably be used
  • More extensive training data would also improve
    quality of each feature detector
  • For each feature F, train two GMMs, p(xF) and
    p(x-F), using frame-level feature MFCC feature
    vectors
  • Trained on 410 TIMIT sentences from 40 speakers
    (126k frames)
  • Use Bayes rule (w/ equal priors) to determine
    posterior probs, which are computed every 10 ms

Posterior probability
Transcribed speech (Training Data)
p(x-F)
Feature TIMIT labels
Frication s, sh, z, zh, f, th,
Rounding w, ow, uw,
Nasal n, m, ng,
Liquid/Glide el, l, uw,
Burst g, k, p,
Voice aa, ae, ah,
p(xF)
Training
Testing
Stream Recognizer
5
Sample Outputs MFCC-based Streams
  • Feature streams on Aurora utt. (six three five
    seven one zero four)

Stream Recognizer
6
Recognizer Training
  • Phonetic feature posterior probability outputs
    used as feature vectors to train Aurora HMM
    recognizer
  • Standard train script included for Aurora 2
    evaluation (8440 clean training utts)
  • Eleven whole word models and one silence model
  • 18 states each, 3 mixtures, 6 dimensional
    diagonal Gaussian emission probs
  • Probably not an optimal model structure for given
    feature set.
  • Also, used HCompV instead of HInit with time
    aligned transcriptions

one


Concat.
Train


two



oh
Feature Extraction Modules
Feature Vector
Clean Training Data
Whole word HMMs
Stream Recognizer
7
Preliminary recognition results
  • Tested across all 4 noise conditions, 7 SNR
    levels on Aurora testa
  • Accuracy is 88 on clean data (obtained 91
    earlier using 9 feature streams, but reduced to 6
    for simplicity)
  • Poor performance compared to Aurora baseline, but
    interesting considering sparsity of feature set
    used to train HMMs
  • Many factors should be addressed to improve
    stream-based recognizer
  • More feature streams
  • Deltas and delta-deltas
  • Relationship between feature streams
  • Discriminative lexical ability for different word
    models
  • Noise compensation in feature extraction

Stream Recognizer
8
A closer look Stream corruption under noise
  • Effect of noise on output of MFCC-based voicing
    feature module

p(Voice)
p(Voice)
p(Voice)
Voicing Module
9
In search of a better voicing module
  • Several possible alternatives to MFCC based
    voicing module
  • Autocorrelation (AutoCorr)
  • Sinusoid Uncertainty (Saul, 2003)
  • Alternative GMM classifier (AltGMM)
  • trained like MFCC classifier, but using above
    features
  • 6 dimensional, 10 mixture diagonal gaussians each
    for p(xF), p(x-F)
  • Evaluated voicing detection using phonetic
    transcription as reference
  • In clean conditions, MFCC GMM has best detection
    performance
  • Is this the best module to use?

Method Equal Error Rate
GMM 11.14
Sinusoid 18.11
Autocorr 16.78
AltGMM 24.84
Voicing Module
10
Evaluating stream robustness
  • Several problems with using global frame
    detection accuracy to rate module performance
  • Would like to have some continuous measure of
    voicing (degree of voicing) instead of binary
    decision
  • Ground truth is hard to come by voiced phone
    labels not necessarily voiced!
  • To evaluate reliability, try using distortion
    between clean and noisy voicing probability for
    the same utterance.
  • For each frame, measure difference between clean,
    fc(t), and noisy, fn(t), estimate.
  • If fc(t)-fn(t) gt 0.2 ? label f(t) as gross
    error
  • If fc(t)-fn(t) lt 0.2 ? use fc(t)-fn(t) as a
    measure of the distortion caused by noise.
  • N.B. Consistency doesnt guarantee accuracy, we
    still need to check

Voicing Module
11
Distortion Comparison
  • Compared the frame distortion for voicing modules
    at each noise level
  • Percentage of frames labelled as gross errors
    (distortion gt 0.2)
  • Average distortion for remaining frames
    (distortion lt 0.2)
  • Despite higher performance in clean data, MFCC
    module is most erratic
  • For consistency, Alt. GMM module outperforms MFCC
    module in noise

Voicing Module
12
A better voicing module?
  • Output of AltGMM module trained on AutoCorr and
    SinUn features

p(Voice)
p(Voice)
p(Voice)
Voicing Module
13
Recognition Performance Comparison
  • Trained 3 additional recognizers, one for each
    alternative voicing module
  • Performed recognition experiments to compare
    voicing modules
  • No significant difference in accuracy at any
    noise level -_-
  • Need to perform additional experiments to
    understand effect of voicing modules on
    recognition

voicing
voicing
Feature Extraction Modules
Test Utterance
Feature Vector
Recognition Experiments
14
Oracle Experiment
  • What happens if we assume the voicing module is
    perfectly reliable?
  • i.e., same output under any noise condition
  • Accuracy not improved from normal scenario
  • Having robust voicing feature only is not enough
    to improve recognition
  • Corruption of other feature streams likely
    skewing overall acoustic model scores
  • How can we isolate the contribution of this
    feature stream?

Clean
Noisy
Feature Extraction Modules
Test Utterance
Feature Vector
Recognition Experiments
15
Inverse Oracle Experiment
  • Assume other feature streams are computed
    consistently
  • Allow voicing module to contribute actual output
  • Significant difference in performance between 4
    voicing modules
  • Even with 5 of 6 clean features, MFCC voicing
    module degrades quickly in noise
  • Recognition performance of each methods is
    correlated with distortion results

Noisy
Clean
Feature Extraction Modules
Feature Vector
Test Utterance
Recognition Experiments
16
Conclusions and Future Work
  • Small set of phonetic features can obtain
    somewhat high (88) recognition accuracy for
    constrained digit task even when integrated in
    non-optimal manner (HMM)
  • Reliable extraction of feature streams is
    essential for robust recognition
  • Combining statistical training with
    feature-specific measurements can improve
    reliability for feature stream extraction
  • Even if other 5 streams computed perfectly,
    messing up voicing can drastically degrade
    recognition accuracy
  • Integrate feature streams with a more appropriate
    acoustic modelling layer (i.e. feature based
    graphical models or DBN)
  • Optimize individual feature stream modules with
    relevant measurements
  • Nasality broad F1 bandwidth, low spectral slope
    in F1F2 region, stable low frequency energy
  • Rounding Low F1, F2.
  • Retroflex Low F3, rising formants.
  • Combine feature streams with SNR based measure of
    reliability
  • Lots to be done!

Conclusions
17
References
  • J.R. Glass and V.W. Zue (1986). Detection and
    Recognition of Nasal Consonants in American
    English," In Proc. ICASSP 86, Tokyo, Japan.
  • P. Niyogi and M.M. Sondhi (2002). Detecting Stop
    Consonants in Continuous Speech, J. Acoust. Soc.
    Am. vol 111, pp 1063.
  • L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun
    (2003). Real time voice processing with
    audiovisual feedback Toward autonomous agents
    with perfect pitch in S. Becker, S. Thrun, and
    K. Obermayer (eds.), Advances in Neural
    Information Processing Systems 15. MIT Press
    Cambridge, MA.
  • K. Kirchoff, G.A. Fink, G. Sagerer (2002).
    Combining acoustic and articulatory feature
    information for robust speech recognition,
    Speech Communications, May 2002.
  • K. Livescu, J. R. Glass, J. Bilmes (2003).
    Hidden Feature Models for Speech Recognition
    Using Dynamic Bayesian Networks, to be
    presented at Eurospeech 03, Geneva, Switzerland.
  • F. Metze, A. Waibel (2002). A Flexible Stream
    Architecture for ASR Using Articulatory
    Features, In Proc. ICSLP 02, Denver, Colorado.

Conclusions
18
Band-limited Sinusoid Fitting (Saul 2003)
  • Filter bandwidths allow at least one filter to
    resolve single harmonics
  • Frames of filtered signals fit with sinusoid of
    frequency w and error u
  • At each step, lowest u gives voicing
    probability, w gives pitch estimate
  • Algorithm is fast and gives accurate pitch tracks

Extra Slide
Voicing Module
19
Supp. recognition results I (Actual streams)
Extra Slide
Stream Recognizer
20
Supp. recognition results II (Oracle voice)
Extra Slide
Stream Recognizer
21
Supp. recognition results III (Inv. Oracle voice)
Extra Slide
Stream Recognizer
22
Supp. distortion results I (Gross error rate)
Extra Slide
Stream Recognizer
23
Supp. distortion results II (Avg Frame Distortion)
Extra Slide
Stream Recognizer
Write a Comment
User Comments (0)
About PowerShow.com