PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM

Description:

2 microphone conditions (Sennheiser and secondary) 2 sample frequencies 16 kHz and 8 kHz ... 7 recorded on Sennheiser; 7 on secondary. Clean plus 6 noise conditions ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 17
Provided by: jennifer194
Category:

less

Transcript and Presenter's Notes

Title: PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM


1
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY
BASELINE SYSTEM
URL www.isip.msstate.edu/projects/ies/publication
s/conferences/
2
Abstract
In this paper, we present the design and analysis
of the baseline recognition system used for ETSI
Aurora large vocabulary (ALV) evaluation. The
experimental paradigm is presented along with the
results from a number of experiments designed to
minimize the computational requirements for the
system. The ALV baseline system achieved a WER of
14.0 on the standard 5K Wall Street Journal
task, and required 4 xRT for training and 15 xRT
for decoding (on an 800 MHz Pentium processor).
It is shown that increasing the sampling
frequency from 8 kHz to 16 kHz improves
performance significantly only for the noisy test
conditions. Utterance detection resulted in
significant improvements only on the noisy
conditions for the mismatched training
conditions. Use of the DSR standard VQ-based
compression algorithm did not result in a
significant degradation. The model mismatch and
microphone mismatch resulted in a relative
increase in WER by 300 and 200, respectively.
3
Motivation
  • ALV goal was at least a 25 relative improvement
    over the baseline MFCC front end
  • Develop generic baseline LVCSR system with no
    front end specific tuning
  • Benchmark the baseline MFCC front end using
    generic LVCSR system on six focus conditions
    sampling frequency reduction, utterance
    detection, feature-vector compression, model
    mismatch, microphone variation, and additive noise

4
ALV Baseline System Development
  • Standard context-dependent cross-word HMM-based
    system
  • Acoustic models state-tied16-mixture cross-word
    triphones
  • Language model WSJ0 5K bigram
  • Search Viterbi one-best using lexical trees for
    N-gram cross-word decoding
  • Lexicon based on CMUlex
  • Performance 8.3 WER at 85xRT

5
ETSI WI007 Front End
  • The baseline HMM system used an ETSI standard
    MFCC-based front end
  • Zero-mean debiasing
  • 10 ms frame duration
  • 25 ms Hamming window
  • Absolute energy
  • 12 cepstral coefficients
  • First and second derivatives

6
Real-time reduction
  • Derived from ISIP WSJ0 system (with CMS)
  • Aurora-4 database terminal filtering resulted in
    marginal degradation
  • ETSI WI007 front end is 14 worst (no CMS)
  • ALV Baseline System performance 14.0
  • Real-time 4 xRT for training and 15 xRT for
    decoding on an 800 MHz Pentium


7
Aurora4 database
  • Acoustic Training
  • Derived from 5000 word WSJ0 task
  • TS1 (clean), and TS2 (multi-condition)
  • Clean plus 6 noise conditions
  • Randomly chosen SNR between 10 and 20 dB
  • 2 microphone conditions (Sennheiser and
    secondary)
  • 2 sample frequencies 16 kHz and 8 kHz
  • G.712 filtering at 8 kHz and P.341 filtering at
    16 kHz
  • Development and Evaluation Sets
  • Derived from WSJ0 Evaluation and Development sets
  • 14 test sets for each
  • 7 recorded on Sennheiser 7 on secondary
  • Clean plus 6 noise conditions
  • Randomly chosen SNR between 5 and 15 dB
  • G.712 filtering at 8 kHz and P.341 filtering at
    16 kHz

8
Sampling Frequency Reduction
  • Perfectly-matched condition (TrS1 and TS1) No
    significant degradation
  • Mismatched conditions (TrS1 and TS2-TS14) No
    clear trend
  • Matched conditions (TrS2 and TS1-TS14)
    Significant degradation on noisy conditions
    recorded on Senn. mic. (TS3-TS8)

9
Utterance Detection
  • Perfectly-matched condition (TrS1 and TS1) No
    significant improvement
  • Mismatched conditions (TrS1 and TS2-TS14)
    Significant improvement due to reduction in
    insertions
  • Matched conditions (TrS2 and TS1-TS14) No
    significant improvement

10
Feature-vector Compression
  • Sampling frequency specific codebooks 8 kHz
    and 16 kHz
  • Perfectly-matched condition (TrS1 and TS1) No
    significant degradation
  • Mismatched conditions (TrS1 and TS2-TS14) No
    significant degradation
  • Matched conditions (TrS2 and TS1-TS14)
    Significant degradation on a few matched
    conditions TS3,8,9,10,12 at 16 kHz sampling and
    TS7,12 at 8 kHz sampling frequency

11
  • Model Mismatch
  • Perfectly-matched condition (TrS1 and TS1) Best
    performance
  • Mismatched conditions (TrS1 and TS2-TS14)
    Significant degradations
  • Matched conditions (TrS2 and TS1-TS14) Better
    than mismatched conditions

12
  • Microphone Variation
  • Train on Sennheiser mic. evaluate on secondary
    mic.
  • Perfectly-matched condition (TrS1 and TS1)
    Optimal performance
  • Mismatched condition (TrS1 and TS8) Significant
    degradation
  • Matched conditions Less severe degradation when
    samples of sec. microphone seen during training

13
  • Additive Noise

14
Summary and Conclusions
  • Presented a WSJ0 based LVCSR system that runs at
    4xRT for training and 15xRT for decoding on a 800
    MHz Pentium
  • Reduction in benchmarking time from 1034 to 203
    days
  • Increase in sampling frequency from 8 kHz to 16
    kHz results in significant improvement only on
    matched noisy test conditions
  • Utterance detection resulted in significant
    improvements only on the noisy conditions for the
    mismatched training conditions
  • VQ based compression is robust in DSR environment
  • Exposing models to different noisy conditions and
    microphone conditions improves the speech
    recognition performance in adverse conditions

15
  • Available Resources

16
  • Brief Bibliography
  • N. Parihar, Performance Analysis of Advanced
    Front Ends, M.S. Dissertation, Mississippi State
    University, December 2003.
  • N. Parihar, and J. Picone, An Analysis of the
    Aurora Large Vocabulary Evaluation, Eurospeech
    2003, pp. 337-340, Geneva, Switzerland, September
    2003.
  • N. Parihar and J. Picone, DSR Front End LVCSR
    Evaluation - AU/384/02, Aurora Working Group,
    European Telecommunications Standards Institute,
    December 06, 2002.
  • D. Pearce, Overview of Evaluation Criteria for
    Advanced Distributed Speech Recognition, ETSI
    STQ-Aurora DSR Working Group, October 2001.
  • G. Hirsch, Experimental Framework for the
    Performance Evaluation of Speech Recognition
    Front-ends in a Large Vocabulary Task, ETSI
    STQ-Aurora DSR Working Group, December 2002.
  • ETSI ES 201 108 v1.1.2 Distributed Speech
    Recognition Front-end Feature Extraction
    Algorithm Compression Algorithm, ETSI, April
    2000.
Write a Comment
User Comments (0)
About PowerShow.com