An Analysis of the Aurora Large Vocabulary Evaluation - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

An Analysis of the Aurora Large Vocabulary Evaluation

Description:

An Analysis of the. Aurora Large Vocabulary Evaluation. Authors: ... training data by sharing state distributions among phonetically similar states ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 19
Provided by: ValuedGate831
Category:

less

Transcript and Presenter's Notes

Title: An Analysis of the Aurora Large Vocabulary Evaluation


1
An Analysis of theAurora Large Vocabulary
Evaluation
EUROSPEECH 2003
  • Authors
  • Naveen Parihar and Joseph Picone
  • Inst. for Signal and Info. Processing
  • Dept. Electrical and Computer Eng.
  • Mississippi State University
  • Contact Information
  • Box 9571
  • Mississippi State University
  • Mississippi State, Mississippi 39762
  • Tel 662-325-8335
  • Fax 662-325-2298

Email parihar,picone_at_isip.msstate.edu
URL isip.msstate.edu/publications/conferences/e
urospeech/2003/evaluation/
2
INTRODUCTION
ABSTRACT
In this paper, we analyze the results of the
recent Aurora large vocabulary evaluations (ALV).
Two consortia submitted proposals on speech
recognition front ends for this evaluation
(1) Qualcomm, ICSI, and OGI (QIO), and
(2) Motorola, France Telecom, and Alcatel (MFA).
These front ends used a variety of noise
reduction techniques including discriminative
transforms, feature normalization, voice activity
detection, and blind equalization. Participants
used a common speech recognition engine to
post-process their features. In this paper, we
show that the results of this evaluation were not
significantly impacted by suboptimal recognition
system parameter settings. Without any front end
specific tuning, the MFA front end outperforms
the QIO front end by 9.6 relative. With tuning,
the relative performance gap increases to 15.8.
Both the mismatched microphone and additive noise
evaluation conditions resulted in a significant
degradation in performance for both front ends.
3
INTRODUCTION
MOTIVATION
ALV Evaluation Results
  • ALV goal was at least a 25 relative improvement
    over the baseline MFCC front end
  • Two consortia participated
  • QIO QualComm, ICSI, OGI
  • MFA Motorola, France Telecom, Alcatel
  • Generic baseline LVCSR system with no front end
    specific tuning
  • Would front end specific tuning change the
    rankings?

4
EVALUATION PARADIGM
THE AURORA 4 DATABASE
  • Acoustic Training
  • Derived from 5000 word WSJ0 task
  • TS1 (clean), and TS2 (multi-condition)
  • Clean plus 6 noise conditions
  • Randomly chosen SNR between 10 and 20 dB
  • 2 microphone conditions (Sennheiser and
    secondary)
  • 2 sample frequencies 16 kHz and 8 kHz
  • G.712 filtering at 8 kHz and P.341 filtering at
    16 kHz
  • Development and Evaluation Sets
  • Derived from WSJ0 Evaluation and Development sets
  • 14 test sets for each
  • 7 recorded on Sennheiser 7 on secondary
  • Clean plus 6 noise conditions
  • Randomly chosen SNR between 5 and 15 dB
  • G.712 filtering at 8 kHz and P.341 filtering at
    16 kHz

5
EVALUATION PARADIGM
BASELINE LVCSR SYSTEM
  • Standard context-dependent cross-word HMM-based
    system
  • Acoustic models state-tied4-mixture cross-word
    triphones
  • Language model WSJ0 5K bigram
  • Search Viterbi one-best using lexical trees for
    N-gram cross-word decoding
  • Lexicon based on CMUlex
  • Real-time 4 xRT for training and 15 xRT for
    decoding on an800 MHz Pentium

6
EVALUATION PARADIGM
WI007 ETSI MFCC FRONT END
Input Speech
  • The baseline HMM system used an ETSI standard
    MFCC-based front end

Zero-mean and Pre-emphasis
  • Zero-mean debiasing
  • 10 ms frame duration
  • 25 ms Hamming window
  • Absolute energy
  • 12 cepstral coefficients
  • First and second derivatives

Fourier Transf. Analysis
Energy
Cepstral Analysis
7
FRONT END PROPOSALS
QIO FRONT END
Input Speech
Qualcomm, ICSI, OGI (QIO) front end
Fourier Transform
  • 10 msec frame duration
  • 25 msec analysis window
  • 15 RASTA-like filtered cepstral coefficients
  • MLP-based VAD
  • Mean and variance normalization
  • First and second derivatives

Mel-scale Filter Bank
RASTA
MLP-based VAD
DCT
Mean/Variance Normalization
/
8
FRONT END PROPOSALS
MFA FRONT END
  • 10 msec frame duration
  • 25 msec analysis window
  • Mel-warped Wiener filter based noise reduction
  • Energy-based VADNest
  • Waveform processing to enhance SNR
  • Weighted log-energy
  • 12 cepstral coefficients
  • Blind equalization (cepstral domain)
  • VAD based on acceleration of various energy based
    measures
  • First and second derivatives

9
EXPERIMENTAL RESULTS
FRONT END SPECIFIC TUNING
  • Pruning beams (word, phone and state) were opened
    during the tuning process to eliminate search
    errors.
  • Tuning parameters
  • State-tying thresholds solves the problem of
    sparsity of training data by sharing state
    distributions among phonetically similar states
  • Language model scale controls influence of the
    language model relative to the acoustic models
    (more relevant for WSJ)
  • Word insertion penalty balances insertions and
    deletions (always a concern in noisy environments)

10
EXPERIMENTAL RESULTS
FRONT END SPECIFIC TUNING - QIO
  • Parameter tuning
  • clean data recorded on Sennhieser mic.
    (corresponds to Training Set 1 and Devtest Set 1
    of the Aurora-4 database)
  • 8 kHz sampling frequency
  • 7.5 relative improvement

11
EXPERIMENTAL RESULTS
FRONT END SPECIFIC TUNING - MFA
  • Parameter tuning
  • clean data recorded on Sennhieser mic.
    (corresponds to Training Set 1 and Devtest Set 1
    of the Aurora-4 database)
  • 8 kHz sampling frequency
  • 9.4 relative improvement
  • Ranking is still the same (14.9 vs. 12.5) !

12
EXPERIMENTAL RESULTS
COMPARISON OF TUNING
  • Same Ranking relative performance gap increased
    from9.6 to 15.8
  • On TS1, MFA FE significantly better on all 14
    test sets (MAPSSWE p0.1)
  • On TS2, MFA FE significantly better only on test
    sets 5 and 14

13
EXPERIMENTAL RESULTS
MICROPHONE VARIATION
  • Train on Sennheiser mic. evaluate on secondary
    mic.
  • Matched conditions result in optimal performance
  • Significant degradation for all front ends on
    mismatched conditions
  • Both QIO and MFA provide improved robustness
    relative to MFCC baseline

14
EXPERIMENTAL RESULTS
ADDITIVE NOISE
15
SUMMARY AND CONCLUSIONS
WHAT HAVE WE LEARNED?
  • Front end specific parameter tuning did not
    result in significant change in overall
    performance (MFA still outperforms QIO)
  • Both QIO and MFA front ends handle convolution
    and additive noise better than ETSI baseline
  • Both QIO and MFA front ends achieved ALV
    evaluation goal of improving performance by at
    least 25 relative over ETSI baseline
  • WER is still high ( 35), further research on
    noise robust front end is needed

16
SUMMARY AND CONCLUSIONS
AVAILABLE RESOURCES
17
SUMMARY AND CONCLUSIONS
BRIEF BIBLIOGRAPHY
  • N. Parihar, Performance Analysis of Advanced
    Front Ends, M.S. Dissertation, Mississippi State
    University, December 2003.
  • N. Parihar, J. Picone, D. Pearce, and H.G.
    Hirsch, Performance Analysis of the Aurora Large
    Vocabulary Baseline System, submitted to the
    Eurospeech 2003, Geneva, Switzerland, September
    2003.
  • N. Parihar and J. Picone, DSR Front End LVCSR
    Evaluation - AU/384/02, Aurora Working Group,
    European Telecommunications Standards Institute,
    December 06, 2002.
  • D. Pearce, Overview of Evaluation Criteria for
    Advanced Distributed Speech Recognition, ETSI
    STQ-Aurora DSR Working Group, October 2001.
  • G. Hirsch, Experimental Framework for the
    Performance Evaluation of Speech Recognition
    Front-ends in a Large Vocabulary Task, ETSI
    STQ-Aurora DSR Working Group, December 2002.
  • ETSI ES 201 108 v1.1.2 Distributed Speech
    Recognition Front-end Feature Extraction
    Algorithm Compression Algorithm, ETSI, April
    2000.

18
SUMMARY AND CONCLUSIONS
BIOGRAPHY
  • Naveen Parihar is a M.S. student in Electrical
    Engineering in the Department of Electrical and
    Computer Engineering at Mississippi State
    University. He currently leads the Core Speech
    Technology team developing a state-of-the-art
    public-domain speech recognition system. Mr.
    Parihars research interests lie in the
    development of discriminative algorithms for
    better acoustic modeling and feature extraction.
    Mr. Parihar is a student member of the IEEE.
  • Joseph Picone is currently a Professor in the
    Department of Electrical and Computer Engineering
    at Mississippi State University, where he also
    directs the Institute for Signal and Information
    Processing. For the past 15 years he has been
    promoting open source speech technology. He has
    previously been employed by Texas Instruments and
    ATT Bell Laboratories. Dr. Picone received his
    Ph.D. in Electrical Engineering from Illinois
    Institute of Technology in 1983. He is a Senior
    Member of the IEEE and a registered Professional
    Engineer.
Write a Comment
User Comments (0)
About PowerShow.com