Using Speech Recognition to Predict VoIP Quality - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Using Speech Recognition to Predict VoIP Quality

Description:

But the relative recognition ratio Rrel is universal and speaker-independent ... The relative word recognition ratio is a universal, speaker-independent metric ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 16
Provided by: henningsc
Category:

less

Transcript and Presenter's Notes

Title: Using Speech Recognition to Predict VoIP Quality


1
Using Speech Recognition to Predict VoIP Quality
  • Wenyu Jiang
  • IRT Lab
  • April 3, 2002

2
Introduction to Voice Quality
  • Quality factors in Voice over IP (VoIP)
  • Packet loss, delay, and jitter
  • Choice of voice codec
  • Quality metric Mean Opinion Score
  • Widely used
  • Human based
  • Time consuming
  • Labor intensive
  • Results N/A in real-time

3
Motivation
  • Features of a speech recognizer
  • Automatic speech recognition (ASR), no human
    listeners needed
  • Accuracy of recognition is apparently coupled
    with the quality of input speech
  • Recognition can be done in real-time, allowing
    online quality monitoring.
  • Recognition performance may be related to speech
    intelligibility as well as quality.

4
Related Work
  • ITU-T E-model G.107/G.108
  • An analytical model for estimating perceived
    quality
  • Provides loss-to-MOS mapping for some common
    codecs (G.729, G.711, G.723.1).
  • Chernick et al studies speech recognition
    performance with DoD-CELP codec
  • Effect of bit error rate instead of packet loss
  • Phoneme (instead of word) recognition ratio
  • Some MOS results, but not accurate enough

5
Experiment Setup
  • Speech recognition engine
  • IBM ViaVoice on Linux
  • Wrote software for both voice model training and
    performance testing
  • Training and Testing
  • 2 scripts, 1 for training, 2 for testing.
  • 2 speakers, A and B, both read 2 scripts.
  • Script 2 is split into 25 audio clips, with 5
    clips per loss condition (0, 2, 5, 10, 15)
  • Codec G.729
  • Training by G.729 processed audio

6
Experiment Setup, contd.
  • Performance metric
  • Absolute word recognition ratio
  • Relative word recognition ratio
  • p is packet loss probability
  • MOS listening tests 22 listeners

7
Recognition Ratio vs. MOS
  • Both MOS and Rabs decrease w.r.t loss
  • Then, eliminate middle variable p

8
Properties of ASR Performance
  • When loss probability is low
  • Recognition ratio changes slowly
  • Possibly due to robustness in ViaVoice
  • Less accurate MOS prediction in such case
  • Importance of voice training method
  • Training audio should use same codec as testing

9
Speaker Dependence in ASR
  • ViaVoice SDK cites a 90 accuracy for
  • Average speaker without a heavy accent
  • Sampling at 22KHz, PCM linear-16
  • For speaker A, we achieved
  • About 42 accuracy with no packet loss
  • Reasons
  • 8KHz sampling G.729 compression
  • Accent talk speed
  • Does not interfere with MOS prediction, but need
    to check for speaker dependence

10
Speaker Dependence Check
  • Absolute recognition ratio is
  • 70 for speaker B, but 42 for speaker A
  • dependent on the speaker
  • But the relative recognition ratio Rrel is
    universal and speaker-independent

11
Rrel as Universal MOS Predictor
  • Mapping from relative recognition ratio Rrel to
    MOS

12
Human Recognition Results
  • Listeners are asked to transcribe what they hear
    in addition to MOS grading.
  • Human recognition result curves are less smooth
    than MOS curves.

13
Human Results, contd.
  • Two flat regions in loss-human curve
  • 2-5 loss (some loss but not very high)
  • 10-15 loss (loss is already too high)
  • Mapping between machine and human recognition
    performance

14
Application Scenarios
  • Sender transmits a pre-recorded audio clip of a
    speaker known to receiver.
  • Receiver does the following
  • Looks up Rabs(0) for this speaker
  • Performs speech recognition
  • Compare to the original text, compute Rrel
  • No need to store the original audio clip
  • Just the text is sufficient ? less storage
  • Need not know packet loss probability
  • Suitable for e2e black-box measurements

15
Conclusions
  • Evaluation of speech recognition performance as a
    MOS predictor
  • Used ViaVoice speech engine
  • Performance metric word recognition ratio
  • The relative word recognition ratio is a
    universal, speaker-independent metric
  • Also analyzed human recognition performance
  • Future work evaluate other codecs, e.g., G.726,
    GSM.
Write a Comment
User Comments (0)
About PowerShow.com