Title: Audio Signals in Speech Interfaces
1On Improvement of CI-based GMM Selection in
Sphinx 3
Arthur Chan, Ravishankar Mosur Alexander Rudnicky
Computer Science Department Carnegie Mellon
University
- CMU Sphinx is an open source speech recognition
system. - Recent development (Sphinx 3.6) has focused on
building a real-time continuous HMM system and
speaker adaptation. - In this work, we describe improvements to the GMM
computation which reduces 10-30 computation in
the Viterbi search in different tasks. - The algorithms are freely available in a
www.cmusphinx.org.
Context-Independent Senone- Based GMM Selection
(CIGMMS)
Three Enhancements
1. Bound the number of CD GMMs to be computed. 2.
When best Gaussian index (BGI) of previous frame
is available and CD is out of beam -gt compute CD
GMM score based on previous BGI. Motivation The
current BGI is a good approximation of GMM score.
And the previous BGI is a good approximation to
the current BGI. 3. Use a tightened CI beam size
for every N frames. Motivation Similar to
dropping senone computation every N frames, and
using previous frame scores (Chan 2004), which
significantly reduced computation, but impacted
accuracy. Narrowing the CI beam size every N
frames preserves the very best scoring senones in
the current frame, and improves accuracy. Using
a tightening factor provides more flexible
control.
Summary Technique for Gaussian Computation
Speed-Up (Lee 2001, Chan 2004) Idea CI senone
score as approximate score Procedure 1. Compute
all CI scores, form a beam (CI beam) from the
highest score 2. For all CD scores a. If base
CI score is within the beam -gt Compute detailed
CD score b. Else -gt Backoff to CI score
Issues of the Basic CIGMMS
Issue 1 Unpredictable Per-frame Performance
beam search -gt number of CD scores computed
varies a great deal Issue 2 Poor Pruning
Characteristics Large number of CD scores
fallback to the same CI scores -gt pruning is less
effective
Experimental Results
Task? Vocab. Size ? Comm. (2k) WSJ5k (5k) ICSI (11k)
baseline WER baseline xRT 12.85 0.89xRT 6.73 0.64xRT 34.45 1.10xRT
Method 1 12.84 0.73xRT 6.73 0.64xRT 35.35 0.93xRT
Method 2 12.84 0.64xRT 6.73 0.63xRT 35.35 0.88xRT
Method 3 13.11 0.56xRT 6.90 0.59xRT 36.43 0.73xRT
Assumptions in Enhancement 2 BGIs in adjacent
frames are usually the same. But how often?
(Depends on GMM size)
GMMsize 1 2 4 8 16 32
Comm. 100 93.1 88.2 84.5 80.7 76.2
WSJ5K 100 90.7 84.7 80.3 N/A N/A
ICSI 100 89.5 82.6 77.3 71.7 64.6
Table Percentages of adjacent BGIs that are the
same.
Conclusions Adjacent BGIs are quite consistent
(even in noisy tasks). But, less consistent for
the top-scoring senones (Not shown in table
Leads to Enhancement 3.)
Table Word error rates and execution times.
Summary Cumulative speedup of up to 37 with
only slight increase in WER.