Title: Juicer: A weighted finitestate transducer speech decoder
1Juicer A weighted finite-state transducer speech
decoder
- D. Moore1, J. Dines1,
- M. Magimai Doss1, J. Vepa1,
- O. Cheng1 and T. Hain2
- 1 IDIAP Research Institute
- 2 Department of Computer Science, University of
Sheffield
2Overview
- The speech decoding problem
- Why develop another decoder?
- WFST theory and practice
- What is Juicer?
- Benchmarking experiments
- The future of Juicer
3The speech decoding problem
- Given a recording and models of speech
language, generate a text transcription of what
was said
Decoder
She had your dark suit.
Models
4The speech decoding problem
5The speech decoding problem
6The speech decoding problem
- ASR system building blocks
-
- Grammar
- N-gram language model
- Lexical knowledge
- pronunciation dictionary
- Phonetic knowledge
- context dependency
- phonological rules
- Acoustic knowledge
- state distributions
Naive combination of these knowledge sources
leads to a large, inefficient representation of
the search space
7The speech decoding problem
- The main issue in decoding is carrying out an
efficient search of the space defined by the
knowledge sources - Two ways we can do this
- Avoid performing redundant search
- Dont pursue unpromising hypotheses
- An additional issue flexibility of the decoder
8Why develop another decoder?
- Need of a state-of-the-art speech decoder that is
also suitable for on-going research - At present, such software is not freely available
to the research community - Open-source development and distribution
framework
9WFST theory and practice
- Maps sequences of input symbols to sequences of
output symbols - Transition pairs have an associated weight
- In the example
- Input sequence Ia b c d maps to output
sequence OX Y Z W, with the path weight a
function of all transition weights associated
with that path, f(0.1,0.2,0.5,0.1)
10WFST theory and practiceWFST operations
- Composition
- Combination of transducers
- Determinisation
- Only one transition per input label
- Minimisation
- Least number of states and transitions
- Weight pushing to aid in minimisation
11WFST theory and practiceComposition
12WFST theory and practiceDeterminisation
13WFST theory and practiceWeight pushing
minimisation
14WFST theory and practiceWFST and speech decoding
- ASR system building blocks
-
- Grammar
-
- Lexical knowledge
- Phonetic knowledge
- Acoustic knowledge
- Each of these knowledge sources
- has a WFST representation
15WFST theory and practice WFST and speech decoding
- Requires some special considerations
- Lexicon and grammar composition can not be
determinised - Nor can the context dependency transducer
- where L, G, C are WFSTs for the grammar, lexicon
and context dependency
16WFST theory and practiceWFST and speech decoding
- Pros
- Flexibility
- Simple decoder architecture
- Optimised search space
- Cons
- Transducer size
- Knowledge sources are fixed during composition
- WFST-only knowledge sources
17What is Juicer?
- A time-synchronous Viterbi decoder
- Tools for WFST construction
- An interface between 3rd party FSM tools
18What is Juicer?Decoder
- Pruning
- beam search, histogram
- 1-best output
- word and model timing information
- Lattice generation
- phone level lattice output
- State-to-phone transducer is not optimised
- incorporated at run time
19What is Juicer?WFST tools
- gramgen
- word-loop, word pair, N-gram language models
- lexgen
- multiple pronunciations
- cdgen
- monophone, word-internal n-phone, cross-word
triphone - HTK CDHMM and hybrid HMM/ANN model support
- build-wfst
- composition, determinisation and minimisation
using 3rd party tools (ATT, MIT)
20Benchmarking experiments
- Experiments were conducted in order to
- Compare with existing state-of-the-art decoders
- Assess the current capabilities and limitations
of the decoder - Guide future development and research directions
21Benchmarking experiments 20k Wall Street Journal
Task
- Equivalent performance on wide beam settings
- HDecode wins out on narrow beam-widths
- Only part of the story
22Benchmarking experiments but whats the catch?
- Composition of large static networks
- Practically infeasible due to memory limitations
- Is slow
- And may not always be necessary
23Benchmarking experimentsAMI Meeting Room
Recogniser
- Decoding for the NIST Rich Transcription
evaluations - Juicer uses pruned LMs
- Good trade-off between RTF and WER
Chosen operating point
24The future of Juicer
- Further benchmarking
- Testing against HDecode
- Trade off between pruned LMs and performance
- Added capabilities
- On the fly network expansion
- Word lattice generation
- Support for MLLR transforms, feature transforms
- Distribution and support
- Currently only available to AMI and IM2 partners
25Summary
- I have presented today
- WFST theory and practice
- The Juicer tools and decoder
- Preliminary experiments
- but more importantly
- We hope to have generated interest in Juicer