Title: SPEECH RECOGNITION SEARCH
1SPEECHRECOGNITIONSEARCH
2Bayesian Recognition including acoustic and
linguistic information sources
- W word sequence
- x1T feature vector sequence
- recognized word sequence
3STRUCTURE for ISOLATED WORD SYSTEMS
Language model acts as postprocessor on
acoustic match
Word Likelihoods P(w1xi) P(w2 xi) P(wM xi)
HMM scores P(xiw1) P(xiw2) P(xiwM)
4STRUCTURE for LARGE VOCABULARY CONTINUOUS SPEECH
RECOGNITION
Acoustic model and Language model drive an
integrated search process
Language Model
Acoustic Model
s(t)
xi
Feature Extraction
Large Vocabulary Search Engine
Ranked Sentence Likelihoods P(S1xi) P(S2
xi)
5DECODINGFOR SMALL VOCABULARYSYSTEMSWITH
SHORT-SPAN LMs
6Search Space for IWR with full word HMMs
7Search Space for IWR with full word HMMs
- search space is defined by
- all states in all words as derived from the
(word/phonetic) lexicon - a few additional states for marking beginning and
end (of sentence) - additional connections for feeding a priori word
probabilities (LM information)
8Time Synchronous Viterbi Search
observations xt-1 xt
- at each time t
- - compute a new set of likelihoods P(s x0xt)
for all states in the search space - - the new likelihoods are based on values at
time t-1 and the new observation xt
search space
9Time Synchronous Recognizer with pruning
- limit the number of states considered for
computation to a set of active states - active states at time t are those that can be
reached from the active states at time t-1 - after computation of likelihoods at time t
- rank all active states according to likelihood
- PRUNE away the less likely states
- if there are no active states left in a word,
abandon the word as a whole
abandon word3
10Continuous Word Recognition
S
E
11Continuous Word Recognition
- words are expanded into their states as in
isolated word recognition - word connections contain word transition
probabilities (limited to bigram probabilities)
12One-Pass Dynamic Programming
- Use standard trellis
- Allow transition from word ends to word starts
where LM allows - Use standard Viterbi beam search
- Recognition of word string
- by backtracking (requires maintenance of full
search history) - by adding word history as extra information
into the nodes - a decision is made on best possible history at
start-up time of a new word - intrinsic bigram limitation
- Conclusion
- well suited for small medium vocabulary tasks
with simple language models - not suited for large vocabulary recognition with
complex language models
13DECODINGFOR LARGE VOCABULARYCONTINUOUS SPEECH
RECOGNITION
14Search for Phoneme Based Recognizers
speech signal
Feature Extraction
sequence of observations
Acoustic Matching by Dynamic Programming
Acoustic Models
phoneme hypothesis
Large Vocabulary Search
Linguistic Models
ranked list of recognized words, sentences
0.0 recognize speech -6.3 wreck a nice
beach -8.8 recognize beach -9.7 wreck a nice
peach .
15Search for Phoneme Based Recognizers
- Output of Acoustic Match dense phoneme graph
- many phoneme hypotheses in parallel need to be
considered - begin- and end-time of competing hypotheses may
not coincide - Linguistic Match matches the phoneme hypotheses
against all sentence hypotheses suggested by - phonetic lexicon
- language model
- ISSUES
- number of possible sentence hypothesis huge
(infinite) - phonetic lexicon may not provide enough
pronunciation variants - SOLUTION
- prototypical systems integrate acoustic match and
linguistic match into a single search trying to
find the most likely sentence for a stream of
features - sentence hypotheses are built up word by word,
i.e. new words are hypthesized when reaching a
word end state
16Search in Large Vocabulary Continuous
Recognition
- GOAL Find the sentence with the highest
likelihood, given the observed features - REALITY The number of possible sentences is so
huge that only a fraction of all hypotheses can
be evaluated - SOLUTION Quickly select those few hypotheses
that seem to be likely candidates to become
winners
17Issues in Large Vocabulary Continuous
Recognition
- SEARCH STRATEGY
- ORDER in which different hypotheses are
evaluated - BREADTH FIRST SYNCHRONOUS VITERBI BEAM SEARCH
- BEST FIRST (A), STACK DECODING
- PRUNING STRATEGY
- Pruning parameters, criteria beam threshold,
number of hypotheses to be considered - Is there a chance that the best path gets pruned
away ? - DATA REPRESENTATION
- ACOUSTIC MODEL
- LEXICON
- LANGUAGE MODEL
- SEARCH HYPOTHESIS
18Lexical Trees for large vocabulary tasks
- Flat (linear) dictionary
- each word is an entry by itself
- full computation for each word
- computation proportional to Nwords x avg nbr of
phonemes per word - computation proportional to Nwords
- Tree structured dictionaries
- organize lexicon as a tree
- computation is shared between words with same
initial set of phonemes - total number of nodes in search network is much
smaller than for linearly organized lexicons
19Lexical Trees for large vocabulary tasks
20Dynamic Tree Expansion
- Works with lexical trees
- When an end node of the tree is reached, a new
lexical tree can be added to the search space - Different pruning criteria for internal nodes vs.
end nodes - Serves as basic algorithm for typical decoders in
LVCSR - By virtue of sharing of the initial node in a
tree, it is not possible to model cross-word
coarticulation in a single pass
21Dynamic Tree Expansion
22Dynamic Combination of Lexicon and LM
- Maintaining multiple copies of the lexical tree
wastes memory. - Use of longer span LM not possible.
- Solution
- Reuse single lexical tree.
- Hypothesis is combination of lexicon position and
language model history.
23Dynamic Combination of Lexicon and LM
24Multi-pass algorithms
- A number of essential features in LVCSR
drastically increase the search space - Cross-word triphone models
- Long span language models
- A multi-pass algorithm can overcome this problem
with following strategy - FIRST PASS
- use simplified assumptions such that an efficient
search algorithm can be used to explore the FULL
search space - use a highly conservative pruning strategy
- Retain possible solutions in a graph or as an
N-best list of sentences - SUBSEQUENT PASS(ES)
- use detailed acoustic and language models
- use different search algorithm (sometimes
exhaustive) on the retained graph