SPEECH RECOGNITION SEARCH - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

SPEECH RECOGNITION SEARCH

Description:

Acoustic model and Language model drive an integrated search process ... Use standard trellis. Allow transition from word ends to word starts where LM allows ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 25

Provided by: DVC2

Category:

more less

Transcript and Presenter's Notes

Title: SPEECH RECOGNITION SEARCH

1
SPEECHRECOGNITIONSEARCH
2
Bayesian Recognition including acoustic and
linguistic information sources

W word sequence
x1T feature vector sequence
recognized word sequence

3
STRUCTURE for ISOLATED WORD SYSTEMS
Language model acts as postprocessor on
acoustic match
Word Likelihoods P(w1xi) P(w2 xi) P(wM xi)
HMM scores P(xiw1) P(xiw2) P(xiwM)
4
STRUCTURE for LARGE VOCABULARY CONTINUOUS SPEECH
RECOGNITION
Acoustic model and Language model drive an
integrated search process
Language Model
Acoustic Model
s(t)
xi
Feature Extraction
Large Vocabulary Search Engine
Ranked Sentence Likelihoods P(S1xi) P(S2
xi)
5
DECODINGFOR SMALL VOCABULARYSYSTEMSWITH
SHORT-SPAN LMs
6
Search Space for IWR with full word HMMs
7
Search Space for IWR with full word HMMs

search space is defined by
all states in all words as derived from the
(word/phonetic) lexicon
a few additional states for marking beginning and
end (of sentence)
additional connections for feeding a priori word
probabilities (LM information)

8
Time Synchronous Viterbi Search
observations xt-1 xt

at each time t
- compute a new set of likelihoods P(s x0xt)
for all states in the search space
- the new likelihoods are based on values at
time t-1 and the new observation xt

search space
9
Time Synchronous Recognizer with pruning

limit the number of states considered for
computation to a set of active states
active states at time t are those that can be
reached from the active states at time t-1
after computation of likelihoods at time t
rank all active states according to likelihood
PRUNE away the less likely states
if there are no active states left in a word,
abandon the word as a whole

abandon word3
10
Continuous Word Recognition
S
E
11
Continuous Word Recognition

words are expanded into their states as in
isolated word recognition
word connections contain word transition
probabilities (limited to bigram probabilities)

12
One-Pass Dynamic Programming

Use standard trellis
Allow transition from word ends to word starts
where LM allows
Use standard Viterbi beam search
Recognition of word string
by backtracking (requires maintenance of full
search history)
by adding word history as extra information
into the nodes
a decision is made on best possible history at
start-up time of a new word
intrinsic bigram limitation
Conclusion
well suited for small medium vocabulary tasks
with simple language models
not suited for large vocabulary recognition with
complex language models

13
DECODINGFOR LARGE VOCABULARYCONTINUOUS SPEECH
RECOGNITION
14
Search for Phoneme Based Recognizers
speech signal
Feature Extraction
sequence of observations
Acoustic Matching by Dynamic Programming
Acoustic Models
phoneme hypothesis
Large Vocabulary Search
Linguistic Models
ranked list of recognized words, sentences
0.0 recognize speech -6.3 wreck a nice
beach -8.8 recognize beach -9.7 wreck a nice
peach .
15
Search for Phoneme Based Recognizers

Output of Acoustic Match dense phoneme graph
many phoneme hypotheses in parallel need to be
considered
begin- and end-time of competing hypotheses may
not coincide
Linguistic Match matches the phoneme hypotheses
against all sentence hypotheses suggested by
phonetic lexicon
language model
ISSUES
number of possible sentence hypothesis huge
(infinite)
phonetic lexicon may not provide enough
pronunciation variants
SOLUTION
prototypical systems integrate acoustic match and
linguistic match into a single search trying to
find the most likely sentence for a stream of
features
sentence hypotheses are built up word by word,
i.e. new words are hypthesized when reaching a
word end state

16
Search in Large Vocabulary Continuous
Recognition

GOAL Find the sentence with the highest
likelihood, given the observed features
REALITY The number of possible sentences is so
huge that only a fraction of all hypotheses can
be evaluated
SOLUTION Quickly select those few hypotheses
that seem to be likely candidates to become
winners

17
Issues in Large Vocabulary Continuous
Recognition

SEARCH STRATEGY
ORDER in which different hypotheses are
evaluated
BREADTH FIRST SYNCHRONOUS VITERBI BEAM SEARCH
BEST FIRST (A), STACK DECODING
PRUNING STRATEGY
Pruning parameters, criteria beam threshold,
number of hypotheses to be considered
Is there a chance that the best path gets pruned
away ?
DATA REPRESENTATION
ACOUSTIC MODEL
LEXICON
LANGUAGE MODEL
SEARCH HYPOTHESIS

18
Lexical Trees for large vocabulary tasks

Flat (linear) dictionary
each word is an entry by itself
full computation for each word
computation proportional to Nwords x avg nbr of
phonemes per word
computation proportional to Nwords
Tree structured dictionaries
organize lexicon as a tree
computation is shared between words with same
initial set of phonemes
total number of nodes in search network is much
smaller than for linearly organized lexicons

19
Lexical Trees for large vocabulary tasks
20
Dynamic Tree Expansion

Works with lexical trees
When an end node of the tree is reached, a new
lexical tree can be added to the search space
Different pruning criteria for internal nodes vs.
end nodes
Serves as basic algorithm for typical decoders in
LVCSR
By virtue of sharing of the initial node in a
tree, it is not possible to model cross-word
coarticulation in a single pass

21
Dynamic Tree Expansion
22
Dynamic Combination of Lexicon and LM

Maintaining multiple copies of the lexical tree
wastes memory.
Use of longer span LM not possible.
Solution
Reuse single lexical tree.
Hypothesis is combination of lexicon position and
language model history.

23
Dynamic Combination of Lexicon and LM
24
Multi-pass algorithms

A number of essential features in LVCSR
drastically increase the search space
Cross-word triphone models
Long span language models
A multi-pass algorithm can overcome this problem
with following strategy
FIRST PASS
use simplified assumptions such that an efficient
search algorithm can be used to explore the FULL
search space
use a highly conservative pruning strategy
Retain possible solutions in a graph or as an
N-best list of sentences
SUBSEQUENT PASS(ES)
use detailed acoustic and language models
use different search algorithm (sometimes
exhaustive) on the retained graph