HUMAN AND SYSTEMS ENGINEERING: - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

HUMAN AND SYSTEMS ENGINEERING:

Description:

The equations used to compute the alphas and betas for an HMM are as follows ... Computing alphas: ... we just initialize the first alpha value with a constant. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 28

Provided by: RAG123

Category:

more less

Transcript and Presenter's Notes

Title: HUMAN AND SYSTEMS ENGINEERING:

1
HUMAN AND SYSTEMS ENGINEERING
Confidence Measures Based on Word Posteriors and
Word Graphs
Sridhar Raghavan1 and Joseph Picone2 Graduate
research assistant1, Human and Systems
Engineering Professor2,Electrical and Computer
Engineering URL www.isip.msstate.edu/publication
s/seminars/msstate/2005/confidence/
2
Abstract

Confidence measure using word posterior
There is a strong need for determining the
confidence of a word hypothesis in a LVCSR
systems, this is because conventional viterbi
decoding just generates the overall one best
sequence while the performance of a speech
recognition system is based on Word error rate
and not sentence error rate.
A good estimate of the confidence level is the
word posterior probability.
The word posteriors can be computed from a word
graph.
A forward-backward algorithm can be used to
compute the link posteriors.

Foundation

The equation for computing the posterior of the
word is as follows Wessel.F
The idea here is to sum up the posterior
probabilities of all those word hypothesis
sequences that contain the word w with same
start and end times.
4

Foundation continued

We cannot compute the above posterior directly,
so we decompose it into likelihood and priors
using Bayes rule.
N
There are 6 different ways to reach the node N
and 2 different ways to leave N, so we need to
obtain the forward probability as well as the
backward probability to obtain a good estimate of
the probability of passing through the node N,
and this is where the forward-backward algorithm
comes into picture.
Hence the value in the numerator has to be
computed using the forward backward algorithm.
The denominator term is simply the sum of the
numerator for all words w occurring in the same
time instant.
5

Scaling

Scaling is used to obtain a flat posterior
distribution so that the distribution is not
dominated by the best path G. Evermann.
Experimentally it has been determined that
(1/language model scale factor) is a good value
that can be used to scale down the acoustic model
score. So, the language model scale factor is
fixed as unity and the acoustic scores is scaled
down as follows
6

How to combine word-posteriors?

The word posteriors corresponding to the
same word have to be combined in order to obtain
a confidence estimate. There are several ways to
do this, and some of the methods are as follows
Sum up the posteriors of all the similar words
that fall within the same time frame.
Build a confusion network where the entire
lattice is mapped into a single linear graph
where the links pass through all the nodes in the
same order.

quest
Full Lattice Network
quest
sense
sil
is
the
a
sil
guest
Compacted Confusion Network
this
this
is
the
test
sentence
Note The redundant silent edges can be fused
together in the full lattice network before
computing the forward-backward probabilities.
This will save a lot of computation if there are
many silence edges in the lattice.
7

Some challenges during posterior rescoring!

Apparently the word posteriors are not very
good estimate of confidence when the WER on the
data is very poor. This is described in the paper
by G.Evermann P.C.Woodland Large Vocabulary
Decoding and Confidence Estimation using Word
Posterior Probabilities. The reason is because
the posteriors get overestimated since the words
in the lattices are not the full set of possible
words, and in case of poor WER the lattice will
contain a lot of wrong hypothesis. In such a case
the depth of the lattice becomes a critical
factor in determining the effectiveness of using
the confidence measure. The paper
mentioned above cites two techniques to solve
this problem. 1. A Decision tree based
technique 2. A neural network based
technique. Different confidence measure
techniques are judged on a metric known as
normalized cross entropy (NCE).
8

How can we compute the word posterior from a
word graph?

A word posterior is a probability that is
computed by considering a words acoustic score,
language model score and its position and history
in a particular path through the word graph.
An example of a word graph is given below, note
that the nodes are the start-stop times and the
links are the words. The goal is to determine the
link posterior probabilities. Every link holds
an acoustic score and a language model
probability.
quest
9

Example

Let us consider an example as shown below
The values on the links are the likelihoods. Some
nodes are outlined with red to specify that they
occur in the same time instant.
10

Forward-backward algorithm

We will use forward-backward type algorithm for
determining the link probability. The equations
used to compute the alphas and betas for an HMM
are as follows from any speech text
book Computing alphas Step 1 Initialization
In a conventional HMM forward-backward algorithm
we would perform the following
We need to use a slightly modified version of the
above equation for processing a word graph. The
emission probability will be the acoustic score
and the initial probability is taken as 1 since
we always begin with a silence. In our
implementation we just initialize the first alpha
value with a constant. Since, we use log domain
for implementation we assign the first alpha
value as 0.
11

Forward-backward algorithm continue

The a for the first node is 1
Step 2 Induction
The alpha values computed in the previous step
are used to compute the alphas for the succeeding
nodes. Note Unlike in HMMs where we move from
left to right at fixed intervals of time, over
here we move from one node to the next closest
node.
12

Forward-backward algorithm continue

Let us see the computation of the alphas from
node 2, the alpha for node 1 was initialized as 1
in the previous step during initialization. Node
2
a1.675E-03
a 0.5025
is
4
Node 3
a 1
3/6
2/6
Sil
1
4/6
3
3/6
3/6
Sil
Node 4
this
2
a 0.5
The alpha calculation continues in this manner
for all the remaining nodes.
The forward backward calculation on word-graphs
is similar to the calculations used on HMMs, but
in word graphs the transition matrix is populated
by the language model probabilities and the
emission probability corresponds to the acoustic
score.
13

Forward-backward algorithm continue

Once we compute the alphas using the forward
algorithm we begin the beta computation using the
backward algorithm. The backward algorithm is
similar to the forward algorithm, but we start
from the last node and proceed from right to
left. Step 1 Initialization
Step 2 Induction
14

Forward-backward algorithm continue

Let us see the computation of the beta values
from node 14 and backwards. Node 14
ß1.66E-3
ß0.1667
1/6
14
sense
1/6
Sil
1/6
11
Sil
sentence
Node 13
5/6
13
15
4/6
ß1
sentence
ß0.833
12
Node 12
ß5.55E-3
15

Forward-backward algorithm continue

Node 11
In a similar manner we obtain the beta values for
all the nodes till node 1.
We can compute the probabilities on the links
(between two nodes) as follows Let us call this
link probability as G. Therefore G(t-1,t) is
computed as the product of a(t-1)ß(t)aij. These
values give the un-normalized posterior
probabilities of the word on the link considering
all possible paths through the link.
16

Word graph showing the computed alphas and betas

This is the word graph with every node with its
corresponding alpha and beta value.
a1.675E-5 ß4.61E-9
a2.79E-8 ß2.766E-6
Assumption here is that the probability of
occurrence of any word is 0.01. i.e. if we have
100 words in a loop grammar.
17

Link probabilities calculated from alphas and
betas

The following word graph shows the links with
their corresponding link posterior probabilities
(not yet normalized).
By choosing the links with the maximum posterior
probability we can be certain that we have
included most probable words in the final
sequence.
18

Normalization

The normalization of the posteriors is performed
by dividing the computed gamma (numerator) by the
probability of sum of all paths through the
lattice. This normalizing value is the byproduct
of forward-backward algorithm.
19

Some Alternate approaches

The normalization of the posteriors is done by
dividing the value by the sum of the posterior
probabilities of all the words in the specific
time instant. This example suffers from the fact
that the lattice is not deep enough, hence
normalization might result in the values of some
of the links to go to 1. This will also be seen
in the example that follows. This phenomenon is
explained in the paper by G.Evermann and P.C
Woodland.
The paper by F.Wessel (confidence Measures for
Large Vocabulary Continuous Speech Recognition)
describes alternate techniques to compute the
posterior, because the drawback of the approach
described above is that the lattice has to be
very deep to accommodate sufficient links at the
same time instant. To overcome the problem one
can use a soft time margin instead a hard margin,
and this is achieved by considering overlapping
words to a certain degree. But, by doing this the
author states that the normalization part will no
longer work since the probabilities are not
summed in the same time frame, and hence will
total more than unity. Hence, the author suggests
an approach where the posteriors are computed
frame-by-frame so that the normalization of the
posteriors is possible. In the end it was found
that normalization using frame-by-frame approach
did not perform significantly better than the
overlapping time marks approach.
Instead of using the probabilities as described
above, one can use logarithmic approximations of
the above probabilities so that the
multiplications are converted to additions. Also,
we can directly use the acoustic and language
model scores from the ASRs output lattice.
20

Logarithmic computations

We can compute the log probabilities in the log
domain, the only difference is that we will be
adding the numbers instead of multiplication. We
will use the following log trick to add two
logarithmic values log(xy) log(x)
log(1y/x) The logarithmic alphas and betas are
shown below at every node in the lattice
a-3.5833 ß-3.5833
a-1.7916 ß-5.375
21

Using it on a real application

Using the algorithm on real application Need
to perform word spotting without using a language
model i.e. we can only use a loop grammar.
In order to spot the word of interest we will
construct a loop grammar with just this one
word. Now the final one best hypothesis will
consist of a sequence of the same
word repeated N times. So, the challenge here is
to determine which of these N words actually
corresponds to the word of interest. This is
achieved by computing the link posterior
probability and selecting the one with the
maximum value.
22

1-best output from the word spotter

The recognizer puts out the following output
- 0000 0023 !SENT_START -1433.434204 0023
0081 BIG -4029.476440 0081
0176 BIG -6402.677246 0176
0237 BIG -4080.437500 0237
0266 !SENT_END -1861.777344 We have to
determine which of the three instances of the
word actually exists.
23
Lattice from one of the utterances
For this example we have to spot the word BIG
in an utterance that consists of three words
(BIG TIED GOD). All the links in the output
lattice contains the word BIG. The values on
the links are the acoustic likelihoods in log
domain. Hence a forward backward computation just
involves addition of these numbers in a
systematic manner.
24

Alphas and betas for the lattice

The initial probability at both the nodes is 1.
So, its logarithmic value is 0. The language
model probability of the word is also 1 since
it is the only word in the loop grammar.
25

Link posterior calculation

It is observed that we can obtain a greater
discrimination in confidence levels if we also
multiply the final probability with the
likelihood of the link other than the
corresponding alphas and betas. In this example
we add the likelihood since it is in log domain.
2
G-18061
G-18061
G-67344
G-17942
3
sent_start
G-21382
1
0
G-17859
4
G-25690
G-17781
G-31152
G-31152
G-31152
sent_end
8
5
6
7
26

Inference from the link posteriors

Link 1 to 5 corresponds to the first word time
instance while 5 to 6 and 6 to 7 correspond to
the second and third word instances respectively.
It is very clear from the link posterior values
that the first instance of the word BIG has a
much higher probability than the other two.
27

References

F. Wessel, R. Schlüter, K. Macherey, H. Ney.
"Confidence Measures for Large Vocabulary
Continuous Speech Recognition". IEEE Trans. on
Speech and Audio Processing. Vol. 9, No. 3, pp.
288-298, March 2001
Wessel, Macherey, and Schauter, "Using Word
Probabilities as Confidence Measures, ICASSP'97
G. Evermann and P.C. Woodland, Large Vocabulary
Decoding and Confidence Estimation using Word
Posterior Probabilities in Proc. ICASSP 2000, pp.
2366-2369, Istanbul.
X. Huang, A. Acero, and H.W. Hon, Spoken Language
Processing - A Guide to Theory, Algorithm, and
System Development, Prentice Hall, ISBN
0-13-022616-5, 2001
J. Deller, et. al., Discrete-Time Processing of
Speech Signals, MacMillan Publishing Co., ISBN
0-7803-5386-2, 2000