Title: MULTILAYER PERCEPTRONS
1MULTI-LAYER PERCEPTRONS Input layer
Hidden Layer Output layer
hkm
wmn
2Inputs
Intermediate values
Outputs
Sigmoid function
dn desired output zn actual output
3Hybrid MLP-HMM System
With a hybrid system the MLP is used to obtain
the observation probability matrix.
Typically the input could consist of a vector of
spectral features for a time t and four
additional vectors on each side i.e. vectors of
features for times t-40ms, t-30ms, t-20ms,
t-10ms, t, t10ms, t20ms, t30ms, t40ms
4The output would consist of one output unit for
each phone. By constraining the values of all
the output units to sum to 1 the net can be used
to compute the probability of a state j given an
observation Ot i.e. P(qjOt).
The MLP computes P(qjOt) but what we want for
our HMM is bj(Ot) i.e. P(Otqj) We can obtain
this using Bayes Rule viz. P(xy)
P(yx)P(x)/P(y)
This is the output of the MLP
This is the total probability of a given state,
summing over all observations
5Therefore, we can compute
which is a scaled likelihood. This is just as
good as the regular likelihood
since the probability of the observation
is a constant during recognition.
6DTW
HMM
MLP
Easy to train (Average of prototypes)
More costly to train (complex training algorithm)
Very costly to train Back propagation can be
slow to converge
Costly to use (Computationally intensive dynamic
programming
Less costly to use (Viterbi algorithm is very
efficient)
Cheap to use (forward computation through net
is fast)
Independent words (separate templates)
Independent words (separate models)
Entire vocabulary
Hard to add or delete words
Easy to add or delete words
Speaker dependent (close to speech data)
Less speaker dependent (structural model of word
generation)
Less speaker dependent (Perceptrons can
generalise)
7Sentence Formation
Feature vectors
Probability of word sequence
Most likely sentence
Pattern matcher
High-level interpretation
Front-end
Word or phoneme models
Language model
8Bayes Rule
Sentence recognition
Observations (feature vectors)
sentence
Denominator is independent of the words being
considered as possible matches i.e. independent
of the sentence
9The final term comes from the language
model. bigram P(word previous
word) trigram P(word previous two words)
A large (sparse) matrix is obtained by counting
adjacent occurrences of words in pairs or
triples in a large corpus (typically 107 words)
of on-line text. The number of combinations is
potentially very large vocabulary 105 words
gt 1010 possible pairs 1015 possible
triples
10- Most of the possibilities do not occur in the
training corpus because - either they cant occur (prohibited in the
language) - or they could occur but dont.
We want to assign a non-zero probability to all
n-grams that are possible in the given context
- including ones that have not occurred in the
training data but they look the same as those
that cant occur in the Language, viz. a zero
count in the matrix.
In practice the matrix is smoothed and
normalised to give a probability distribution
11P(I want Chinese food) (.0023)(0.32)(.0049)((.56
) 0.0002
12Add-One Smoothing
Unigram
Then add-one smoothing of the probabilities is
defined by
where V is the total number of word types in the
language.
13Bigram
is redefined as
14(No Transcript)
15Witten-Bell Discounting
Model the probability of seeing a zero-frequency
N-gram by the probability of seeing a N-gram for
the first time.
The count of first-time N-grams is the number
of N-gram types we saw in the data, since we had
to see each type for the first time exactly once.
So we can estimate the total probability of all
the zero N-grams as
We want to divide this total probability of
unseen N-grams up amongst all the zero N-grams.
16Unigram
Let Z be the total number of N-grams with count
zero types and divide the total probability of
unseen N-grams up equally among the zero N-grams.
We have introduced T/(NT) as the total
probability of unseen N-grams and this has to
come from somewhere, so we discount the
probability of the seen N-grams.
17Bigram
Distribute this equally between the unseen bigrams
18P(I want Chinese food) (.0023)(0.32)(.0049)((.56
) 0.0002
19Probabilistic context-free grammar
Rule Probability S ? NP VP 1 NP ?
pn 0.2 NP ? det n 0.4 NP ? NP PP 0.4 PP ? prep
NP 1 VP ? v NP 0.8 VP ? VP PP 0.2 S
sentence pn pronoun NP Noun Phrase n
noun VP Verb Phrase prep preposition PP
Prepositional Phrase v verb det
determiner
20He bought a dog with three pounds
S ? NP VP 1 NP ? pn 0.2 NP ? det n 0.4 NP ? NP
PP 0.4 PP ? prep NP 1 VP ? v NP 0.8 VP ? VP PP 0.2
S
NP
VP
VP
PP
NP
v
prep
NP
det
N
det
N
dog
three
pounds
He
a
with
bought
P(sentence)
1 x 0.2 x 0.2 x 0.8 x 1.0 x 0.4
21Is there another parse tree for the sentence
He bought a dog with three pounds
S ? NP VP 1 NP ? pn 0.2 NP ? det n 0.4 NP ? NP
PP 0.4 PP ? prep NP 1 VP ? v NP 0.8 VP ? VP PP 0.2
S
NP
VP
He
V
NP
PP
NP
bought
prep
NP
det
N
det
N
with
a
dog
three
pounds
P(sentence)
1 x 0.2 x 0.8 x 0.4 x 0.4 x 1.0 x 0.4