MULTILAYER PERCEPTRONS

1 / 21
About This Presentation
Title:

MULTILAYER PERCEPTRONS

Description:

Sigmoid. function. dn = desired output zn = actual output. Hybrid MLP-HMM System ... Typically the input could consist of a vector of spectral features for a ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: eri87

less

Transcript and Presenter's Notes

Title: MULTILAYER PERCEPTRONS


1
MULTI-LAYER PERCEPTRONS Input layer
Hidden Layer Output layer
hkm
wmn
2
Inputs
Intermediate values
Outputs
Sigmoid function
dn desired output zn actual output
3
Hybrid MLP-HMM System
With a hybrid system the MLP is used to obtain
the observation probability matrix.
Typically the input could consist of a vector of
spectral features for a time t and four
additional vectors on each side i.e. vectors of
features for times t-40ms, t-30ms, t-20ms,
t-10ms, t, t10ms, t20ms, t30ms, t40ms
4
The output would consist of one output unit for
each phone. By constraining the values of all
the output units to sum to 1 the net can be used
to compute the probability of a state j given an
observation Ot i.e. P(qjOt).
The MLP computes P(qjOt) but what we want for
our HMM is bj(Ot) i.e. P(Otqj) We can obtain
this using Bayes Rule viz. P(xy)
P(yx)P(x)/P(y)
This is the output of the MLP
This is the total probability of a given state,
summing over all observations
5
Therefore, we can compute
which is a scaled likelihood. This is just as
good as the regular likelihood
since the probability of the observation
is a constant during recognition.
6
DTW
HMM
MLP
Easy to train (Average of prototypes)
More costly to train (complex training algorithm)
Very costly to train Back propagation can be
slow to converge
Costly to use (Computationally intensive dynamic
programming
Less costly to use (Viterbi algorithm is very
efficient)
Cheap to use (forward computation through net
is fast)
Independent words (separate templates)
Independent words (separate models)
Entire vocabulary
Hard to add or delete words
Easy to add or delete words
Speaker dependent (close to speech data)
Less speaker dependent (structural model of word
generation)
Less speaker dependent (Perceptrons can
generalise)
7
Sentence Formation
Feature vectors
Probability of word sequence
Most likely sentence
Pattern matcher
High-level interpretation
Front-end
Word or phoneme models
Language model
8
Bayes Rule
Sentence recognition
Observations (feature vectors)
sentence
Denominator is independent of the words being
considered as possible matches i.e. independent
of the sentence
9
The final term comes from the language
model. bigram P(word previous
word) trigram P(word previous two words)
A large (sparse) matrix is obtained by counting
adjacent occurrences of words in pairs or
triples in a large corpus (typically 107 words)
of on-line text. The number of combinations is
potentially very large vocabulary 105 words
gt 1010 possible pairs 1015 possible
triples
10
  • Most of the possibilities do not occur in the
    training corpus because
  • either they cant occur (prohibited in the
    language)
  • or they could occur but dont.

We want to assign a non-zero probability to all
n-grams that are possible in the given context
- including ones that have not occurred in the
training data but they look the same as those
that cant occur in the Language, viz. a zero
count in the matrix.
In practice the matrix is smoothed and
normalised to give a probability distribution
11
P(I want Chinese food) (.0023)(0.32)(.0049)((.56
) 0.0002
12
Add-One Smoothing
Unigram
Then add-one smoothing of the probabilities is
defined by
where V is the total number of word types in the
language.
13
Bigram
is redefined as
14
(No Transcript)
15
Witten-Bell Discounting
Model the probability of seeing a zero-frequency
N-gram by the probability of seeing a N-gram for
the first time.
The count of first-time N-grams is the number
of N-gram types we saw in the data, since we had
to see each type for the first time exactly once.
So we can estimate the total probability of all
the zero N-grams as
We want to divide this total probability of
unseen N-grams up amongst all the zero N-grams.
16
Unigram
Let Z be the total number of N-grams with count
zero types and divide the total probability of
unseen N-grams up equally among the zero N-grams.
We have introduced T/(NT) as the total
probability of unseen N-grams and this has to
come from somewhere, so we discount the
probability of the seen N-grams.
17
Bigram
Distribute this equally between the unseen bigrams
18
P(I want Chinese food) (.0023)(0.32)(.0049)((.56
) 0.0002
19
Probabilistic context-free grammar
Rule Probability S ? NP VP 1 NP ?
pn 0.2 NP ? det n 0.4 NP ? NP PP 0.4 PP ? prep
NP 1 VP ? v NP 0.8 VP ? VP PP 0.2 S
sentence pn pronoun NP Noun Phrase n
noun VP Verb Phrase prep preposition PP
Prepositional Phrase v verb det
determiner
20
He bought a dog with three pounds
S ? NP VP 1 NP ? pn 0.2 NP ? det n 0.4 NP ? NP
PP 0.4 PP ? prep NP 1 VP ? v NP 0.8 VP ? VP PP 0.2
S
NP
VP
VP
PP
NP
v
prep
NP
det
N
det
N
dog
three
pounds
He
a
with
bought
P(sentence)
1 x 0.2 x 0.2 x 0.8 x 1.0 x 0.4
21
Is there another parse tree for the sentence
He bought a dog with three pounds
S ? NP VP 1 NP ? pn 0.2 NP ? det n 0.4 NP ? NP
PP 0.4 PP ? prep NP 1 VP ? v NP 0.8 VP ? VP PP 0.2
S
NP
VP
He
V
NP
PP
NP
bought
prep
NP
det
N
det
N
with
a
dog
three
pounds
P(sentence)
1 x 0.2 x 0.8 x 0.4 x 0.4 x 1.0 x 0.4
Write a Comment
User Comments (0)