Kenntnisbasierte ASR

About This Presentation

Title:

Kenntnisbasierte ASR

Description:

... war meistens das Wort (also keine kontinuierliche Spracherkennung! ... Die Wahl des Referenzmusters f r ein Wort kann die Erkennung stark beeinflussen. ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 90

Provided by: institutf3

Category:

more less

Transcript and Presenter's Notes

Title: Kenntnisbasierte ASR

1
(No Transcript)
2
The definition of sound segments in phonetics
and speech technology
Intensive course Centre of Information
Society Technologies Sofia University St.
Kliment Ohridski 18. - 22. Februar 2002 Jacques
Koreman jkoreman_at_coli.uni-sb.de Bistra
Andreeva andreeva_at_coli.uni-sb.de Trajan
Iliev t_iliev_at_fmi.uni-sofia.bg Institute of
Phonetics University of the Saarland P.O. Box
15 11 50 D - 66041 Saarbrücken Germany
3
Overview of the course

Monday introduction and discussion of student
projects
Tuesday 9-11 ASR techniques hidden
Markov modelling (JK) 11-13 A formal
description of Bulgarian sound segments (BA)
Wednesday 9-11 ASR techniques neural
networks (JK) 11-13 The acoustic description
of speech signals (BA)
Thursday 9-11 The segmentation of
speech sounds using NNs (TI) 11-13 Sound
segments and their boundaries (BA)
Friday discussion of student projects

4
(No Transcript)
5
The definition of sound segments in phonetics
and speech technology

Preamble

6
Goal of ASR systems

Automatic speech recognition (ASR) systems take
the microphone signal as their input and
recognise utterances as a sequence of words.
In order to achieve this, speech sounds must be
recognised from the signal and matched to (a
sequence of) words in the lexicon. By doing this,
sound sequences which do not constitute a
sequence of words are excluded from the search.

7
Goal of ASR systems

Since the microphone signal (reflecting the
variations in air pressure which constitute the
signal) is not a very informative representation,
we first derive some sort of spectral
representation from the signal. The spectrum does
contain all the information we need to identify
speech sounds.

8
Microphone signal spectrogram
d
e
p0
s
i
m
a
l
z
Y
b0
s
p0
t
e
m

(
9
Finding sounds in the spectrum

Two problems must be solved to find sounds in the
speech signal
segmentation slicing the signal into sounds
identification determining which sounds
were spoken

10
Segmentation

Segmentation is difficult, because speech is
produced in a single flow, i.e. there are no
pauses between the words and the articulators are
constantly moving from one position to the next.
In some sound types the movement is intrinsic to
the sound glides and diphthongs. But only when
people speak slowly, we find so-called steady
states in the signal. The interpretation of the
movement depends on context, accentuation,
speaking rate.

Have you ever tried to identifythe word or phone
boundariesin a language you do not know?!
11
Identification

Identification is also difficult, because no
sound is produced the same twice. This is due to
differences between speakers (and even the same
speaker will never produce a sound exactly the
same twice), context, accent and dialect,
accentuation, situation (e.g. formal or
informal), etc.etc.
Example pan span ban
Conclusion a sound can not always be identified
on the basis of fixed cues!

12
The history of ASR systems

The first ASR systems were knowledge-based, i.e.
they used phonetic knowledge about the
realisation of speech sounds to identify them in
the signal. The best results were achieved if
broad sound classes were detected in a first
step, and then a set of matching word candidates
were selected, on the basis of which a fine
search for distinguishing phonetic properties was
then carried out.

13
The history of ASR systems

Although a few people still work with
knowledge-based systems, most of the systems used
nowadays are stochastic, i.e. they do not attempt
to find specific phonetic characteristics of
speech sounds in an all-or-none approach, but use
probabilities of general spectral properties to
compute a model for each speech sound.
For this reason, we shall only discuss stochastic
modelling techniques in this course.

14
References knowledge-based ASR

Broad, D. und Shoup, J. (1975). Concepts for
acoustic phonetic recognition. In D. Reddy,
Speech Recognition. New York Academic Press.
Zue, V. (1990). The use of speech knowledge in
automatic speech recognition. In A. Waibel and
K.-F. Lee, Readings in Speech Recognition, pp.
200-213. San Mateo Morgan Kaufmann Publishers,
Inc.

15
References knowledge-based ASR

Stevens, K. (2000). From acoustic cues to
segments, features and words, Proc. Int. Conf. On
Spoken Lang. Proc. (ICSLP2000), Beijing.
Reetz (1999). Converting speech signals to
phonological features. Proc. of the XIVth Conf.
Of Phonetic Sciences (ICPhS99), San Francisco,
1733-1736.
Lahiri (2000). Underspecified recognition. Proc.
of the Conf. on Laboratory Phonology
(LabPhon2000).

16
(No Transcript)
17
The definition of sound segments in phonetics
and speech technology

ASR techniques hidden Markov modelling

18
Hidden Markov modelling
Hidden Markov modelling is a stochastic
technique, which means that it models variation
(the variation in the signal) by using
probabilities. Usually, each sound (sometimes
word or word sequence) is represented by a hidden
Markov model (HMM). Whole utterances are then
modelled as a sequence of words, each of which
constitutes a sequence of sounds. The sequence of
words with the highest probability is
recognised.
19
Hidden Markov modelling
The a-priori probability of the words (lexicon)
and of word sequences (language) model play an
important role in computing the most likely
sequence of HMMs to have generated an acoustic
signal. I shall come back to this later. But let
us first look at how the match between an
acoustic signal and a sequence of HMMs is
determined.
20
Markov modelling

Markov models consist of states which are
connected by transitions.
When the automaton is in a specific state, it
emits a symbol (e.g. an acoustic vector)
Each transitions between two states has a
probability associated with it.
Lets first look at a simple example, in which
the states are represented by containers with
coloured balls.

stochasticmodelling
21
MMs a simple example

We start in state S, which does not emit a
symbol. From there, we go to state 1 with
probability 1.
There we take a black ball from the container.

22
MMs a simple example

Then we either continue to the 2nd state (p
0.4) and take a red ball or we go to state 1
again (self-loop) and take another black ball
from the container.
We continue until we get to state E and have
collected a sequence of coloured balls.

23
MMs a simple example

We can now compute the probabilty that the HMM
displayed below generated the shown sequence of
observations as

in fact, we should have written x1 for each ball
takenfrom the container
1x0.6x0.4x0.5x0.5x0.5x05x05x0.5x0.7x0.7x0.7x0.3
24
Hidden Markov modelling

Hidden Markov models (HMMs) differ from Markov
models in that the state emissions cannot be
allotted to only a particular state.
In our example this would be the case, if all
three (emitting) containers are filled with red,
black and yellow balls.
The percentage of balls of the different colours
can be different for the three containers, so
that the colour emissions have different
probabilities for each of the three states.

25
HMMs a simple example

We start in state S, which does not emit a
symbol. From there, we go to state 1 with
probability 1.
There we take a ball from the container, which
can now be red, black or yellow.

26
HMMs a simple example

Then we go on to the 2nd state (p 0.4) and take
a ball from the container or we go to state 1
again and take another ball from that container.
We continue until we get to state E and have
collected a sequence of coloured balls.

27
HMMs hidden states

In the situation that we can see a sequence of
coloured balls it is now impossible to recognise
with certainty in which states (from which
container) each ball has been taken. The states
are hidden, that is why we speak of hidden
Markov modelling. 1 1 1 1 1 2 2 2 2 3 3 3 1
1 1 2 2 2 2 2 3 3 3 3 etc.

28
HMMs speech recognition

Sequence of coloured balls acoustic frames of
parameter vectors.
It is the task of the ASR system to identify the
sequence of states which is most likely to have
generated/emitted the frame sequence representing
an utterance. This is dependent on the transition
and emission probabilities of the states.

29
HMMs transitions

In ASR left-to-right models (as in the previous
graphical representations) are used, because the
acoustic events are ordered in time. Vowels, for
instance, are often thought of as a sequence of
onset transition, steady state and offset
transition.
If a model is trained for pauses, transitions are
allowed from each state to any other state
(including self-loops), because the sequence of
acoustic events is random (ergodic model).

30
HMMs emissions

Emissions can be described using
a vector codebook a fixed number of vectors are
used to represent the acoustic space. They are
related to states by state-specific emission
probabilities.
Gaussian mixtures the variation in the acoustic
realisation in each state is described by a
normal distribution.

31
HMMs complex models

More complex models are also used
parallel states and multiple mixtures can capture
the variation in the realisation of speech sounds
(speaker, dialect, context, etc.) more
effectively.
Generalised triphones describe a speech sound in
different contexts. The contexts are grouped
(e.g. according to place of articulation or on
the basis of data-driven clustering techni-ques.
The grouping reduces the requirements in terms of
the size of the training corpus.

32
HMMs speech recognition

There are several state sequences which can
generate the same signal (frame sequence). The
state sequence with the highest probability is
found using the Viterbi algorithm.
This is done for all HMMs. The HMM leading to the
highest probability is recognised.
Since we do not usually want to recognise a
single speech sound, but a sequence of speech
sounds, the optimisation is performed over
sequences of HMMs.

33
HMMs lexicon language model

HMM is now used for continuous, spontaneous
speech recognition. Besides acoustic (hidden
Markov) models, we also need a lexicon and a
language model.
In the lexicon, all the words (or morphemes)
which the system must be able to recognise are
listed with their pronunciation.
In the language model, all the possible
combinations of lexical entries are described.

34
HMMs lexicon

The lexicon entries consist of an orthographic
word and its realisation in terms of a sequence
of HMMs for speech sounds.
In order to better cope with variation in the
pronunciation of words, pronunciation variants
are sometimes added to the lexicon which include
reductions, epentheses and assimilations.

35
HMMs lexicon

Phonological processes can lead to a change in
the identity of a speech sound

deletion
insertion
assimilation

36
Phonological variation
HMMs lexicon

deletion
A speech sound which is there in the so-called
canonical form (lexicon form), is not realised.

........., isnt it??... ?????
(G.) Fährst du mit dem Bus??.......... ???????

37
Phonological variation
HMMs lexicon

insertion
A speech sound which is not there in the
canonical form (lexicon form) is inserted.

tense - tents
(G.) Gans - Ganz

38
Phonological variation
HMMs lexicon
39
HMMs lexicon

assimilation
The phonological identity of a speech sound
changes under the influence of the (segmental or
prosodic) context in which it occurs.

input (cf. spelling immediate)
but not sometimes, some guys

40
HMMs lexicon

Pronunciation variants in the lexicon reduce the
distance between an acoustic realisation and the
lexical entry.
At the same time, however, the distance between
lexical entries becomes smaller, which can lead
to the misrecognition of words. For this reason,
we often only add the most frequent pronunciation
variants, e.g. for function words, to improve
recognition.

41
HMMs language model

The language model can be implemented as a
rule system these have the advantage that they
can lead to a better understanding of the
linguistic properties of utterances.
probabilistic system n-gram probabilities are
computed for word sequences. They general-ise
less and need a lot of training data. If the test
condition matches the training well (text type,
lexical domain, etc.), they describe the observed
speech behaviour very well.

42
HMMs language model

Orthographically distinguishable utterances can
have identical acoustic realistions. The language
model can choose one of two possible readings
dependent on the probabilities of the word
sequences.
Example r??????sp???

Recognise speech
Wreck a nice beach
43
HMMs language model

Orthographically distinguishable utterances can
have identical acoustic realistions. The language
model can choose one of two possible readings
dependent on the probabilities of the word
sequences.
Example ???????????????

Get up at eight oclock
Get a potato clock
44
HMMs applications

HMM systems are used in
information systems (travel information)
hands-free telephony
spoken input, e.g. in navigation systems
aids for the handicapped
dictation systems, e.g. NaturallySpeaking
(Dragon), ViaVoice (IBM), FreeSpeech (Philips)

45
References

Van Alphen, P. und D. van Bergem (1989). Markov
models and their application in speech
recognition, Proceedings Institute of Phonetic
Sciences, University of Amsterdam 13, 1-26.
Holmes, J. (1988). Speech Synthesis and
Recognition (Kap. 8). Wokingham (Berks.) Van
Nostrand Reinhold, 129-152.

46
References

Cox, S. (1988). Hidden Markov models for
automatic speech recognition theory and
application, Br. Telecom techn. Journal 6(2),
105-115.
Lee, K.-F. (1989). Hidden Markov modelling
past, present, future, Proc. Eurospeech 1989,
vol. 1, 148-155.

47
(No Transcript)
48
The definition of sound segments in phonetics
and speech technology

ASR techniques neural networks

49
Neural networks

Artificial neural networks are particularly
suited for the classification of input signals,
e.g. to recognise to what sound an acoustic frame
belongs.
They are not suited to integrate information over
time.

50
NNs biological basis

The building blocks of a neural net are based on
the functionality of biological nerve cells, as
they are found in the brain (about 1010 neurons).
As a simplifying statement we could say that a
nerve cell does no more (nor any less) than
compute a weighted sum of its inputs and create
an output dependent on this weighted sum.

51
NNs biological neuron
synapses on cell body
axon
propagation of activation
cell body (soma)
A membrane surroundsthe neuron
dendrites
52
NNs biological neuron

An axon carries the signal. It is long and can
split itself several times. The ends of an axon
are connected to dendrites or to the cell body of
another nerve cell by means of synapses.
Once the threshold for the electric potential of
a synapse is exceeded, the impulse propagates
across the synapse.
The threshold of a synapse changes, if it is
rarely/frequently activated.

53
NNs artificial neuron

Like a biological neuron, an artificial neuron
has one or more inputs (cf. dendrites and axons
directly connected to the cell body).
The function of a synapse is described by the
activation function.
The output is generated on the basis of the
activation.

54
NNs artificial neuron
input vectorx x1, x1,...., xn
output valuey f(z)
weights vectorw w1, w1,...., wn
activationz F(x,w)
x1
w1
y f(z)
z F(x,w)
w2
x2
wn
xn
55
NNs artificial neuron
Just a single neuron can distinguish two
categories. A simple NN consisting of one single
neuron can determine for instance whether water
is contaminated or not. This would be the case,
when specific measures exceed a threshold. Y
output z activation, is dependent on
inputs x and weights w f function linear,
threshold, sine, etc.
yf(z)
output
activation
56
NNs 2 main types
A NN is built up from single neurons. Two main
types of neural networks are distinguished, which
we shall discuss hereafter

multi-layer perceptrons (MLPs)
Kohonen networks

57
NNs MLP
By combining many neurons, the NN can learn very
complex relationships. A standard MLP consists of
the following layers

Input layer number of input units equals the
number of signal parameters
Hidden layer (or layers) usually configured so
that all units are connect to the units in the
input layer and to each other
Output layer number of units equals the number
of categories to be distinguished.

58
MLPs graphic display
Output layer Hidden layer 2 Hidden layer 1 Input
layer
59
MLPs connections
The connections between the units are learnt in
the training. Learning rules determine how the
(initially random) weights are optimised
dependent on the distance to the required output
(supervised learning). Because of the connections
between the units, the computation with NNs is
also called connectionism. Since the
information is processed by many units in
parallel, the expression parallel distributed
processing (PDP) is also used.
60
MLPs time
NNs are very suitable for categorising single
input frames. Change over time is not handled so
well. Several solutions to this problem have been
suggested

inputframe plus several contextframes
time-delayed NNs

61
NNs Kohonen networks

In MLPs the output computed by the NN for each
input is compared with the required out-put
(supervised). On the basis of the difference
between computed and required output, the weights
of the connections between the units are adapted
(usually by backpropagation).
Kohonen networks, on the other hand, are
unsupervised. For this reason they are also
called self-organised.

62
NNs Kohonen networks

Kohonen networks only consist of an input and an
output layer (competitive layer).
The units in the input layer are connected to all
the units in the output layer.
All connections are weighted.

63
Kohonen networks training

At the start the weights of each unit are
initialised with a vector of small random values
(the vecor size equals the number of input
parameters).
In the training the unit whose weights are
closest (Euclidian distance) to the input vector
wins.
The connection weights of the winning neuron are
adapted in the direction of the input vector,
without using information about the required
output (unsupervised learning).

64
Kohonen networks weights

The units weights are adapted so, that they
better predict the input vector. A similar, but
smaller adatation is made for units which are
close to the winning neuron in the
self-organising map, while units which are
farther away are inhibited (mexican hat
function).
In this way, clusters build up in the network,
which organise information in a
topographical/phonotopic way.

65
Kohonen networks calibration

At the end of the training the Kohonen network is
calibrated to each input vector (e.g. of
acoustic parameters) is attached the required
output (e.g. speech sound).
For each neuron, a list is created of the speech
sounds by which it has been activated, together
with the number of times each speech sound
activated the neuron.
At the end of the calibration, a probability is
computed for each neuron that it was activated by
each speech sound.

66
Kohonen nets graphic display
67
Kohonen nets graphic display
Phonotopic map calibrated with speech sounds
(part)
68
Kohonen nets graphic display
Phonotopic map calibrated with speech sounds
(part)
69
Hybrid system

As we said at the beginning, neural nets are good
at discriminating, but bad at modelling time.
One way of overcoming this problem is by using a
hybrid system, in which the output of the neural
net is used as input to hidden Markov modelling.

70
Hybrid systems
phone
phone
Kohonen network
MLP
spectral parameters
spectral parameters
71
References

Lippmann, R.P. (1989). Review of neural
net-works for speech recognition, in A. Waibel
und K.-F. Lee, Readings in Speech Recognit-ion,
374-392. San Mateo Morgan Kaufmann.
Ritter, H., T. Martinez und K. Schulten (1992).
Neural Computation and Self-Organizing Maps, Kap.
2 - 4. Bonn Addison-Wesley.
Kohonen, T. (1988). The neural phonetic
typewriter, in A. Waibel und K.-F. Lee,
Readings in Speech Recognition, 413-424. San
Mateo Morgan Kaufmann.

72
(No Transcript)
73
Einführung in dieautomatische Spracherkennung

Dynamic Time Warping
Sommersemester 2001
Jacques Koreman

74
Dynamic time warping (DTW)

Weil das Auffinden von invarianten akustischen
Cues für Laute im Signal sich als sehr schwer
herausgestellt hat, wird in vielen Systemen eine
allgemeine Mustererkennung verwendet.
Die ersten Mustererkennungssysteme verwendeten
dynamic time warping (DTW), auch dynamische
programmierung genannt.
Die Erkennungseinheit war meistens das Wort (also
keine kontinuierliche Spracherkennung!).

75
Das Prinzip

Für jedes Wort, das erkannt werden soll, wird ein
Referenzmuster (E. template) gespeichert, mit
dem alle Inputsignale verglichen werden. Das
Referenzmuster, das die größte Ähnlichkeit zum
Inputsignal aufweist, wird erkannt.
Vorteil koartikulatorische Effekte die Variation
in der Lautrealisierung zufolge haben (Problem
für kenntnisbasierte Spracherkennung) werden
mitmodelliert.

76
Das Referenzmuster

Da wir wissen, daß die reine Schalldruckwelle,
die mit dem Mikrofon aufgenommen werden kann, von
Realisierung zu Realisierung stark
unterschiedlich sein kann, wird eine andere
Darstellung für das Referenzmuster verwendet. Sie
besteht aus Parametern, die die Änderung der
Verteilung der Energie über das Frequenz-spektrum
darstellen (vgl. Spektrogramm).

77
Das Referenzmuster Abtastrate

Früher wurden die Parameter wegen dem Anspruch an
Speicherkapazität einmal pro 10 oder 20 ms.
abgeleitet. Heutzutage ist die Speicherkapazität
meistens kein Problem mehr und wird alle 5 ms.
eine Vektor von Parametern berechnet (manchmal
sogar ein mal pro Milli-sekunde).

78
Das Referenzmuster Parameter

Die Parametrisierung des Signals kann bestehen
aus

Filterbankparametern (linear rein akustisch
Beschreibung)
Filterbankparametern (logarithmisch
perzeptorische Modellierung)
LPC Analyseparametern (Modellierung des
Produktionssystems)

79
Das Referenzmuster Beispiel
Output einer 9-Kanals Filterbank für eine
Realisierung vom Wort three und zwei vom Wort
eight. Wie erwar-tet, sind die beiden eight
ähnlicher (Holmes, S. 104).
80
Der Vergleich
Jedes Referenzmuster wird mit dem Inputsignal
verglichen. Dabei können Unterschiede in
Lautstärke, Intonation und zeitlicher
Realisierung vorliegen. DTW ist vor allem
geeignet, um zeitliche Änderungen zu modellieren.

Der Effekt von Lautstärke-Unterschieden, die in
den meisten Fällen für den Unterschied zwischen
Wörtern nicht relevant sind) wird geringer, wenn
die Energie logaritmisch dargestellt wird.

81
Der Vergleich

Die Tonhöhe ist für den Unterschied zwischen
Wörtern normalerweise nicht relevant (bis auf
mikroprosodische Effekte). Sie wird dadurch
ge-glättet, daß man breite Frequenzbänder wählt
und ein längeres Zeitfenster benutzt, wodurch die
einzelnen Perioden gepooled werden.
Zeitliche Änderungen entstehen durch
unter-schiedliche Sprechgeschwindigkeiten, wobei
auch lokale Änderungen (innerhalb des Wortes)
auftreten können.

82
Der vergleich
Für Unterschiede in Sprechgeschwindigkeit kann
man dadurch kompensieren, daß man das
Input-muster ausdehnt (wenn es schnell gesprochen
wurde) oder komprimiert (wenn es langsam
gesprochen wurde), bevor man das Abstands-maß
berechnet. Mögliche Anpassungen (E. matchings)

Keine
linear
non-linear

83
Der Vergleich Visualisierung
Akustisches Muster und Vergleichsmaß (a)(b)
akust. Muster x und y, (c) keine zeitliche
Anpassung, (d) lineare und (e) nicht-lineare
Anpassung (Huang et al., S.72)
84
Der Vergleich die Diagonale
Wenn man das Testwort auf der x-Achse und das
Referenzmuster auf der y-Achse darstellt, stellt
die Diagonale den perfekten Pfad dar, da sie
bedeutet, daß die beiden Wortrealisierungen genau
gleich sind. Jede Abweichung von der Diagonale
wird bestraft. Die Summe dieser Abweichungen (für
die Energie in Filterbändern oft Euklidische
Abstände zwischen Referenzframe und Inputframe)
ergibt ein Abstandsmaß zwischen Inputwort und
Referenzmuster.
85
Der Vergleich Visualisierung
DTW Pfad für Test- und Referenzwort (Holmes, S.
116)
86
Der Vergleich Beschränkung

Negative Steigungen sind nicht erlaubt (die
Reihenfolge der Realisierung ist immer von links
nach rechts).
Seitenpfade, die einen zu großen Abstandswert
aufweisen, werden abgeschnitten (E. pruning).

87
Probleme

Die Wahl des Referenzmusters für ein Wort kann
die Erkennung stark beeinflussen.
Die Annahme, daß Anfangs- und Endpunkt des Wortes
korrekt gefunden werden, ist nicht immer gegeben.
DTW eignet sich nicht sehr gut für die Erkennung
kontinuierlich gesprochener Sprache, da jedes
Frame der Anfang eines neuen Wortes darstellen
kann, so daß die Zahl der Vergleiche
explosionsartig zunimmt (z.B. six teenagers
versus sixteen ages)..

88
Literaturangaben

Holmes, J. (1988). Speech Synthesis and
Recognition (Kap. 7).Wokingham (Berks.) Van
Nostrand Reinhold.
Holmes, J. (1991). Spracherkennung und
Sprachsynthese (Kap. 7). München Oldenburg.
Huang, X.-D., Ariki, Y. und Jack, M. (1990).
Hidden Markov Models for Speech Recognition, S.
70-78. Edinburgh Edinburgh University Press.

89
(No Transcript)

Write a Comment

User Comments (0)