Title: Kenntnisbasierte ASR
1(No Transcript)
2The definition of sound segments in phonetics
and speech technology
Intensive course Centre of Information
Society Technologies Sofia University St.
Kliment Ohridski 18. - 22. Februar 2002 Jacques
Koreman jkoreman_at_coli.uni-sb.de Bistra
Andreeva andreeva_at_coli.uni-sb.de Trajan
Iliev t_iliev_at_fmi.uni-sofia.bg Institute of
Phonetics University of the Saarland P.O. Box
15 11 50 D - 66041 Saarbrücken Germany
3Overview of the course
- Monday introduction and discussion of student
projects - Tuesday 9-11 ASR techniques hidden
Markov modelling (JK) 11-13 A formal
description of Bulgarian sound segments (BA) - Wednesday 9-11 ASR techniques neural
networks (JK) 11-13 The acoustic description
of speech signals (BA) - Thursday 9-11 The segmentation of
speech sounds using NNs (TI) 11-13 Sound
segments and their boundaries (BA) - Friday discussion of student projects
4(No Transcript)
5 The definition of sound segments in phonetics
and speech technology
6Goal of ASR systems
- Automatic speech recognition (ASR) systems take
the microphone signal as their input and
recognise utterances as a sequence of words. - In order to achieve this, speech sounds must be
recognised from the signal and matched to (a
sequence of) words in the lexicon. By doing this,
sound sequences which do not constitute a
sequence of words are excluded from the search.
7Goal of ASR systems
- Since the microphone signal (reflecting the
variations in air pressure which constitute the
signal) is not a very informative representation,
we first derive some sort of spectral
representation from the signal. The spectrum does
contain all the information we need to identify
speech sounds.
8Microphone signal spectrogram
d
e
p0
s
i
m
a
l
z
Y
b0
s
p0
t
e
m
(
9Finding sounds in the spectrum
- Two problems must be solved to find sounds in the
speech signal - segmentation slicing the signal into sounds
- identification determining which sounds
were spoken
10Segmentation
- Segmentation is difficult, because speech is
produced in a single flow, i.e. there are no
pauses between the words and the articulators are
constantly moving from one position to the next. - In some sound types the movement is intrinsic to
the sound glides and diphthongs. But only when
people speak slowly, we find so-called steady
states in the signal. The interpretation of the
movement depends on context, accentuation,
speaking rate.
Have you ever tried to identifythe word or phone
boundariesin a language you do not know?!
11Identification
- Identification is also difficult, because no
sound is produced the same twice. This is due to
differences between speakers (and even the same
speaker will never produce a sound exactly the
same twice), context, accent and dialect,
accentuation, situation (e.g. formal or
informal), etc.etc. - Example pan span ban
- Conclusion a sound can not always be identified
on the basis of fixed cues!
12The history of ASR systems
- The first ASR systems were knowledge-based, i.e.
they used phonetic knowledge about the
realisation of speech sounds to identify them in
the signal. The best results were achieved if
broad sound classes were detected in a first
step, and then a set of matching word candidates
were selected, on the basis of which a fine
search for distinguishing phonetic properties was
then carried out.
13The history of ASR systems
- Although a few people still work with
knowledge-based systems, most of the systems used
nowadays are stochastic, i.e. they do not attempt
to find specific phonetic characteristics of
speech sounds in an all-or-none approach, but use
probabilities of general spectral properties to
compute a model for each speech sound. - For this reason, we shall only discuss stochastic
modelling techniques in this course.
14References knowledge-based ASR
- Broad, D. und Shoup, J. (1975). Concepts for
acoustic phonetic recognition. In D. Reddy,
Speech Recognition. New York Academic Press. - Zue, V. (1990). The use of speech knowledge in
automatic speech recognition. In A. Waibel and
K.-F. Lee, Readings in Speech Recognition, pp.
200-213. San Mateo Morgan Kaufmann Publishers,
Inc.
15References knowledge-based ASR
- Stevens, K. (2000). From acoustic cues to
segments, features and words, Proc. Int. Conf. On
Spoken Lang. Proc. (ICSLP2000), Beijing. - Reetz (1999). Converting speech signals to
phonological features. Proc. of the XIVth Conf.
Of Phonetic Sciences (ICPhS99), San Francisco,
1733-1736. - Lahiri (2000). Underspecified recognition. Proc.
of the Conf. on Laboratory Phonology
(LabPhon2000).
16(No Transcript)
17 The definition of sound segments in phonetics
and speech technology
- ASR techniques hidden Markov modelling
18Hidden Markov modelling
Hidden Markov modelling is a stochastic
technique, which means that it models variation
(the variation in the signal) by using
probabilities. Usually, each sound (sometimes
word or word sequence) is represented by a hidden
Markov model (HMM). Whole utterances are then
modelled as a sequence of words, each of which
constitutes a sequence of sounds. The sequence of
words with the highest probability is
recognised.
19Hidden Markov modelling
The a-priori probability of the words (lexicon)
and of word sequences (language) model play an
important role in computing the most likely
sequence of HMMs to have generated an acoustic
signal. I shall come back to this later. But let
us first look at how the match between an
acoustic signal and a sequence of HMMs is
determined.
20Markov modelling
- Markov models consist of states which are
connected by transitions. - When the automaton is in a specific state, it
emits a symbol (e.g. an acoustic vector) - Each transitions between two states has a
probability associated with it. - Lets first look at a simple example, in which
the states are represented by containers with
coloured balls.
stochasticmodelling
21MMs a simple example
- We start in state S, which does not emit a
symbol. From there, we go to state 1 with
probability 1. - There we take a black ball from the container.
22MMs a simple example
- Then we either continue to the 2nd state (p
0.4) and take a red ball or we go to state 1
again (self-loop) and take another black ball
from the container. - We continue until we get to state E and have
collected a sequence of coloured balls.
23MMs a simple example
- We can now compute the probabilty that the HMM
displayed below generated the shown sequence of
observations as
in fact, we should have written x1 for each ball
takenfrom the container
1x0.6x0.4x0.5x0.5x0.5x05x05x0.5x0.7x0.7x0.7x0.3
24Hidden Markov modelling
- Hidden Markov models (HMMs) differ from Markov
models in that the state emissions cannot be
allotted to only a particular state. - In our example this would be the case, if all
three (emitting) containers are filled with red,
black and yellow balls. - The percentage of balls of the different colours
can be different for the three containers, so
that the colour emissions have different
probabilities for each of the three states.
25HMMs a simple example
- We start in state S, which does not emit a
symbol. From there, we go to state 1 with
probability 1. - There we take a ball from the container, which
can now be red, black or yellow.
26HMMs a simple example
- Then we go on to the 2nd state (p 0.4) and take
a ball from the container or we go to state 1
again and take another ball from that container. - We continue until we get to state E and have
collected a sequence of coloured balls.
27HMMs hidden states
- In the situation that we can see a sequence of
coloured balls it is now impossible to recognise
with certainty in which states (from which
container) each ball has been taken. The states
are hidden, that is why we speak of hidden
Markov modelling. 1 1 1 1 1 2 2 2 2 3 3 3 1
1 1 2 2 2 2 2 3 3 3 3 etc.
28HMMs speech recognition
- Sequence of coloured balls acoustic frames of
parameter vectors. - It is the task of the ASR system to identify the
sequence of states which is most likely to have
generated/emitted the frame sequence representing
an utterance. This is dependent on the transition
and emission probabilities of the states.
29HMMs transitions
- In ASR left-to-right models (as in the previous
graphical representations) are used, because the
acoustic events are ordered in time. Vowels, for
instance, are often thought of as a sequence of
onset transition, steady state and offset
transition. - If a model is trained for pauses, transitions are
allowed from each state to any other state
(including self-loops), because the sequence of
acoustic events is random (ergodic model).
30HMMs emissions
- Emissions can be described using
- a vector codebook a fixed number of vectors are
used to represent the acoustic space. They are
related to states by state-specific emission
probabilities. - Gaussian mixtures the variation in the acoustic
realisation in each state is described by a
normal distribution.
31HMMs complex models
- More complex models are also used
- parallel states and multiple mixtures can capture
the variation in the realisation of speech sounds
(speaker, dialect, context, etc.) more
effectively. - Generalised triphones describe a speech sound in
different contexts. The contexts are grouped
(e.g. according to place of articulation or on
the basis of data-driven clustering techni-ques.
The grouping reduces the requirements in terms of
the size of the training corpus.
32HMMs speech recognition
- There are several state sequences which can
generate the same signal (frame sequence). The
state sequence with the highest probability is
found using the Viterbi algorithm. - This is done for all HMMs. The HMM leading to the
highest probability is recognised. - Since we do not usually want to recognise a
single speech sound, but a sequence of speech
sounds, the optimisation is performed over
sequences of HMMs.
33HMMs lexicon language model
- HMM is now used for continuous, spontaneous
speech recognition. Besides acoustic (hidden
Markov) models, we also need a lexicon and a
language model. - In the lexicon, all the words (or morphemes)
which the system must be able to recognise are
listed with their pronunciation. - In the language model, all the possible
combinations of lexical entries are described.
34HMMs lexicon
- The lexicon entries consist of an orthographic
word and its realisation in terms of a sequence
of HMMs for speech sounds. - In order to better cope with variation in the
pronunciation of words, pronunciation variants
are sometimes added to the lexicon which include
reductions, epentheses and assimilations.
35HMMs lexicon
- Phonological processes can lead to a change in
the identity of a speech sound
- deletion
- insertion
- assimilation
36Phonological variation
HMMs lexicon
- deletion
- A speech sound which is there in the so-called
canonical form (lexicon form), is not realised.
- ........., isnt it??... ?????
- (G.) Fährst du mit dem Bus??.......... ???????
37Phonological variation
HMMs lexicon
- insertion
- A speech sound which is not there in the
canonical form (lexicon form) is inserted.
- tense - tents
- (G.) Gans - Ganz
38Phonological variation
HMMs lexicon
39HMMs lexicon
- assimilation
- The phonological identity of a speech sound
changes under the influence of the (segmental or
prosodic) context in which it occurs.
- input (cf. spelling immediate)
- but not sometimes, some guys
40HMMs lexicon
- Pronunciation variants in the lexicon reduce the
distance between an acoustic realisation and the
lexical entry. - At the same time, however, the distance between
lexical entries becomes smaller, which can lead
to the misrecognition of words. For this reason,
we often only add the most frequent pronunciation
variants, e.g. for function words, to improve
recognition.
41HMMs language model
- The language model can be implemented as a
- rule system these have the advantage that they
can lead to a better understanding of the
linguistic properties of utterances. - probabilistic system n-gram probabilities are
computed for word sequences. They general-ise
less and need a lot of training data. If the test
condition matches the training well (text type,
lexical domain, etc.), they describe the observed
speech behaviour very well.
42HMMs language model
- Orthographically distinguishable utterances can
have identical acoustic realistions. The language
model can choose one of two possible readings
dependent on the probabilities of the word
sequences. -
- Example r??????sp???
-
Recognise speech
Wreck a nice beach
43HMMs language model
- Orthographically distinguishable utterances can
have identical acoustic realistions. The language
model can choose one of two possible readings
dependent on the probabilities of the word
sequences. -
- Example ???????????????
-
Get up at eight oclock
Get a potato clock
44HMMs applications
- HMM systems are used in
- information systems (travel information)
- hands-free telephony
- spoken input, e.g. in navigation systems
- aids for the handicapped
- dictation systems, e.g. NaturallySpeaking
(Dragon), ViaVoice (IBM), FreeSpeech (Philips)
45References
- Van Alphen, P. und D. van Bergem (1989). Markov
models and their application in speech
recognition, Proceedings Institute of Phonetic
Sciences, University of Amsterdam 13, 1-26. - Holmes, J. (1988). Speech Synthesis and
Recognition (Kap. 8). Wokingham (Berks.) Van
Nostrand Reinhold, 129-152.
46References
- Cox, S. (1988). Hidden Markov models for
automatic speech recognition theory and
application, Br. Telecom techn. Journal 6(2),
105-115. - Lee, K.-F. (1989). Hidden Markov modelling
past, present, future, Proc. Eurospeech 1989,
vol. 1, 148-155.
47(No Transcript)
48 The definition of sound segments in phonetics
and speech technology
- ASR techniques neural networks
49Neural networks
- Artificial neural networks are particularly
suited for the classification of input signals,
e.g. to recognise to what sound an acoustic frame
belongs. - They are not suited to integrate information over
time.
50NNs biological basis
- The building blocks of a neural net are based on
the functionality of biological nerve cells, as
they are found in the brain (about 1010 neurons). - As a simplifying statement we could say that a
nerve cell does no more (nor any less) than
compute a weighted sum of its inputs and create
an output dependent on this weighted sum.
51NNs biological neuron
synapses on cell body
axon
propagation of activation
cell body (soma)
A membrane surroundsthe neuron
dendrites
52NNs biological neuron
- An axon carries the signal. It is long and can
split itself several times. The ends of an axon
are connected to dendrites or to the cell body of
another nerve cell by means of synapses. - Once the threshold for the electric potential of
a synapse is exceeded, the impulse propagates
across the synapse. - The threshold of a synapse changes, if it is
rarely/frequently activated.
53NNs artificial neuron
- Like a biological neuron, an artificial neuron
has one or more inputs (cf. dendrites and axons
directly connected to the cell body). - The function of a synapse is described by the
activation function. - The output is generated on the basis of the
activation.
54NNs artificial neuron
input vectorx x1, x1,...., xn
output valuey f(z)
weights vectorw w1, w1,...., wn
activationz F(x,w)
x1
w1
y f(z)
z F(x,w)
w2
x2
wn
xn
55NNs artificial neuron
Just a single neuron can distinguish two
categories. A simple NN consisting of one single
neuron can determine for instance whether water
is contaminated or not. This would be the case,
when specific measures exceed a threshold. Y
output z activation, is dependent on
inputs x and weights w f function linear,
threshold, sine, etc.
yf(z)
output
activation
56NNs 2 main types
A NN is built up from single neurons. Two main
types of neural networks are distinguished, which
we shall discuss hereafter
- multi-layer perceptrons (MLPs)
- Kohonen networks
57NNs MLP
By combining many neurons, the NN can learn very
complex relationships. A standard MLP consists of
the following layers
- Input layer number of input units equals the
number of signal parameters - Hidden layer (or layers) usually configured so
that all units are connect to the units in the
input layer and to each other - Output layer number of units equals the number
of categories to be distinguished.
58MLPs graphic display
Output layer Hidden layer 2 Hidden layer 1 Input
layer
59MLPs connections
The connections between the units are learnt in
the training. Learning rules determine how the
(initially random) weights are optimised
dependent on the distance to the required output
(supervised learning). Because of the connections
between the units, the computation with NNs is
also called connectionism. Since the
information is processed by many units in
parallel, the expression parallel distributed
processing (PDP) is also used.
60MLPs time
NNs are very suitable for categorising single
input frames. Change over time is not handled so
well. Several solutions to this problem have been
suggested
- inputframe plus several contextframes
- time-delayed NNs
61NNs Kohonen networks
- In MLPs the output computed by the NN for each
input is compared with the required out-put
(supervised). On the basis of the difference
between computed and required output, the weights
of the connections between the units are adapted
(usually by backpropagation). - Kohonen networks, on the other hand, are
unsupervised. For this reason they are also
called self-organised.
62NNs Kohonen networks
- Kohonen networks only consist of an input and an
output layer (competitive layer). - The units in the input layer are connected to all
the units in the output layer. - All connections are weighted.
63Kohonen networks training
- At the start the weights of each unit are
initialised with a vector of small random values
(the vecor size equals the number of input
parameters). - In the training the unit whose weights are
closest (Euclidian distance) to the input vector
wins. - The connection weights of the winning neuron are
adapted in the direction of the input vector,
without using information about the required
output (unsupervised learning).
64Kohonen networks weights
- The units weights are adapted so, that they
better predict the input vector. A similar, but
smaller adatation is made for units which are
close to the winning neuron in the
self-organising map, while units which are
farther away are inhibited (mexican hat
function). - In this way, clusters build up in the network,
which organise information in a
topographical/phonotopic way.
65Kohonen networks calibration
- At the end of the training the Kohonen network is
calibrated to each input vector (e.g. of
acoustic parameters) is attached the required
output (e.g. speech sound). - For each neuron, a list is created of the speech
sounds by which it has been activated, together
with the number of times each speech sound
activated the neuron. - At the end of the calibration, a probability is
computed for each neuron that it was activated by
each speech sound.
66Kohonen nets graphic display
67Kohonen nets graphic display
Phonotopic map calibrated with speech sounds
(part)
68Kohonen nets graphic display
Phonotopic map calibrated with speech sounds
(part)
69Hybrid system
- As we said at the beginning, neural nets are good
at discriminating, but bad at modelling time. - One way of overcoming this problem is by using a
hybrid system, in which the output of the neural
net is used as input to hidden Markov modelling.
70Hybrid systems
phone
phone
Kohonen network
MLP
spectral parameters
spectral parameters
71References
- Lippmann, R.P. (1989). Review of neural
net-works for speech recognition, in A. Waibel
und K.-F. Lee, Readings in Speech Recognit-ion,
374-392. San Mateo Morgan Kaufmann. - Ritter, H., T. Martinez und K. Schulten (1992).
Neural Computation and Self-Organizing Maps, Kap.
2 - 4. Bonn Addison-Wesley. - Kohonen, T. (1988). The neural phonetic
typewriter, in A. Waibel und K.-F. Lee,
Readings in Speech Recognition, 413-424. San
Mateo Morgan Kaufmann.
72(No Transcript)
73Einführung in dieautomatische Spracherkennung
- Dynamic Time Warping
- Sommersemester 2001
- Jacques Koreman
74Dynamic time warping (DTW)
- Weil das Auffinden von invarianten akustischen
Cues für Laute im Signal sich als sehr schwer
herausgestellt hat, wird in vielen Systemen eine
allgemeine Mustererkennung verwendet. - Die ersten Mustererkennungssysteme verwendeten
dynamic time warping (DTW), auch dynamische
programmierung genannt. - Die Erkennungseinheit war meistens das Wort (also
keine kontinuierliche Spracherkennung!).
75Das Prinzip
- Für jedes Wort, das erkannt werden soll, wird ein
Referenzmuster (E. template) gespeichert, mit
dem alle Inputsignale verglichen werden. Das
Referenzmuster, das die größte Ähnlichkeit zum
Inputsignal aufweist, wird erkannt. - Vorteil koartikulatorische Effekte die Variation
in der Lautrealisierung zufolge haben (Problem
für kenntnisbasierte Spracherkennung) werden
mitmodelliert.
76Das Referenzmuster
- Da wir wissen, daß die reine Schalldruckwelle,
die mit dem Mikrofon aufgenommen werden kann, von
Realisierung zu Realisierung stark
unterschiedlich sein kann, wird eine andere
Darstellung für das Referenzmuster verwendet. Sie
besteht aus Parametern, die die Änderung der
Verteilung der Energie über das Frequenz-spektrum
darstellen (vgl. Spektrogramm).
77Das Referenzmuster Abtastrate
- Früher wurden die Parameter wegen dem Anspruch an
Speicherkapazität einmal pro 10 oder 20 ms.
abgeleitet. Heutzutage ist die Speicherkapazität
meistens kein Problem mehr und wird alle 5 ms.
eine Vektor von Parametern berechnet (manchmal
sogar ein mal pro Milli-sekunde).
78Das Referenzmuster Parameter
- Die Parametrisierung des Signals kann bestehen
aus
- Filterbankparametern (linear rein akustisch
Beschreibung) - Filterbankparametern (logarithmisch
perzeptorische Modellierung) - LPC Analyseparametern (Modellierung des
Produktionssystems)
79Das Referenzmuster Beispiel
Output einer 9-Kanals Filterbank für eine
Realisierung vom Wort three und zwei vom Wort
eight. Wie erwar-tet, sind die beiden eight
ähnlicher (Holmes, S. 104).
80Der Vergleich
Jedes Referenzmuster wird mit dem Inputsignal
verglichen. Dabei können Unterschiede in
Lautstärke, Intonation und zeitlicher
Realisierung vorliegen. DTW ist vor allem
geeignet, um zeitliche Änderungen zu modellieren.
- Der Effekt von Lautstärke-Unterschieden, die in
den meisten Fällen für den Unterschied zwischen
Wörtern nicht relevant sind) wird geringer, wenn
die Energie logaritmisch dargestellt wird.
81Der Vergleich
- Die Tonhöhe ist für den Unterschied zwischen
Wörtern normalerweise nicht relevant (bis auf
mikroprosodische Effekte). Sie wird dadurch
ge-glättet, daß man breite Frequenzbänder wählt
und ein längeres Zeitfenster benutzt, wodurch die
einzelnen Perioden gepooled werden. - Zeitliche Änderungen entstehen durch
unter-schiedliche Sprechgeschwindigkeiten, wobei
auch lokale Änderungen (innerhalb des Wortes)
auftreten können.
82Der vergleich
Für Unterschiede in Sprechgeschwindigkeit kann
man dadurch kompensieren, daß man das
Input-muster ausdehnt (wenn es schnell gesprochen
wurde) oder komprimiert (wenn es langsam
gesprochen wurde), bevor man das Abstands-maß
berechnet. Mögliche Anpassungen (E. matchings)
83Der Vergleich Visualisierung
Akustisches Muster und Vergleichsmaß (a)(b)
akust. Muster x und y, (c) keine zeitliche
Anpassung, (d) lineare und (e) nicht-lineare
Anpassung (Huang et al., S.72)
84Der Vergleich die Diagonale
Wenn man das Testwort auf der x-Achse und das
Referenzmuster auf der y-Achse darstellt, stellt
die Diagonale den perfekten Pfad dar, da sie
bedeutet, daß die beiden Wortrealisierungen genau
gleich sind. Jede Abweichung von der Diagonale
wird bestraft. Die Summe dieser Abweichungen (für
die Energie in Filterbändern oft Euklidische
Abstände zwischen Referenzframe und Inputframe)
ergibt ein Abstandsmaß zwischen Inputwort und
Referenzmuster.
85Der Vergleich Visualisierung
DTW Pfad für Test- und Referenzwort (Holmes, S.
116)
86Der Vergleich Beschränkung
- Negative Steigungen sind nicht erlaubt (die
Reihenfolge der Realisierung ist immer von links
nach rechts). - Seitenpfade, die einen zu großen Abstandswert
aufweisen, werden abgeschnitten (E. pruning).
87Probleme
- Die Wahl des Referenzmusters für ein Wort kann
die Erkennung stark beeinflussen. - Die Annahme, daß Anfangs- und Endpunkt des Wortes
korrekt gefunden werden, ist nicht immer gegeben. - DTW eignet sich nicht sehr gut für die Erkennung
kontinuierlich gesprochener Sprache, da jedes
Frame der Anfang eines neuen Wortes darstellen
kann, so daß die Zahl der Vergleiche
explosionsartig zunimmt (z.B. six teenagers
versus sixteen ages)..
88Literaturangaben
- Holmes, J. (1988). Speech Synthesis and
Recognition (Kap. 7).Wokingham (Berks.) Van
Nostrand Reinhold. - Holmes, J. (1991). Spracherkennung und
Sprachsynthese (Kap. 7). München Oldenburg. - Huang, X.-D., Ariki, Y. und Jack, M. (1990).
Hidden Markov Models for Speech Recognition, S.
70-78. Edinburgh Edinburgh University Press.
89(No Transcript)