Title: Computerimplemented neural model of speech production and speech acquisition
1Computer-implemented neural model of speech
production and speech acquisition
- Bernd J. Kröger
- Department of Phoniatrics, Pedaudiology,
- and Communication Disorders,
- Medical Faculty of the Aachen University,
- RWTH Aachen, Germany
For more literature see http//www.speechtrainier.
eu
2Aachen University, Germany
Bernd J. Kröger bkroeger_at_ukaachen.de Download
of refs www.speechtrainer.eu
3Aachen University (Hospital)
3 Tesla MR-Scanner (Philips)
4Overview
- Introduction
- The 3D model
- The neural control component
- Modeling speech acquisition
- Speech perception experiments
- Results and further work
5Overview
- Introduction
- The 3D model
- The neural control component
- Modeling speech acquisition
- Speech perception experiments
- Results and further work
6Speech Production
separate
technical term controller
control module central nervous system
controlled system / plant
articulatory and acoustic module articulators
and cavities
7Note
Birkholz model (2006)
- We have a lot of knowledge concerning the plant
- articulatory geometries
- speech acoustics
We have much less knowledge concerning
neural control of speech articulation
8Note
- We have a lot of knowledge concerning the plant
- articulatory geometries
- speech acoustics
We have much less knowledge concerning
neural control of speech articulation
9Neural computer-implemented models of production
and acquisition including articulation and
acoustics
- Neural models of speech production
- e.g. Levelt, Dell ? neurolinguistic models
- e.g. Guenther (2006) neural model of the
sensorimotor processes of speech production
(articulation) - newer models see this session (including my
talk) - Neural models of speech acquisition
- focusing on pre-linguistic phases (babbling)
- focusing on 00 16 i.e. time interval before
the start of the vocabulary spurt?? (Guenther
1995, Guenther et al. 2006, prelim. Bailly 1997) - This talk
- I will present results of my work on neural
modeling of speech production and acquisition
which is related to the Guenther approach
10Overview
- Introduction
- The 3D vocal tract model as front-end part of a
neural model of speech production - The neural control component
- Modeling speech acquisition
- Speech perception experiments
- Results and further work
113D Articulatory Model
Birkholz et al. (2006)
- 11 wireframe meshes representing the
- upper cover (palate, velum, pharynx wall, ) (4
meshes) - lower cover (mandible, pharynx, ) (3 meshes)
- upper and lower teeth, lips, tongue (4 meshes)
- belongs to the group of geo-metrical models in
comparison to statistical or biomechanical vocal
tract models
123D Articulatory Model
Birkholz et al. (2006)
- complete model
- upper and lower cover (light gray)
- tongue and lips (dark gray)
- upper and lower row of teeth (black)
- The model is based on 3D static and 2D dynamic
MRI-Data of one speaker of Standard German (JD,
ZAS, Berlin)
see http//www.vocaltractlab.de
13High Quality Acoustic Model
Birkholz et al. (2006)
- comprises modeling the
- subglottal tract (13 tube sections)
- glottis (2 tube sections)
- pharyngeal and oral tract (40 tube sections with
individual length not necessarily equidistant
cylinder tubes) - nasal tract (19 tube sections) and 4 sections for
paranasel sinuses
equivalent electrical circuit
14Overview
- Introduction
- The 3D model
- The neural control component
- Modeling speech acquisition
- Results
- Further work
15The Neural Control Component.Some basics
- Ensembles / groups of neurons, representing
motor, sensory or other states constitute the
central levels of the neural control component - ? neural maps
16Motor states and sensory states
are based on a set of parameters each
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
17Articulatory or lower level motor parameters
10 parameters (defined a priori in a
geometrical articulatory model)
- control positions of articulators relative to
position of other articulators - joint coordinates (vs. spatial coordinates)
18JAA lower jaw angle
- JAA influences position of
- lower jaw
- tongue body
- tongue tip
- lower lips
19Motor states and sensory states
are based on a set of parameters each
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
20Proprioceptive parameters
Flesh point locations Absolute position of end
articulators with respect to hard palate (or
cranial system / skull )
tract-variables
? spatial coordinates (vs. joint coordinates)
on the border sensory vs. higher level motor
representations
21Tactile parameters
contact pattern for a velar closure
g
contact area of movable articulators (tongue
body, tongue tip, lips) with vocal tract walls
(regions lower pharyngeal, upper pharyngeal,
velar, palatal, postalveolar, alveolar)
22Tactile parameters
contact area of movable articulators (tongue
body, tongue tip, lips) with vocal tract walls
(regions lower pharyngeal, upper pharyngeal,
velar, palatal, postalveolar, alveolar)
23Auditory parameters
3D model
area function
transfer function
and perimeter function
F1
F2
F3
Bark values of F1, F2, F3
24Parameters values are directly coded by neural
activations in neural maps
1 ) Sub-groups of neurons are defined for each
parameter 2 ) All sub-groups of neurons for all
parameters define the whole neural map
TTy
HYy
TBx
JAy
LId
TTx
VEx
ULx
TBy
HYx
A specific activation of the 4 neu-rons occurrs
for each parameter value
LIH
LIP
TTA
TTL
TBA
TBL
HLH
HLV
JAA
VEH
JAA
Example motor parameter map
25The Neural Control Component.Some basics
- Ensembles / groups of neurons, representing
motor, sensory or other states constitute the
central levels of the neural control component ?
neural maps - Neural networks connect these neural maps ?
neural mappings
The organization of the whole control component
26The whole neural control component
cerebral cortex
Cortico-cortical mappings
cortical
subcortical and peripheral
Guenther et al. (2006)
27The neural control component
current state
efference copy
error signal
auditory map
sound-to-sensory mappings
sound-map syllable-map word-map
10 ms
proprioceptive map
tactile map
sound-to-motor mapping
somatosensory map
auditory state
5 ms
? feedback subsystem
10 ms
somato-sensory state
motor map
(joint coordinates)
lt 5 ms
motor state
cortical
12 ms
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
30 ms
articulatory signal
acoustic signal
subcortical and peripheral
0 ms
28The neural control component
current state
efference copy
error signal
auditory map
sound-to-sensory mappings
sound-map syllable-map word-map
10 ms
proprioceptive map
tactile map
sound-to-motor mapping
somatosensory map
auditory state
5 ms
? feedback subsystem
10 ms
somato-sensory state
motor map
(joint coordinates)
? later on during speech acquisition the
feed-forward control subsystem becomes more and
more active
lt 5 ms
motor state
cortical
12 ms
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
30 ms
articulatory signal
acoustic signal
subcortical and peripheral
0 ms
29Again Differentiation of maps vs. mappings
a sample neural network
map ensemble of neurons representing different
(sensory or motor) states (neural
representation) mapping association of
appropriate or related states in different
maps (has to be trained / learned gt adjustment
of link weights)
map 1 (e.g. sensory) mapping from map 1 to map
2 map 2 (e.g. motor)
Two types of neural networks were used
30One-layer feed forward networks
unidirectional
sensory parameters (proprioceptive param.)
input neuron layer (map)
TTy
HYy
TBx
JAy
LId
TTx
VEx
ULx
TBy
HYx
Links each with each other neuron (mapping)
output neuron layer (map)
LIH
LIP
TTA
TTL
TBA
TBL
HLH
HLV
JAA
VEH
motor parameters (joint coordinates)
31Self-organizing maps (SOMs)
- SOM-neurons form the central part of the
self-organizing map (central layer of neurons
SOM-layer 2D map, representing an ensemble of
neurons of the cerebral cortex) - input neurons (in terms of Kohonen) ? map 1 and
map 2 (in our terms) - neurons representing e.g. sensory and motor
states - Links and link weights ? part of the mapping
Input
SOM
map 2 map 1
32Self-organizing maps (SOMs)
- SOM-neurons form the self-organizing map (central
layer of neurons SOM-layer cortical 2D map) ?
part of the mapping in our approach - input neurons (in terms of Kohonen) ? map 1 and
map 2 (in our terms) - neurons representing e.g. sensory and motor
states - link weights ? part of the mapping
both maps (input neurons in terms of Kohonen) can
be interpreted as input or output maps in our
approach
Input
SOM
map 2 map 1
Note this type of network is not unidirectional
but bi-/multi-directional
33The Neural Control Component.Some basics
- Ensembles of neurons, representing motor, sensory
or other states constitute the central levels of
the neural control component ? neural maps - Neural networks connect these neural maps ?
neural mappings - Neural networks must be trained, i.e. they have
to learn something during speech acquisition (and
later on)
34Example for learning / trainingHow does it work
e.g. during babbling?(Babbling exploring your
vocal tract)
Sensory information can be generated by the 3D
model for each articulatory or motor state
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
35Learning during babbling
Sensory information can be generated by the 3D
model for each articulatory or motor state
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
Produce a lot of random motor states
36Learning during babbling
Sensory information can be generated by the 3D
model for each articulatory or motor state
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
37Learning during babbling
Sensory information can be generated by the 3D
model for each articulatory or motor state
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
related
38Learning during babbling
Sensory information can be generated by the 3D
model for each articulatory or motor state
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
39Learning during babbling
Sensory information can be generated by the 3D
model for each articulatory or motor state
- somatosensory
- information
- - proprioceptive
- - tactil
- auditory
- information
an amount of random motor states
associated with
40Learning during babbling
Sensory information can be generated by the 3D
model for each articulatory or motor state ?
forms a training set
predict
- somatosensory
- representation
- - proprioceptive
- - tactil
- auditory
- representation
related
motor
sensory
training set
(e.g. 4.000 states)
41Overview
- Introduction
- The 3D model
- The neural control component
- Modeling speech acquisition
- Speech perception experiments
- Results and further work
42Modeling speech acquisition
A simple approach
- Babbling
- Mouthing
- Proto-vocalic articulation
- Proto-gestures
- Imitation
- Vowels
is not language specific
all these phases overlap more or less in time
43Modeling speech acquisition
- Babbling
- Mouthing (silent mouthing)
- Proto-vocalic articulation
- Proto-gestures
- Imitation
- Vowels
44The neural control component
auditory map
A silent mouthing training set was designed for
training the
proprioceptive map
tactile map
somatosensory map
auditory-to-motor mapping
auditory state
somatosensory-to-motor mapping
?
somato-sensory state
motor map
(joint coordinates)
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
45Training set silent mouthing
- combination of min, (mid,) and max values of all
10 motor parameters (Kröger et al. 2006c, DAGA
Braunschweig) - double closures and non-physiological
articulations are avoided
subset for lips
subset for tongue
? 4608 patterns of training data
46Training
- Design of the net one-layer feed-forward,
- 2518 input neurons (somatosensory), 40 output
neurons (motor) ? ca. 2000 links - Set of 4608 patterns of training data
- ? min-max combination training set silent
mouthing - 5.000 cycles of batch training
- ? mean error ca. 10 for prediction of a motor
state from its somatosensory state (Kröger et
al. 2006b, ISSP, Ubatuba, Brazil) - Software Java-version of SNNS (Stuttgart Neural
Network Simulator) http//www-ra.informatik.uni-tu
ebingen.de/SNNS/
47Training results some features of motor
equivalence
despite prediction error 10
position of lower jaw low
position of lower jaw high
labial closure
apical closure
dorsal closure
each column somatosensory values are the same
(except of jaw parameter) ? acoustically relevant
closures are kept despite strong jaw perturbation
48The neural control component
auditory map
proprioceptive map
tactile map
somatosensory map
auditory-to-motor mapping
auditory state
somatosensory-to-motor mapping
somato-sensory state
motor map
(joint coordinates)
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
49Modeling speech acquisition
- Babbling
- Mouthing
- Proto-vocalic articulation
- Proto-gestures
- Imitation
- Vowels
50The neural control component
auditory map
A proto-vocalic training set was designed for
training the
proprioceptive map
tactile map
somatosensory map
auditory-to-motor mapping
auditory state
?
somatosensory-to-motor mapping
somato-sensory state
motor map
(joint coordinates)
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
51Proto-vocalic training set
? 540 states
start with a front high, back high and low gesture
i
- use articulatory constraints
- there remain only two degrees of freedom
- giving the 2D-plane
- ? the training set covers the whole (continuous
language independent) vowel space
u
a
52Proto-vocalic training set
i
u
continuous articulatory space for proto-vocalic
articulation
540 states, in three subsets
a
from u to a, i
from a to i, u
from i to u, a
53Using SOMs for the auditory-to-motor mapping
- 10x10 100 neurons form the self-organizing map
layer - topology rectangular ordering (not hexagonal)
- 40 input neurons ? 4000 link weights
- 36 neurons representing the 9 joint coordinate
parameters (motor representation without velum) - 4 neurons representing F1 and F2 (auditory
representation) - 200 cycles x 524 training patterns 104.800
training steps - ? mean error 0.6 for prediction of an
articulatory state (Kröger et al. 2006b)
- ? very precise mapping for predicting
protovocalic motor states from auditory states
(F1, F2)!!! - using standard SOM learning algorithm (Kohonen
1995 and 2001) - initialization random distribution of link
weights within interval 0.4 0.6 - SOM update radius 5 neurons, rectangular
neighborhood function - SOM update radius decay 0.999
- learning rate factor 0.1 learning rate decay
factor 0.99 - modus winner takes all
54Training results for the auditory-to-motor
mapping node plot of SOM-link weights
display of auditory link weight values for each
neuron
SOM neurons as cortical map
1.0
i
F2
a
- The whole range of training data is covered
- The topology is preserved (not many
doublings/folds)
u
0.0
0.0
1.0
F1
55Training results for the auditory-to-motor
mapping bar plot of same SOM-link-weights
cortical map
cortical map
u
one neuron
Simply a different display of the auditory link
weight values for each neuron
i
a
56Training results for the auditory-to-motor
mapping bar plot of SOM-weights
- 1 ) continuous F1-F2-transitions for the i-a
and u-a paths - 2) Ordering of proto-vocalic states with respect
to phonetic criteria (high-low, front-back)
Phonetotopy (cp. fMRI-study of Obleser et al.
2007)
u
a
i
57The neural control component
auditory map
static articulation
proprioceptive map
tactile map
somatosensory map
auditory-to-motor mapping
auditory state
somatosensory-to-motor mapping
somato-sensory state
motor map
(joint coordinates)
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
58Modeling speech acquisition
- Babbling
- Mouthing
- Proto-vocalic articulation
- Proto-gestures
- Imitation
- Vowels
static
dynamic
59The auditory-to-motor mapping for
vocal-tract-closing gestures
Each training pattern is a whole gesture!
t0
t1
t2
t3
t4
Starting from different proto-vocalic positions
frequency bark
0
10
20
t0 0 msec t4 20 msec
time msec
motor state
auditory state
the complete closing gesture
whole formant pattern
association
60The auditory-to-motor mapping for closing
gestures 10x10 SOM
- 10x10 100 neurons form the SOM neuron layer
- 59 input neurons ? 5800 link weights
- 56 neurons representing the formant transition
(F1, F2, F3 and its time deviation and length) - 3 neurons representing closure-related
articulator labial / apical / dorsal and 4
neurons representing the starting vowel (TBx,
TBy) - Set of 22 x 9 198 training patterns
- closing-gestures training-set based on 22
proto-vocalic articulatory states - production of 9 closures 3x labial, 3x apical,
and 3x dorsal - 500 cycles x 198 training patterns / training
steps ? 99.000 - ? mean error 0.8 for prediction of the motor
state from its formant pattern - Standard SOM learning algorithm (Kohonen 1995 and
2001) - initialization random distribution of link
weights within interval 0.4 0.6 - SOM update radius 3 neurons, rectangular
neighborhood function - SOM update radius decay 0.999
- learning rate factor 0.1 learning rate decay
factor 0.99 - modus winner takes all
61Bar plot of the SOM for closing- /
proto-VC-gestures
10x10 SOM Column 1-3 SOM link weights for motor
parameter closure-related articulator
62SOM is capable of learning different types of
formant transitions for different starting vowels
10x10 SOM Column 1-3 SOM link weights for motor
parameter closure-related articulator Formant
transitions auditory SOM link weights
- clear separation for the motor states with
respect to the closure-related articulator
63SOM training 2 (instance 2)
Column 1-3 closure-related articulator Column
4,5 SOM link weights for motor parameter
initial vocalic state (TBx, TBy) a TBy -gt
0 i TBx, TBy -gt 1 u TBx -gt 0 TBy -gt 1
64SOM training 2 (brain 2)
Column 1-3 closure-related articulator Column
4,5 SOM link weights for motor parameter vowel
(TBx, TBy) a TBy -gt 0 i TBx, TBy -gt 1 u
TBx -gt 0 TBy -gt 1
65SOM training 2 (brain 2)
Column 1-3 constriction forming
articulator Column 4,5 SOM link weights for
motor parameter vowel (TBx, TBy) a TBy -gt
0 i TBx, TBy -gt 1 u TBx -gt 0 TBy -gt 1
da
gu
di
bu
ba
du
bi
gi
ga
? phonetotopic ordering with respect to the
initial proto-vocalic state
66Modeling speech acquisition
- Babbling
- Mouthing
- Proto-vocalic articulation
- Proto-gestures
- Imitation
- Vowels
This was non language-specific learning of
proto-vowels and proto-consonantal gestures
The feedback loop is now trained to a certain
degree, which allows imitation
language specific training
67The neural control component
auditory map
language-specific V and VC
tactile map
proprioceptive map
somatosensory map
auditory state
auditory-to-motor mapping
somatosensory-to-motor mapping
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
feedback loop
68Training set mothers vowels
/i/
/e/
hypothetical langu-age with 7 vocalic phonemes
/i/, /e/, /?/, /a/, /?/, /o/, /u/ 100 items per
vowel phoneme
/?/
F2
/a/
/?/
/o/
/u/
F1
69Auditory mapping (F1, F2) for babbling and
imitation (brain 2)
15x15 SOM Input 4 F1/F2-neurons 500 cycles x
540 training patterns 270000 training steps
i
F2
nodes of SOM are continu-ously distributed over
the whole vowel space
a
u
F1
70Auditory mapping (F1, F2) for babbling and
imitation
/i/
15x15 SOM Input 4 F1/F2-neurons 500 cycles x
540 training patterns 270000 training steps
/e/
/?/
F2
nodes of SOM are continu-ously distributed over
the whole vowel space - A shift towards
phonemic regions (perceptual magnet effect ?)
/a/
/?/
/o/
/u/
F1
concentration of net nodes at the phonemic regions
712nd example 5-vowel system
auditory link weights
/i/
hypothetical langu-age with 5 vocalic phonemes
/i/, /e/, /a/, /o/, /u/
/e/
F2
/a/
/o/
/u/
F1
- concentration of net nodes at the phonemic
regions
72vocalic training data
/i/
hypothetical langu-age with 5 vocalic phonemes
/i/, /e/, /a/, /o/, /u/ broader phoneme clouds
/e/
F2
/a/
/o/
/u/
F1
73Bar-plot for the same V-SOM
phonemic link weights
hypothetical langu-age with 5 vocalic phonemes
/i/, /e/, /a/, /o/, /u/
? clear separation of phonemic regions
By the way The phonetic ordering of vowel
phonemes within the map is given. It results from
the ordering of nodes in the F1-F2-plane trained
during babbling! Phonetotopy!
74Results of our speech acquisition modeling
- Self-organizing maps are useful for modeling the
higher level mappings sensory-to-motor,
phonemic-to-sensory (and -to-motor)
- But
- The mappings shown thus far within the
schematic diagram of our model are
unidirectional.
75The neural control component
auditory map
sound-to-sensory mappings
sound-map syllable-map word-map
should be modified with respect to the fact, that
we are using SOMs
tactile map
proprioceptive map
somatosensory map
auditory state
auditory-to-motor mapping
somatosensory-to-motor mapping
sound-to-motor mapping
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
76Implications for the model from doing speech
acquisition
- Self-organizing maps are very successful for
modeling the higher level mappings
sensory-to-motor, lexical-to-sensory (and
-to-motor)
- But
- The mappings shown thus far within the
schematic diagram of our model are
unidirectional. - The SOM neuron layers themselves are not
represented within the schematic diagram
77The neural control component
auditory map
sound-to-sensory mappings
sound-map syllable-map word-map
should be modified with respect to the
SOMs Mappings are unidirectional? SOM layers are
not represented here!
tactile map
proprioceptive map
somatosensory map
auditory state
auditory-to-motor mapping
somatosensory-to-motor mapping
sound-to-motor mapping
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
78The neural control component
auditory map
sound-to-sensory mappings
sound-map syllable-map word-map
should be modified with respect to the
SOMs Mappings are unidirectional? SOM layers are
not represented here!
tactile map
proprioceptive map
somatosensory map
auditory state
auditory-to-motor mapping
somatosensory-to-motor mapping
sound-to-motor mapping
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
79The neural control component
auditory map
sound-map syllable-map word-map
should be modified with respect to the
SOMs Mappings are unidirectional SOM layers are
not represented here!
tactile map
proprioceptive map
somatosensory map
auditory state
auditory-to-motor mapping
somatosensory-to-motor mapping
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
80The neural control component
sound-map syllable-map word-map
phonological map
auditory map
tactile map
proprioceptive map
V C CV VC CVC
central phonetic map
somatosensory map
auditory state
sound or syllable-specific layers repre-senting
the SOM layers
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
81The neural control component
Introduce multidirectional mappings
sound-map syllable-map word-map
auditory map
static
task V
tactile map
proprioceptive map
V C CV VC CVC
central pho-netic map, sound or syllable specific
somatosensory map
static
auditory state
somato-sensory state
static
motor map
joint coordinates
co-activation of all levels phonemic, sensory,
motor.
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
82The neural control component
sound-map syllable-map word-map
phonological map
auditory map
dynamic
task VC
tactile map
proprioceptive map
V C CV VC CVC
central pho-netic map
somatosensory map
dynamic
auditory state
dynamic
somato-sensory state
motor map
joint coordinates
co-activation of all levels
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
83Now
- How is the production of a sound or syllable
processed within this (new) model (after speech
acquisition)?
84The neural control component
sound-map syllable-map word-map
phonological map
sound-map syllable-map word-map
auditory map
auditory map
compare internal and external states
task V
tactile map
proprioceptive map
proprioceptive map
tactile map
V C CV VC CVC
central pho-netic map
somatosensory map
somatosensory map
auditory state
auditory state
co-activation of all levels ? internal idea how
the pho-neme should sound and feel like
somato-sensory state
somato-sensory state
motor map
motor map
joint coordinates
joint coordinates
motor state
motor state
cortical
neuro-muscular processing
neuro-muscular processing
somato-sensory processing
somato-sensory processing
auditory processing
auditory processing
articulatory state
articulatory state
corrections if needed
articulatory signal
articulatory signal
acoustic signal
acoustic signal
subcortical and peripheral
85Overview
- Introduction
- The 3D model
- Motor and sensory representations
- The neural control component
- Modeling speech acquisition
- Speech perception experiments using the
production model - Results and further work
86Idea
- Due to the multidirectional mappings introduced,
this production model can be used as a perception
model
87The neural control component
sound-map syllable-map word-map
phonological map
sound-map syllable-map word-map
auditory map
auditory map
task V
tactile map
proprioceptive map
V C CV VC CVC
central pho-netic map
somatosensory map
auditory state
Note no coactivation of motor states is needed,
but may occur
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
external acoustic signal
88Question
- Do we get typical effects of vowel and consonant
perception using this path within the production
model?
Start consonant perception experiment Does
categorical perception occur?
89Experiment Consonantal categorical perception
- Start 1 ) Generate a stimulus continuum ab, ad,
ag
Then 2 ) train 20 different instances of the
model (using different initial link weight
settings and different ordering of the training
stimuli) ? 20 virtual listeners, doing
identification and discrimination
90The neural control component
sound-map syllable-map word-map
phonological map
sound-map syllable-map word-map
auditory map
auditory map
task V
tactile map
proprioceptive map
V C CV VC CVC
central pho-netic map
somatosensory map
auditory state
Identification and discrimi-nation is done on the
level of the phonetic map (SOM)
somato-sensory state
motor map
joint coordinates
motor state
cortical
neuro-muscular processing
somato-sensory processing
auditory processing
auditory processing
articulatory state
articulatory signal
acoustic signal
subcortical and peripheral
external acoustic signal
91Example Phonetic map for VC
Identification neuron with highest degree of
activation Discrimination city-block-distance
of neurons activated for stimuli A and B
brain 2
92Identification and discrimination scores
consonants
/b/
/d/
/g/
CP lower discrimination scores within phoneme
regions in comparison to phoneme boundaries
100
percentage of identification / discrimination
50
on the basis of (phonological) identification
0
stimulus continuum
20 listeners
93VC-SOM (brain 1)
apical
Are different instances of the model really
different ??
labial
dorsal
brain 1
94VC-SOM (brain 2)
dorsal
labial
Are different instances of the model really
different ??
apical
brain 2
95Vowels Categorical Perception(?)
Stimulus-continuum i, e, a ? 13 green
dots Training of 20 instan-ces of the model (? 20
listeners) Identification neuron with highest
degree of activation Discrimination
city-block-distance of neurons activated for
stimuli A and B
F2
F1
96Example Phonetic map for V
Identification neuron with highest degree of
activation Discrimination city-block-distance
of neurons activated for stimuli A and B
97Identification and discrimination scores vowels
Only slightly higher discrimination score at one
phoneme boundary ? CP ??
/i/
/e/
/a/
100
And Much higher percentage of measu-red than
calculated discrimination within all vocalic
phoneme regions in compari-son to consonants
percentage of identification / discrimination
50
0
stimulus continuum
20 listeners
98Identification and discrimination scores
consonants
/b/
/d/
/g/
100
percentage of identification / discrimination
50
0
stimulus continuum
20 listeners
99Calculated discrimination
- Calculated discrimination is discrimination
based exclusively on differences in (phonemic)
identification - If the rate of calculated really perceived
(measured) discrimination - ? the whole acoustic content of a speech item is
needed for phonemic processing. - If measured (perceived) gt calculated
discrimination - ? these sounds include additional acoustic
information (e.g. vowels phonemic and phonetic
vowel quality)
100Results modeling perception using the production
model
- The production model gives typical results of
categorical perception - Consonants b, d, g are strongly perceived in a
categorical way (i.e. strongly encoded) - vowels are less categorical
101Overview
- Introduction
- The 3D model
- Motor and sensory representations
- The neural control component
- Modeling speech acquisition
- Speech perception experiments
- Results and further work
102Results
- A computer-implemented neural model of speech
production and speech acquisition based on the
Guenther 2006 approach has been introduced - Training of vowel and consonant production
(voiced plosives in CV-syllables) has been
illustrated. - This production model shows effects of speech
perception straightfor-wardly - Using SOM for cortical mappings straightforward
leads to phonetotopy (cp. results of imaging
experiments given by Obleser et al. 2007)
103Further work General problems with models
- Its limited power
- if model is successful
- ? real human processes have to be similar?
- if model fails
- ? natural process has to be different?
- So
- Model should be validated by data (e.g. imaging
studies) - On the other hand A model can be taken as
starting point for developing hypotheses for
experiments!
not necessarily
not necessarily
104Further work
- Include VC-syllables, CVC,
- Include other types of consonants frics, nasals,
- Include canonical babbling
- Include comprehension and imitation of first
words - Very important
- Separation of higher level and lower level motor
states - More realistic types of neural mappings. But the
principle of self-organization is important! - Solve the normalization problem Difference
between caretakers and toddlers vocal tract
105to comprehension
from mental lexicon and syllabification
phonological plan
cortical
auditory state
infrequent syllables
auditory map
auditory-phonetic processing
frequent syllables
premotor
prosody
temporal lobe
high-order
primary
ssst.
frontal lobe
motor plan
parietal lobe
high-order
primary
subcortical
motor execution (control and corrections)
cerebellum basal ganglia thalamus
primary motor
cortical
motor state
subcortical and peripheral
auditory receptors and preprocessing
somatosensory receptors and preprocessing
skin, ears and sensory pathways
articulatory state
muscles and articulators tongue, lips, jaw,
velum
articulatory signal
acoustic signal
peripheral
neural model of speech production (Kröger 2007)
106Acknowledgments
- Many thanks to .
- Peter Birkholz, Department of Computer Science,
University of Rostock for developing and
implementing the 3D articulatory model
(PhD-thesis) - Jim Kannampuzha, Student of computer Science at
RWTH Aachen for implementing the neural control
model - German Research Council for supporting this work
(Grant Nr KR1439/10-1 and KR 1439/13-1). - Georg Heike and Christiane Neuschaefer-Rube for
always supporting my work
107Thanks for your attention !!
Bernd J. Kröger Email bkroeger_at_ukaachen.de Ho
mepage and literature http//www.speechtrainer.eu
Dona nobis pacem
108References see also http//www.speechtrainer.eu
- Birkholz P, Jackel D, Kröger BJ (2006)
Development and control of a 3D vocal tract
model. Proceedings of the IEEE International
conference on Acoustics, Speech, and Signal
Processing (ICASSP 2006) Toulouse, France, pp.
873-876 - Bullock D, Grossberg S, Guenther FH (1993) A
self-organizing neural model of motor equivalent
reaching and tool use by a multijoint arm.
Journal of Cognitive Neuroscience 5 408-435 - Guenther FH, Gjaja MN (1996) The perceptual
magnet effect as an emergent property of neural
map formation. Journal of the Acoustical Society
of America 100 1111-1121 - Guenther FH, Ghosh SS, Tourville JA (2006) Neural
modeling and imaging of the cortical interactions
underlying syllable production. Brain and
Language 96 280-301 - Kandel ER, Schwartz JH, Jessell TM (2000)
Principles of neural science. MacGraw-Hill, New
York - Kohonen T (2001) Self-organizing maps. Springer,
Berlin, 3rd edition - Kröger BJ, Birkholz P, Kannampuzha J,
Neuschaefer-Rube C (2006a) Modeling
sensory-to-motor mappings using neural nets and a
3D articulatory speech synthesizer. Proceedings
of INTERSPEECH 2006, Pittsburgh, Pennsylvania. - Kröger BJ, Birkholz P, Kannampuzha J,
Neuschaefer-Rube C (2006b) Ubatuba - Kröger BJ, Birkholz P, Kannampuzha J,
Neuschaefer-Rube C (2006c) Spatial-to-joint
coordinate mapping in a neural model of speech
production. DAGA-Proceedings of the annual
meeting of the German Acoustical Society,
Braunschweig, Germany (see also
http//www.speechtrainer.eu) - Oller DK, Eilers RE, Neal AR, Schwartz HK (1999)
Precursors to speech in infancy the prediction
of speech and language disorders. Journal of
Communication Disorders 32 223-245 - Saltzman EL, Munhall KG (1989) A dynamic approach
to gestural patterning in speech production.
Ecological Psychology 1 333-382