Title: Phonetic features in ASR
1Phonetic features in ASR
- Intensive course Dipartimento di Elettrotecnica
ed Elettronica Politecnica di Bari 22 26
March 1999 - Jacques KoremanInstitute of PhoneticsUniversity
of the SaarlandP.O. Box 15 11 50D - 66041
Saarbrücken E-mail Germany jkoreman_at_coli.uni-sb
.de
2Organisation of the course
- Tuesday Friday- First half of each
session theory- Second half of each
session practice - Interruptions invited!!!
3Overview of the course
- 1. Variability in the signal
- 2. Phonetic features in ASR
- 3. Deriving phonetic features from the acoustic
signal by a Kohonen network - 4. ICSLP98 Exploiting transitions and
focussing on linguistic properties for ASR - 5. ICSLP98 Do phonetic features help to
improve consonant identification in ASR?
day
day
day
day
day
4The goal of ASR systems
- Input spectral description of microphone signal,
typically- energy in band-pass filters- LPC
coefficients- cepstral coefficients - Output linguistic units, usually phones or
phonemes (on the basis of which words can be
recognised)
5Variability in the signal (1)
- Main problem in ASR variability in the input
signalExample /k/ has very different
realisations in different contexts. Its place
of articulation varies from velar before back
vowels to pre-velar before front vowels (own
articulation of keep,cool)
6Variability in the signal (2)
- Main problem in ASR variability in the input
signalExample /g/ in canonical form is
sometimes realised as a fricative or
approximant , e.g. intervocalically (OE. regen gt
E. rain). In Danish, this happens to all
intervocalic voiced plosives also, voiceless
plosives become voiced.
7Variability in the signal (3)
- Main problem in ASR variability in the input
signalExample /h/ has very different
realisations in different contexts. It can be
considered as a voiceless realisation of the
surrounding vowels. (spectrograms ihi, aha,
uhu)
8Variability in the signal (3a)
i
i
a
h
h
u
u
h
a
9Variability in the signal (4)
- Main problem in ASR variability in the input
signalExample deletion of segments due to
articulat- ory overlap. Friction is superimposed
on the vowel signal. - (spectrogram G.System)
10Variability in the signal (4a)
d
e
p0
s
i
m
a
l
z
Y
b0
s
p0
t
e
m
(
11Variability in the signal (5)
- Main problem in ASR variability in the input
signalExample the same vowel /a/ is realised
differ- ently dependent on its context. - (spectrogram aba, ada, aga)
12Variability in the signal (5a)
a
a
a
a
a
b0
d
g
b0
b0
b
a
13Modelling variability
- Hidden Markov models can represent the variable
signal characteristics of phones
1-p3
1-p2
1-p1
1
p1
p3
p2
E
S
14Lexicon and language model (1)
- Linguistic knowledge about phone sequences
(lexicon, language model) improves word
recognition - Without linguistic knowledge, low phone accuracy
15Lexicon and language model (2)
- Using a lexicon and/or language model is not a
top-down solution to all problems sometimes
pragmatic knowledge needed. -
- Example r??????sp???
-
Recognise speech
Wreck a nice beach
16Lexicon and language model (3)
- Using a lexicon and/or language model is not a
top-down solution to all problems sometimes
pragmatic knowledge needed. -
- Example ???????????????
-
Get up at eight oclock
Get a potato clock
17CONCLUSIONS
- The acoustic parameters (e.g. MFCC) are very
variable. - We must try to improve phone accuracy by
extracting linguistic information. - Rationale word recognition rates will increase
if phone accuracy improves - BUT not all our problems can be solved
18Phonetic features in ASR
- Assumption phone accuracy can be improved by
deriving phonetic features from the spectral
representation of the speech signal - What are phonetic features?
19A phonetic description of sounds
20A phonetic description of sounds
- The articulation of consonants
velum ( soft palate)
tongue
21A phonetic description of sounds
- The articulation of vowels
22Phonetic features IPA
- IPA (International Phonetic Alphabet) chart-
consonants and vowels- only phonemic
distinctions(http//www.arts.gla.ac.uk/IPA/i
pa.html)
23The IPA chart (consonants)
24The IPA chart (other consonants)
25The IPA chart (non-pulm. cons.)
26The IPA chart (vowels)
27The IPA chart (diacritics)
28IPA features (obstruents)
l d a p v u g p f n l a t
v a e l a e v l l r a a p r
o b n v l l u o o i s t r i
i p0 0 0 0 0 0 -1 -1 1 0 0 0 0 0
-1 b0 0 -1 0 0 0 -1 -1 1 0 0 0 0 0
1 p 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
-1 t -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
-1 k -1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 b 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
1 d -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
1 g -1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1
1 f 1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 T -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 s -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 S -1 -1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 C -1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1
-1 -1 x -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 vfri 1 1 -1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 1 vapr 1 1 -1 -1 -1 -1 -1 -1 -1
-1 -1 1 -1 1 Dfri -1 1 -1 -1 -1 -1 -1 -1
1 -1 -1 -1 -1 1 z -1 -1 1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 1 Z -1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 -1 1
29IPA features (sonorants)
l d a p v u g p f n l a t
v a e l a e v l l r a a p r
o b n v l l u o o i s t r i
i m 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1
1 n -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 1 J -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 N -1 -1 -1 -1 1 -1 -1 -1 -1 1
-1 -1 -1 1 l -1 -1 1 -1 -1 -1 -1 -1 -1
-1 1 1 -1 1 L -1 -1 -1 1 -1 -1 -1 -1
-1 -1 1 1 -1 1 rret -1 -1 1 -1 -1 -1 -1
-1 -1 -1 -1 1 -1 1 ralv -1 -1 1 -1 -1 -1
-1 -1 -1 -1 -1 -1 1 1 Ruvu -1 -1 -1 -1 -1
1 -1 -1 -1 -1 -1 -1 1 1 j -1 -1 -1 1
-1 -1 -1 -1 -1 -1 -1 1 -1 1 w 1 -1 -1
-1 1 -1 -1 -1 -1 -1 -1 1 -1 1 h -1 -1
-1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 ... 0 0
0 0 0 0 0 0 0 0 0 0 0 0
A zero value is assigned to all vowel features
(not listed here)
30IPA features (vowels)
m o f c r m o f c
r i p r e o i p
r e o d e o n u d
e o n u i -1 -1 1 -1 -1 I -1
-1 1 1 -1 y -1 -1 1 -1 1 Y -1
-1 1 1 1 u -1 -1 -1 -1 1 U -1
-1 -1 1 1 e 1 -1 1 -1 -1 2 1
-1 1 -1 1 o 1 -1 -1 -1 1 O 1
1 -1 -1 1 V 1 1 -1 -1 -1 Q -1
1 -1 -1 1 Uschwa 1 -1 -1 1 1 -1
1 1 -1 -1 a -1 1 1 1 -1 A
-1 1 -1 -1 -1 E 1 1 1 -1 -1 9
1 1 1 -1 1 3 1 1 1 1 -1 _at_ 1
1 -1 1 -1 6 -1 1 -1 1 -1
A zero value is assigned to all consonant
features (not listed here)
31Phonetic features
- Phonetic features- different systems (JFH, SPE,
art. feat.)- distinction between natural
classes which undergo the same phonological
processes
32SPE features (obstruents)
- c s n s l h c b r a c c v
l s t - n y a o o i e a o n o n o
a t e - s l s n w g n c u t r t i
t r n - p0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 -1
-1 -1 1 - b0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 1
-1 -1 -1 - p 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 -1
-1 -1 1 - b 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 1
-1 -1 -1 - tden 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1
-1 -1 1 - t 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1
-1 -1 1 - d 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 1
-1 -1 -1 - k 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 -1
-1 -1 1 - g 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 1
-1 -1 -1 - f 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 -1
-1 1 1 - vfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 1
-1 1 -1 - T 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1
-1 -1 1 - Dfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1
-1 -1 -1 - s 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1
-1 1 1 - z 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1
-1 1 -1 - S 1 -1 -1 -1 -1 1 0 -1 -1 -1 1 1 -1
-1 1 1
33SPE features (sonorants)
- c s n s l h c b r a c c v
l s t - n y a o o i e a o n o n o
a t e - s l s n w g n c u t r t i
t r n - m 1 -1 1 1 -1 -1 0 -1 -1 1 -1 -1 1
-1 -1 0 - n 1 -1 1 1 -1 -1 0 -1 -1 1 1 -1 1
-1 -1 0 - J 1 -1 1 1 -1 1 0 -1 -1 -1 -1 -1 1
-1 -1 0 - N 1 -1 1 1 -1 1 0 1 -1 -1 -1 -1 1
-1 -1 0 - l 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1
1 -1 0 - L 1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1
1 -1 0 - ralv 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1
-1 -1 0 - Ruvu 1 -1 -1 1 -1 -1 0 1 -1 -1 -1 1 1
-1 -1 0 - rret 1 -1 -1 1 -1 -1 0 -1 -1 -1 1 1 1
-1 -1 0 - j -1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1
-1 -1 0 - vapr -1 -1 -1 1 -1 -1 0 -1 -1 1 -1 1 1
-1 -1 0 - w -1 -1 -1 1 -1 1 0 1 1 1 -1 1 1
-1 -1 0 - h -1 -1 -1 1 1 -1 0 -1 -1 -1 -1 1 -1
-1 -1 0 - XXX 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0
34SPE features (vowels)
- c s n s l h c b r a c c v
l s t - n y a o o i e a o n o n o
a t e - s l s n w g n c u t r t i
t r n - i -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1
-1 -1 1 - I -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1
-1 -1 -1 - e -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1
-1 -1 1 - E -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1
-1 -1 -1 - -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1
-1 -1 -1 - a -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1
-1 -1 1 - y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1
-1 -1 1 - Y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1
-1 -1 -1 - 2 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1
-1 -1 1 - 9 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1
-1 -1 -1 - A -1 1 -1 1 1 -1 -1 1 -1 -1 -1 1 1
-1 -1 -1 - Q -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 1
-1 -1 -1 - V -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 1
-1 -1 -1 - O -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1
-1 -1 -1 - o -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1
-1 -1 1 - U -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1
-1 -1 -1
35CONCLUSION
- Different feature matrices have different
implications for relations between phones
36Kohonen networks
- Kohonen networks are unsupervised neural networks
- Our Kohonen networks take vectors of acoustic
parameters (MFCC_E_D) as input and output
phonetic feature vectors - Network size 50 x 50 neurons
37Training the Kohonen network
- 1. Self-organisation results in a phonotopic map
- 2. Phone calibration attaches array of phones to
each winning neuron - 3. Feature calibration replaces array of phones
by array of phonetic feature vectors - 4. Averaging of phonetic feature vectors for each
neuron
38Mapping with the Kohonen network
- Acoustic parameter vector belonging to one frame
activates neuron - Weighted average of phonetic feature vector
attached to winning neuron and K-nearest neurons
is output
39Advantages of Kohonen networks
- Reduction of features dimensions possible
- Mapping onto linguistically meaningful dimensions
(phonetically less severe confusions) - Many-to-one mapping allows mapping of different
allophones (acoustic variability) onto the same
phonetic feature values - automatic and fast mapping
40Disadvantages of Kohonen networks
- They need to be trained on manually segmented and
labelled material - BUT cross-language training has been shown to be
succesful
41Hybrid ASR system
phone
lexicon
hidden Markov modelling
language model
phonetic features
BASELINE
BASELINE
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
phone
42CONCLUSION
- Acoustic-phonetic mapping extracts linguistically
relevant information from the variable input
signal.
43ICSLP98
Exploiting transitions and focussing on
linguistic properties for ASR Jacques
KoremanWilliam J. BarryBistra
Andreeva Institute of Phonetics, University of
the SaarlandSaarbrücken, Germany
44INTRODUCTION
45INTRODUCTION
No lexicon or language model
The controlled experiments presented here reflect
our general aim of using phonetic knowledge to
improve the ASR system architecture. In order to
evaluate the effect of the changes in bottom-up
processing, no lexicon or language model is used.
Both improve phone identification in a top-down
manner by preventing the identification of
inadmissible words (lexical gaps or phonotactic
restrictions) or word sequences.
46DATA
Texts
English, German, Italian and Dutch texts from the
EUROM0 database, read by 2 male 2 female
speakers per language
Hamming window 15 ms step size 5
ms pre-emphasis 0.97
47DATA
Signals
- 12 mel-frequency cepstral coefficients (MFCCs)
- energy
- corresponding delta parameters
Hamming window 15 ms step size 5
ms pre-emphasis 0.97
16 kHz microphone signals
48DATA
Labels
- Intervocalic consonants labelled with SAMPA
symbols, except plosives and affricates, which
are divided into closure and frication subphone
units - 35-ms vowel transitions labelled asi_lab, alv_O
(experiment 1)V_lab, alv_V (experiment 2)
where lab, alv cons. generalized across
placeV generalized vowel
Hamming window 15 ms step size 5
ms pre-emphasis 0.97
49EXPERIMENT 1 SYSTEM
50EXPERIMENT 1 RESULTS
51EXPERIMENT 1 CONCLUSIONS
- When vowel transitions are used
- consonant identification rate improves
- place better identified
- manner identified worse, because hidden Markov
models for vowel transitions generalize across
all consonants sharing the same place of
articul-ation (solution do not pool consonants
sharing the same place of articulation)
- vowel transitions can be exploited for
identification of the consonant, particularly its
place of articulation
52EXPERIMENT 2 SYSTEM
consonant
lexicon
hidden Markov modelling
language model
BASELINE
phonetic features
BASELINE
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
C
53EXPERIMENT 2 RESULTS
54EXPERIMENT 2 CONCLUSIONS
- When acoustic-phonetic mapping is applied
- consonant identification rate improves strongly
- place better identified
- manner better identified
- phonetic features better address linguistically
relevant information than acoustic parameters
55EXPERIMENT 3 SYSTEM
consonant
lexicon
hidden Markov modelling
language model
phonetic features
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
BASELINE
Voffset - C - Vonset
C
56EXPERIMENT 3 RESULTS
57EXPERIMENT 3 CONCLUSIONS
When transitions are used for acoustic-phonetic
mapping
- consonant identification rate does not improve
- place identification improves slightly
- manner identification rate decreases slightly
vowel transitions do not increase identification
rate because baseline identification rate is
already high vowel transitions are undertrained
in the Kohonen networks
58INTERPRETATION (1)
- The greatest improvement in consonant
identification is achieved in experiment 2. By
mapping acoustically different realisations of
consonants onto more similar phonetic features,
the input to hidden Markov modelling becomes more
homogeneous, leading to a higher consonant
identification rate. - Using vowel transitions also leads to a higher
consonant identification rate in experiment 1. It
was shown that particularly the consonants place
is identified better. Findings confirm the
importance of transitions as known from
perceptual experiments.
59INTERPRETATION (2)
- The additional use of vowel transitions when
acoustic-phonetic mapping is applied does not
improve the identification results. Two possible
explanations for this have been suggested
The latter interpretation is currently
being verified by Sibylle Kötzer by applying the
methodology to a larger database (TIMIT).
- the identification rates are high anyway when
mapping is applied, so that it is less likely
that large improvements are found - the generalized vowel transitions are
undertrained in the Kohonen networks, because the
intrinsically variable frames are spread over a
larger area in the phonotopic map.
60REFERENCES (1)
Bitar, N. Espy-Wilson, C. (1995a). Speech
parameterization based on phonetic features
application to speech recognition. Proc. 4th
Eurospeech, 1411-1414. Cassidy, S Harrington,
J. (1995). The place of articulation distinction
in voiced oral stops evidence from burst spectra
and formant transitions. Phonetica 52,
263-284. Delattre, P., Liberman, A. Cooper, F.
(1955). Acoustic loci and transitional cues for
consonants. JASA 27(4), 769-773. Furui, S.
(1986). On the role of spectral transitions for
speech preception. JASA 80(4), 1016-1025. Koreman,
J., Andreeva, B. Barry, W.J. (1998). Do
phonetic features help to improve consonant
identification in ASR? Proc. ICSLP.
61REFERENCES (2)
Koreman, J., Barry, W.J. Andreeva, B. (1997).
Relational phonetic features for consonant
identification in a hybrid ASR system. PHONUS 3,
83-109. Saarbrücken (Germany) Institute of
Phonetics, University of the Saarland. Koreman,
J., Erriquez, A. W.J. Barry (to appear
). On the selective use of acoustic parameters
for consonant identification. PHONUS 4.
Saarbrücken (Germany) Institute of Phonetics,
University of the Saarland. Stevens, K.
Blumstein, S. (1978). Invariant cues for place of
articulation in stop consonants. JASA 64(5),
1358-1368.
soon
62SUMMARY
- Acoustic-phonetic mapping by a Kohonen network
improves consonant identification rates.
63ICSLP98
Do phonetic features help to improve consonant
identification in ASR? Jacques KoremanBistra
AndreevaWilliam J. Barry Institute of
Phonetics, University of the SaarlandSaarbrücken,
Germany
64INTRODUCTION
Variation in the acoustic signal is not a problem
for human perception, but causes inhomogeneity in
the phone models for ASR, leading to poor
consonant identification. We should Bitar
Espy-Wilson do this by using a knowledge-based
event-seeking approach for extracting phonetic
features from the microphone signal on the basis
of acoustic cues. We propose an acoustic-phonetic
mapping procedure on the basis of a Kohonen
network.
directly target the linguistic information in
the signal and ... minimize other
extra-linguistic information that may yield large
speech variability (Bitar Espy-Wilson 1995a,
p. 1411)
65DATA
Texts
English, German, Italian and Dutch texts from the
EUROM0 database, read by 2 male 2 female
speakers per language
66DATA
Signals
- 12 mel-frequency cepstral coefficients (MFCCs)
- energy
- corresponding delta parameters
Hamming window 15 ms step size 5
ms pre-emphasis 0.97
16 kHz microphone signals
67DATA (1)
Labels
The consonants were transcribed with SAMPA
symbols, except
- plosives and afficates are subdivided into a
closure (p0 voiceless closure b0 voiced
closure) and a burst-plus-aspiration (p, t,
k) or frication part (f, s, S, z, Z) - Italian geminates were pooled with non-geminates
to prevent undertraining of geminate consonants - The Dutch voiced velar fricative ?, which only
occurs in some dialects, was pooled with its
voiceless counterpart x to prevent undertraining
68DATA (2)
Labels
- SAMPA symbols are phonemic within a language, but
can represent different allophones
cross-linguistically. These were relabelled as
shown in the table below
SAMPA allophone label description language r ?
rapr alv. approx. English r ralv alveolar
trill It., Dutch ? Ruvu uvular trill G.,
Dutch v ? vapr labiod. approx. German v vfri v
d. labiod. fric. E., It., NL w ? vapr labiod.
approx. Dutch w w bilab. approx. Engl., It.
69SYSTEM ARCHITECTURE
consonant
lexicon
hidden Markov modelling
language model
BASELINE
phonetic features
BASELINE
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
C
70CONFUSIONS BASELINE
(by Attilio Erriquez)
phonetic categories manner, place, voicing 1
category wrong 2 categories wrong 3
categories wrong
71CONFUSIONS MAPPING
(by Attilio Erriquez)
phonetic categories manner, place, voicing 1
category wrong 2 categories wrong 3
categories wrong
72ACIS
The Average Correct Identification Score
compensates for the number of occurrences in the
database, giving each consonant equal weight. It
is the total of all percentage numbers along the
diagonal of the confusion matrix divided by the
number of consonants.
Baseline system 31.22
Mapping system 68.47
73BASELINE SYSTEM
- good identification of language-specific phones
- reason acoustic homogeneity
- poor identification of other phones
74MAPPING SYSTEM
- good identification, also of acoustically
variable phones - reason variable acoustic parameters are mapped
onto homogenous, distinctive phonetic features
75AFFRICATES (1)
correct cons.
baseline mapping language pf 0.0 100.0 German
f 1.2 64.4 all ts 0.0 72.2 German,
It. s 3.1 64.7 all t? 0.0 40.2 E., G., It.
? 78.1 90.6 all dz 0.0 70.3 Italian z 10.4 5
0.5 all d? 28.0 96.0 English, It. ? no
intervocalic realisations
76AFFRICATES (2)
- affricates, although restricted to fewer
languages, are recognised poorly in the baseline
system - reason they are broken up into closure and
frication segments, which are trained separately
in the Kohonen networks these segments occur in
all languages and are acoustically variable,
leading to poor identification - this is corroborated by the poor identification
rates for fricatives in the baseline system
(exception /?/, which only occurs rarely) - after mapping, both fricatives and affricates are
identified well
77APMS
phonetic misidentification coefficient
sum of the misidentification percentages
The Average Phonetic Misidentification Score
gives a measure of the severity of the consonant
confusions in terms of phonetic features. The
multiple is the sum of all products of the
misidentification percentages (in the
non-diagonal cells) times the number of
misidentified phonetic categories (manner, place
and voicing). It is divided by the total of all
the percentage numbers in the non-diagonal cells.
Baseline system 1.79
Mapping system 1.57
78APMS
phonetic misidentification coefficient
sum of the misidentification percentages
- after mapping, incorrectly identified consonant
is on average closer to the phonetic identity of
the consonant which was produced - reason the Kohonen network is able to extract
linguistically distinctive phonetic features
which allow for a better separation of the
consonants in hidden Markov modelling.
79CONSONANT CONFUSIONS
BASELINE
MAPPING
80CONCLUSIONS
Acoustic-phonetic mapping helps to address
linguistically relevant information in the speech
signal, ignoring extra-linguistic sources of
variation. The advantages of mapping are
reflected in the two measures which we have
presented
- ACIS shows that mapping leads to better consonant
identification rates for all except a few of the
language-specific consonants. The improvement can
be put down to the systems ability to map
acoustically variable consonant realisations to
more homogeneous phonetic feature vectors.
81CONCLUSIONS
Acoustic-phonetic mapping helps to address
linguistically relevant information in the speech
signal, ignoring extra-linguistic sources of
variation. The advantages of mapping are
reflected in the two measures which we have
presented
- APMS shows that the confusions which occur in the
mapping experiment are less severe than in the
baseline experiment from a phonetic point of
view. There are fewer confusions on the phonetic
dimensions manner, place and voicing when mapping
is applied, because the system focuses on
distinctive information in the acoustic signals.
82REFERENCES (1)
Bitar, N. Espy-Wilson, C. (1995a). Speech
parameterization based on phonetic features
application to speech recognition. Proc. 4th
European Conference on Speech Communication and
Technology, 1411-1414. Bitar, N. Espy-Wilson,
C. (1995b). A signal representation of speech
based on phonetic features. Proc. 5th Annual
Dual-Use Techn. and Applications Conf.,
310-315. Kirchhoff, K. (1996). Syllable-level
desynchronisation of phonetic features for speech
recognition. Proc. ICSLP., 2274-2276. Dalsgaard,
P. (1992). Phoneme label alignment using
acoustic-phonetic features and Gaussian
probability density functions. Computer Speech
and Language 6, 303-329.
83REFERENCES (2)
Koreman, J., Barry, W.J. Andreeva, B. (1997).
Relational phonetic features for consonant
identification in a hybrid ASR system. PHONUS 3,
83-109. Saarbrücken (Germany) Institute of
Phonetics, University of the Saarland. Koreman,
J., Barry, W.J., Andreeva, B. (1998). Exploiting
transitions and focussing on linguistic
properties for ASR. Proc ICSLP. (these
proceedings).
84SUMMARY
Acoustic-phonetic mapping leads to fewer and
phonetically less severe consonant confusions.
85THE END