Phonetic features in ASR - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Phonetic features in ASR

Description:

Phonetic features in ASR. Intensive course Dipartimento di ... (spectrogram 'aba', 'ada', 'aga') Variability in the signal (5a) a: a: a: a: a: a: b0 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 86
Provided by: Phon2
Category:

less

Transcript and Presenter's Notes

Title: Phonetic features in ASR


1
Phonetic features in ASR
  • Intensive course Dipartimento di Elettrotecnica
    ed Elettronica Politecnica di Bari 22 26
    March 1999
  • Jacques KoremanInstitute of PhoneticsUniversity
    of the SaarlandP.O. Box 15 11 50D - 66041
    Saarbrücken E-mail Germany jkoreman_at_coli.uni-sb
    .de

2
Organisation of the course
  • Tuesday Friday- First half of each
    session theory- Second half of each
    session practice
  • Interruptions invited!!!

3
Overview of the course
  • 1. Variability in the signal
  • 2. Phonetic features in ASR
  • 3. Deriving phonetic features from the acoustic
    signal by a Kohonen network
  • 4. ICSLP98 Exploiting transitions and
    focussing on linguistic properties for ASR
  • 5. ICSLP98 Do phonetic features help to
    improve consonant identification in ASR?

day
day
day
day
day
4
The goal of ASR systems
  • Input spectral description of microphone signal,
    typically- energy in band-pass filters- LPC
    coefficients- cepstral coefficients
  • Output linguistic units, usually phones or
    phonemes (on the basis of which words can be
    recognised)

5
Variability in the signal (1)
  • Main problem in ASR variability in the input
    signalExample /k/ has very different
    realisations in different contexts. Its place
    of articulation varies from velar before back
    vowels to pre-velar before front vowels (own
    articulation of keep,cool)

6
Variability in the signal (2)
  • Main problem in ASR variability in the input
    signalExample /g/ in canonical form is
    sometimes realised as a fricative or
    approximant , e.g. intervocalically (OE. regen gt
    E. rain). In Danish, this happens to all
    intervocalic voiced plosives also, voiceless
    plosives become voiced.

7
Variability in the signal (3)
  • Main problem in ASR variability in the input
    signalExample /h/ has very different
    realisations in different contexts. It can be
    considered as a voiceless realisation of the
    surrounding vowels. (spectrograms ihi, aha,
    uhu)

8
Variability in the signal (3a)

i
i
a
h
h
u
u





h
a
9
Variability in the signal (4)
  • Main problem in ASR variability in the input
    signalExample deletion of segments due to
    articulat- ory overlap. Friction is superimposed
    on the vowel signal.
  • (spectrogram G.System)

10
Variability in the signal (4a)
d
e
p0
s
i
m
a
l
z
Y
b0
s
p0
t
e
m


(
11
Variability in the signal (5)
  • Main problem in ASR variability in the input
    signalExample the same vowel /a/ is realised
    differ- ently dependent on its context.
  • (spectrogram aba, ada, aga)

12
Variability in the signal (5a)
a
a
a
a
a
b0
d
g






b0
b0
b
a
13
Modelling variability
  • Hidden Markov models can represent the variable
    signal characteristics of phones

1-p3
1-p2
1-p1
1
p1
p3
p2
E
S
14
Lexicon and language model (1)
  • Linguistic knowledge about phone sequences
    (lexicon, language model) improves word
    recognition
  • Without linguistic knowledge, low phone accuracy

15
Lexicon and language model (2)
  • Using a lexicon and/or language model is not a
    top-down solution to all problems sometimes
    pragmatic knowledge needed.
  • Example r??????sp???

Recognise speech
Wreck a nice beach
16
Lexicon and language model (3)
  • Using a lexicon and/or language model is not a
    top-down solution to all problems sometimes
    pragmatic knowledge needed.
  • Example ???????????????

Get up at eight oclock
Get a potato clock
17
CONCLUSIONS
  • The acoustic parameters (e.g. MFCC) are very
    variable.
  • We must try to improve phone accuracy by
    extracting linguistic information.
  • Rationale word recognition rates will increase
    if phone accuracy improves
  • BUT not all our problems can be solved

18
Phonetic features in ASR
  • Assumption phone accuracy can be improved by
    deriving phonetic features from the spectral
    representation of the speech signal
  • What are phonetic features?

19
A phonetic description of sounds
  • The articulatory organs

20
A phonetic description of sounds
  • The articulation of consonants

velum ( soft palate)
tongue
21
A phonetic description of sounds
  • The articulation of vowels

22
Phonetic features IPA
  • IPA (International Phonetic Alphabet) chart-
    consonants and vowels- only phonemic
    distinctions(http//www.arts.gla.ac.uk/IPA/i
    pa.html)

23
The IPA chart (consonants)
24
The IPA chart (other consonants)
25
The IPA chart (non-pulm. cons.)
26
The IPA chart (vowels)
27
The IPA chart (diacritics)
28
IPA features (obstruents)
l d a p v u g p f n l a t
v a e l a e v l l r a a p r
o b n v l l u o o i s t r i
i p0 0 0 0 0 0 -1 -1 1 0 0 0 0 0
-1 b0 0 -1 0 0 0 -1 -1 1 0 0 0 0 0
1 p 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
-1 t -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
-1 k -1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1
-1 b 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
1 d -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1
1 g -1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1
1 f 1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 T -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 s -1 -1 1 -1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 S -1 -1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1
-1 C -1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1
-1 -1 x -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1
-1 -1 -1 vfri 1 1 -1 -1 -1 -1 -1 -1 1 -1
-1 -1 -1 1 vapr 1 1 -1 -1 -1 -1 -1 -1 -1
-1 -1 1 -1 1 Dfri -1 1 -1 -1 -1 -1 -1 -1
1 -1 -1 -1 -1 1 z -1 -1 1 -1 -1 -1 -1
-1 1 -1 -1 -1 -1 1 Z -1 -1 1 1 -1 -1
-1 -1 1 -1 -1 -1 -1 1
29
IPA features (sonorants)
l d a p v u g p f n l a t
v a e l a e v l l r a a p r
o b n v l l u o o i s t r i
i m 1 -1 -1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1
1 n -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1
-1 1 J -1 -1 -1 1 -1 -1 -1 -1 -1 1 -1
-1 -1 1 N -1 -1 -1 -1 1 -1 -1 -1 -1 1
-1 -1 -1 1 l -1 -1 1 -1 -1 -1 -1 -1 -1
-1 1 1 -1 1 L -1 -1 -1 1 -1 -1 -1 -1
-1 -1 1 1 -1 1 rret -1 -1 1 -1 -1 -1 -1
-1 -1 -1 -1 1 -1 1 ralv -1 -1 1 -1 -1 -1
-1 -1 -1 -1 -1 -1 1 1 Ruvu -1 -1 -1 -1 -1
1 -1 -1 -1 -1 -1 -1 1 1 j -1 -1 -1 1
-1 -1 -1 -1 -1 -1 -1 1 -1 1 w 1 -1 -1
-1 1 -1 -1 -1 -1 -1 -1 1 -1 1 h -1 -1
-1 -1 -1 -1 1 -1 1 -1 -1 -1 -1 -1 ... 0 0
0 0 0 0 0 0 0 0 0 0 0 0
A zero value is assigned to all vowel features
(not listed here)
30
IPA features (vowels)
m o f c r m o f c
r i p r e o i p
r e o d e o n u d
e o n u i -1 -1 1 -1 -1 I -1
-1 1 1 -1 y -1 -1 1 -1 1 Y -1
-1 1 1 1 u -1 -1 -1 -1 1 U -1
-1 -1 1 1 e 1 -1 1 -1 -1 2 1
-1 1 -1 1 o 1 -1 -1 -1 1 O 1
1 -1 -1 1 V 1 1 -1 -1 -1 Q -1
1 -1 -1 1 Uschwa 1 -1 -1 1 1 -1
1 1 -1 -1 a -1 1 1 1 -1 A
-1 1 -1 -1 -1 E 1 1 1 -1 -1 9
1 1 1 -1 1 3 1 1 1 1 -1 _at_ 1
1 -1 1 -1 6 -1 1 -1 1 -1
A zero value is assigned to all consonant
features (not listed here)
31
Phonetic features
  • Phonetic features- different systems (JFH, SPE,
    art. feat.)- distinction between natural
    classes which undergo the same phonological
    processes

32
SPE features (obstruents)
  • c s n s l h c b r a c c v
    l s t
  • n y a o o i e a o n o n o
    a t e
  • s l s n w g n c u t r t i
    t r n
  • p0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 -1
    -1 -1 1
  • b0 1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 1
    -1 -1 -1
  • p 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 -1
    -1 -1 1
  • b 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 -1 1
    -1 -1 -1
  • tden 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1
    -1 -1 1
  • t 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 -1
    -1 -1 1
  • d 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 -1 1
    -1 -1 -1
  • k 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 -1
    -1 -1 1
  • g 1 -1 -1 -1 -1 1 0 1 -1 -1 -1 -1 1
    -1 -1 -1
  • f 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 -1
    -1 1 1
  • vfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 -1 1 1
    -1 1 -1
  • T 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1
    -1 -1 1
  • Dfri 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1
    -1 -1 -1
  • s 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 -1
    -1 1 1
  • z 1 -1 -1 -1 -1 -1 0 -1 -1 1 1 1 1
    -1 1 -1
  • S 1 -1 -1 -1 -1 1 0 -1 -1 -1 1 1 -1
    -1 1 1

33
SPE features (sonorants)
  • c s n s l h c b r a c c v
    l s t
  • n y a o o i e a o n o n o
    a t e
  • s l s n w g n c u t r t i
    t r n
  • m 1 -1 1 1 -1 -1 0 -1 -1 1 -1 -1 1
    -1 -1 0
  • n 1 -1 1 1 -1 -1 0 -1 -1 1 1 -1 1
    -1 -1 0
  • J 1 -1 1 1 -1 1 0 -1 -1 -1 -1 -1 1
    -1 -1 0
  • N 1 -1 1 1 -1 1 0 1 -1 -1 -1 -1 1
    -1 -1 0
  • l 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1
    1 -1 0
  • L 1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1
    1 -1 0
  • ralv 1 -1 -1 1 -1 -1 0 -1 -1 1 1 1 1
    -1 -1 0
  • Ruvu 1 -1 -1 1 -1 -1 0 1 -1 -1 -1 1 1
    -1 -1 0
  • rret 1 -1 -1 1 -1 -1 0 -1 -1 -1 1 1 1
    -1 -1 0
  • j -1 -1 -1 1 -1 1 0 -1 -1 -1 -1 1 1
    -1 -1 0
  • vapr -1 -1 -1 1 -1 -1 0 -1 -1 1 -1 1 1
    -1 -1 0
  • w -1 -1 -1 1 -1 1 0 1 1 1 -1 1 1
    -1 -1 0
  • h -1 -1 -1 1 1 -1 0 -1 -1 -1 -1 1 -1
    -1 -1 0
  • XXX 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0

34
SPE features (vowels)
  • c s n s l h c b r a c c v
    l s t
  • n y a o o i e a o n o n o
    a t e
  • s l s n w g n c u t r t i
    t r n
  • i -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1
    -1 -1 1
  • I -1 1 -1 1 -1 1 -1 -1 -1 -1 -1 1 1
    -1 -1 -1
  • e -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1
    -1 -1 1
  • E -1 1 -1 1 -1 -1 -1 -1 -1 -1 -1 1 1
    -1 -1 -1
  • -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1
    -1 -1 -1
  • a -1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1
    -1 -1 1
  • y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1
    -1 -1 1
  • Y -1 1 -1 1 -1 1 -1 -1 1 -1 -1 1 1
    -1 -1 -1
  • 2 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1
    -1 -1 1
  • 9 -1 1 -1 1 -1 -1 -1 -1 1 -1 -1 1 1
    -1 -1 -1
  • A -1 1 -1 1 1 -1 -1 1 -1 -1 -1 1 1
    -1 -1 -1
  • Q -1 1 -1 1 1 -1 -1 1 1 -1 -1 1 1
    -1 -1 -1
  • V -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 1
    -1 -1 -1
  • O -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1
    -1 -1 -1
  • o -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1
    -1 -1 1
  • U -1 1 -1 1 -1 1 -1 1 1 -1 -1 1 1
    -1 -1 -1

35
CONCLUSION
  • Different feature matrices have different
    implications for relations between phones

36
Kohonen networks
  • Kohonen networks are unsupervised neural networks
  • Our Kohonen networks take vectors of acoustic
    parameters (MFCC_E_D) as input and output
    phonetic feature vectors
  • Network size 50 x 50 neurons

37
Training the Kohonen network
  • 1. Self-organisation results in a phonotopic map
  • 2. Phone calibration attaches array of phones to
    each winning neuron
  • 3. Feature calibration replaces array of phones
    by array of phonetic feature vectors
  • 4. Averaging of phonetic feature vectors for each
    neuron

38
Mapping with the Kohonen network
  • Acoustic parameter vector belonging to one frame
    activates neuron
  • Weighted average of phonetic feature vector
    attached to winning neuron and K-nearest neurons
    is output

39
Advantages of Kohonen networks
  • Reduction of features dimensions possible
  • Mapping onto linguistically meaningful dimensions
    (phonetically less severe confusions)
  • Many-to-one mapping allows mapping of different
    allophones (acoustic variability) onto the same
    phonetic feature values
  • automatic and fast mapping

40
Disadvantages of Kohonen networks
  • They need to be trained on manually segmented and
    labelled material
  • BUT cross-language training has been shown to be
    succesful

41
Hybrid ASR system
phone
lexicon
hidden Markov modelling
language model
phonetic features
BASELINE
BASELINE
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
phone
42
CONCLUSION
  • Acoustic-phonetic mapping extracts linguistically
    relevant information from the variable input
    signal.

43
ICSLP98

Exploiting transitions and focussing on
linguistic properties for ASR Jacques
KoremanWilliam J. BarryBistra
Andreeva Institute of Phonetics, University of
the SaarlandSaarbrücken, Germany
44
INTRODUCTION
45
INTRODUCTION
No lexicon or language model
The controlled experiments presented here reflect
our general aim of using phonetic knowledge to
improve the ASR system architecture. In order to
evaluate the effect of the changes in bottom-up
processing, no lexicon or language model is used.
Both improve phone identification in a top-down
manner by preventing the identification of
inadmissible words (lexical gaps or phonotactic
restrictions) or word sequences.
46
DATA
Texts
English, German, Italian and Dutch texts from the
EUROM0 database, read by 2 male 2 female
speakers per language
Hamming window 15 ms step size 5
ms pre-emphasis 0.97
47
DATA
Signals
  • 12 mel-frequency cepstral coefficients (MFCCs)
  • energy
  • corresponding delta parameters

Hamming window 15 ms step size 5
ms pre-emphasis 0.97
16 kHz microphone signals
48
DATA
Labels
  • Intervocalic consonants labelled with SAMPA
    symbols, except plosives and affricates, which
    are divided into closure and frication subphone
    units
  • 35-ms vowel transitions labelled asi_lab, alv_O
    (experiment 1)V_lab, alv_V (experiment 2)

where lab, alv cons. generalized across
placeV generalized vowel
Hamming window 15 ms step size 5
ms pre-emphasis 0.97
49
EXPERIMENT 1 SYSTEM
50
EXPERIMENT 1 RESULTS
51
EXPERIMENT 1 CONCLUSIONS
  • When vowel transitions are used
  • consonant identification rate improves
  • place better identified
  • manner identified worse, because hidden Markov
    models for vowel transitions generalize across
    all consonants sharing the same place of
    articul-ation (solution do not pool consonants
    sharing the same place of articulation)
  • vowel transitions can be exploited for
    identification of the consonant, particularly its
    place of articulation

52
EXPERIMENT 2 SYSTEM
consonant
lexicon
hidden Markov modelling
language model
BASELINE
phonetic features
BASELINE
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
C
53
EXPERIMENT 2 RESULTS
54
EXPERIMENT 2 CONCLUSIONS
  • When acoustic-phonetic mapping is applied
  • consonant identification rate improves strongly
  • place better identified
  • manner better identified
  • phonetic features better address linguistically
    relevant information than acoustic parameters

55
EXPERIMENT 3 SYSTEM
consonant
lexicon
hidden Markov modelling
language model
phonetic features
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
BASELINE
Voffset - C - Vonset
C
56
EXPERIMENT 3 RESULTS
57
EXPERIMENT 3 CONCLUSIONS
When transitions are used for acoustic-phonetic
mapping
  • consonant identification rate does not improve
  • place identification improves slightly
  • manner identification rate decreases slightly

vowel transitions do not increase identification
rate because baseline identification rate is
already high vowel transitions are undertrained
in the Kohonen networks
58
INTERPRETATION (1)
  • The greatest improvement in consonant
    identification is achieved in experiment 2. By
    mapping acoustically different realisations of
    consonants onto more similar phonetic features,
    the input to hidden Markov modelling becomes more
    homogeneous, leading to a higher consonant
    identification rate.
  • Using vowel transitions also leads to a higher
    consonant identification rate in experiment 1. It
    was shown that particularly the consonants place
    is identified better. Findings confirm the
    importance of transitions as known from
    perceptual experiments.

59
INTERPRETATION (2)
  • The additional use of vowel transitions when
    acoustic-phonetic mapping is applied does not
    improve the identification results. Two possible
    explanations for this have been suggested
    The latter interpretation is currently
    being verified by Sibylle Kötzer by applying the
    methodology to a larger database (TIMIT).
  • the identification rates are high anyway when
    mapping is applied, so that it is less likely
    that large improvements are found
  • the generalized vowel transitions are
    undertrained in the Kohonen networks, because the
    intrinsically variable frames are spread over a
    larger area in the phonotopic map.

60
REFERENCES (1)
Bitar, N. Espy-Wilson, C. (1995a). Speech
parameterization based on phonetic features
application to speech recognition. Proc. 4th
Eurospeech, 1411-1414. Cassidy, S Harrington,
J. (1995). The place of articulation distinction
in voiced oral stops evidence from burst spectra
and formant transitions. Phonetica 52,
263-284. Delattre, P., Liberman, A. Cooper, F.
(1955). Acoustic loci and transitional cues for
consonants. JASA 27(4), 769-773. Furui, S.
(1986). On the role of spectral transitions for
speech preception. JASA 80(4), 1016-1025. Koreman,
J., Andreeva, B. Barry, W.J. (1998). Do
phonetic features help to improve consonant
identification in ASR? Proc. ICSLP.
61
REFERENCES (2)
Koreman, J., Barry, W.J. Andreeva, B. (1997).
Relational phonetic features for consonant
identification in a hybrid ASR system. PHONUS 3,
83-109. Saarbrücken (Germany) Institute of
Phonetics, University of the Saarland. Koreman,
J., Erriquez, A. W.J. Barry (to appear
). On the selective use of acoustic parameters
for consonant identification. PHONUS 4.
Saarbrücken (Germany) Institute of Phonetics,
University of the Saarland. Stevens, K.
Blumstein, S. (1978). Invariant cues for place of
articulation in stop consonants. JASA 64(5),
1358-1368.
soon
62
SUMMARY
  • Acoustic-phonetic mapping by a Kohonen network
    improves consonant identification rates.

63
ICSLP98

Do phonetic features help to improve consonant
identification in ASR? Jacques KoremanBistra
AndreevaWilliam J. Barry Institute of
Phonetics, University of the SaarlandSaarbrücken,
Germany
64
INTRODUCTION

Variation in the acoustic signal is not a problem
for human perception, but causes inhomogeneity in
the phone models for ASR, leading to poor
consonant identification. We should Bitar
Espy-Wilson do this by using a knowledge-based
event-seeking approach for extracting phonetic
features from the microphone signal on the basis
of acoustic cues. We propose an acoustic-phonetic
mapping procedure on the basis of a Kohonen
network.
directly target the linguistic information in
the signal and ... minimize other
extra-linguistic information that may yield large
speech variability (Bitar Espy-Wilson 1995a,
p. 1411)
65
DATA
Texts
English, German, Italian and Dutch texts from the
EUROM0 database, read by 2 male 2 female
speakers per language
66
DATA
Signals
  • 12 mel-frequency cepstral coefficients (MFCCs)
  • energy
  • corresponding delta parameters

Hamming window 15 ms step size 5
ms pre-emphasis 0.97
16 kHz microphone signals
67
DATA (1)
Labels
The consonants were transcribed with SAMPA
symbols, except
  • plosives and afficates are subdivided into a
    closure (p0 voiceless closure b0 voiced
    closure) and a burst-plus-aspiration (p, t,
    k) or frication part (f, s, S, z, Z)
  • Italian geminates were pooled with non-geminates
    to prevent undertraining of geminate consonants
  • The Dutch voiced velar fricative ?, which only
    occurs in some dialects, was pooled with its
    voiceless counterpart x to prevent undertraining

68
DATA (2)
Labels
  • SAMPA symbols are phonemic within a language, but
    can represent different allophones
    cross-linguistically. These were relabelled as
    shown in the table below

SAMPA allophone label description language r ?
rapr alv. approx. English r ralv alveolar
trill It., Dutch ? Ruvu uvular trill G.,
Dutch v ? vapr labiod. approx. German v vfri v
d. labiod. fric. E., It., NL w ? vapr labiod.
approx. Dutch w w bilab. approx. Engl., It.
69
SYSTEM ARCHITECTURE
consonant
lexicon
hidden Markov modelling
language model
BASELINE
phonetic features
BASELINE
Kohonen network
Kohonen network
Kohonen network
MFCCs energy
delta parameters
C
70
CONFUSIONS BASELINE
(by Attilio Erriquez)
phonetic categories manner, place, voicing 1
category wrong 2 categories wrong 3
categories wrong
71
CONFUSIONS MAPPING
(by Attilio Erriquez)
phonetic categories manner, place, voicing 1
category wrong 2 categories wrong 3
categories wrong
72
ACIS
The Average Correct Identification Score
compensates for the number of occurrences in the
database, giving each consonant equal weight. It
is the total of all percentage numbers along the
diagonal of the confusion matrix divided by the
number of consonants.
Baseline system 31.22
Mapping system 68.47
73
BASELINE SYSTEM
  • good identification of language-specific phones
  • reason acoustic homogeneity
  • poor identification of other phones

74
MAPPING SYSTEM
  • good identification, also of acoustically
    variable phones
  • reason variable acoustic parameters are mapped
    onto homogenous, distinctive phonetic features

75
AFFRICATES (1)
correct cons.
baseline mapping language pf 0.0 100.0 German
f 1.2 64.4 all ts 0.0 72.2 German,
It. s 3.1 64.7 all t? 0.0 40.2 E., G., It.
? 78.1 90.6 all dz 0.0 70.3 Italian z 10.4 5
0.5 all d? 28.0 96.0 English, It. ? no
intervocalic realisations
76
AFFRICATES (2)
  • affricates, although restricted to fewer
    languages, are recognised poorly in the baseline
    system
  • reason they are broken up into closure and
    frication segments, which are trained separately
    in the Kohonen networks these segments occur in
    all languages and are acoustically variable,
    leading to poor identification
  • this is corroborated by the poor identification
    rates for fricatives in the baseline system
    (exception /?/, which only occurs rarely)
  • after mapping, both fricatives and affricates are
    identified well

77
APMS
phonetic misidentification coefficient
sum of the misidentification percentages
The Average Phonetic Misidentification Score
gives a measure of the severity of the consonant
confusions in terms of phonetic features. The
multiple is the sum of all products of the
misidentification percentages (in the
non-diagonal cells) times the number of
misidentified phonetic categories (manner, place
and voicing). It is divided by the total of all
the percentage numbers in the non-diagonal cells.
Baseline system 1.79
Mapping system 1.57
78
APMS
phonetic misidentification coefficient
sum of the misidentification percentages
  • after mapping, incorrectly identified consonant
    is on average closer to the phonetic identity of
    the consonant which was produced
  • reason the Kohonen network is able to extract
    linguistically distinctive phonetic features
    which allow for a better separation of the
    consonants in hidden Markov modelling.

79
CONSONANT CONFUSIONS
BASELINE
MAPPING
80
CONCLUSIONS
Acoustic-phonetic mapping helps to address
linguistically relevant information in the speech
signal, ignoring extra-linguistic sources of
variation. The advantages of mapping are
reflected in the two measures which we have
presented
  • ACIS shows that mapping leads to better consonant
    identification rates for all except a few of the
    language-specific consonants. The improvement can
    be put down to the systems ability to map
    acoustically variable consonant realisations to
    more homogeneous phonetic feature vectors.

81
CONCLUSIONS
Acoustic-phonetic mapping helps to address
linguistically relevant information in the speech
signal, ignoring extra-linguistic sources of
variation. The advantages of mapping are
reflected in the two measures which we have
presented
  • APMS shows that the confusions which occur in the
    mapping experiment are less severe than in the
    baseline experiment from a phonetic point of
    view. There are fewer confusions on the phonetic
    dimensions manner, place and voicing when mapping
    is applied, because the system focuses on
    distinctive information in the acoustic signals.

82
REFERENCES (1)
Bitar, N. Espy-Wilson, C. (1995a). Speech
parameterization based on phonetic features
application to speech recognition. Proc. 4th
European Conference on Speech Communication and
Technology, 1411-1414. Bitar, N. Espy-Wilson,
C. (1995b). A signal representation of speech
based on phonetic features. Proc. 5th Annual
Dual-Use Techn. and Applications Conf.,
310-315. Kirchhoff, K. (1996). Syllable-level
desynchronisation of phonetic features for speech
recognition. Proc. ICSLP., 2274-2276. Dalsgaard,
P. (1992). Phoneme label alignment using
acoustic-phonetic features and Gaussian
probability density functions. Computer Speech
and Language 6, 303-329.
83
REFERENCES (2)
Koreman, J., Barry, W.J. Andreeva, B. (1997).
Relational phonetic features for consonant
identification in a hybrid ASR system. PHONUS 3,
83-109. Saarbrücken (Germany) Institute of
Phonetics, University of the Saarland. Koreman,
J., Barry, W.J., Andreeva, B. (1998). Exploiting
transitions and focussing on linguistic
properties for ASR. Proc ICSLP. (these
proceedings).
84
SUMMARY
Acoustic-phonetic mapping leads to fewer and
phonetically less severe consonant confusions.
85
THE END
Write a Comment
User Comments (0)
About PowerShow.com