Title: A Universal Human Machine Speech Interaction
11
A Universal Human Machine Speech
Interaction Language for Robust Speech
Recognition Applications
Ebru Arisoy, Levent M. Arslan Bogaziçi
University, Electrical and Electronics
Engineering Department, Istanbul, Turkey
22
INTRODUCTION
Statement of the Problem Automatic speech
recognition systems are prone to errors when
there are confusable words in the dictionary.
Proposed Solution To create a human machine
speech interaction language (HUMSIL) with
acoustically orthogonal words.
33
THE DESIGN OF THE NEW LANGUAGE
Phonetic Alphabet 29 natural languages
are examined in IPA 7. The most common
phonemes (included in at least 70 of these
languages) in descending order
Consonants /m/, /n/, /k/, /t/,/l/, /b/, /d/,
/p/, /s/, /g/, /f/, /y/, /z/. Vowels /i/, /u/,
/a/, /o/, /e/.
44
THE DESIGN OF THE NEW LANGUAGE
Phonetic Alphabet (Contd.) Vowels
IY(i)
- The vowels, exist in
- 250 of 317 languages 8.
-
- take place at the
- three corners of the vowel triangle.
- are the
most - common vowels.
/i/, /u/, and /a/
/a/, /i/, and /u/
/a/, /e/, /i/, /o/, and /u/
A(a)
OO(u)
The vowel triangle 9
55
THE DESIGN OF THE NEW LANGUAGE
- Phonetic Alphabet (Contd.)
- Vowels
- have the least
error rates in the confusion matrix 10. -
- The phoneme /u/ may have variations in
different languages and even - in different words.
- Depending on these facts, we select the
phonemes /a/, /i/ and /o/ as -
- the vowels of our minimal alphabet.
/a/, /i/, and /u/
/u/
/a/, /i/, and /o/
66
THE DESIGN OF THE NEW LANGUAGE
- Phonetic Alphabet (Contd.)
- Consonants
- In a perceptual study 11, it is found that
the phoneme groups - /ptk/ and /bdg/ have a very high rate of
within confusions. - Therefore, taking one or two phonemes from
each group may result - in a better recognition performance.
- 83 of all languages have some kind of /s/
sound. - Next most frequent is the voiced counter part
of /s/, namely /z/ 8. - The voiceless forms of the cognate pairs are
heard more successfully - than the voiced form (/s/gt/z/ and /f/gt/v/)
12.
/ptk/ and /bdg/
/s/
/s/, namely /z/
(/s/gt/z/ and /f/gt/v/)
77
THE DESIGN OF THE NEW LANGUAGE
- Phonetic Alphabet (Contd.)
- Consonants
- The bilabial nasal /m/ appeared in almost
300 languages 13. - The presence of /m/ in a language implies
the presence of /n/ in - 99.3 8.
- The confusion rate between /m/ and /n/ is
the highest - among other consonant pairs 14.
/m/
/m/
/n/
/m/ and /n/
In light of these facts, the final version of our
minimal alphabet will include the phonemes
/a/, /i/, /o/, as vowels and /b/, /t/, /k/,
/s/, /f/ and /n/ as consonants.
88
CHOICE OF THE WORDS FOR HUMSIL
- Main Considerations in the Design Process
- Acoustic orthogonality
- The factors affecting human learning of new
words in a second language - number of syllables within a word
- familiarity of the word to the speaker.
- Acoustic orthogonality The new words are
selected such that they are perceptually as
distant from each other as possible in the
acoustic space. - Number of syllables within a word Equal number
of two and three syllable words are selected for
the new digit vocabulary. - Familarity of the word to the speaker We prefer
to use unfamiliar words. - Since multi-nationality is a more important
criterion.
99
CHOICE OF THE WORDS FOR HUMSIL
Flowchart of the Vocabulary Design Process
Initial Vocabulary 18 one-syllable words 324
two-syllable words 5832 three-syllable words
All of the Possible Words
Common Phoneme Set
Syllable Constraints
/a/, /i/, /o/, /b/, /t/, /k/, /s/, f/,/n/
Consonant-Vowel Rule
Phoneme String Distance
Word Selection Algorithm
Phoneme Similarity
Candidate Vocabulary Sets
Acoustic Similarity
Best Vocabulary Set
Vocabulary Sets after the Application of Word
Selection Algorithm
Acoustic Distance
New Digit Vocabulary
1010
CHOICE OF THE WORDS FOR HUMSIL
Word Selection Algorithm
Phoneme String Distance The phoneme string
distance is some metric of how alike
two strings are to each other 16.
Acoustic Distance Acoustic distance
between two phonemes is defined as
(1)
intention
delete i gt ntention substitute n by e
gt etention substitute t by x gt exention
insert u gt exenution substitute n by
c gt execution
Operation List
For every substitution operation, the
acoustic distance between the actual phoneme
and the substituted phoneme is calculated
using (1) and then they are summed to find
the total acoustic distance between word
pairs.
Operation list between strings intention and
execution
1111
CHOICE OF THE WORDS FOR HUMSIL
Word Selection Algorithm (Contd.)
- Phoneme String distance determines the level of
similarity between two words - Acoustic distance determines the most orthogonal
word pairs. - The aim is to select the word pairs having larger
string distances and that are as distant as
possible from each other in the acoustic space.
- The first word of our new vocabulary is selected
randomly from the generated two-syllable words. - The second word is selected such that it has the
highest string distance from the first word. - The words from the third to the tenth are
selected in a way that the minimum of the string
distances between the new selected word and the
previously selected words will be the highest
1212
CHOICE OF THE WORDS FOR HUMSIL
Word Selection Algorithm (Contd.)
- All candidate vocabulary sets are selected using
the algorithm. - For all the vocabulary sets,
- the effect of acoustic distance to the phoneme
string distances is added. - The minimum of these total distances are found.
- The vocabulary set having
- the maximum of these minimum total distances is
selected as the best vocabulary set.
1st Selected word
Minimum of distances between the word and the
previously selected three words are found
2nd Selected word
2nd Selected word
Word 1
3rd Selected word
Word 2
Maximum of these minimum distances are found and
the fourth word is selected
All of the generated two-syllable words
1st Selected word
Word 324
2nd Selected word
2nd Selected word
Minimum of distances between the word and the
previously selected three words are found
3rd Selected word
Explanation of the algorithm for the selection
process of the fourth word.
1313
EVALUATIONS
Proposed Digit Set
Recognition Experiments
Digit
Turkish
English
Humsil
0 1 2 3 4 5 6 7 8 9
zero one two three four five six seven eight nine
sifir bir iki üç dört bes alti yedi sekiz dokuz
/biko/ /nana/ /fofi/ /siti/ /toso/ /babisi/ /tita
ba/ /kobati/ /satabo/ /fibata/
- Two recognition experiments are performed.
- Telephone speech database of GVZ Speech
Technologies is used to train the HMMs for
recognition. - The training data does not contain the recordings
of the new vocabulary. - Training data only contains of Turkish utterances
spoken by Turkish native speakers. - Recordings are taken in a noisy office
environment. - A low quality microphone and a low sampling rate
(8 kHz) was used in the recordings.
1414
EVALUATIONS
Recognition Experiments (Contd.)
Experiment I
- Test recordings of English, Turkish, and HUMSIL
digits are taken from 30 Turkish people (15
females and 15 males) whose second language is
English. - Error Rates
- 25.6
4.6
1.3
1515
EVALUATIONS
Recognition Experiments (Contd.)
Experiment II
- Test recordings of English and HUMSIL digits are
taken from 30 multinational speakers (15 females
and 15 males). 10 of them were native English
speakers. - Error Rates
- 37.0
4.0
1616
CONCLUSIONS
- A new human-machine speech interaction language
(HUMSIL) is proposed for the confusable word pair
problem in speech recognition applications. - A recognition experiment is performed with
Turkish speakers in their mother - tongue, second language and the new language.
- In HUMSIL, an error rate reduction of 71.7
compared to Turkish and 94.9 compared to English
is observed. - The same experiment is performed with
multinational speakers. - The error rate reduction of 89.1 compared to
English is observed. - The main disadvantage of our idea is that people
have to learn new words. - However, acoustically similar words in existing
languages will always degrade performance of SR
engines under noisy conditions and for speakers
with heavy accents. - Therefore, we think that the proposed idea
provides a good alternative to the solution of
this problem.
1717
REFERENCES
1. Hemphill, C.T., Agarwal, R., Muthusamy, Y.K.,
and Gong, Y. Voice-Driven Information Access in
the Automobile. IEEE Vehicular Technology Society
News,August, 8-11 (2000) 2. Arslan, L.M., and
Hansen, J.H.L. Likehood Decision Boundary
Estimation between HMM Pairs in Speech
Recognition. IEEE Trans. On Acoust. Speech, and
Signal Processing,6(4) (1998) 410- 414 3.
Schubert, K(ed.). Interlinguistics Aspects of
the Science of Planned Languages, Trends in
Linguistics, Studies and Monographs 42.(Mouton de
Gruyter, Berlin and New York) (1989) 10 4.
Mackenzie, I. S. and Zang, S. The immediate
usability of Graffiti. Proc. of Graphics
Interface'97. (1997) 129-137 5. Fromkin, V. and
Rodman, R. An Inroduction to Language. Holt,
Rinehart and Winston, Inc.,Orlando. (1998) 6.
Deller, J.R., Proakis, J.G. and Hansen J.H.L.
Discrete-Time Processing of Speech Signals.
Macmillan Publishing Company. (1993) 7. IPA,
Handbook of the International Phonetic
Association, Cambridge University Press,
(1999) 8. Maddieson, I. Patterns of Sounds,
Cambridge University Press. (1984) 9. Rabiner,
L. R. and Schafer, W. Digital Processing of
Speech Signals, Prentice Hall, (1978) 10. Forgie,
J. W. and Forgie, C. D. Results Obtained from a
Vowel Recognition Computer Program. The Journal
of the Acoustical Soceity of America, 31(11).
(1959) 1480-1489 11. Miller, G. A. and Nicely, P.
E. An Analysis of Perceptual Confusions Among
Some English Consonants. The Journal of the
Acoustical Society of America, 27(2), (1955)
338-352 12. House, A. S. Williams, C. E.
Hecker, M. H. L. and Kryter, K. D.
Articulation-Testing Methods Consonantal
Differentiation with a Closed-Response Set, The
Journal of the Acoustical Society of America,
37(1), (1965) 13. Odlin, T. Cross-linguistic
Influence in Language Learning, Cambridge
University Press, (1989). 14. Roe, D. B. and
Riley, M. D. Prediction of Word Confusabilities
for Speech Recognition, ICSLP, Yokohama, (1994),
227-230. 15. Arslan, L. M. A New Universal
Language for Speech Recognition Applications,
IEEE Proc. ICASSP, Istanbul Turkey, (2000) 16.
Jurafsky, D. and Martin J. H. Speech and
Language Processing, Prentice Hall, (2000)