Title: English across Taiwan EAT Application on MixedLanguage Accented Speech Recognition
1English across Taiwan (EAT) Application on
Mixed-Language Accented Speech Recognition
- Presenter Prof. Jhing-Fa Wang
- Jhing-Fa Wang, Chi-Feng Li, Jia-Ching Wang,
Shun-Chieh Lin, and Chung-Hsien Wu - National Cheng Kung University, Tainan, Taiwan
- Dec. 4-6 Hanoi O-COCOSDA 2007
2OUTLINE
- 1. INTRODUCTION
- 2. MANDARIN AND ENGLISH MIXED-LANGUAGE
SPEECH RECOGNITION (MLSR) SYSTEM - 3. IMPLEMENTATION AND EXPERIMENTAL RESULTS
- 4. CONCLUSIONS AND FUTURE WORKS
31.1 Background and Motivation
- Speech recognition technology has recently
reached a higher level of performance and
robustness, allowing it to be deployed in a
number of real-world environments, such as mobile
phones and toys. - As global communication and multiethnic societies
grow, multilingualism frequently occurs in speech
content, and so the multilingual speech
recognition systems has become increasingly
desirable.
41.2 Objective of Proposed Work
- Our research mainly to build a Mandarin and
English mixed-language speech recognition (MLSR)
System. - The MLSR systems can be used easily in different
command control applications by changing the
dictionary description and grammar in each new
task. - Finally, the mixed-language speaker independent
speech recognition embedded system is also
implemented on PDAs.
51. 3 Current Research and Products
- Mandarin speech recognition
- IBM Via Voice
- Penpower Voicewriter
- NCKU VenusDictate (????????? )
- English speech recognition
- ATT Lab
- Microsoft
- IBM
- CMU
62 Framework of the Proposed System
- Front-end Signal Processing
Input Speech
Feature Vectors
Search Algorithm
Output Sentence
Front-end Signal Processing
(??PRADA???)
Speech Corpora (ChineseEnglish)
Grammar
Acoustic Model Training
Mixed-Langage Acoustic Models
1.Front-end Signal Processing Silence Removal,
Voice Activity Detection, Feature Extraction
2.Example Input Sentence ??PRADA??? 3.Acoustic
Models ( Ch_m-UW-AN-Zh_m-_at_-P-R-AA-D-AH-D-_at_-_at_-M-OW
) 4.Lexicon ( Ch_m-UW-AN) ? ?? ( D-_at_ ) ???
( Zh_m-_at_ ) ? ?? ( _at_ ) ???
( P-R-AA-D-AH ) ? PRADA? ( M-OW
) ???
5.Grammar (?) (?) (PRADA) (?) (?) (?)
72.1 Feature Extraction and Front-End Signal
Processing
- Features and transforms
- an MFCC (Mel Frequency Cepstral Coefficient)
analysis is performed for each signal frame, the
following coefficients are extracted
82.1 Feature Extraction and Front-End Signal
Processing
- Silence Removal
- Silence removal is used to remove silent parts of
the speech signal, which contain little or no
speech specific characteristics. - End-point Detection
- The goal of end-point detection (EPD for short)
is to identfy the important part of audio signals
for further processing. Hence EPD is also known
as "speech detecton" or "voice activity
detection" (VAD for short).
92.2.1 Training and Testing Corpora
- Corpora of English Across Taiwan (EAT)
- EAT project prepared 600 recording sheets. Each
sheet had 80 reading sentences, including - English long sentences
- English short sentences
- English words
102.2.1 Training and Testing Corpora
- Corpora of Mandarin speech data Across Taiwan
(MAT-400) - .
112.2.2 Acoustic Model Training
- We developed the bilingual acoustic model for
Mandarin and English using the following steps - Step1. Group the phones into acoustically and
phonetically similar clusters for each of the
language. - Step2. Develop Gaussian acoustic models for all
the monophones.
12- Step3. Calculate the dissimilarity of the phone
set to all the phones in the same group
for English. If the value is below a
threshold, the source phone in Mandarin
would be mapped to that phone in English.
Otherwise, both the phones would be
modeled separately in the bilingual system. - Step4. After getting the list of phones for the
bilingual - system, the lexicon for Mandarin is
edited using the - rules obtained for mapping.
132.2.2 Mixed multilingual phone inventory
for English and Mandarin
Mixed multilingual phone inventory for English
and Mandarin. (Reference on Chomsky
definition)
142.2.2 Mixed-Language Phone Set (cont.)
152.3 Training Phase of the Proposed System
162.4 Recognition Phase of the Proposed System
Acoustic models Lexicon Grammar.
172.4 Recognition Phase of the Proposed System
- Tree Lexicon (A Pronunciation Dictionary)
- The dictionary provides an association between
words used in the task grammar and the acoustic
models which is composed of sub word (phonetic,
syllabic etc,,) units.
English
Mandarin
- Reference From CMU Pronouncing Dictionary
182.4 Recognition Phase of the Proposed System
- The Task Grammar
- Grammar - the union words or phrases to constrain
the range of input or output in the voice
application.
192.4 Recognition Phase of the Proposed System
- Viterbi Beam Search
- Viterbi search is essentially a dynamic
programming algorithm. It is well known that a
complete Viterbi search is computation intensive. - Most large-vocabulary ASR systems would rely on
beam search to cut down the computation cost - In other words, we do not need to search the
entire Viterbi trellis to find the optimal path.
Instead, we limit the number of branch-out
candidates (which is proportional to the
computation cost) according to a certain
heuristics.
203. IMPLEMENTATION AND EXPERIMENTAL RESULTS
- The Cambridge University Hidden Markov Models
toolkit (HTK) was used for the implementation of
the Mixed-Language Speech Recognizer.
21HTK HMM training tool
- The Training phase was started by applying the
HTK tool to a predefined simple prototype model.
- This tool uses all the available training data
and utilizing the Viterbi alignment repeatedly,
tries to provide initial estimates of HMM
parameters. - Later, the HRest tool was used to provide more
accurate parameter estimates using - the Baum-Welch algorithm
22 Recognition Process
23 Port Hvite to Embedded System (WM 5.0)
- Port the HTK tool to embedded system(OSWM5.0),AC
ER N300. - HCopy Feature Extraction
- HVite Viterbi Search
- Therefore, the two important part of MLSR,
feature extraction core and the decoding core, of
the speech recognition can be applied on the
embedded system
243.1.1 Configuration of Embedded System (Acer N300)
- Performance
- Samsung S3C2440 processor at 400 MHz
- System memory
- 64 MB mobile SDRAM for system operation.
- 128 MB of flash memory for operating system,
embedded applications, user applications and
storage - Microsoft Windows MobileTM Version 5.0 Software
for Pocket PC, Premium Edition with Microsoft
Outlook 2002
253.1.2 Embedded Visaul C 4.0 (EVC 4.0)
- The Microsoft eMbedded Visual Tools 4.0 deliver a
free and complete desktop development environment
for creating applications and system components
for Windows powered devices, including the Pocket
PC and Handheld PC . - Microsoft eMbedded Visual C enables playing
sounds on Pocket PCs. Functions such as
PlaySound, MessageBeep, waveOutWrite and other
waveOut functions are indeed available.
263.1.3 System Interface
273.2.1 Experiment Setup
- EAT
- In EAT, through some selection, there are totally
8375 wave files, incluing English long sentences,
English short sentences and English words for
training. The corpus contains 19221 words . This
contributes about 5.33 hours of continuous
speech. - MAT
- In MAT-400, we use the MATDB-4 (1200) and MATDB-5
(400) category as our training corpora. There are
totally 15400 wave files, including words of 2
to 4 syllables and phonetically balanced
sentences for training. This contributes about
9.65 hours of continuous speech..
283.2.1 Experiment Setup
- We pick 100 sentences from EAT and another 100
sentences from MAT-400 as Test pattern A, which
is not included in the training copora.
Test pattern B
Test pattern C
293.2.2 Experimental Results
- Insertion errors ( ins ), deletion errors ( del )
and substitution errors ( sub ), were considered.
The phone recognition accuracies for ( acc ) was
estimated as - The monophone approach adopts direct combination
of Mandarin and English language-dependent phones
(MIX) and language-dependent IPA phones (IPA).
303.2.2 Experimental Results (Conti.)
Test pattern B
Test pattern C
314. Conclusions and Future Works
- A prototype of mixed-language accented speech
recognition for Mandarin and English has been
established and presented. - The experimental results show that the proposed
system can perform 7080 lexicon recognition
accuracy. - In the future, more language, like Taiwanese,
also may be integrated together into the
prototype system. - Finally, the speech recognizer would be also
implemented in the hand-free environment with
learning capability and the VLSI hardware
realization in the future.
32References
- 1 Ladefoged P., Local J. and Shockey L.,
editors. Handbook of the International Phonetic
Association A Guide to the Use of the
International Phonetic Alphabet. Cambridge
University Press, U.K., 1999. - 2 Kohler J. Comparing three methods to create
multilingual phone models for vocabulary
independent speech recognition tasks. In Proc.
ESCA-NATO Tutorial and Research Workshop
Multi-lingual Interoperability in Speech
Technology, pp. 7984, Sept. 1999. - 3 Kohler J. Multilingual phone models for
vocabulary-independent speech recognition tasks.
Speech Communication, 35(1-2)2130, Aug. 2001. - 4 Vihola M., Harju M., Salmela P., Suontausta
J. and Savela J. Two dissimilarity measures for
HMMs and their application in phoneme model
clustering. Accepted to Proceedings of
International Conference on Acoustics, Speech and
Signal Processing, Orlando, USA, 2002. - 5 Harju M., Salmela P., Leppanen J., Viikki O.
and Saarinen J. Comparing parameter tying
techniques for multilingual acoustic modelling.
In Proceedings of the European Conference of
Speech Communication and Technology, pp.
27292732, Aalborg, Denmark, Sept. 2001. - 6 Andersen O., Dalsgaard P. and Barry W. On the
use of data-driven clustering technique for
identification of poly- and mono-phonemes for
four european languages. In Proceedings of
International Conference on Acoustics, Speech and
Signal Processing, volume 1, pp. 121124,
Adelaide, Australia, Apr. 1994. - 7 Imperl B. and Horvat B. The clustering
algorithm for the definition of multilingual set
of context dependent speech models. In
Proceedings of the European Conference of Speech
Communication and Technology, pp. 887890,
Budabest, Hungary, 1999. - 8 Zgank A., Imperl B. and Johansen F.
Crosslingual speech recognition with multilingual
acoustic models based on agglomerative and
tree-based triphone clustering. In Proceedings of
the European Conference of Speech Communication
and Technology, pp. 27252728, Aalborg, Denmark,
Sept. 2001. - 9 Turunen E. Survey of theory and applications
of Lukasiewicz-Pavelka fuzzy logic. In di Nola A.
and Gerla G., editors, Lectures on Soft Computing
and Fuzzy Logic. Advances in Soft Computing, pp.
313337. Physica-Verlag, Heidelberg, 2001. - 10 Chomsky, N. and Halle, M., 1968. The Sound
Pattern of English. New York Harper
Row. - 11 H. C. Wang, MAT A project to collect
Mandarin speech data through telephone networks,
Computational Linguistics and Chinese Language
Processing, vol.2, no. 1, pp. 73-90 (1997).
33References
- 12 F. Seide. N. J. C. Wang, 1998. Phonetic
modeling in the Philips Chinese continuous-speech
recognition system. In Proc. - 13 Y. J. Chen, C-H. Wu et al. 2002. Generation
of robust phonetic set and decision tree for
Mandarin using chi-square testing. Speech
Communication, Vol. 38 (3-4), pp. 349-364. - 14Young, S. et al. HTKbook(V3.2), Cambridge
University Engineering Dept. (2002) - 15 Karjalainen M. Kommunikaatioakustiikka.
Technical Report 51, Helsinki University of
Technology, Laboratory of Acoustics and Audio
Signal Processing, Espoo, Finland, 1999.
Preprint, In Finnish. - 16 Rabiner L. Fundamentals of Speech
Recognition. PTR Prentice-Hall Inc., New Jersey,
1993. - 17 H. C. Wang, MAT A project to collect
Mandarin speech data through telephone networks,
Computational Linguistics and Chinese Language
Processing, vol.2, no. 1, pp. 73-90 (1997). - 18 C. Y. Tseng, A phonetically oriented speech
database for Mandarin Chinese, Proc. ICPhS95,
Stockholm, pp. 326- 329 (1995). - 19 C.-H. Lee, L. Rabiner et al. 1990. Acoustic
modcling for large vocabulary speech recognition.
Computer speech and language Vol. 4_ pp.127-
165. - 20 J. L. Gauvain, L.F. Laniel, G Adda, M.
Adda-Decker, 1994. Speaker Independent Continuous
Speech Dictation, Speech Communication, Vol. 15
(l-2), pp. 21-37. - 21 C.L. Huang, C-H Wu, PHONE SET GENERATION
BASED ON ACOUSTIC AND CONTEXTUAL ANALYSIS FOR
MULTILINGUAL SPEECH RECOGNITION Department of
Computer Science and Information
Engineering,National Cheng Kung University,
Tainan, Taiwan, R.O.C. (2007) - 22 C.L. Huang, C-H Wu, Generation of Phonetic
Units for Mixed-Language Speech Recognition Based
on Acoustic and Contextual Analysis, Department
of Computer Science and Information
Engineering,National Cheng Kung University,
Tainan, Taiwan, R.O.C. (2007) - 23 Shengmin Yu Sheng Hu Shuwu Zhang Bo Xu,
CHINESE-ENGLISH BILINGUAL SPEECH RECOGNITION,
Hi-Tech Innovation Center, Institute of
Automation Chinese Academy of Sciences, Beijing,
P. R. China (2003) - 24 C. Y MA, Pascale FUNG, Using English
Phoneme Models for Chinese Speech Recognition ,
The Human Language Technology Center Department
of Electrical and Electronic Engineering Hong
Kong University of Science and Technology
(HKUST), Hong Kong
34Thank you !!