English across Taiwan EAT Application on MixedLanguage Accented Speech Recognition PowerPoint PPT Presentation

presentation player overlay
1 / 34
About This Presentation
Transcript and Presenter's Notes

Title: English across Taiwan EAT Application on MixedLanguage Accented Speech Recognition


1
English across Taiwan (EAT) Application on
Mixed-Language Accented Speech Recognition
  • Presenter Prof. Jhing-Fa Wang
  • Jhing-Fa Wang, Chi-Feng Li, Jia-Ching Wang,
    Shun-Chieh Lin, and Chung-Hsien Wu
  • National Cheng Kung University, Tainan, Taiwan
  • Dec. 4-6 Hanoi O-COCOSDA 2007

2
OUTLINE
  • 1. INTRODUCTION
  • 2. MANDARIN AND ENGLISH MIXED-LANGUAGE
    SPEECH RECOGNITION (MLSR) SYSTEM
  • 3. IMPLEMENTATION AND EXPERIMENTAL RESULTS
  • 4. CONCLUSIONS AND FUTURE WORKS

3
1.1 Background and Motivation
  • Speech recognition technology has recently
    reached a higher level of performance and
    robustness, allowing it to be deployed in a
    number of real-world environments, such as mobile
    phones and toys.
  • As global communication and multiethnic societies
    grow, multilingualism frequently occurs in speech
    content, and so the multilingual speech
    recognition systems has become increasingly
    desirable.

4
1.2 Objective of Proposed Work
  • Our research mainly to build a Mandarin and
    English mixed-language speech recognition (MLSR)
    System.
  • The MLSR systems can be used easily in different
    command control applications by changing the
    dictionary description and grammar in each new
    task.
  • Finally, the mixed-language speaker independent
    speech recognition embedded system is also
    implemented on PDAs.

5
1. 3 Current Research and Products
  • Mandarin speech recognition
  • IBM Via Voice
  • Penpower Voicewriter
  • NCKU VenusDictate (????????? )
  • English speech recognition
  • ATT Lab
  • Microsoft
  • IBM
  • CMU

6
2 Framework of the Proposed System
  • Front-end Signal Processing

Input Speech
Feature Vectors
Search Algorithm
Output Sentence
Front-end Signal Processing
(??PRADA???)
Speech Corpora (ChineseEnglish)
Grammar
Acoustic Model Training
Mixed-Langage Acoustic Models
  • Recognition Phase
  • Training Phase

1.Front-end Signal Processing Silence Removal,
Voice Activity Detection, Feature Extraction
2.Example Input Sentence ??PRADA??? 3.Acoustic
Models ( Ch_m-UW-AN-Zh_m-_at_-P-R-AA-D-AH-D-_at_-_at_-M-OW
) 4.Lexicon ( Ch_m-UW-AN) ? ?? ( D-_at_ ) ???
( Zh_m-_at_ ) ? ?? ( _at_ ) ???
( P-R-AA-D-AH ) ? PRADA? ( M-OW
) ???
5.Grammar (?) (?) (PRADA) (?) (?) (?)
7
2.1 Feature Extraction and Front-End Signal
Processing
  • Features and transforms
  • an MFCC (Mel Frequency Cepstral Coefficient)
    analysis is performed for each signal frame, the
    following coefficients are extracted

8
2.1 Feature Extraction and Front-End Signal
Processing
  • Silence Removal
  • Silence removal is used to remove silent parts of
    the speech signal, which contain little or no
    speech specific characteristics.
  • End-point Detection
  • The goal of end-point detection (EPD for short)
    is to identfy the important part of audio signals
    for further processing. Hence EPD is also known
    as "speech detecton" or "voice activity
    detection" (VAD for short).

9
2.2.1 Training and Testing Corpora
  • Corpora of English Across Taiwan (EAT)
  • EAT project prepared 600 recording sheets. Each
    sheet had 80 reading sentences, including
  • English long sentences
  • English short sentences
  • English words

10
2.2.1 Training and Testing Corpora
  • Corpora of Mandarin speech data Across Taiwan
    (MAT-400)
  • .

11
2.2.2 Acoustic Model Training
  • We developed the bilingual acoustic model for
    Mandarin and English using the following steps
  • Step1. Group the phones into acoustically and
    phonetically similar clusters for each of the
    language.
  • Step2. Develop Gaussian acoustic models for all
    the monophones.

12
  • Step3. Calculate the dissimilarity of the phone
    set to all the phones in the same group
    for English. If the value is below a
    threshold, the source phone in Mandarin
    would be mapped to that phone in English.
    Otherwise, both the phones would be
    modeled separately in the bilingual system.
  • Step4. After getting the list of phones for the
    bilingual
  • system, the lexicon for Mandarin is
    edited using the
  • rules obtained for mapping.

13
2.2.2 Mixed multilingual phone inventory
for English and Mandarin
Mixed multilingual phone inventory for English
and Mandarin. (Reference on Chomsky
definition)
14
2.2.2 Mixed-Language Phone Set (cont.)
15
2.3 Training Phase of the Proposed System
  • Acoustic Models
  • Training Phase

16
2.4 Recognition Phase of the Proposed System
  • Recognition Phase

Acoustic models Lexicon Grammar.
17
2.4 Recognition Phase of the Proposed System
  • Tree Lexicon (A Pronunciation Dictionary)
  • The dictionary provides an association between
    words used in the task grammar and the acoustic
    models which is composed of sub word (phonetic,
    syllabic etc,,) units.

English
Mandarin
  • Reference From CMU Pronouncing Dictionary
  • Based on MPS

18
2.4 Recognition Phase of the Proposed System
  • The Task Grammar
  • Grammar - the union words or phrases to constrain
    the range of input or output in the voice
    application.

19
2.4 Recognition Phase of the Proposed System
  • Viterbi Beam Search
  • Viterbi search is essentially a dynamic
    programming algorithm. It is well known that a
    complete Viterbi search is computation intensive.
  • Most large-vocabulary ASR systems would rely on
    beam search to cut down the computation cost
  • In other words, we do not need to search the
    entire Viterbi trellis to find the optimal path.
    Instead, we limit the number of branch-out
    candidates (which is proportional to the
    computation cost) according to a certain
    heuristics.

20
3. IMPLEMENTATION AND EXPERIMENTAL RESULTS
  • The Cambridge University Hidden Markov Models
    toolkit (HTK) was used for the implementation of
    the Mixed-Language Speech Recognizer.

21
HTK HMM training tool
  • The Training phase was started by applying the
    HTK tool to a predefined simple prototype model.
  • This tool uses all the available training data
    and utilizing the Viterbi alignment repeatedly,
    tries to provide initial estimates of HMM
    parameters.
  • Later, the HRest tool was used to provide more
    accurate parameter estimates using
  • the Baum-Welch algorithm

22
Recognition Process
  • Configuration File
  • HMMs
  • Grammar
  • Dictionary
  • Phone set

23
Port Hvite to Embedded System (WM 5.0)
  • Port the HTK tool to embedded system(OSWM5.0),AC
    ER N300.
  • HCopy Feature Extraction
  • HVite Viterbi Search
  • Therefore, the two important part of MLSR,
    feature extraction core and the decoding core, of
    the speech recognition can be applied on the
    embedded system

24
3.1.1 Configuration of Embedded System (Acer N300)
  • Performance
  • Samsung S3C2440 processor at 400 MHz
  • System memory
  • 64 MB mobile SDRAM for system operation.
  • 128 MB of flash memory for operating system,
    embedded applications, user applications and
    storage
  • Microsoft Windows MobileTM Version 5.0 Software
    for Pocket PC, Premium Edition with Microsoft
    Outlook 2002

25
3.1.2 Embedded Visaul C 4.0 (EVC 4.0)
  • The Microsoft eMbedded Visual Tools 4.0 deliver a
    free and complete desktop development environment
    for creating applications and system components
    for Windows powered devices, including the Pocket
    PC and Handheld PC .
  • Microsoft eMbedded Visual C enables playing
    sounds on Pocket PCs. Functions such as
    PlaySound, MessageBeep, waveOutWrite and other
    waveOut functions are indeed available.

26
3.1.3 System Interface
27
3.2.1 Experiment Setup
  • EAT
  • In EAT, through some selection, there are totally
    8375 wave files, incluing English long sentences,
    English short sentences and English words for
    training. The corpus contains 19221 words . This
    contributes about 5.33 hours of continuous
    speech.
  • MAT
  • In MAT-400, we use the MATDB-4 (1200) and MATDB-5
    (400) category as our training corpora. There are
    totally 15400 wave files, including words of 2
    to 4 syllables and phonetically balanced
    sentences for training. This contributes about
    9.65 hours of continuous speech..

28
3.2.1 Experiment Setup
  • We pick 100 sentences from EAT and another 100
    sentences from MAT-400 as Test pattern A, which
    is not included in the training copora.

Test pattern B
Test pattern C
29
3.2.2 Experimental Results
  • Insertion errors ( ins ), deletion errors ( del )
    and substitution errors ( sub ), were considered.
    The phone recognition accuracies for ( acc ) was
    estimated as
  • The monophone approach adopts direct combination
    of Mandarin and English language-dependent phones
    (MIX) and language-dependent IPA phones (IPA).

30
3.2.2 Experimental Results (Conti.)
Test pattern B
Test pattern C
31
4. Conclusions and Future Works
  • A prototype of mixed-language accented speech
    recognition for Mandarin and English has been
    established and presented.
  • The experimental results show that the proposed
    system can perform 7080 lexicon recognition
    accuracy.
  • In the future, more language, like Taiwanese,
    also may be integrated together into the
    prototype system.
  • Finally, the speech recognizer would be also
    implemented in the hand-free environment with
    learning capability and the VLSI hardware
    realization in the future.

32
References
  • 1 Ladefoged P., Local J. and Shockey L.,
    editors. Handbook of the International Phonetic
    Association A Guide to the Use of the
    International Phonetic Alphabet. Cambridge
    University Press, U.K., 1999.
  • 2 Kohler J. Comparing three methods to create
    multilingual phone models for vocabulary
    independent speech recognition tasks. In Proc.
    ESCA-NATO Tutorial and Research Workshop
    Multi-lingual Interoperability in Speech
    Technology, pp. 7984, Sept. 1999.
  • 3 Kohler J. Multilingual phone models for
    vocabulary-independent speech recognition tasks.
    Speech Communication, 35(1-2)2130, Aug. 2001.
  • 4 Vihola M., Harju M., Salmela P., Suontausta
    J. and Savela J. Two dissimilarity measures for
    HMMs and their application in phoneme model
    clustering. Accepted to Proceedings of
    International Conference on Acoustics, Speech and
    Signal Processing, Orlando, USA, 2002.
  • 5 Harju M., Salmela P., Leppanen J., Viikki O.
    and Saarinen J. Comparing parameter tying
    techniques for multilingual acoustic modelling.
    In Proceedings of the European Conference of
    Speech Communication and Technology, pp.
    27292732, Aalborg, Denmark, Sept. 2001.
  • 6 Andersen O., Dalsgaard P. and Barry W. On the
    use of data-driven clustering technique for
    identification of poly- and mono-phonemes for
    four european languages. In Proceedings of
    International Conference on Acoustics, Speech and
    Signal Processing, volume 1, pp. 121124,
    Adelaide, Australia, Apr. 1994.
  • 7 Imperl B. and Horvat B. The clustering
    algorithm for the definition of multilingual set
    of context dependent speech models. In
    Proceedings of the European Conference of Speech
    Communication and Technology, pp. 887890,
    Budabest, Hungary, 1999.
  • 8 Zgank A., Imperl B. and Johansen F.
    Crosslingual speech recognition with multilingual
    acoustic models based on agglomerative and
    tree-based triphone clustering. In Proceedings of
    the European Conference of Speech Communication
    and Technology, pp. 27252728, Aalborg, Denmark,
    Sept. 2001.
  • 9 Turunen E. Survey of theory and applications
    of Lukasiewicz-Pavelka fuzzy logic. In di Nola A.
    and Gerla G., editors, Lectures on Soft Computing
    and Fuzzy Logic. Advances in Soft Computing, pp.
    313337. Physica-Verlag, Heidelberg, 2001.
  • 10 Chomsky, N. and Halle, M., 1968. The Sound
    Pattern of English. New York Harper
    Row.
  • 11 H. C. Wang, MAT A project to collect
    Mandarin speech data through telephone networks,
    Computational Linguistics and Chinese Language
    Processing, vol.2, no. 1, pp. 73-90 (1997).

33
References
  • 12 F. Seide. N. J. C. Wang, 1998. Phonetic
    modeling in the Philips Chinese continuous-speech
    recognition system. In Proc.
  • 13 Y. J. Chen, C-H. Wu et al. 2002. Generation
    of robust phonetic set and decision tree for
    Mandarin using chi-square testing. Speech
    Communication, Vol. 38 (3-4), pp. 349-364.
  • 14Young, S. et al. HTKbook(V3.2), Cambridge
    University Engineering Dept. (2002)
  • 15 Karjalainen M. Kommunikaatioakustiikka.
    Technical Report 51, Helsinki University of
    Technology, Laboratory of Acoustics and Audio
    Signal Processing, Espoo, Finland, 1999.
    Preprint, In Finnish.
  • 16 Rabiner L. Fundamentals of Speech
    Recognition. PTR Prentice-Hall Inc., New Jersey,
    1993.
  • 17 H. C. Wang, MAT A project to collect
    Mandarin speech data through telephone networks,
    Computational Linguistics and Chinese Language
    Processing, vol.2, no. 1, pp. 73-90 (1997).
  • 18 C. Y. Tseng, A phonetically oriented speech
    database for Mandarin Chinese, Proc. ICPhS95,
    Stockholm, pp. 326- 329 (1995).
  • 19 C.-H. Lee, L. Rabiner et al. 1990. Acoustic
    modcling for large vocabulary speech recognition.
    Computer speech and language Vol. 4_ pp.127-
    165.
  • 20 J. L. Gauvain, L.F. Laniel, G Adda, M.
    Adda-Decker, 1994. Speaker Independent Continuous
    Speech Dictation, Speech Communication, Vol. 15
    (l-2), pp. 21-37.
  • 21 C.L. Huang, C-H Wu, PHONE SET GENERATION
    BASED ON ACOUSTIC AND CONTEXTUAL ANALYSIS FOR
    MULTILINGUAL SPEECH RECOGNITION Department of
    Computer Science and Information
    Engineering,National Cheng Kung University,
    Tainan, Taiwan, R.O.C. (2007)
  • 22 C.L. Huang, C-H Wu, Generation of Phonetic
    Units for Mixed-Language Speech Recognition Based
    on Acoustic and Contextual Analysis, Department
    of Computer Science and Information
    Engineering,National Cheng Kung University,
    Tainan, Taiwan, R.O.C. (2007)
  • 23 Shengmin Yu Sheng Hu Shuwu Zhang Bo Xu,
    CHINESE-ENGLISH BILINGUAL SPEECH RECOGNITION,
    Hi-Tech Innovation Center, Institute of
    Automation Chinese Academy of Sciences, Beijing,
    P. R. China (2003)
  • 24 C. Y MA, Pascale FUNG, Using English
    Phoneme Models for Chinese Speech Recognition ,
    The Human Language Technology Center Department
    of Electrical and Electronic Engineering Hong
    Kong University of Science and Technology
    (HKUST), Hong Kong

34
Thank you !!
Write a Comment
User Comments (0)
About PowerShow.com