English across Taiwan EAT Application on MixedLanguage Accented Speech Recognition presentation

About This Presentation

Transcript and Presenter's Notes

Title: English across Taiwan EAT Application on MixedLanguage Accented Speech Recognition

1
English across Taiwan (EAT) Application on
Mixed-Language Accented Speech Recognition

Presenter Prof. Jhing-Fa Wang
Jhing-Fa Wang, Chi-Feng Li, Jia-Ching Wang,
Shun-Chieh Lin, and Chung-Hsien Wu
National Cheng Kung University, Tainan, Taiwan
Dec. 4-6 Hanoi O-COCOSDA 2007

2
OUTLINE

1. INTRODUCTION
2. MANDARIN AND ENGLISH MIXED-LANGUAGE
SPEECH RECOGNITION (MLSR) SYSTEM
3. IMPLEMENTATION AND EXPERIMENTAL RESULTS
4. CONCLUSIONS AND FUTURE WORKS

3
1.1 Background and Motivation

Speech recognition technology has recently
reached a higher level of performance and
robustness, allowing it to be deployed in a
number of real-world environments, such as mobile
phones and toys.
As global communication and multiethnic societies
grow, multilingualism frequently occurs in speech
content, and so the multilingual speech
recognition systems has become increasingly
desirable.

4
1.2 Objective of Proposed Work

Our research mainly to build a Mandarin and
English mixed-language speech recognition (MLSR)
System.
The MLSR systems can be used easily in different
command control applications by changing the
dictionary description and grammar in each new
task.
Finally, the mixed-language speaker independent
speech recognition embedded system is also
implemented on PDAs.

5
1. 3 Current Research and Products

Mandarin speech recognition
IBM Via Voice
Penpower Voicewriter
NCKU VenusDictate (????????? )
English speech recognition
ATT Lab
Microsoft
IBM
CMU

6
2 Framework of the Proposed System

Front-end Signal Processing

Input Speech
Feature Vectors
Search Algorithm
Output Sentence
Front-end Signal Processing
(??PRADA???)
Speech Corpora (ChineseEnglish)
Grammar
Acoustic Model Training
Mixed-Langage Acoustic Models

Recognition Phase

Training Phase

1.Front-end Signal Processing Silence Removal,
Voice Activity Detection, Feature Extraction
2.Example Input Sentence ??PRADA??? 3.Acoustic
Models ( Ch_m-UW-AN-Zh_m-_at_-P-R-AA-D-AH-D-_at_-_at_-M-OW
) 4.Lexicon ( Ch_m-UW-AN) ? ?? ( D-_at_ ) ???
( Zh_m-_at_ ) ? ?? ( _at_ ) ???
( P-R-AA-D-AH ) ? PRADA? ( M-OW
) ???
5.Grammar (?) (?) (PRADA) (?) (?) (?)
7
2.1 Feature Extraction and Front-End Signal
Processing

Features and transforms
an MFCC (Mel Frequency Cepstral Coefficient)
analysis is performed for each signal frame, the
following coefficients are extracted

8
2.1 Feature Extraction and Front-End Signal
Processing

Silence Removal
Silence removal is used to remove silent parts of
the speech signal, which contain little or no
speech specific characteristics.
End-point Detection
The goal of end-point detection (EPD for short)
is to identfy the important part of audio signals
for further processing. Hence EPD is also known
as "speech detecton" or "voice activity
detection" (VAD for short).

9
2.2.1 Training and Testing Corpora

Corpora of English Across Taiwan (EAT)
EAT project prepared 600 recording sheets. Each
sheet had 80 reading sentences, including
English long sentences
English short sentences
English words

10
2.2.1 Training and Testing Corpora

Corpora of Mandarin speech data Across Taiwan
(MAT-400)
.

11
2.2.2 Acoustic Model Training

We developed the bilingual acoustic model for
Mandarin and English using the following steps
Step1. Group the phones into acoustically and
phonetically similar clusters for each of the
language.
Step2. Develop Gaussian acoustic models for all
the monophones.

Step3. Calculate the dissimilarity of the phone
set to all the phones in the same group
for English. If the value is below a
threshold, the source phone in Mandarin
would be mapped to that phone in English.
Otherwise, both the phones would be
modeled separately in the bilingual system.
Step4. After getting the list of phones for the
bilingual
system, the lexicon for Mandarin is
edited using the
rules obtained for mapping.

13
2.2.2 Mixed multilingual phone inventory
for English and Mandarin
Mixed multilingual phone inventory for English
and Mandarin. (Reference on Chomsky
definition)
14
2.2.2 Mixed-Language Phone Set (cont.)
15
2.3 Training Phase of the Proposed System

Acoustic Models

Training Phase

16
2.4 Recognition Phase of the Proposed System

Recognition Phase

Acoustic models Lexicon Grammar.
17
2.4 Recognition Phase of the Proposed System

Tree Lexicon (A Pronunciation Dictionary)
The dictionary provides an association between
words used in the task grammar and the acoustic
models which is composed of sub word (phonetic,
syllabic etc,,) units.

English
Mandarin

Reference From CMU Pronouncing Dictionary

Based on MPS

18
2.4 Recognition Phase of the Proposed System

The Task Grammar
Grammar - the union words or phrases to constrain
the range of input or output in the voice
application.

19
2.4 Recognition Phase of the Proposed System

Viterbi Beam Search
Viterbi search is essentially a dynamic
programming algorithm. It is well known that a
complete Viterbi search is computation intensive.
Most large-vocabulary ASR systems would rely on
beam search to cut down the computation cost
In other words, we do not need to search the
entire Viterbi trellis to find the optimal path.
Instead, we limit the number of branch-out
candidates (which is proportional to the
computation cost) according to a certain
heuristics.

20
3. IMPLEMENTATION AND EXPERIMENTAL RESULTS

The Cambridge University Hidden Markov Models
toolkit (HTK) was used for the implementation of
the Mixed-Language Speech Recognizer.

21
HTK HMM training tool

The Training phase was started by applying the
HTK tool to a predefined simple prototype model.
This tool uses all the available training data
and utilizing the Viterbi alignment repeatedly,
tries to provide initial estimates of HMM
parameters.
Later, the HRest tool was used to provide more
accurate parameter estimates using
the Baum-Welch algorithm

22
Recognition Process

Configuration File

HMMs

Grammar

Dictionary

Phone set

23
Port Hvite to Embedded System (WM 5.0)

Port the HTK tool to embedded system(OSWM5.0),AC
ER N300.
HCopy Feature Extraction
HVite Viterbi Search
Therefore, the two important part of MLSR,
feature extraction core and the decoding core, of
the speech recognition can be applied on the
embedded system

24
3.1.1 Configuration of Embedded System (Acer N300)

Performance
Samsung S3C2440 processor at 400 MHz
System memory
64 MB mobile SDRAM for system operation.
128 MB of flash memory for operating system,
embedded applications, user applications and
storage
Microsoft Windows MobileTM Version 5.0 Software
for Pocket PC, Premium Edition with Microsoft
Outlook 2002

25
3.1.2 Embedded Visaul C 4.0 (EVC 4.0)

The Microsoft eMbedded Visual Tools 4.0 deliver a
free and complete desktop development environment
for creating applications and system components
for Windows powered devices, including the Pocket
PC and Handheld PC .
Microsoft eMbedded Visual C enables playing
sounds on Pocket PCs. Functions such as
PlaySound, MessageBeep, waveOutWrite and other
waveOut functions are indeed available.

26
3.1.3 System Interface
27
3.2.1 Experiment Setup

EAT
In EAT, through some selection, there are totally
8375 wave files, incluing English long sentences,
English short sentences and English words for
training. The corpus contains 19221 words . This
contributes about 5.33 hours of continuous
speech.
MAT
In MAT-400, we use the MATDB-4 (1200) and MATDB-5
(400) category as our training corpora. There are
totally 15400 wave files, including words of 2
to 4 syllables and phonetically balanced
sentences for training. This contributes about
9.65 hours of continuous speech..

28
3.2.1 Experiment Setup

We pick 100 sentences from EAT and another 100
sentences from MAT-400 as Test pattern A, which
is not included in the training copora.

Test pattern B
Test pattern C
29
3.2.2 Experimental Results

Insertion errors ( ins ), deletion errors ( del )
and substitution errors ( sub ), were considered.
The phone recognition accuracies for ( acc ) was
estimated as
The monophone approach adopts direct combination
of Mandarin and English language-dependent phones
(MIX) and language-dependent IPA phones (IPA).

30
3.2.2 Experimental Results (Conti.)
Test pattern B
Test pattern C
31
4. Conclusions and Future Works

A prototype of mixed-language accented speech
recognition for Mandarin and English has been
established and presented.
The experimental results show that the proposed
system can perform 7080 lexicon recognition
accuracy.
In the future, more language, like Taiwanese,
also may be integrated together into the
prototype system.
Finally, the speech recognizer would be also
implemented in the hand-free environment with
learning capability and the VLSI hardware
realization in the future.

32
References

1 Ladefoged P., Local J. and Shockey L.,
editors. Handbook of the International Phonetic
Association A Guide to the Use of the
International Phonetic Alphabet. Cambridge
University Press, U.K., 1999.
2 Kohler J. Comparing three methods to create
multilingual phone models for vocabulary
independent speech recognition tasks. In Proc.
ESCA-NATO Tutorial and Research Workshop
Multi-lingual Interoperability in Speech
Technology, pp. 7984, Sept. 1999.
3 Kohler J. Multilingual phone models for
vocabulary-independent speech recognition tasks.
Speech Communication, 35(1-2)2130, Aug. 2001.
4 Vihola M., Harju M., Salmela P., Suontausta
J. and Savela J. Two dissimilarity measures for
HMMs and their application in phoneme model
clustering. Accepted to Proceedings of
International Conference on Acoustics, Speech and
Signal Processing, Orlando, USA, 2002.
5 Harju M., Salmela P., Leppanen J., Viikki O.
and Saarinen J. Comparing parameter tying
techniques for multilingual acoustic modelling.
In Proceedings of the European Conference of
Speech Communication and Technology, pp.
27292732, Aalborg, Denmark, Sept. 2001.
6 Andersen O., Dalsgaard P. and Barry W. On the
use of data-driven clustering technique for
identification of poly- and mono-phonemes for
four european languages. In Proceedings of
International Conference on Acoustics, Speech and
Signal Processing, volume 1, pp. 121124,
Adelaide, Australia, Apr. 1994.
7 Imperl B. and Horvat B. The clustering
algorithm for the definition of multilingual set
of context dependent speech models. In
Proceedings of the European Conference of Speech
Communication and Technology, pp. 887890,
Budabest, Hungary, 1999.
8 Zgank A., Imperl B. and Johansen F.
Crosslingual speech recognition with multilingual
acoustic models based on agglomerative and
tree-based triphone clustering. In Proceedings of
the European Conference of Speech Communication
and Technology, pp. 27252728, Aalborg, Denmark,
Sept. 2001.
9 Turunen E. Survey of theory and applications
of Lukasiewicz-Pavelka fuzzy logic. In di Nola A.
and Gerla G., editors, Lectures on Soft Computing
and Fuzzy Logic. Advances in Soft Computing, pp.
313337. Physica-Verlag, Heidelberg, 2001.
10 Chomsky, N. and Halle, M., 1968. The Sound
Pattern of English. New York Harper
Row.
11 H. C. Wang, MAT A project to collect
Mandarin speech data through telephone networks,
Computational Linguistics and Chinese Language
Processing, vol.2, no. 1, pp. 73-90 (1997).

33
References

12 F. Seide. N. J. C. Wang, 1998. Phonetic
modeling in the Philips Chinese continuous-speech
recognition system. In Proc.
13 Y. J. Chen, C-H. Wu et al. 2002. Generation
of robust phonetic set and decision tree for
Mandarin using chi-square testing. Speech
Communication, Vol. 38 (3-4), pp. 349-364.
14Young, S. et al. HTKbook(V3.2), Cambridge
University Engineering Dept. (2002)
15 Karjalainen M. Kommunikaatioakustiikka.
Technical Report 51, Helsinki University of
Technology, Laboratory of Acoustics and Audio
Signal Processing, Espoo, Finland, 1999.
Preprint, In Finnish.
16 Rabiner L. Fundamentals of Speech
Recognition. PTR Prentice-Hall Inc., New Jersey,
1993.
17 H. C. Wang, MAT A project to collect
Mandarin speech data through telephone networks,
Computational Linguistics and Chinese Language
Processing, vol.2, no. 1, pp. 73-90 (1997).
18 C. Y. Tseng, A phonetically oriented speech
database for Mandarin Chinese, Proc. ICPhS95,
Stockholm, pp. 326- 329 (1995).
19 C.-H. Lee, L. Rabiner et al. 1990. Acoustic
modcling for large vocabulary speech recognition.
Computer speech and language Vol. 4_ pp.127-
165.
20 J. L. Gauvain, L.F. Laniel, G Adda, M.
Adda-Decker, 1994. Speaker Independent Continuous
Speech Dictation, Speech Communication, Vol. 15
(l-2), pp. 21-37.
21 C.L. Huang, C-H Wu, PHONE SET GENERATION
BASED ON ACOUSTIC AND CONTEXTUAL ANALYSIS FOR
MULTILINGUAL SPEECH RECOGNITION Department of
Computer Science and Information
Engineering,National Cheng Kung University,
Tainan, Taiwan, R.O.C. (2007)
22 C.L. Huang, C-H Wu, Generation of Phonetic
Units for Mixed-Language Speech Recognition Based
on Acoustic and Contextual Analysis, Department
of Computer Science and Information
Engineering,National Cheng Kung University,
Tainan, Taiwan, R.O.C. (2007)
23 Shengmin Yu Sheng Hu Shuwu Zhang Bo Xu,
CHINESE-ENGLISH BILINGUAL SPEECH RECOGNITION,
Hi-Tech Innovation Center, Institute of
Automation Chinese Academy of Sciences, Beijing,
P. R. China (2003)
24 C. Y MA, Pascale FUNG, Using English
Phoneme Models for Chinese Speech Recognition ,
The Human Language Technology Center Department
of Electrical and Electronic Engineering Hong
Kong University of Science and Technology
(HKUST), Hong Kong

34
Thank you !!

Write a Comment

User Comments (0)

About PowerShow.com

English across Taiwan EAT Application on MixedLanguage Accented Speech Recognition PowerPoint PPT Presentation