Speech to Speech Machine Translation (S2SMT) presentation

About This Presentation

Transcript and Presenter's Notes

Title: Speech to Speech Machine Translation (S2SMT)

1
Speech to Speech Machine Translation (S2SMT)

Kapita Selekta, 26 November 2005
Suyanto

2
Overview

Motivation
S2SMT System
Applications
Conclusion
Discussion

3
VerbMobil
4
Motivation

6500 living languages www.ling.gu.se
Translation Market Donald Barabé 2003
8 Billion Global Market
Doubling every five years

5
Motivation (cont.)
6
Motivation (cont.)
Customer Service Department in a China company
with 10,000 employees (Chinese and English)
Staffers Words/Day Translation Time (hrs) Percentage
Customer Feedback 235,000 350 39
Prospective Orders 150,000 228 25
Technical Support 156,000 235 26
Dealer Feedback 60,000 93 10
Total 601,000 906 100
7
Questions

Is it possible to develop S2SMT for the problems?
What are the challanges?

8
S2SMT
Source Language Utterance
Source Language Text
Target Language Text
Target Language Utterance
Kurt Godden 2002
9
ASR (Automatic Speech Recognition)
10
ASR Challenges

Co-articulation
Speaker independence
dialect variations
non-native speakers
Spontaneous speech
Disfluencies
Out-of-vocabulary words
Noise robustness
Convolutive recording/transmission conditions
Additive recording environment, transmission SNR
Intra-speaker variability stress, age, humor
Prosody
Intonation, stress, and phrase boundaries
Emotion

11
ASR approaches

Word-based ASR (1970s)
Recognize a word as a whole pattern
For special purposes isolated digits, connected
words
How many words you need to develop an
application?
Syllable-based ASR
Recognize a word as a set of syllable patterns
In English, there are about 10,000 syllables.
Phoneme-based ASR (widely used today)
Recognize a word as a set of phoneme patterns
Considerable for general purposes
In English, there are 50 phonemes.
In Indonesian, only 32 phonemes.

12
Phoneme-based ASR

Today, it is the most realistic approach.
In English, we need only 50 phonemes to be
recognized.
To develop speech corpus, we should develop a
sentence set with tri-phone (sil-asil, c-asil,
c-an) balance.
For 50 phonemes, we need about 125,000
tri-phones.
For 10,000 syllable, we need 1012 tri-syllables
(complicated!)

13
Todays ASR
Phoneme-based approach using statistical models
(HMM or hybrid HMM/ANN) for acoustics and
linguistics Large vocabulary, speaker
independent T. Dutoit 2002
14
Language Model
n-gram models (trigram is widely used for
ASR) Probability of a sentence is estimated from
the conditional probabilities of each word given
the n-1 preceding words T. Dutoit 2002
P(The red hat linux) P(The_,_) P(redThe,_)
P(hatred,The) P(linuxhat,red)
Solve coarticulation/dialect hat, had, head,
heat, hate ...
15
Language Model (cont.)

An example in Bahasa Indonesia.
Satu kantornya
Satukan TOR-nya
If the sentences preceded by tolong
Tolong satu kantornya
Tolong satukan TOR-nya
3-gram is better than 2-gram

16
IBM trigram example
17
IBM trigram example (cont.)
18
Language Model (cont.)

Advantages
Robust and efficient
Increase accuracy from 85 to 97 T. Dutoit

Problems
Limited only the local linguistic structure
a vocabulary of size V will have Vn n-grams
e.g. 20,000 words will have 8 trillion trigrams!

19
ASR Performance
T. Dutoit 2002 - Faculty Polytechnique de Mons
Belgium
20
Today ASR

Large Vocabulary Continues Speech Recognition
(LVCSR)
Minimum Vocabulary 10,000 words
Continues Speech
Speaker Independent

21
Machine Translation (MT)
22
MT Challenges

Orthographic Variations
Ambiguous spelling
Ambiguous word boundaries
Lexical Ambiguity
Eat ? essen (human) vs fressen (animal)
? he-wrote vs. it-was-written vs. books

23
MT Challenges

Morphological Variations
Affixation vs. RootPattern
write ? written
kill ? killed
do ? done
Translation Divergences

24
MT Approaches

Grammar-based
- Interlingua-based
- Transfer-based
Direct
- Example-based
- Statistical

MT Pyramid
Nizar Habash 2004 - Columbia University
25
Multi-Engine Machine Translation

Idea take output from different translation
engines and get an overall better translation
Get the best from different worlds
High quality but low coverage from translation
memory, interlingua system
High coverage but lower quality from statistical
system
How to get a better translation ?
Select one translation, i.e. work on sentence
level
Create a new one, i.e. using partial translations
from different engines and create a new one

26
Text-to-Speech (TTS)
27
TTS Challenges

Accurate automatic phonetization (?dictionary
look-up)
Prosody generation (i.e., intonation and phoneme
durations) must be coherent easy to produce
unnatural prosody
Synthesize phoneme sequences with corresponding
prosody
Co-articulation!
Segmental quality should be maintained after
pitch and duration modification
Engineering
Low design and maintenance cost
Low computational and Memory cost
Easy adaptation to other languages

28
TTS Diagram
T. Dutoit 2002 - Faculty Polytechnique de Mons
Belgium
29
Automatic Phonetization
30
Automatic Phonetization
More complex than that !
31
Intonation

Why ups and downs?
Stress (word level) ? Accent (phrase level)
Modify slightly ? unnatural

32
Phoneme Duration

Not constant
Not fixed for a given phoneme
Linked to intonation (longer on accented
syllables)

33
Applications of S2SMT
Joy (Ying Zhang) 2003 CMU
34
VerbMobil
35
ATT
36
S2SMT Advantages

Data transmission
Voice format to text format
GSM 8 kbps to 40 bps (reduce 200 times!)

Can you put S2SMT to the client ?
37
S2SMT Disadvantages

The original voice of the speaker?
Not natural intonation, emotion
Delay

38
Conclusion

At the moment, it is possible to develop S2SMT
for small special purposes, e.g. reservation,
helpdesk, etc.
The main problem is ASR
MT and TTS are considerably acceptable
Many remaining challenges in S2SMT

39
Discussion

Coupling of ASR and MT

Takezawa et al. 98
40
Discussion

Personalized TTS

Takezawa et al. 98
41
Discussion

How about Bahasa Indonesia?
Population 240 million people
PT TELKOM has 30 million (12.5) customers (fixed
and wireless)
Other operators has millions customers (say 20
million)
Prospective market for S2SMT ???

42
Discussion

TELKOM RisTI, ATR (www.atr.jp), and STTTelkom are
developing Indonesian text and speech corpus
Text corpus
5,000 sentences
Extracted from news and application domain
Speech corpus
400 people (200 male and 200 female)
4 dialects Javanese, Sundanese, Jakarta, Batak
4 age categories 18-23, 24-35, 36-50, 51-60
We need 100 students to be uttered!!!

43
Discussion

Target Applications (not translation)
E-governance (status tracking IMB, PBB, KTP,
etc.)
Billing info
Audio conference (Reservation)
Tele Home-Security
Dumb and Deaf Telecommunication System
To develop S2SMT, we need experts in linguistic,
computer science, electronics engineering,
communications, etc.

Speech to Speech Machine Translation (S2SMT) PowerPoint PPT Presentation