Title: The Use of Speech in SpeechtoSpeech Translation
1The Use of Speech in Speech-to-Speech Translation
- Andrew Rosenberg
- 8/31/06
- Weekly Speech Lab Talk
2Candidacy Exam Organization
Use and Meaning of Intonation
Automatic Analysis of Intonation
Applications
Speech-to-Speech Translation
L2 Learning Systems
3The Use of Speech in Speech-to-Speech Translation
- The Use of Prosodic Event Information
- On the Use of Prosody in a Speech-to-Speech
TranslatorStrom et al. 1997 - A Japanese-to-English Speech Translation System
ATR-MATRIXTakezawa et al. 1998 - Cascaded / Loose Coupled Approaches
- Janus-III Speech-to-Speech Translation in
Multiple Languages Lavie et al. 1997 - A Unified Approach in Speech Translation
Integrating Features of Speech Recognition and
Machine TranslationZhang et al. 2004 - Integrated Approaches
- Finite State Speech-to-Speech TranslationVidal
1997 - On the Integration of Speech Recognition and
Statistical Machine TranslationMatusov 2005 - Coupling vs. Unifying Modeling Techniques for
Speech-to-Speech TranslationGao 2003
4The Use of Speech in Speech-to-Speech Translation
- The Use of Prosodic Event Information
- On the Use of Prosody in a Speech-to-Speech
TranslatorStrom et al. 1997 - A Japanese-to-English Speech Translation System
ATR-MATRIXTakezawa et al. 1998 - Cascaded / Loosely Coupled Approaches
- Janus-III Speech-to-Speech Translation in
Multiple Languages Lavie et al. 1997 - A Unified Approach in Speech Translation
Integrating Features of Speech Recognition and
Machine TranslationZhang et al. 2004 - Integrated / Tightly Coupled Approaches
- Finite State Speech-to-Speech TranslationVidal
1997 - On the Integration of Speech Recognition and
Statistical Machine TranslationMatusov 2005 - Coupling vs. Unifying Modeling Techniques for
Speech-to-Speech TranslationGao 2003
5On the Use of Prosody in a Speech-to-Speech
TranslatorStrom et al. 1997
- INTARC - German-English Translator produced for
VERBMOBIL project. - Spontaneous, limited domain (appointment
scheduling) - 80 minutes of prosodically labeled speech
- Phrase Boundary (PB) Detector
- Gaussian classifier based on F0, energy and time
features with a 4 syl. window (acc. 80.76) - Focus Detector
- Rule based approach Identifies location of
steepest F0 decline (acc. 78.5) - Syntactic parsing search space is reduced by 65
- Baseline syntactic parsing uses
- Decoder factor product of acoustic and bi-gram
scores - Grammar factor grammar model probability of a
parse using the hypothesized word - Prosody factor 4-gram model of prosodic events
(focus and PB) - Semantic parsing search space is reduced by 24.7
- The semantic grammar was augmented, labeling
rules as segment-connecting(SC) and
segment-internal (SI) - SC rules are applied when there is a PB between
segments, SI are applied when there are not. - Ideal phrase boundaries reduced the number of
hypotheses by 65.4 (analysis trees by 41.9) - Automatically hypothesized PBs required a backoff
mechanism to handle errors and PBs that are not
aligned with grammatical phrase boundaries. - Prosodically driven translation is used when deep
transfer (translation) fails - A focused word determines (probabilistically) a
dialog act which is translated based on available
information from the word chain.
6A Japanese-to-English Speech Translation System
ATR-MATRIXTakezawa et al. 1998
- Limited domain translation system (Hotel
Reservations) - Cascaded approach
- ASR sequential model 2k word vocabulary
- MT syntactically driven 12k word vocabulary
- TTS CHATR (now unit selection, then
concatenative) - Early Example of Interactive Speech-to-Speech
Translation - When the system has low confidence in either
recognition or MT outputs, it prompts the user
for corrections. - Speech Information is used in three ways in
ATR-MATRIX - Voice Selection
- Based on the source voice, either a male or
female voice is used for synthesis - Hypothesized phrase boundaries
- Using pause information along with POS N-gram
information the source utterance is divided into
meaningful chunks for translation. - Phrase Final Behavior
- If phrase final rise is detected, it is passed to
the MT module as a lexical item potentially
indicating a question.
7The Use of Speech in Speech-to-Speech Translation
- The Use of Prosodic Event Information
- On the Use of Prosody in a Speech-to-Speech
TranslatorStrom et al. 1997 - A Japanese-to-English Speech Translation System
ATR-MATRIXTakezawa et al. 1998 - Cascaded / Loosely Coupled Approaches
- Janus-III Speech-to-Speech Translation in
Multiple Languages Lavie et al. 1997 - A Unified Approach in Speech Translation
Integrating Features of Speech Recognition and
Machine TranslationZhang et al. 2004 - Integrated / Tightly Coupled Approaches
- Finite State Speech-to-Speech TranslationVidal
1997 - On the Integration of Speech Recognition and
Statistical Machine TranslationMatusov 2005 - Coupling vs. Unifying Modeling Techniques for
Speech-to-Speech TranslationGao 2003
8Janus-III Speech-to-Speech Translation in
Multiple LanguagesLavie et al. 1997
- Interlingua and Frame-Slot based Spanish-English
translation - limited domain (conference registration)
spontaneous speech - Cascaded Approach
- Two semantic parse techniques
- GLR Interlingua parsing (transcript 82.9 ASR
54) - Manually constructed grammar to parse input into
interlingua - robust, doesnt not require grammatically
correct input - Search for the maximal subset covered by the
grammar - Generation is performed by an interlingua
generator - Phoenix (transcript 76.3 ASR 48.6)
- identifies key concepts and their structure
- parsing grammar contains specific patterns which
represent domain concepts - The patterns are then compiled into a recursive
transition network - Each concept has one or more fixed phrasings in
the target language - Phoenix is used as a backoff when GLR fails.
- Transcript 83.3 ASR 63.6
- Late stage disambiguation
- Multiple translations are processed through the
whole system. - Translation hypothesis selection occurs just
before generation using scores from recognition,
parsing and discourse processing.
9A Unified Approach in Speech Translation
Integrating Features of Speech Recognition and
Machine TranslationZhang et al. 2004
- Process many hypotheses, then select one.
- In a cascaded architecture
- HMM-based ASR produces N-best recognition
hypotheses - IBM Model 4 MT processes all N.
- Rescore MT hypotheses based on weighted
log-linear combination of ASR and MT features. - Construct the feature weight model by optimizing
a translation distance metric (mWER, mPER, BLEU,
NIST) - Experiment Results
- Corpus 162k/510/508 Japanese-English parallel
sentences - Baseline no optimization of MT features
- Substantial improvement was obtained by
optimizing feature weights based on distance
metric - Additional improvement was achieved by including
ASR features - Translation of N-best ASR hypotheses improved
sentence translation accuracy of incorrectly
recognized 1-best hypotheses by 7.5
10The Use of Speech in Speech-to-Speech Translation
- The Use of Prosodic Event Information
- On the Use of Prosody in a Speech-to-Speech
TranslatorStrom et al. 1997 - A Japanese-to-English Speech Translation System
ATR-MATRIXTakezawa et al. 1998 - Cascaded / Loosely Coupled Approaches
- Janus-III Speech-to-Speech Translation in
Multiple Languages Lavie et al. 1997 - A Unified Approach in Speech Translation
Integrating Features of Speech Recognition and
Machine TranslationZhang et al. 2004 - Integrated / Tightly Coupled Approaches
- Finite State Speech-to-Speech TranslationVidal
1997 - On the Integration of Speech Recognition and
Statistical Machine TranslationMatusov 2005 - Coupling vs. Unifying Modeling Techniques for
Speech-to-Speech TranslationGao 2003
11Finite-State Speech-to-Speech TranslationVidal
1997
- FSTs can naturally be applied to translation.
- FSTs for statistical MT can be learned from
parallel corpora. (OSTIA) - Speech input is handled in two ways
- Baseline cascaded approach
- Integrated approach
- Create an FST on text, replace each edge with an
acoustic model of the lexical item - A major drawback of using this approach is large
training data requirement. - Align the source and target utterances, reducing
their asynchronicity - Cluster lexical items, reducing the vocabulary
size - Proof of concept experiment
- Text 30 lexical items used in 16k paired
sentences (Spanish- English) - Greater than 99 translation accuracy is achieved
- Speech 50k/400 (training/testing) paired
utterances, spoken by 4 speakers - Best performance 97.2 translation acc. 97.4
recognition accuracy - Requires inclusion of source and target 4-gram
LMs in FST training. - Travel domain experiment
- Text 600 lexical items in 169k/2k paired
sentences - 0.7 translation WER w/ categorization 13.3 WER
w/o - Speech 336 test utterances (3k words) spoken by
4 speakers
12On the Integration of Speech Recognition and
Statistical Machine TranslationMatusov et al.
2005
- Use word lattices weighted by HMM ASR scores as
input to a weighted FST for translation - Noisy Channel Model
- Using an alignment model, A
- Instead of modeling the alignment, search for the
best alignment - Evaluation
- Material 4 parallel corpora
- Spontaneous speech in the travel domain
- 3k - 66k paired sentences in Italian-English,
Spanish-English and Spanish-Catalan - Vocabulary size 1.7k-15k words
- On all metrics (mWER, mPER, BLEU, NIST), the
translation results are as follows - Correct text
- Word lattice w/ acoustic scores
- Fully integrated ASR and MT (FUB Italian-English
only) - Word lattice w/o acoustic scores
- Single best ASR hypothesis (lower mPER than
lattice w/o scores on FUB I-E) - Denser ASR lattices yield reduced translation WER
(on FUB Italian-English)
13Coupling vs. Unifying Modeling Techniques for
Speech-to-Speech TranslationGao 2003
- Application of direct modeling to ASR, with the
goal of direct modeling of interlingua text for
MT. - A direct model of target text from source
acoustics could also be constructed using this
approach. - Composing models (e.g., noisy channel models) can
lead to local or sub-optimal solutions - Direct Modeling tries to avoid these by creating
a single maximum entropy model - p(textacoustics,...)
- Direct modeling can also include other
non-independent observations (features). - Major considerations
- To simplify computational complexity, acoustic
features are quantized. - Since the feature vector can get very large,
reliable feature selection is necessary. - In preliminary experiments, 150M features were
reduced to 500K via feature selection
14The Use of Speech in Speech-to-Speech Translation
- The Use of Prosodic Event Information
- On the Use of Prosody in a Speech-to-Speech
TranslatorStrom et al. 1997 - A Japanese-to-English Speech Translation System
ATR-MATRIXTakezawa et al. 1998 - Cascaded / Loosely Coupled Approaches
- Janus-III Speech-to-Speech Translation in
Multiple Languages Lavie et al. 1997 - A Unified Approach in Speech Translation
Integrating Features of Speech Recognition and
Machine TranslationZhang et al. 2004 - Integrated / Tightly Coupled Approaches
- Finite State Speech-to-Speech TranslationVidal
1997 - On the Integration of Speech Recognition and
Statistical Machine TranslationMatusov 2005 - Coupling vs. Unifying Modeling Techniques for
Speech-to-Speech TranslationGao 2003
15Thank you.