Title: Coupling between ASR and MT in Speech-to-Speech Translation
1Coupling between ASR and MT in Speech-to-Speech
Translation
- Arthur Chan
- Prepared for
- Advanced Machine Translation Seminar
2This Seminar
- Introduction (6 slides)
- Ringgers categorization of Coupling between ASR
and NLU (7 slides) - Interfaces in Loose Coupling
- 1 best and N-best (5 slides)
- Lattices/Confusion Network/Confidence Estimation
(12 slides) - Results from literature
- Tight Coupling
- Neys Theory and 2 methods of Implementation (14
slides) - ? Sorry, without FST approaches.
- Some As Is Ideas on This Topic
3History of this presentation
- V1
- Draft finished in Mar 1st
- Tanjas comment
- Direct modeling could be skipped.
- We could focus on more ASR-related issue.
- Issues in MT search could be ignored.
4History of this presentation (cont.)
- V2 V4
- Follow Tanjas comment and finished in Mar 19th .
- Reviewers comment
- Neys search formulation is too difficult to
follow - FST-based tight coupling method is important. We
should cover it. - V5
- Reviewed another 5 papers solely on the issue of
FST-based tight-coupling. (True coupling)
56 papers on Coupling of Speech-to-Speech
Translation
- H. Ney, Speech translation Coupling of
recognition and translation, in Proc. ICASSP,
1999. - Casacuberta et al., Architectures for
speech-to-speech translation using finite-state
models, in Proc. Workshop on Speech-to-Speech
Translation, 2002. - E. Matusov, S.Kanthak, and H. Ney, On the
integration of speech recognition and statistical
machine translation, in Proc. InterSpeech,
2005. - S.Saleem, S. C. Jou, S. Vogel, and T. Schultz,
Using word lattice information for a tighter
coupling in speech translation systems, in Proc.
ICSLP, 2004. - V.H. Quan et al., Integrated N-best re-ranking
for spoken language translation, in In
EuroSpeech, 2005. - N. Bertoldi and M. Federico, A new decoder for
spoken language translation based on confusion
networks, in IEEE ASRU Workshop, 2005.
6A Conceptual Model of Speech-to-Speech Translation
Speech Recognizer
Machine Translator
Speech Synthesizer
Decoding Result(s)
Translation
waveforms
waveforms
7Motivation of Tight Coupling between ASR and MT
- One best of ASR could be wrong
- MT could be benefited from wide range of
supplementary information provided by ASR - N-best list
- Lattice
- Sentenced/Word-based Confidence Scores
- E.g. Word posterior probability
- Confusion network
- Or consensus decoding (Mangu 1999)
- MT quality may depend on WER of ASR (?)
8Scope of this talk.
Speech Recognizer
Machine Translator
Speech Synthesizer
1-best?
Translation
N-best?
waveforms
waveforms
Lattice?
Confusion network?
9Topics Covered Today
- The concept of Coupling
- Tightness of coupling between ASR and
Technology X. (Ringger 95) - Two questions
- What could ASR provide in loose coupling?
- Discussion of interfaces between ASR and MT in
loose coupling - What is the status of tight coupling?
- Neys Formulation
10Topics not covered
- Direct Modeling
- Use both features in ASR and MT
- Some referred as ASR and MT unification
- Implication of the MT search algorithms on the
coupling - Generation of speech from text.
- Presenter doesnt know enough.
11The Concept of Coupling
12Classification of Coupling of ASR and Natural
Language Understanding (NLU)
- Proposed in Ringger 95, Harper 94
- 3 Dimensions of ASR/NLU
- Complexity of the search algorithm
- Simple N-gram?
- Incrementality of the coupling
- On-line? Left-to-right?
- Tightness of the coupling
- Tight? Loose? Semi-tight?
13Tightness of Coupling
Tight
Semi-Tight
Loose
14Notes
- Semi-tight coupling could appear as
- Feedback loop between ASR and Technology X for
the whole utterance of speech - Or Feedback loop between ASR and Technology X for
every frame. - The Ringger system
- A good way to understand how speech-based system
is developed
15Example 1 LM
- Someone asserts that ASR has to be used with
13-grams. - In tight-coupling,
- A search will be devised to search for the best
word sequence with best acoustic score 13 gram
likelihood - In loose coupling
- A simple search will be used to generate some
outputs (N-best list, lattice etc.), - 13-gram will then use to rescore the output.
- In semi-tight coupling
- 1, A simple search will be used to generate
results - 2, 13 gram will be applied at the word-end only
(but exact history will not be stored)
16Example 2 Higher order AM
- Segmental model assume obs. probability is not
conditionally independent. - Someone assert that segmental model is better
than just HMM. - Tight coupling Direct search of the best word
sequence using segmental model. - Loose coupling Use segmental model to rescore
- Semi-tight coupling Hybrid HMM-Segmental model
algorithm?
17Summary of Coupling between ASR and NLU
18Implication on ASR/MT coupling
- Generalize many systems
- Loose coupling
- Any system which uses 1-best, n-best, lattice, or
other inputs for 1-way module communication - (Bertoldi 2005)
- CMU System (Saleem 2004)
- Tight coupling
- (Ney 1999)
- (Matusov 2005)
- (Casacuberta 2002)
- Semi-tight coupling
- (Quan 2005)
19Interfaces in Loose Coupling1-best and N-best
20Perspectives
- ASR outputs
- 1-best results
- N-best results
- Lattice
- Consensus network.
- Confidence scores
- How ASR generate these outputs?
- Why they are generated?
- What if there are multiple ASRs?
- (and what if their results are combined?)
- Note we are talking about state-lattice now,
not word-lattice. ?
21Origin of the 1-best.
- Decoding of HMM-based ASR
- Searching the best path in a huge HMM-state
lattice. - 1-best ASR result
- The best path one could find from backtracking.
- State Lattice in ASR (Next page)
22(No Transcript)
23Note on 1-best in ASR
- Most of the time 1-best Word Sequence
- Why?
- In LVCSR, storing the backtracking pointer table
for state sequence takes a lot of memory (even
nowadays) - Compare this with the number of frames of score
one need to be stored - Usually a backtrack pointer storing
- The previous words before the current word
- Clever structure dynamically allocate
back-tracking pointer table.
24What is N-best list?
- Traceback not only from the 1st -best, also from
the 2nd best and 3rd best, etc. - Pathway
- Directly from search backtrack pointer table
- Exact N-best algorithm (Chow 90)
- Word pair N-best algorithm (Chow 91)
- A search using Viterbi score as heuristic (Chow
92) - Generate lattice first, then generate N-best from
lattice
25Interfaces in Loose CouplingLattice, Consensus
Network and Confidence Estimation
26What is Lattice?
- A word-based lattice
- A compact representation of state-lattice
- Only word node (or link) are involved
- Difference between N-best and Lattice
- Lattice could be compact representation of N-best
list.
27(No Transcript)
28How lattice is generated?
- From the decoding backtracking pointer table
- Only record all the links between word nodes.
- From N-best list
- Become a compact representation of N-best
- Sometimes spurious link will be introduced
29How lattice is generated when there are phone
contexts at the word end?
- Very complicated when phonetic context is
involved - Not only word-end needs to be stored but also the
phone contexts. - Lattice has the word identity as well as contexts
- Lattice can become very large.
30How this is resolved?
- Some used only approximate triphone to generate
lattice in first stage (BBN) - Some generate lattice even with full CD-phones
but convert it back to no-context lattices (RWTH) - Use the lattice with full CD phone contexts
(RWTH)
31What ASR folks do when lattice is still too large?
- Use some criteria to prune the lattice.
- Example Criteria
- Word posterior probability
- Application of another LM or AM, then filtering.
- General confidence score
- Maximum lattice density
- (number of words in lattice/number of words)
- Or generate an even more compact representation
than lattices - E.g. consensus network.
32Conclusions on lattices
- Lattice generation itself could be a complicated
issue - Sometimes, what post-processing stage (e.g. MT)
will get is pre-filtered, pre-processed results.
33Confusion Network and Consensus Hypothesis
- Confusion Network
- Or Sausage Network.
- Or Consensus Network
34Special Properties (?)
- More local than lattice
- One can apply simple criteria to find the best
results - E.g. consensus decoding is to apply
word-posterior probability on confusion network. - More tractable
- In terms of size
- Found to be useful in
- ?
- ?
35How to generate consensus network?
- From the lattice
- Summary of Mangus algorithm
- Intra-word clustering
- Inter-word clustering
36Note on Consensus Network
- Note
- Time information might not be preserved in
confusion network - The similarity function directly affect the final
output of the consensus network.
37Other ways to generate confusion network
- From the N-best list
- Using Rover.
- A mixture of voting and adding confidence of word
38Confidence Measure
- Anything other than likelihood which could tell
whether the answer is useful - E.g.
- Word posterior probability
- P(WA)
- Usually compute using lattices
- Language model backoff mode
- Other posterior probabilities (frame, sentence)
39Interfaces in Loose CouplingResults from the
Literature
40General word
- Coupling in SST is still pretty new
- Papers are chosen according to whether some
outputs have been used - Other techniques such as direct modeling might be
mixed into the papers.
41N-best list (Quan 2005)
- Using N-best list for reranking
- Interpolation weights of AM and TM are then
optimized. - Summary
- Reranking gives improvements.
42Lattices CMU results (Saleem 2004)
- Summary of results
- Lattice word error rate improved when lattice
density improves - Lattice density and Weight on Acoustic scores
turns out to be an important parameter to tune - Too large and small could hurt.
43LWER against Lattice Density
44Modified Bleu scores against lattice density
45Optimal density and score weight based on
Utterance Length.
46Consensus Network
- Bertoldi 2005 is probably the only work on
confusion-network based method - Summary of results
- When direct modeling is applied
- Consensus Network doesnt beat N-best method.
- Author argues for speed and simplicity of the
algorithm
47Confidence Does it help?
- According to Zhang 2006, Yes.
- Confidence Measure (CM) filtering is used to
filter out unnecessary results in N-best - Note The approaches used is quite different.
48Conclusion on Loose Coupling
- SR could give a rich sets of output.
- It is still an unknown what type of output should
be used in pipeline. - Currently, it seem to lack of comprehensive
experimental studies on which method is the best.
- Usage of confusion network and confidence
estimation seem to be under-explored.
49Tight Coupling Theory and Practice
50Theory (Ney 1999)
Bayes Rule
Introduce f as hidden var.
Bayes Rule
Assume x doesnt depend on target lang.
Sum to Max
51Layman point of view
- Three factors
- Pr(e) target language model
- Pr(fe) translation model
- Pr(xf) acoustic model
- Note assumption has been made only the best
matching f for e is used.
52Comparison with SR
- In SR
- Pr(f) Source language model
- In Tight coupling
- Pr(fe), Pr(e) Translation model and Target
language model
53Algorithmic Point of View
- Brute Force Method Instead of incorporating LM
into standard Viterbi algorithm - Incoporating P(e) and P(fe)
- gt Very complicated
- The backup slides in the presentation has detail
about Neys implementations.
54Experimental Results in Matusov, Kanthak and Ney
2005
- Summary of the results
- Translation quality is only improved by tight
coupling when the lattice density is not high. - Same as Saleem 2004, incorporation of acoustic
scores help.
55Conclusion Possible Issues of tight coupling
- Possibilities
- In SR, source n-gram LM is very closed to the
best configuration. - The complexity of the algorithm is too high,
approximation is still necessary to make it work. - When the criterion in tight coupling is used. It
is possible that the LM and the TM need to be
jointly estimated. - The current approaches still havent really
implement tight-coupling - There might be bugs in the programs.
56Conclusion
- Two major issues in coupling of SST is discussed
- In loose coupling
- Consensus network and Confidence scoring is still
not fully utilized - In tight coupling
- The approach seem to be haunted by very high
complexity of search algorithm construction
57Discussion
58The End. Thanks.
59Literature
- 2006 Ruiqiang Zhang, Genichiro Kikui. Integration
of Speech Recognition and Machine Translation
Speech Recognition Word Lattice Translation.
Speech Communication. Vol.48, Issues 3-4 - H. Ney, Speech translation Coupling of
recognition and translation, in Proc. ICASSP,
1999. - E. Matusov, S.Kanthak, and H. Ney, On the
integration of speech recognition and statistical
machine translation, in Proc. InterSpeech, 2005. - S.Saleem, S. C. Jou, S. Vogel, and T. Schultz,
Using word lattice information for a tighter
coupling in speech translation systems, in Proc.
ICSLP, 2004. - V.H. Quan et al., Integrated N-best re-ranking
for spoken language translation, in In
EuroSpeech, 2005. - N. Bertoldi and M. Federico, A new decoder for
spoken language translation based on confusion
networks, in IEEE ASRU Workshop, 2005. - L. Mangu, E. Brill, A. Stolcke, Finding
consensus in speech recognition word error
minimization and other applications of confusion
networks, Computer Speech and Language 14(4),
373-400., (2000) - E. Ringger, A Robust Loose Coupling for Speech
Recognition and Natural Language Understanding,
1995
60Backup Slides
61Ney 99s Formulation of SSTs Search.
62Assumptions in Modeling
- Alignment Models (HMM)
- Acoustic Modeling
- Speech Recognizer will produce a word graph.
- Each link with word hypothesis covers the portion
of acoustic scores. (notation is confusing in
paper)
63Lexicon Modeling
- Further assumption from standard IBM models
- Target word is assumed to be dependent on
previous word - So, in fact, source LM is actually there.
64First Implementation Local Average Assumptions
- Local Average Assumptions
- P(xe) is used to capture the local
characteristic of the acoustic.
65Justification of Using Average Local Assumption
- Rephrased from Author (p.3 para 2)
- Lexicon modeling and language modeling will cause
f_j-1, f_j, f_j1 appear in the math. - In another words
- It is too complicated to carry out
- Computation advantage the local score could be
obtained just from the word graph but before
translation - gt Full translation strategy could still be
carried out
66Computation of P(xe)
- Make use of best source sequence
- Also refer to Wessel 98,
- A commonly used word posterior probability
algorithm for lattice - A forward-backward like procedure is used
67Second Method Monotone Alignment Assumption -
Network
68Monotone Alignment Assumption Formula for Text
Input
- Close-formed solution exist form DP O(JE2)
69Monotone Alignment Assumption Formula for
Speech Input
70How to make Monotone Assumptions work?
- Words needs to be reordered
- As part of search strategy.
- Does acoustic model assumption used?
- i.e. Are we talking about word lattice or still
state lattice? - Dont know, seems like we are actually talking
about word lattice. - Supported by Matusov 2005