Title: Compact%20WFSA%20based%20Language%20Model%20and%20Its%20Application%20in%20Statistical%20Machine%20Translation
1Compact WFSA basedLanguage Model and Its
Application in Statistical Machine Translation
- Xiaoyin Fu, Wei Wei, Shixiang Lu,
- Dengfeng Ke, Bo Xu
- Interactive Digital Media Technology Research
Center, CASIA
2Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
3Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
4Task
- N-gram Language Model
- assign probabilities to string of words or tokens
- Let wL denote a string of L tokens over a fixed
vocabulary - smoothing techniques
- back-off
- Define
5Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
6Problems
- Query in trie structure
- Useless queries
- Problems in Forward Query
- Problems in Back-off Query
7Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
8Solution
- Another point of view
- a random procedure
- a continuous process
- Benefit
- Speed up Forward Query
- Speed up Back-off Query
- Goal
- Fast
- Compact
9Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
10Our Approaches
- FAST
- WFSA
- 5-turple M(Q, S, I, F, d )
- Definition
Q a set of states
I a set of initial states
F a set of final states
S a alphabet which represents the input and output labels
d d Q(S?e), a transition relation
11Our Approaches
- FAST
- WFSA
- 5-turple M(Q, S, I, F, d )
- Example
Q a set of states
I a set of initial states
F a set of final states
S a alphabet which represents the input and output labels
d d Q(S?e), a transition relation
12Our Approaches
13Our Approaches
- Compact
- Trie
- Sort Array
- Link index
14Our Approaches
- WFSA-based LM
- Trie structure
- Note
- Tf triggers corresponding to forward query
- Tb triggers spontaneously without any input
- reaches to the leaves
- carries out back-off queries
Q the nodes in trie
I the root of trie
F Each node of trie except the root
S the alphabet of input sentences
d forward transition Tf and roll-back transition Tb
15Our Approaches
16Our Approaches
Probability Back-off Index
Probability Back-off Index Roll-back index
17Our Approaches
Probability Back-off Index
Probability Back-off Index Roll-back index
Cross Layer
18Our Approaches
19Our Approaches
20Our Approaches
21Our Approaches
22Our Approaches
23Our Approaches
24Our Approaches
25Our Approaches
26Our Approaches
27Our Approaches
28Our Approaches
29Our Approaches
- For HPB SMT
- For a source sentence
- A huge number of LM queries
- Ten Millions
- Most of these are repetitive
- Hash cache
30Our Approaches
- For HPB SMT
- Hash cache
- Small fast
- Hash size 24bit
- 16M
- Simple operation
- Additive Operation
- Bitwise Operation
- Hash clear
- For each sentence
31Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
32Results
- Setup
- LM Toolkit SRILM
- Decoder Hierarchical phrase-based translation
system - Test data IWSLT-07(489) NIST-06(1664)
- Training data
Tasks Model Parallel sentences Chinese words English words
IWSLT-07 TM1 0.38M 3.0M 3.1M
IWSLT-07 LM2 1.3M 15.2M
NIST-06 TM3 3.4M 64M 70M
NIST-06 LM4 14.3M 377M
1 The parallel corpus of BTEC (Basic Traveling
Expression Corpus) and CJK (China-Japan-Korea
corpus) 2 The English corpus of
BTECCJKCWMT2008 3 LDC2002E18, LDC2002T01,
LDC2003E07, LDC2003E14, LDC2003T17, LDC2004T07,
LDC2004T08, LDC2005T06, LDC2005T10, LDC2005T34,
LDC2006T04, LDC2007T09 4 LDC2007T07
33Results
- Storage Space
- The storage sizes increase about 35
- Linearly dependent with the nodes of trie
- Acceptable
The comparison of LM size between SRILM and WFSA
Tasks n-grams SRILM (Mb) WFSA (Mb) ? ()
IWSLT-07 4 65.7 89.1 35.6
IWSLT-07 5 89.8 119.5 33.1
NIST-06 4 860.3 1190.4 38.4
NIST-06 5 998.5 1339.7 34.2
34Results
- Query Speed
- WFSA
- 60 in 4-grams
- 70 in 5-grams
- WFSAcache
- Speed up by 75
n-grams methods IWSLT-07(s) NIST-06(s)
4 SRILM 163 15433
4 WFSA 70 6251
4 WFSAcache 42 3907
5 SRILM 261 25172
5 WFSA 85 7944
5 WFSAcache 59 6128
35Results
- Analysis
- Repetitive queries and back-off queries in SMT
- 4-gram
- back-off queries are widely existed
- most of these queries are repetitive
- WFSA based LM can speed up queries effectively
Tasks Back-off Repetitive
IWSLT-07 60.5 95.5
NIST-06 60.3 96.4
36Outline
- Task
- Problems
- Solution
- Our Approach
- Results
- Conclusion
37Conclusion
- A faster WFSA-based LM
- Faster forward query
- Faster back-off query
- A compact WFSA-based LM
- Trie structure
- A simple caching technique
- For SMT system
- Other fields
- Speech recognition
- Information retrieval
38 Thanks!