Title: ICASSP 05
1ICASSP 05
2Reference
- Rapid Language Model Development Using External
Resources for New Spoken Dialog Domains - Ruhi Sarikaya1, Agustin Gravano2, Yuqing Gao1
- 1IBM, 2Columbia University
- Maximum Entropy Based Generic Filter for Language
Model Adaptation - Dong Yu, Milind Mahajan, PeterMau, Alex Acero
- Microsoft
- Language Model Estimation for Optimizing End-to
End Performance of A Natural Language Call
Routing System - Vaibhava Goel, Hong-Kwang Jeff Kuo, Sabine
Deligne, Cheng Wu - IBM
3Introduction
- LM adaptation consists of four steps
- 1. Collection of task specific adaptation data
- 2. Normalization step
- Abbreviations, data and time, punctuations
- 3. Analyze adaptation data and build a task
specific LM - 4. Interpolate task specific LM with task
independent LM
4Introduction
- Language Modeling research concentrated in tow
directions - 1. Improving the language model probability
estimation - 2. Obtaining additional training material
- The largest data set is the World Wide Web (WWW)
- More than 4 billion pages
5Introduction
- Using web data for language modeling
- Query generation
- Filtering the relevant text from the retrieved
pages - The web counts are certainly less sparse than the
counts in a corpus of a fixed size - The web counts are also likely to be
significantly more noisy than counts obtained
from a carefully cleaned and normalized corpus - Retrieve unit
- Whole document v.s. sentence (utterance)
6Build LM for new domain
- In practice when we start to build an SDS (spoken
dialog system) for a new domain, the amount of
in-domain data for the target domain is usually
small - Definition
- Static resource corpora collected for other
tasks - Dynamic resource web data
7Flow diagram for the collecting relevant data
8Generating search Queries
- Using Google as search engine
- The more specific a query is the more relevant
the retrieved pages are.
9Similarity based sentence selection
- Machine translations BLEU (BiLingual Evaluation
Understudy) - N is the maximum n-gram length, wn and pn are the
corresponding weight and precision, respectively,
BP is the brevity penalty - Where r and c are the lengths of the reference
and candidate sentences, respectively
Threshold is 0.08
10Experimental result
- SCLM using static corpora for language model
- WWW-20 / WWW-100 predefined limit to 20 / 100
pares per sentence
11E-mail corpus
- Dictated and non-dictated
12Filtering the corpus
- Filtering out these non-dictated texts is not an
easy job in general - Hand-crafted rules (e.g. regular expressions)
- Limitations
- It does not generalize well to situations which
we have not encountered - Rules are usually language dependent
- Developing and testing rules are very costly
13Maximum Entropy based filter
- Consider the filtering task as a labeling problem
to segment the adaptation data into two
categories - Category D (Dictated text)
- Text which should be used for LM adaptation
- Category N (Non-dictated text)
- Text which should not be used for LM adaptation
- Text is divided into a sequence of text units
(such as lines)
ti is the text unit, and li is the label
associated with ti
14Label dependency
- Assume that the labels of text units are
independent with each other given the complete
sequence of text units - We further assume that the label for a given unit
depends only upon units in a surrounding window
of units - k 1
15Classification Model
- A MaxEnt model has the form
- where is the vector of
model parameters
16Classification Model
17Features
18Space Splitting
19Evaluation
- Uses only features RawCompact, EOS, and OOV
- No space splitting
20- Filtering is especially important and effective
for the adaptation data with high percentage of
non-dictated text (U2)
21Efficient linear combination for distant n-gram
models
- David Langlois, Kamel Smaili, Jean-Paul Haton
- EUROSPEECH 2003 p409412
22Introduction
- Classical n-gram model
- Distant language models
23Modelization of distance in SLM
- Cache model (self-relationship)
- The former deals with the self-relationship
between a word present in the history and itself
if a word is frequent in the history, it has more
chance to appear once again
24Modelization of distance in SLM (cont.)
- Trigger model
- the relationship between two words
- It deals with couple of words v ? w such that if
v (the triggering word) is in the history, w (the
triggered word) has more chance to appear - But, in fact, the majority of triggers are self
triggers (v ? v) a word triggers itself
25d-n-gram model
- Nd(.) is the discounted count
- 0-n-gram model is the classical n-gram model
26Evaluation
- Voc 20k words
- Training set 38M words
- Development set 2M words
- Test set 2M words
- Baseline classical n-gram models
Models Perplexity
unigram 739.9
bigram 132.4
trigram 97.8
27Integration of distant n-gram models
- Distant n-gram model cannot be used alone. It
takes into account only a part of the history - Perplexity is 717 for n2 and d4
- Several models with distance up to d are combined
with the baseline model
28Improvement 7.1
Improvement 3.1
The utility of distant n-gram models decreases
with the distance a distance greater than 2 does
not provide more information
29Distant trigram lead to an improvement, but it is
less important than in distant bigram. ?overlap
between the history of d-trigram and (d1)-trigram
30Backoff smoothing
b_u_z
db_u_z
(b_u_z)?(db_u_z)
317.9
11.6
32Combination weight
- Unique weight
- The models weights depend on each history (the
class of each sub-history)
33Combination of distant n-gram
- In order to combine K models, M1,,MK, a set of
weight ?1,,?K is defined and the combination is
expressed by - Development corpus is not sufficient to estimate
a huge number of parameters - Classify histories and set a weight to each class
34Classification
- Break the history into the several parts
(sub-histories). Each sub-history is analyzed in
order to estimate its importance in terms of
prediction and then put into a class - Such a class is directly linked to the value of
the sub-history frequency - This class gathers all sub-histories which have
approximately the same frequency
358000 classes/115.4
12.8 improvement to baseline (132.4) 5.3
improvement to the single weight combination
(121.9)
4000 classes/85.2
12.8 improvement to baseline (97.8) 1.5
improvement to the single weight combination
(86.5)