Title: An Empirical Study on Language Model Adaptation
1An Empirical Study on Language Model Adaptation
Jianfeng Gao , Hisami Suzuki, Microsoft
Research Wei Yuan Shanghai Jiao Tong University
Presented by Patty Liu
2Outline
- Introduction
- The Language Model and the Task of IME
- Related Work
- LM Adaptation Methods
- Experimental Results
- Discussion
- Conclusion and Future Work
3Introduction
- Language model adaptation attempts to adjust the
parameters of a LM so that it will perform well
on a particular domain of data. - In particular, we focus on the so-called
cross-domain LM adaptation paradigm, that is, to
adapt a LM trained on one domain (background
domain) to a different domain (adaptation
domain), for which only a small amount of
training data is available. - The LM adaptation methods investigated here can
be grouped into two categories - (1) Maximum a posteriori (MAP) Linear
interpolation - (2) Discriminative training
boosting?perceptron? -
minimum sample risk
4The Language Model and the Task of IME
- IME (Input Method Editor)The users first input
phonetic strings, which are then converted into
appropriate word strings by software. - Unlike speech recognition, there is no acoustic
ambiguity in IME, since the phonetic string is
provided directly by users. Moreover, we can
assume a unique mapping from W to A in IME, that
is, . - From the perspective of LM adaptation, IME faces
the same problem that speech recognition does
the quality of the model depends heavily on the
similarity between the training data and the test
data.
5Related Work (1/3)
- I. Measuring Domain Similarity
- a language
- true underlying probability distribution of
- another distribution (e.g., an SLM) which
attempts to model - the cross entropy of with
respect to - a word string in
6Related Work (2/3)
- However, in reality, the underlying is never
known and the corpus size is never infinite. We
therefore make the assumption that is an
ergodic and stationary process, and approximate
the cross entropy by calculating it for a
sufficiently large n instead of calculating it
for the limit. - The cross entropy takes into account both the
similarity between two distributions (given by KL
divergence) and the entropy of the corpus in
question.
7Related Work (3/3)
- II. LM Adaptation Methods
- MAP adjust the parameters of the background
model - ? maximize the likelihood of the adaptation
data - Discriminative training methods using
adaptation data ? directly minimize the errors in
it made by the background model - These techniques have been applied successfully
to language modeling in non-adaptation as well as
adaptation scenarios for speech recognition.
8LM Adaptation Methods -LI
- I. The Linear Interpolation Method
- the probability of the background model
- the probability of the adaptation model
- the history, corresponds to the two preceding
words - For simplicity, we chose a single for
all histories and tuned it on held-out data
9LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (1/3)
- II. Discriminative Training Methods
- ? Problem Definition
10LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (2/3)
- which views IME as a ranking problem, where
the model gives the ranking score, not
probabilities. We therefore do not evaluate the
LM obtained using discriminative training via
perplexity.
11LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (3/3)
- reference transcript
- an error function which is an
edit distance function in this case - sample risk , the sum of error counts
over the training samples - Discriminative training methods strive to
minimize the by optimizing the model
parameters. However, cannot be optimized
easily, since is a piecewise constant (or
step) function of and its gradient is
undefined. - Therefore, discriminative methods apply different
approaches that optimize it approximately. The
boosting and perceptron algorithms approximate
by loss functions that are suitable for
optimization, while MSR uses a simple heuristic
training procedure to minimize - directly.
12LM Adaptation Methods- The Boosting Algorithm
(1/2)
- (i) The Boosting Algorithm
- margin
- a ranking error an incorrect candidate
conversion gets a higher score than the
correct conversion -
, where if , and 0
otherwise - Optimizing the RLoss NP-complete
- ?optimizes its upper bound, ExpLoss
- ExpLossconvex
13LM Adaptation Methods- The Boosting Algorithm
(2/2)
- a value increasing exponentially with the
sum of the margins of pairs over the
set where is seen in but not in - the value related to the sum of margins
over the set where is seen in but not in
- a smoothing factor (whose value is
optimized on held-out data) - a normalization constant.
14LM Adaptation Methods- The Perceptron Algorithm
(1/2)
- (ii) The Perceptron Algorithm
- delta rule
-
- stochastic approximation
15LM Adaptation Methods - The Perceptron Algorithm
(2/2)
- averaged perceptron algorithm
16LM Adaptation Methods- MSR(1/7)
- (iii) The Minimum Sample Risk Method
- Conceptually, MSR operates like any
multidimensional function optimization approach - - The first direction (i.e., feature) is
selected and SR is minimized along that direction
using a line search, that is, adjusting the
parameter of the selected feature while keeping
all other parameters fixed. - - Then, from there, along the second direction
to its minimum, and so on - - Cycling through the whole set of directions
as many times as necessary, until SR stops
decreasing.
17LM Adaptation Methods - MSR(2/7)
- This simple method can work properly under two
assumptions. - - First, there exists an implementation of line
search that efficiently optimizes the function
along one direction. - - Second, the number of candidate features is
not too large, and they are not highly
correlated. - However, neither of the assumptions holds in our
case. - - First of all, Er(.) in
- is a step function of ?, and thus cannot be
optimized directly by regular gradient-based
procedures - a grid search has to be used
instead. However, there are problems with simple
grid search using a large grid could miss the
optimal solution, whereas using a fine-grained
grid would lead to a very slow algorithm. - - Second, in the case of LM, there are millions
of candidate features, some of which are highly
correlated with each other.
18LM Adaptation Methods - MSR(3/7)
- ? active candidate of a group
- candidate word string,
- Since in our case takes integer values and
( is the count of a particular n-gram in
), we can group the candidates using so
that candidates in each group have the same value
of . - In each group, we define the candidate with the
- highest value of as the
active candidate of the group because no matter
what value takes, only this candidate could
be selected according to
19LM Adaptation Methods - MSR(4/7)
- ? Grid Line Search
- By finding the active candidates, we can reduce
to a much smaller list of active
candidates. We can find a set of intervals for
, within each of which a particular active
candidate will be selected as . - As a result, for each training sample, we obtain
a sequence of intervals and their corresponding
values. The optimal value can then
be found by traversing the sequence and taking
the midpoint of the interval with the lowest
value. - By merging the sequence of intervals of each
training sample in the training set, we obtain a
global sequence of intervals as well as their
corresponding sample risk. We can then find the
optimal value as well as the minimal sample
risk by traversing the global interval sequence.
20LM Adaptation Methods - MSR(5/7)
- ? Feature Subset Selection
- Reducing the number of features is essential for
two reasons to reduce computational complexity
and to ensure the generalization property of the
linear model. - Effectiveness of
- The cross-correlation coefficient between two
features and
21LM Adaptation Methods - MSR(6/7)
22LM Adaptation Methods - MSR(7/7)
- the number of all candidate features
- the number of features in the resulting
model, - According to the feature selection method
- - step1 for each of the candidate
features - - step4 estimates of are
required - Therefore, we only estimate the value of
between each of the selected features and each of
the top remaining features with the highest
value of . This reduces the number of
estimates of to . -
23Experimental Results (1/3)
- I. Data
- The data used in our experiments stems from five
distinct sources of text. - Different sizes of each adaptation training data
were also used to show how different sizes of
adaptation training data affected the
performances of various adaptation methods.
24Experimental Results (2/3)
- II. Computing Domain Characteristics
- (i) The similarity between two domains cross
entropy - - not symmetric
- - self entropy (the diversity of the corpus)
increases in the following order N?Y?E?T?S
25Experimental Results (3/3)
- III. Results of LM Adaptation
- We trained our baseline trigram model on our
background (Nikkei) corpus .
26Discussion (1/6)
- I. Domain Similarity and CER
- The more similar the adaptation domain is to the
background domain, the better the CER results.
27Discussion (2/6)
- II. Domain Similarity and the Robustness of
Adaptation Methods - The discriminative methods outperform LI in most
cases. - The performance of LI is greatly influenced by
domain similarity. Such a limitation is not
observed with the discriminative methods.
28Discussion (3/6)
- III. Adaptation Data Size and CER Reduction
- X-axis self entropy
- Y-axis the improvement in CER reduction
- a positive correlation between the diversity of
the adaptation corpus and the benefit of having
more training data available - An intuitive explanation The less diverse the
adaptation data, the fewer distinct training
examples will be included for discriminative
training.
29Discussion (4/6)
- IV. Domain Characteristics and Error Ratios
- error ratio (ER) metric, which measures the side
effects of a new model - the number of errors found only in the
new (adaptation) model - the number of errors corrected by the new
model - if the adapted model introduces no new
errors - if the adapted model makes CER
improvements - if the CER improvement is zero (i.e.,
the adapted model makes as many new mistakes as
it corrects old mistakes) - when the adapted model has worse CER
performance than the baseline model
30Discussion (5/6)
-
- RER relative error rate reduction, i.e., the CER
difference between the background and adapted
models in - A discriminative method (in this case MSR) is
superior to linear interpolation, not only in
terms of CER reduction but also in having fewer
side effects.
31Discussion (6/6)
- Although the boosting and perceptron algorithms
have the same CER for Yomiuri and TuneUp from
Table III, the perceptron is better in terms of
ER. This may be due to the use of an exponential
loss function in the boosting algorithm, which is
less robust against noisy data. - Corpus diversity the less stylistically diverse,
the more consistent within the domain.
32Conclusion and Future Work
- Conclusion
- (1) cross-domain similarity (cross entropy)
correlates with the CER of all models - (2) diversity (self entropy) correlates with the
utility of more adaptation training data for
discriminative training methods - Future Work an online learning scenario