An Empirical Study on Language Model Adaptation - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

An Empirical Study on Language Model Adaptation

Description:

Language model adaptation attempts to adjust the parameters of a ... We therefore do not evaluate the LM obtained using discriminative training via perplexity. ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 33

Provided by: Patt95

Category:

more less

Transcript and Presenter's Notes

Title: An Empirical Study on Language Model Adaptation

1
An Empirical Study on Language Model Adaptation
Jianfeng Gao , Hisami Suzuki, Microsoft
Research Wei Yuan Shanghai Jiao Tong University
Presented by Patty Liu
2
Outline

Introduction
The Language Model and the Task of IME
Related Work
LM Adaptation Methods
Experimental Results
Discussion
Conclusion and Future Work

3
Introduction

Language model adaptation attempts to adjust the
parameters of a LM so that it will perform well
on a particular domain of data.
In particular, we focus on the so-called
cross-domain LM adaptation paradigm, that is, to
adapt a LM trained on one domain (background
domain) to a different domain (adaptation
domain), for which only a small amount of
training data is available.
The LM adaptation methods investigated here can
be grouped into two categories
(1) Maximum a posteriori (MAP) Linear
interpolation
(2) Discriminative training
boosting?perceptron?
minimum sample risk

4
The Language Model and the Task of IME

IME (Input Method Editor)The users first input
phonetic strings, which are then converted into
appropriate word strings by software.
Unlike speech recognition, there is no acoustic
ambiguity in IME, since the phonetic string is
provided directly by users. Moreover, we can
assume a unique mapping from W to A in IME, that
is, .
From the perspective of LM adaptation, IME faces
the same problem that speech recognition does
the quality of the model depends heavily on the
similarity between the training data and the test
data.

5
Related Work (1/3)

I. Measuring Domain Similarity
a language
true underlying probability distribution of
another distribution (e.g., an SLM) which
attempts to model
the cross entropy of with
respect to
a word string in

6
Related Work (2/3)

However, in reality, the underlying is never
known and the corpus size is never infinite. We
therefore make the assumption that is an
ergodic and stationary process, and approximate
the cross entropy by calculating it for a
sufficiently large n instead of calculating it
for the limit.
The cross entropy takes into account both the
similarity between two distributions (given by KL
divergence) and the entropy of the corpus in
question.

7
Related Work (3/3)

II. LM Adaptation Methods
MAP adjust the parameters of the background
model
? maximize the likelihood of the adaptation
data
Discriminative training methods using
adaptation data ? directly minimize the errors in
it made by the background model
These techniques have been applied successfully
to language modeling in non-adaptation as well as
adaptation scenarios for speech recognition.

8
LM Adaptation Methods -LI

I. The Linear Interpolation Method
the probability of the background model
the probability of the adaptation model
the history, corresponds to the two preceding
words
For simplicity, we chose a single for
all histories and tuned it on held-out data

9
LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (1/3)

II. Discriminative Training Methods
? Problem Definition

10
LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (2/3)

which views IME as a ranking problem, where
the model gives the ranking score, not
probabilities. We therefore do not evaluate the
LM obtained using discriminative training via
perplexity.

11
LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (3/3)

reference transcript
an error function which is an
edit distance function in this case
sample risk , the sum of error counts
over the training samples
Discriminative training methods strive to
minimize the by optimizing the model
parameters. However, cannot be optimized
easily, since is a piecewise constant (or
step) function of and its gradient is
undefined.
Therefore, discriminative methods apply different
approaches that optimize it approximately. The
boosting and perceptron algorithms approximate
by loss functions that are suitable for
optimization, while MSR uses a simple heuristic
training procedure to minimize
directly.

12
LM Adaptation Methods- The Boosting Algorithm
(1/2)

(i) The Boosting Algorithm
margin
a ranking error an incorrect candidate
conversion gets a higher score than the
correct conversion
, where if , and 0
otherwise
Optimizing the RLoss NP-complete
?optimizes its upper bound, ExpLoss
ExpLossconvex

13
LM Adaptation Methods- The Boosting Algorithm
(2/2)

a value increasing exponentially with the
sum of the margins of pairs over the
set where is seen in but not in
the value related to the sum of margins
over the set where is seen in but not in
a smoothing factor (whose value is
optimized on held-out data)
a normalization constant.

14
LM Adaptation Methods- The Perceptron Algorithm
(1/2)

(ii) The Perceptron Algorithm
delta rule
stochastic approximation

15
LM Adaptation Methods - The Perceptron Algorithm
(2/2)

averaged perceptron algorithm

16
LM Adaptation Methods- MSR(1/7)

(iii) The Minimum Sample Risk Method
Conceptually, MSR operates like any
multidimensional function optimization approach
- The first direction (i.e., feature) is
selected and SR is minimized along that direction
using a line search, that is, adjusting the
parameter of the selected feature while keeping
all other parameters fixed.
- Then, from there, along the second direction
to its minimum, and so on
- Cycling through the whole set of directions
as many times as necessary, until SR stops
decreasing.

17
LM Adaptation Methods - MSR(2/7)

This simple method can work properly under two
assumptions.
- First, there exists an implementation of line
search that efficiently optimizes the function
along one direction.
- Second, the number of candidate features is
not too large, and they are not highly
correlated.
However, neither of the assumptions holds in our
case.
- First of all, Er(.) in
is a step function of ?, and thus cannot be
optimized directly by regular gradient-based
procedures - a grid search has to be used
instead. However, there are problems with simple
grid search using a large grid could miss the
optimal solution, whereas using a fine-grained
grid would lead to a very slow algorithm.
- Second, in the case of LM, there are millions
of candidate features, some of which are highly
correlated with each other.

18
LM Adaptation Methods - MSR(3/7)

? active candidate of a group
candidate word string,
Since in our case takes integer values and
( is the count of a particular n-gram in
), we can group the candidates using so
that candidates in each group have the same value
of .
In each group, we define the candidate with the
highest value of as the
active candidate of the group because no matter
what value takes, only this candidate could
be selected according to

19
LM Adaptation Methods - MSR(4/7)

? Grid Line Search
By finding the active candidates, we can reduce
to a much smaller list of active
candidates. We can find a set of intervals for
, within each of which a particular active
candidate will be selected as .
As a result, for each training sample, we obtain
a sequence of intervals and their corresponding
values. The optimal value can then
be found by traversing the sequence and taking
the midpoint of the interval with the lowest
value.
By merging the sequence of intervals of each
training sample in the training set, we obtain a
global sequence of intervals as well as their
corresponding sample risk. We can then find the
optimal value as well as the minimal sample
risk by traversing the global interval sequence.

20
LM Adaptation Methods - MSR(5/7)

? Feature Subset Selection
Reducing the number of features is essential for
two reasons to reduce computational complexity
and to ensure the generalization property of the
linear model.
Effectiveness of
The cross-correlation coefficient between two
features and

21
LM Adaptation Methods - MSR(6/7)
22
LM Adaptation Methods - MSR(7/7)

the number of all candidate features
the number of features in the resulting
model,
According to the feature selection method
- step1 for each of the candidate
features
- step4 estimates of are
required
Therefore, we only estimate the value of
between each of the selected features and each of
the top remaining features with the highest
value of . This reduces the number of
estimates of to .

23
Experimental Results (1/3)

I. Data
The data used in our experiments stems from five
distinct sources of text.
Different sizes of each adaptation training data
were also used to show how different sizes of
adaptation training data affected the
performances of various adaptation methods.

24
Experimental Results (2/3)

II. Computing Domain Characteristics
(i) The similarity between two domains cross
entropy
- not symmetric
- self entropy (the diversity of the corpus)
increases in the following order N?Y?E?T?S

25
Experimental Results (3/3)

III. Results of LM Adaptation
We trained our baseline trigram model on our
background (Nikkei) corpus .

26
Discussion (1/6)

I. Domain Similarity and CER
The more similar the adaptation domain is to the
background domain, the better the CER results.

27
Discussion (2/6)

II. Domain Similarity and the Robustness of
Adaptation Methods
The discriminative methods outperform LI in most
cases.
The performance of LI is greatly influenced by
domain similarity. Such a limitation is not
observed with the discriminative methods.

28
Discussion (3/6)

III. Adaptation Data Size and CER Reduction
X-axis self entropy
Y-axis the improvement in CER reduction
a positive correlation between the diversity of
the adaptation corpus and the benefit of having
more training data available
An intuitive explanation The less diverse the
adaptation data, the fewer distinct training
examples will be included for discriminative
training.

29
Discussion (4/6)

IV. Domain Characteristics and Error Ratios
error ratio (ER) metric, which measures the side
effects of a new model
the number of errors found only in the
new (adaptation) model
the number of errors corrected by the new
model
if the adapted model introduces no new
errors
if the adapted model makes CER
improvements
if the CER improvement is zero (i.e.,
the adapted model makes as many new mistakes as
it corrects old mistakes)
when the adapted model has worse CER
performance than the baseline model

30
Discussion (5/6)

RER relative error rate reduction, i.e., the CER
difference between the background and adapted
models in
A discriminative method (in this case MSR) is
superior to linear interpolation, not only in
terms of CER reduction but also in having fewer
side effects.

31
Discussion (6/6)

Although the boosting and perceptron algorithms
have the same CER for Yomiuri and TuneUp from
Table III, the perceptron is better in terms of
ER. This may be due to the use of an exponential
loss function in the boosting algorithm, which is
less robust against noisy data.
Corpus diversity the less stylistically diverse,
the more consistent within the domain.

32
Conclusion and Future Work

Conclusion
(1) cross-domain similarity (cross entropy)
correlates with the CER of all models
(2) diversity (self entropy) correlates with the
utility of more adaptation training data for
discriminative training methods
Future Work an online learning scenario

Write a Comment

User Comments (0)