An Empirical Study on Language Model Adaptation - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

An Empirical Study on Language Model Adaptation

Description:

Language model adaptation attempts to adjust the parameters of a ... We therefore do not evaluate the LM obtained using discriminative training via perplexity. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 33
Provided by: Patt95
Category:

less

Transcript and Presenter's Notes

Title: An Empirical Study on Language Model Adaptation


1
An Empirical Study on Language Model Adaptation
Jianfeng Gao , Hisami Suzuki, Microsoft
Research Wei Yuan Shanghai Jiao Tong University
Presented by Patty Liu
2
Outline
  • Introduction
  • The Language Model and the Task of IME
  • Related Work
  • LM Adaptation Methods
  • Experimental Results
  • Discussion
  • Conclusion and Future Work

3
Introduction
  • Language model adaptation attempts to adjust the
    parameters of a LM so that it will perform well
    on a particular domain of data.
  • In particular, we focus on the so-called
    cross-domain LM adaptation paradigm, that is, to
    adapt a LM trained on one domain (background
    domain) to a different domain (adaptation
    domain), for which only a small amount of
    training data is available.
  • The LM adaptation methods investigated here can
    be grouped into two categories
  • (1) Maximum a posteriori (MAP) Linear
    interpolation
  • (2) Discriminative training
    boosting?perceptron?

  • minimum sample risk

4
The Language Model and the Task of IME
  • IME (Input Method Editor)The users first input
    phonetic strings, which are then converted into
    appropriate word strings by software.
  • Unlike speech recognition, there is no acoustic
    ambiguity in IME, since the phonetic string is
    provided directly by users. Moreover, we can
    assume a unique mapping from W to A in IME, that
    is, .
  • From the perspective of LM adaptation, IME faces
    the same problem that speech recognition does
    the quality of the model depends heavily on the
    similarity between the training data and the test
    data.

5
Related Work (1/3)
  • I. Measuring Domain Similarity
  • a language
  • true underlying probability distribution of
  • another distribution (e.g., an SLM) which
    attempts to model
  • the cross entropy of with
    respect to
  • a word string in

6
Related Work (2/3)
  • However, in reality, the underlying is never
    known and the corpus size is never infinite. We
    therefore make the assumption that is an
    ergodic and stationary process, and approximate
    the cross entropy by calculating it for a
    sufficiently large n instead of calculating it
    for the limit.
  • The cross entropy takes into account both the
    similarity between two distributions (given by KL
    divergence) and the entropy of the corpus in
    question.

7
Related Work (3/3)
  • II. LM Adaptation Methods
  • MAP adjust the parameters of the background
    model
  • ? maximize the likelihood of the adaptation
    data
  • Discriminative training methods using
    adaptation data ? directly minimize the errors in
    it made by the background model
  • These techniques have been applied successfully
    to language modeling in non-adaptation as well as
    adaptation scenarios for speech recognition.

8
LM Adaptation Methods -LI
  • I. The Linear Interpolation Method
  • the probability of the background model
  • the probability of the adaptation model
  • the history, corresponds to the two preceding
    words
  • For simplicity, we chose a single for
    all histories and tuned it on held-out data

9
LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (1/3)
  • II. Discriminative Training Methods
  • ? Problem Definition

10
LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (2/3)
  • which views IME as a ranking problem, where
    the model gives the ranking score, not
    probabilities. We therefore do not evaluate the
    LM obtained using discriminative training via
    perplexity.

11
LM Adaptation Methods- Problem Definition Of
Discriminative Training Methods (3/3)
  • reference transcript
  • an error function which is an
    edit distance function in this case
  • sample risk , the sum of error counts
    over the training samples
  • Discriminative training methods strive to
    minimize the by optimizing the model
    parameters. However, cannot be optimized
    easily, since is a piecewise constant (or
    step) function of and its gradient is
    undefined.
  • Therefore, discriminative methods apply different
    approaches that optimize it approximately. The
    boosting and perceptron algorithms approximate
    by loss functions that are suitable for
    optimization, while MSR uses a simple heuristic
    training procedure to minimize
  • directly.

12
LM Adaptation Methods- The Boosting Algorithm
(1/2)
  • (i) The Boosting Algorithm
  • margin
  • a ranking error an incorrect candidate
    conversion gets a higher score than the
    correct conversion

  • , where if , and 0
    otherwise
  • Optimizing the RLoss NP-complete
  • ?optimizes its upper bound, ExpLoss
  • ExpLossconvex

13
LM Adaptation Methods- The Boosting Algorithm
(2/2)
  • a value increasing exponentially with the
    sum of the margins of pairs over the
    set where is seen in but not in
  • the value related to the sum of margins
    over the set where is seen in but not in
  • a smoothing factor (whose value is
    optimized on held-out data)
  • a normalization constant.

14
LM Adaptation Methods- The Perceptron Algorithm
(1/2)
  • (ii) The Perceptron Algorithm
  • delta rule
  • stochastic approximation

15
LM Adaptation Methods - The Perceptron Algorithm
(2/2)
  • averaged perceptron algorithm

16
LM Adaptation Methods- MSR(1/7)
  • (iii) The Minimum Sample Risk Method
  • Conceptually, MSR operates like any
    multidimensional function optimization approach
  • - The first direction (i.e., feature) is
    selected and SR is minimized along that direction
    using a line search, that is, adjusting the
    parameter of the selected feature while keeping
    all other parameters fixed.
  • - Then, from there, along the second direction
    to its minimum, and so on
  • - Cycling through the whole set of directions
    as many times as necessary, until SR stops
    decreasing.

17
LM Adaptation Methods - MSR(2/7)
  • This simple method can work properly under two
    assumptions.
  • - First, there exists an implementation of line
    search that efficiently optimizes the function
    along one direction.
  • - Second, the number of candidate features is
    not too large, and they are not highly
    correlated.
  • However, neither of the assumptions holds in our
    case.
  • - First of all, Er(.) in
  • is a step function of ?, and thus cannot be
    optimized directly by regular gradient-based
    procedures - a grid search has to be used
    instead. However, there are problems with simple
    grid search using a large grid could miss the
    optimal solution, whereas using a fine-grained
    grid would lead to a very slow algorithm.
  • - Second, in the case of LM, there are millions
    of candidate features, some of which are highly
    correlated with each other.

18
LM Adaptation Methods - MSR(3/7)
  • ? active candidate of a group
  • candidate word string,
  • Since in our case takes integer values and
    ( is the count of a particular n-gram in
    ), we can group the candidates using so
    that candidates in each group have the same value
    of .
  • In each group, we define the candidate with the
  • highest value of as the
    active candidate of the group because no matter
    what value takes, only this candidate could
    be selected according to

19
LM Adaptation Methods - MSR(4/7)
  • ? Grid Line Search
  • By finding the active candidates, we can reduce
    to a much smaller list of active
    candidates. We can find a set of intervals for
    , within each of which a particular active
    candidate will be selected as .
  • As a result, for each training sample, we obtain
    a sequence of intervals and their corresponding
    values. The optimal value can then
    be found by traversing the sequence and taking
    the midpoint of the interval with the lowest
    value.
  • By merging the sequence of intervals of each
    training sample in the training set, we obtain a
    global sequence of intervals as well as their
    corresponding sample risk. We can then find the
    optimal value as well as the minimal sample
    risk by traversing the global interval sequence.

20
LM Adaptation Methods - MSR(5/7)
  • ? Feature Subset Selection
  • Reducing the number of features is essential for
    two reasons to reduce computational complexity
    and to ensure the generalization property of the
    linear model.
  • Effectiveness of
  • The cross-correlation coefficient between two
    features and

21
LM Adaptation Methods - MSR(6/7)
22
LM Adaptation Methods - MSR(7/7)
  • the number of all candidate features
  • the number of features in the resulting
    model,
  • According to the feature selection method
  • - step1 for each of the candidate
    features
  • - step4 estimates of are
    required
  • Therefore, we only estimate the value of
    between each of the selected features and each of
    the top remaining features with the highest
    value of . This reduces the number of
    estimates of to .

23
Experimental Results (1/3)
  • I. Data
  • The data used in our experiments stems from five
    distinct sources of text.
  • Different sizes of each adaptation training data
    were also used to show how different sizes of
    adaptation training data affected the
    performances of various adaptation methods.

24
Experimental Results (2/3)
  • II. Computing Domain Characteristics
  • (i) The similarity between two domains cross
    entropy
  • - not symmetric
  • - self entropy (the diversity of the corpus)
    increases in the following order N?Y?E?T?S

25
Experimental Results (3/3)
  • III. Results of LM Adaptation
  • We trained our baseline trigram model on our
    background (Nikkei) corpus .

26
Discussion (1/6)
  • I. Domain Similarity and CER
  • The more similar the adaptation domain is to the
    background domain, the better the CER results.

27
Discussion (2/6)
  • II. Domain Similarity and the Robustness of
    Adaptation Methods
  • The discriminative methods outperform LI in most
    cases.
  • The performance of LI is greatly influenced by
    domain similarity. Such a limitation is not
    observed with the discriminative methods.

28
Discussion (3/6)
  • III. Adaptation Data Size and CER Reduction
  • X-axis self entropy
  • Y-axis the improvement in CER reduction
  • a positive correlation between the diversity of
    the adaptation corpus and the benefit of having
    more training data available
  • An intuitive explanation The less diverse the
    adaptation data, the fewer distinct training
    examples will be included for discriminative
    training.

29
Discussion (4/6)
  • IV. Domain Characteristics and Error Ratios
  • error ratio (ER) metric, which measures the side
    effects of a new model
  • the number of errors found only in the
    new (adaptation) model
  • the number of errors corrected by the new
    model
  • if the adapted model introduces no new
    errors
  • if the adapted model makes CER
    improvements
  • if the CER improvement is zero (i.e.,
    the adapted model makes as many new mistakes as
    it corrects old mistakes)
  • when the adapted model has worse CER
    performance than the baseline model

30
Discussion (5/6)
  • RER relative error rate reduction, i.e., the CER
    difference between the background and adapted
    models in
  • A discriminative method (in this case MSR) is
    superior to linear interpolation, not only in
    terms of CER reduction but also in having fewer
    side effects.

31
Discussion (6/6)
  • Although the boosting and perceptron algorithms
    have the same CER for Yomiuri and TuneUp from
    Table III, the perceptron is better in terms of
    ER. This may be due to the use of an exponential
    loss function in the boosting algorithm, which is
    less robust against noisy data.
  • Corpus diversity the less stylistically diverse,
    the more consistent within the domain.

32
Conclusion and Future Work
  • Conclusion
  • (1) cross-domain similarity (cross entropy)
    correlates with the CER of all models
  • (2) diversity (self entropy) correlates with the
    utility of more adaptation training data for
    discriminative training methods
  • Future Work an online learning scenario
Write a Comment
User Comments (0)
About PowerShow.com