A Bit of Progress in Language Modeling Extended Version

1 / 92
About This Presentation
Title:

A Bit of Progress in Language Modeling Extended Version

Description:

... data source will have the lowest possible perplexity. The lower the perplexity of our model, the closer it is, ... Entropy, which is simply log2 of perplexity ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 93
Provided by: Lou576

less

Transcript and Presenter's Notes

Title: A Bit of Progress in Language Modeling Extended Version


1
A Bit of Progress in Language Modeling Extended
Version
  • Presented by Louis-Tsai
  • Speech Lab, CSIE, NTNU
  • louis_at_csie.ntnu.edu.tw

2
IntroductionOverview
  • LM is the art of determining the probability of a
    sequence of words
  • Speech recognition, optical character
    recognition, handwriting recognition, machine
    translation, spelling correction
  • Improvements
  • Higher-order n-grams
  • Skipping models
  • Clustering
  • Caching
  • Sentence-mixture models

3
IntroductionTechnique introductions
  • The goal of a LM is to determine the probability
    of a word sequence w1wn, P(w1wn)
  • Trigram assumption

4
IntroductionTechnique introductions
  • C(wi-2wi-1wi) represent the number of
    occurrences of wi-2wi-1wi in the training corpus,
    and similarly for C(wi-2wi-1)
  • There are many three word sequences that never
    occur, consider the sequence party on Tuesday,
    what is P(Tuesday party on)?

5
IntroductionSmoothing
  • The training corpus might not contain any
    instances of the phrase, so C(party on Tuesday)
    would be 0, while there might still be 20
    instances of the phrase party on? P(Tuesday
    party on) 0
  • Smoothing techniques take some probability away
    from some occurrences
  • Imagine we have party on Stan Chens birthday
    in the training data and occurs only one time

6
IntroductionSmoothing
  • By taking some probability away from some words,
    such as Stan and redistributing it to other
    words, such as Tuesday, zero probabilities can
    be avoided
  • Katz smoothingJelinek-Mercer smoothing (deleted
    interpolation)Kneser-Ney smoothing

7
IntroductionHigher-order n-grams
  • The most obvious extension to trigram models is
    to simply move to higher-order n-grams, such as
    four-grams and five-grams
  • There is a significant interaction between
    smoothing and n-gram order higher-order n-grams
    work better with Kneser-Ney smoothing than with
    some other methods, especially Katz smoothing

8
IntroductionSkipping
  • We condition on a different context than the
    previous two words
  • Instead computing P(wiwi-2wi-1) of computing
    P(wiwi-3wi-2)

9
IntroductionClustering
  • Clustering (classing) models attempt to make use
    of the similarities between words
  • If we have seen occurrences of phrases like
    party on Monday and party on Wednesday then
    we might imagine that the word Tuesday is also
    likely to follow the phrase party on

10
IntroductionCaching
  • Caching models make use of the observation that
    if you use a word, you are likely to use it again

11
IntroductionSentence Mixture
  • Sentence Mixture models make use of the
    observation that there are many different
    sentence types, and that making models for each
    type of sentence may be better than using one
    global model

12
IntroductionEvaluation
  • A LM that assigned equal probability to 100 words
    would have perplexity 100

13
IntroductionEvaluation
  • In general, the perplexity of a LM is equal to
    the geometric average of the inverse probability
    of the words measured on test data

14
(No Transcript)
15
IntroductionEvaluation
  • true model for any data source will have the
    lowest possible perplexity
  • The lower the perplexity of our model, the closer
    it is, in some sense, to the true model
  • Entropy, which is simply log2 of perplexity
  • Entropy is the average number of bits per word
    that would be necessary to encode the test data
    using an optimal coder

16
IntroductionEvaluation
  • entropy 5?4perplexity 32?16 50
  • entropy 5?4.5perplexity 32?
    29.3

17
IntroductionEvaluation
  • Experiments corpus 1996 NAB
  • Experiments performed at 4 different training
    data sizes 100K words, 1M words, 10M words, 284M
    words
  • Heldout and test data taken from the 1994 WSJ
  • Heldout data 20K words
  • Test data 20K words
  • Vocabulary 58,546 words

18
Smoothingsimply interpolation
  • where 0??,??1
  • In practice, the uniform distribution are also
    interpolated this ensures that no word is
    assigned probability 0

19
SmoothingKatz smoothing
  • Katz smoothing is based on the Good-Turing
    formula
  • Let nr represent the number of n-grams that occur
    r times
  • discount

20
SmoothingKatz smoothing
(r1)nr10
  • Let N represent the total size of the training
    set, this left-over probability will be equal to
    n1/N

Sumn1
21
SmoothingKatz smoothing
  • Consider a bigram model of a phrase such as
    Pkatz(Francisco on). Since the phrase San
    Francisco is fairly common, the unigram
    probability will also be
    fairly high.
  • This means that using Katz smoothing, the
    probabilitywill also be fairly high. But,
    the word Francisco occurs in exceedingly few
    contexts, and its probability of occurring in a
    new one is very low

22
SmoothingKneser-Ney smoothing
  • KN smoothing uses a modified backoff distribution
    based on the number of contexts each word occurs
    in, rather than the number of occurrences of the
    word. Thus, the probability PKN(Francisco on)
    would be fairly low, while for a word like
    Tuesday that occurs in many contexts, PKN(Tuesday
    on) would be relatively high, even if the
    phrase on Tuesday did not occur in the training
    data

23
SmoothingKneser-Ney smoothing
  • Backoff Kneser-Ney smoothing
  • where vC(vwi)gt0 is the number of words v
    that wi can occur in the context, D is the
    discount, ? is a normalization constant such that
    the probabilities sum to 1

24
SmoothingKneser-Ney smoothing
Va,b,c,d
b
b
c
a
a
c
d
a
a
b
b
b
b
c
c
a
a
b
b
c
c
c
c
d
d
a
d
c
25
SmoothingKneser-Ney smoothing
  • Interpolated models always combine both the
    higher-order and the lower-order distribution
  • Interpolated Kneser-Ney smoothingwhere
    ?(wi-1) is a normalization constant such that the
    probabilities sum to 1

26
SmoothingKneser-Ney smoothing
  • Multiple discounts, one for one counts, another
    for tow counts, and another for three or more
    counts. But it have too many parameters
  • Modified Kneser-Ney smoothing

27
SmoothingJelinek-mercer smoothing
  • Combines different N-gram orders by linearly
    interpolating all three models whenever computing
    trigram

28
Smoothingabsolute discounting
  • Absolute discounting subtracting a fixed discount
    Dlt1 from each nonzero count

29
Witten-Bell Discounting
  • Key ConceptThings Seen Once Use the count of
    things youve seen once to help estimate the
    count of things youve never seen
  • So we estimate the total probability mass of all
    the zero N-grams with the number of types divided
    by the number of tokens plus observed types

N the number of tokensT observed types
30
Witten-Bell Discounting
  • T/(NT) gives the total probability of unseen
    N-grams, we need to divide this up among all the
    zero N-grams
  • We could just choose to divide it equally

Z is the total number of N-grams with count zero
31
Witten-Bell Discounting
Alternatively, we can represent the smoothed
counts directly as
32
Witten-Bell Discounting
33
Witten-Bell Discounting
  • For bigramT the number of bigram types, N
    the number of bigram token

34
20 words per sentence
35
Higher-order n-grams
  • Trigram P(wiwi-2wi-1) ? five-gram
    P(wiwi-4wi-3wi-2wi-1)
  • In many cases, no sequence of the form
    wi-4wi-3wi-2wi-1 will have been seen in the
    training data?backoff to or interpolation with
    four-grams, trigrams, bigrams, or even unigrams
  • But in those cases where such a long sequence has
    been seen, it may be a good predictor of wi

36
0.06
0.02
0.01
284,000,000
37
Higher-order n-grams
  • As we can see, the behavior for Katz smoothing is
    very different than the behavior for KN
    smoothing? the main cause of this difference was
    backoff smoothing techniques, such as Katz
    smoothing, or even the backoff version of KN
    smoothing
  • Backoff smoothing techniques work poorly on low
    counts, especially one counts, and that as the
    n-grams order increases, the number of one counts
    increases

38
Higher-order n-grams
  • Katz smoothing has its best performance around
    the trigram level, and actually gets worse as
    this level is exceeded
  • KN smoothing is essentially monotonic even
    through 20-grams
  • The plateau point for KN smoothing depends on the
    amount of training data availablesmall (100,000
    words) at trigram levelfull (284 million words)
    at 5 to 7 gram
  • (6-gram has .02 bits better than 5-gram, 7-gram
    has .01 bits better than 6-gram)

39
Skipping
  • When considering a 5-gram context, there are many
    subsets of the 5-gram we could consider, such as
    P(wiwi-4wi-3wi-1) or P(wiwi-4wi-2wi-1)
  • If have never seen Show John a good time but we
    have seen Show Stan a good time. A normal
    5-gram predicting P(time show John a good)
    would back off to P(time John a good) and from
    there to P(time a good), which would have a
    relatively low probability
  • A skipping model of the from P(wiwi-4wi-2wi-1)
    would assign high probability to P(time show
    ____ a good)

40
Skipping
  • These skipping 5-grams are then interpolated with
    a normal 5-gram, forming models such aswhere
    0? ? ?1 and 0 ? ? ?1 and 0 ? (1-?-?) ?1
  • Another (and more traditional) use for skipping
    is as a sort of poor mans higher order n-gram.
    One can, for instance, create a model of the
    formno component probability depends on more
    than two previous words but the overall
    probability is 4-gram-like, since it depends on
    wi-3, wi-2, and wi-1

? P(wiwi-4wi-3wi-2wi-1) ? P(wiwi-4wi-3wi-1)
(1-?-?) P(wiwi-4wi-2wi-1)
? P(wiwi-2wi-1) ? P(wiwi-3wi-1) (1-?-?)
P(wiwi-3wi-2)
41
Skipping
  • For a 5-gram skipping experiments, all contexts
    depended on at most the previous four words,
    wi-4, wi-3, wi-2,and wi-1, but used the four
    words in a variety of ways
  • For readability and conciseness, we define v
    wi-4, w wi-3, x wi-2, y wi-1

42
(No Transcript)
43
Skipping
  • First model interpolated dependencies on vw_y and
    v_xy?does not work well on the smallest training
    data size, but is competitive for larger ones
  • In second model, we add vwx_ into first
    model?roughly .02 to .04 bits over the first
    model
  • Next, adding back in the dependencies on the
    missing words, xvwy, wvxy, and yvwx that is, all
    models depended on the same variables, but with
    the interpolation order modified
  • e.g., by xvwy, we refer to a model of the form
    P(zvwxy) interpolated with P(zvw_y)
    interpolated with P(zw_y) interpolated with
    P(zy) interpolated with P(z)

44
Skipping
  • Interpolating together vwyx, vxyw, wxyv (base on
    vwxy) This model puts each of the four preceding
    words in the last position for one
    component?this model does not work as well as
    the previous two, leading us to conclude that the
    y word is by far the most important

45
Skipping
  • Interpolating together vwyx, vywx, yvwx, which
    put the y word in each possible position in the
    backoff model?this was overall the worst model,
    reconfirming the intuition that the y word is
    critical
  • Finally we interpolating together vwyx, vxyw,
    wxyv, vywx, yvwx, xvwy, wvxy? the result is a
    marginal gain less than 0.01 bits over the
    best previous model

46
(No Transcript)
47
Skipping
  • 1-back word (y) xy, wy, vy, uy and ty
  • 4-gram level xy, wy and wx
  • The improvement over 4-gram pairs was still
    marginal

48
Clustering
  • Consider a probability such as P(Tuesday party
    on)
  • Perhaps the training data contains no instances
    of the phrase party on Tuesday, although other
    phrase such as party on Wednesday and party on
    Friday do appear
  • We can put words into classes, such as the word
    Tuesday into the class WEEKDAY
  • P(Tuesday party on WEEKDAY)

49
Clustering
  • When each word belongs to only one class, which
    is called hard clustering, this decomposition is
    a strict equality a fact that can be trivially
    provenLet Wi represent the cluster of word wi

(1)
50
Clustering
  • Since each word belongs to a single cluster,
    P(Wiwi) 1

(2)
(2) ?? (1) ?
(3)
predictive clustering
51
Clustering
  • Another type of clustering we can do is to
    cluster the words in the contexts. For instance,
    if party is in the class EVENT and on is in
    the class PREPOSITION, then we could writeor
    more generallyCombining (4) with (3) we get

(4)
(5)
fullibm clustering
52
Clustering
  • Use the approximation P(wWi-2Wi-1W) P(wW) to
    getfullibm clustering uses more information
    than ibm clustering, we assumed that it would
    lead to improvements (goodibm)

(6)
ibm clustering
53
Clustering
(7)
index clustering
  • Backoff/interpolation go fromP(Tuesday party
    EVENT on PREPOSITION) toP(Tuesday EVENT on
    PREPOSITION) toP(Tuesday on PREPOSITION)
    toP(Tuesday PREPOSITION) toP(Tuesday)since
    each word belongs to a single cluster ?redundant

54
Clustering
  • C(party EVENT on PREPOSITION) C(party
    on)C(EVENT on PREPOSITION) C(EVENT on)
  • We generally write an index clustered model as

fullibmpredict clustering
55
Clustering
  • indexpredict, combining index and predictive
  • combinepredict, interpolating a normal trigram
    with a predictive clustered trigram

56
Clustering
  • allcombinenotop, which is an interpolation of a
    normal trigram, a fullibm-like model, an index
    model, a predictive model, a true fullibm model,
    and an indexpredict model

normal trigram
fullibm-like
index midel
predictive
true fullibm
indexpredict
57
Clustering
  • allcombine, interpolates the predict-type models
    first at the cluster level, before interpolating
    with the word level model

normal trigram
fullibm-like
index midel
predictive
true fullibm
indexpredict
58
baseline
59
Clustering
  • The value of clustering decreases with training
    data increases, since clustering is a technique
    for dealing with data sparseness
  • ibm clustering consistently works very well

60
(No Transcript)
61
Clustering
  • In Fig.6 we show a comparison of several
    techniques using Katz smoothing and the same
    techniques with KN smoothing. The results are
    similar, with same interesting exceptions
  • Indexpredict works well for the KN smoothing
    model, but very poorly for the Katz smoothed
    model.
  • This shows that smoothing can have a significant
    effect on other techniques, such as clustering

62
Other ways to perform Clustering
  • Cluster groups of words instead of individual
    words
  • could compute
  • For instance, in a trigram model, one could
    cluster contexts like New York and Los
    Angeles as CITY, and on Wednesday and late
    tomorrow as TIME

63
Finding Clusters
  • There is no need for the clusters used for
    different positions to be the same
  • ibm clustering P(wiWi)P(WiWi-2Wi-1)Wi
    cluster predictive cluster,Wi-1 and Wi-2
    conditional cluster
  • The predictive and conditional clusters can be
    different, consider words a and an, in general, a
    and an can follow the same words, and so, for
    predictive clustering, belong in the same
    cluster. But, there are very few words that can
    follow both a and an so for conditional
    clustering, they belong in different clusters

64
Finding Clusters
  • The clusters are found automatically using a tool
    that attempts to minimize perplexity
  • For the conditional clusters, we try to minimize
    the perplexity of training data for a bigram of
    the form P(wiWi-1), which is equivalent to
    maximizing

65
Finding Clusters
  • For the predictive clusters, we try to minimize
    the perplexity of training data of
    P(Wiwi-1)P(wiWi)

P(Wiwi)P(Wiwi)P(wi)P(Wiwi) 1
P(wi-1Wi)P(wi-1Wi)P(Wi)
66
Caching
  • If a speaker uses a word, it is likely that he
    will use the same word again in the near future
  • We could form a smoothed bigram or trigram from
    the previous words, and interpolate this with the
    standard trigramwhere Ptricache(ww1wi-1) is
    a simple interpolated trigram model, using counts
    from the preceding words in the same document

67
Caching
  • When interpolating three probabilities P1(w),
    P2(w), and P3(w), rather than usewe actually
    useThis allows us to simplify the constraints
    of the search

68
Caching
  • Conditional caching weight the trigram cache
    differently depending on whether or not we have
    previously seen the context

69
Caching
  • Assume that the more data we have, the more
    useful each cache is. Thus we make ?, ? and ? be
    linear functions of the amount of data in the
    cache
  • Always set ?maxwordsweight to at or near
    1,000,000 while assigning ?multiplier to a small
    value (100 or less)

70
Caching
  • Finally, we can try conditionally combining
    unigram, bigram, and trigram caches

71
(No Transcript)
72
Caching
  • As can be seen, caching is potentially one of the
    most powerful techniques we can apply, leading to
    performance improvements of up to 0.6 bits on
    small data. Even on large data, the improvement
    is still substantial, up to 0.23 bits
  • On all data size, the n-gram caches perform
    substantially better than the unigram cache, but
    which version of the n-gram is used appears to
    make only a small difference

73
Caching
  • It should be noted that all of these results
    assume that the previous words are known exactly
  • In a speech recognition system, it is possible
    for a cache to look-in errorif recognition
    speech ? wreck a nice beach, later, speech
    recognition ? beach wreck ignitionsince the
    probability of beach will be significantly
    raised

74
Sentence Mixture Models
  • There may be several different sentence types
    within a corpus these types could be grouped by
    topic, or style, or some other criterion
  • In WSJ data, we might assume that there are three
    types financial market sentences (with a great
    deal of numbers and stock name), business
    sentences (promotions, demotions, mergers) and
    general news stories
  • Of course, in general, we do not know the
    sentence type until we have heard the sentence.
    Therefore, instead, we treat the sentence type as
    a hidden variable

75
Sentence Mixture Models
  • Let sj denote the condition that the sentence
    under consideration is a sentence of type j. Then
    the probability of the sentence, given that it is
    of type j can be written as
  • Let s0 be a special context that is always true
  • Let there be S different sentence types (4?S?8)
    let ?0?S be sentence interpolation parameters,
    that

76
Sentence Mixture Models
  • The overall probability of a sentence w1wn is
  • Eq (8) can be read as saying that there is a
    hidden variable, the sentence type the prior
    probability for each sentence type is ?j
  • The probability P(wiwi-2wi-1sj) may suffer from
    data sparsity, so they are linearly interpolated
    with the global model P(wiwi-2wi-1)

(8)
77
Sentence Mixture Models
  • Sentence types for the training data were found
    by using the same clustering program used for
    clustering words in this case, we tried to
    minimize the sentence-cluster unigram
    perplexities
  • Let s(i) represent the sentence type assigned to
    the sentence that word i is part of. (All words
    in a given sentence are assigned to the same
    type)
  • We tried to put sentences into clusters in such a
    way that was maximized

78
Relationship between training data size, n-gram
order, and number of types
79
0.08
0.12
80
Sentence Mixture Models
  • Note that we dont trust results for 128
    mixtures. With 128 sentence types, there are 773
    parameters, and the system may not have had
    enough heldout data to accurately estimate the
    parameters
  • Ideally, we would run this experiment with a
    larger heldout set, but it already required 5.5
    days with 20,000 words, so this is impractical

81
Sentence Mixture Models
  • We suspected that sentence mixture models would
    be more useful on larger training data sizewith
    100,000 words, only .1 bits,with 284,000,000
    words, its nearly .3 bits
  • This bodes well for the future of sentence
    mixture models as computers get faster and
    larger, training data sizes should also increase

82
Sentence Mixture Models
  • Both 5-gram and sentence mixture models attempt
    to model long distance dependencies, the
    improvement from their combination would be less
    than the sum of the individual improvements
  • In Fig.8, for 100,000 and 1,000,000 words, that
    different between trigram and 5-gram is very
    small, so the question is not very important
  • For 10,000,000 words and all training data, there
    is some negative interaction

So, approximately one third of the improvement
seems to be correlated
83
Combining techniques
  • Combining techniquesinterpolate this
    clustered trigram with a normal 5-gram

84
Combining techniques
  • Interpolate the sentence-specific 5-gram model
    with the global 5-gram model, the three skipping
    models, and the two cache model

85
Combining techniques
  • Next, we define the analogous function for
    predicting words given clusters

86
Combining techniques
  • Now, we can write out our probability model

(9)
87
(No Transcript)
88
(No Transcript)
89
Experiment
  • In fact, without KN-smooth, 5-gram actually hurt
    at small and medium data sizes. This is a
    wonderful example of synergy
  • Caching is the largest gain at small and medium
    data size
  • Combined with KN-smoothing, 5-grams are the
    largest gain at large data sizes

90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
Write a Comment
User Comments (0)