Title: Word clustering: Smaller models, Faster training
1Word clusteringSmaller models, Faster training
- Joshua Goodman
- Microsoft Speech.Net /
- Microsoft Research
2Quick Overview
- What are language models
- What are word clusters
- How word clusters make language models
- Smaller
- Faster
3A bad language model
4A bad language model
5A bad language model
6A bad language model
7Whats a Language Model
- For our purposes today, a language model gives
the probability of a word given its context - P(truth and nothing but the) ? 0.2
- P(roof and nuts sing on the) ? 0.00000001
- Useful for speech recognition, hand writing, OCR,
etc.
8The Trigram Approximation
- Assume each word depends only on the previous two
words - P(the whole truth and nothing but) ?
- P(thenothing but)
9Trigrams, continued
- Find probabilities by counting in real text
P(the nothing but) ? - C(nothing but the) /
C(nothing but) - Smoothing need to combine trigram P(the
nothing but) with bigram P(the nothing) with
unigram P(the) otherwise, too many things
youve never seen
10Perplexity
- Perplexity standard measure of language model
accuracy lower is better - Corresponds to average branching factor of model
11Trigram Problems
- Models are potentially huge similar in size to
training data - Largest part of commercial recognizers
- Sophisticated variations can be slow to learn
- Maximum entropy could take weeks, months, or
years!
12Overview Word clusters solve problems --
Smaller, faster
- Background What are word clusters
- Word clusters for smaller models
- Use a clustering technique that leads to larger
models, then prune - Up to 3 times smaller at same perplexity
- Word clusters for faster training of maximum
entropy models - Train two models, each of which predicts half as
much. Up to 35 times faster training
13What are word clusters?
- CLUSTERING CLASSES (same thing)
- What is P(Tuesday party on)
- Similar to P(Monday party on)
- Similar to P(Tuesday celebration on)
- Put words in clusters
- WEEKDAY Sunday, Monday, Tuesday,
- EVENTparty, celebration, birthday,
14Putting words into clusters
- One cluster per word hard clustering
- WEEKDAY Sunday, Monday, Tuesday,
- MONTH January, February, April, May, June,
- Soft clustering (each word belongs to more than
one cluster) possible, but complicates things.
You get fractional counts.
15Clustering how to get them
- Build them by hand
- Works ok when almost no data
- Part of Speech (POS) tags
- Tends not to work as well as automatic
- Automatic Clustering
- Swap words between clusters to minimize perplexity
16Clustering automatic
- Minimize perplexity of P(zY)
- Put words into clusters randomly
- Swap words between clusters whenever overall
perplexity of P(zY) goes down - Doing this naively is very slow, but mathematical
tricks speed it up
17Clustering fast
- Use top-down splitting at each
level - consider swapping each word
- between two clusters.
- not bottom up merging!
- (considers all pairs of
- Clusters!)
18Clustering example
- Imagine following counts
- C(Tuesday party on) 0
- C(Wednesday celebration before) 100
- C(Tuesday WEEKDAY) 1000
- Then
- P(Tuesday party on) ? 0
- P(WEEKDAY EVENT PREPOSITION) ? large
- P(Tuesday WEEKDAY) ? large
- P(WEEKDAY EVENT PREPOSITION) ? P(Tuesday
WEEKDAY) ? large
19Two actual WSJ clusters
- MONDAYS
- FRIDAYS
- THURSDAY
- MONDAY
- EURODOLLARS
- SATURDAY
- WEDNESDAY
- FRIDAY
- TENTERHOOKS
- TUESDAY
- SUNDAY
- CONDITION
- PARTY
- FESCO
- CULT
- NILSON
- PETA
- CAMPAIGN
- WESTPAC
- FORCE
- CONRAN
- DEPARTMENT
- PENH
- GUILD
-
-
20How to use clusters
- Let x, y, z be words, X, Y, Z be the clusters of
those words. - P(zxy) ? P(ZXY) ? P(zZ)
- P(Tuesday party on) ? P(WEEKDAY EVENT
PREPOSITION) ? P(Tuesday WEEKDAY) - Much smoother, smaller model than normal P(zxy),
but higher perplexity.
21Predictive clustering
- IMPORTANT FACT -- with no smoothing, etc.
We are using hard clusters, so if we know z then
we know the cluster, Z, so P(z, Z, history)
P(z, history)
22Predictive clustering
- Equality with no smoothing, etc.
- P(Zhistory)?? P(zhistory, Z)
- With smoothing, tends to be better
- May have trouble figuring out probability of
P(Tuesdayparty on) but can guess - P(WEEKDAYparty on)?P(Tuesdayparty on WEEKDAY)
? - P(WEEKDAYparty on)?P(Tuesday on WEEKDAY)
23Compression - Introduction
- We have billions of words of training data.
- Most large-vocabulary models are limited by model
size. - The most important question in language modeling
is What is the best language model we can build
that will fit in the available memory? - Relatively little research.
- New results, up to a factor of 3 or more smaller
than previous state of the art at the same
perplexity.
24Compression overview
- Review previous techniques
- Count cutoffs
- Stolcke pruning
- IBM clustering
- Describe new techniques (Stolcke pruning
predictive clustering) - Show experimental results
- Up to factor of 3 or more size decrease (at same
perplexity) versus Stolcke Pruning.
25Count cutoffs
- Simple, commonly used technique
- Just remove n-grams with
- small counts
26Stolcke pruning
- Consider P(City New York) vs. P(City York)
- Probabilities are almost the same
- Pruning P(City New York) has almost no cost,
even though C(New York City) is big. - Consider pruning P(lightbulb change a) much
more likely than P(lightbulb a)
27IBM clustering
- Use P(ZXY)?? P(zZ)
- Dont interpolate P(zxy) of course
- Model is much smaller, but higher perplexity.
- How does it compare to count cutoffs, etc? No one
ever tried comparison!
28Predictive clustering
- Predictive clustering P(Zxy) ? P(zxyZ)
- Model actually larger than original P(zxy)
- For each original P(zxy), we must store
P(zxyZ). In addition, need P(Zxy) - Normal model stores
- P(Sunday party on), P(Mondayparty on),
P(Tuedayparty on), - Clustered, pruned model stores
- P(WEEKDAYparty on)
- P(SundayWEEKDAY), P(Monday WEEKDAY),
P(TuedayWEEKDAY),
29Experiments
30Different Clusterings
- Let xj, xk,xl be alternate clusterings for x
- Example
- Tuesdayl WEEKDAY
- Tuesdayj DAYS-MONTHS-TIMES
- Tuesdayk NOUNS
- You can think of l, j, and k as being the number
of clusters. - Example P(zl xy) ? P(zl xj yj )
31Different Clusterings (continued)
- Example P(zl xy) ? P(zl xj yj )
- P(WEEKDAYparty on) ?
- P(WEEKDAYpartyj onj )
- P(WEEKDAY NOUN PREP)
- Or
- P(WEEKDAYparty on) ?
- P(WEEKDAYpartyk onk )
- P(WEEKDAY EVENT LOC-PREP)
32Both Clustering
- P(zl xy) ? P(zl xj yj )
- P(z xyzl ) ? P(z xk yk zl )
- Substitute into predictive clustering
- P(zxy)
- P(zl xy) ? P(zxyzl ) ?
- P(zl xj yj ) ? P(z xk yk zl )
33Example
- P(zxy)
- P(zl xy) ? P(z xyzl ) ?
- P(zl xj yj ) ? P(z xk yk zl )
- P(Tuesday party on)
- P(WEEKDAYparty on) ? P(Tuesday party
on WEEKDAY) ? - P(WEEKDAYNOUN PREP) ? P(Tuesday EVENT
LOC-PREP WEEKDAY)
34Size reduction
- P(zxy) P(zl xy) ? P(zxyzl ) ?
- P(zl xj yj) ? P(z xk
yk zl ) - Optimal setting for k is often very large, e.g.
whole vocabulary. - Unpruned model is typically larger than
unclustered, but smaller than predictive. - Pruned model is smaller than unclustered and
smaller than predictive at same perplexity
35Experiments
36WSJ (English) results -- relative
37Chinese Newswire Results(with Jianfeng Gao, MSR
Beijing)
38Compression conclusion
- We can achieve up to a factor of 3 or more
reduction at the same perplexity by using Both
Clustering combined with Stolcke pruning. - Model is surprising it actually increases the
model size and then prunes it down smaller - Results are similar for Chinese and English.
39Maximum Entropy Speedups
- Many people think Maximum Entropy is the future
of language modeling - (not me anymore)
- Allows lots of different information to be
combined - Very slow to train weeks
- Predictive cluster models are up to 35 times
faster to train
40Maximum entropy overview
- Describe what maximum entropy is
- Explain how to train maxent models, and why it is
slow - Show how predictive clustering can speed it up
- Give experimental results showing factor of 35
speedup. - Talk about application to other areas
41Maximum Entropy Introduction
- Im busy next weekend. We are having a big
party on - How likely is Friday?
- Reasonably likely to start 0.001
- weekend occurs nearby 2 times as likely
- Previous word is on 3 times as likely
- Previous words are party on 5 times as likely
- 0.001 ? 2 ? 3 ? 5 0.03
- Need to normalize 0.03 / ?? P(all words)
42Maximum Entropy what is it
- Product of many indicator functions
- fj is an indicator 1 if some condition holds,
e.g. fj (w, wi-2 , wi-1 ) 1 if wFriday,wi-2
party, wi-1 on - Can create bigrams, trigrams, skipping, caches,
triggers with right indicator functions. - Z?? is a normalization constant
43Maximum entropy training
- How to get the ??s Iterative EM algorithm
- Requires computing probability distribution in
all training contexts. For each training
context - Requires determining all indicators that might
apply - Requires computing normalization constant
- Note that number of indicators that can apply,
and time to normalize are both bounded by a
factor of vocabulary size.
44Example party on Tuesday
- Consider party on Friday
- We need to compute P(Fridayparty on),
P(Tuesdayparty on), P(fishparty on), etc. - Number of trigram indicators (fj s) that we need
to consider bounded by vocabulary size - Number of words to normalize vocabulary size.
45Solution Predictive Clustering
- Create two separate maximum entropy models
P(Zwxy) and P(zwxyZ). - Imagine 10,000 word vocabulary, 100 clusters, 100
words per cluster. - Time to train first model will be proportional to
number of clusters (100) - Time to train second model proportional to number
of words per cluster (100) - 10,000 / 200 50 times speedup
46Predictive clustering example
- Consider party on Tuesday, P(Zwxy)
- We need to know P(WEEKDAYparty on),
P(MONTHparty on), P(ANIMALparty on), etc. - Number of trigram indicators (fj s) that we need
to consider bounded by number clusters - Normalize only over number of clusters
47Predictive clustering example(continued)
- Consider party on Tuesday, P(zwxyZ)
- We need to know P(Mondayparty on WEEKDAY),
P(Tuesdayparty on WEEKDAY), etc. Note that
P(fish party on WEEKDAY) 0 - Number of trigram indicators (fj s) we need to
consider bounded by number words in cluster. - Normalize only over number of words in cluster
48Improvements testing
- May also speed up testing.
- If running decoder with all words, then we need
to compute P(zwxy) for all z, and no speedup. - If using maximum entropy as a postprocessing
step, on a lattice or n-best list, may still lead
to speedups, since only need to compute a few zs
for each context wxy.
49Maximum entropy results
50Maximum entropy conclusions
- At 10,000 predictive hurts a little
- At any larger size they help.
- Amount they help increases as training data size
increases. - Triple predictive gives a factor of 35 over fast
unigram at 10,000,000 words training - Perplexity actually decreases slightly, even with
faster training!
51Overall conclusion Predictive clustering ??
Smaller, faster
- Clustering is a well known technique
- Smaller New ways of using clustering to reduce
language model size up to 50 reduction in size
at same perplexity. - Faster New ways of speeding up training for
maximum entropy models.
52Speedup applied to other areas
- Can apply to any problem with many outputs, not
just words - Example collaborative filtering tasks
- This speedup can be used with most machine
learning algorithms applied to problems with many
outputs - Examples neural networks, decision trees
53Neural Networks
- Imagine a neural network with a large number of
outputs (10,000) - Requires backpropagating one 1, and 9,999 0s
54Maximum Entropy trainingInner loop
- For each word w in vocabulary
- Pw ? 1
- next w
- For each non-zero fj
- Pw ? Pw ??
- next j
- z ?
- For each word w in vocabulary
- observedw?observedwPw/z
- next w