Title: Maximum Entropy Model
1Maximum Entropy Model
- LING 572
- Fei Xia
- 02/08/07
2Topics in LING 572
- Easy
- kNN, Rocchio, DT, DL
- Feature selection, binarization, system
combination - Bagging
- Self-training
3Topics in LING 572
- Slightly more complicated
- Boosting
- Co-training
- Hard (to some people)
- MaxEnt
- EM
4History
- The concept of Maximum Entropy can be traced back
along multiple threads to Biblical times. - Introduced to NLP area by Berger et. al. (1996).
- Used in many NLP tasks Tagging, Parsing, PP
attachment, LM,
5Outline
- Main idea
- Modeling
- Training estimating parameters
- Feature selection during training
- Case study
6Main idea
7Maximum Entropy
- Why maximum entropy?
- Maximize entropy Minimize commitment
- Model all that is known and assume nothing about
what is unknown. - Model all that is known satisfy a set of
constraints that must hold - Assume nothing about what is unknown
- choose the most uniform distribution
- ? choose the one with maximum entropy
8Ex1 Coin-flip example(Klein Manning 2003)
- Toss a coin p(H)p1, p(T)p2.
- Constraint p1 p2 1
- Question whats your estimation of p(p1, p2)?
- Answer choose the p that maximizes H(p)
H
p1
p10.3
9Coin-flip example (cont)
H
p1 p2 1
p2
p1
p1p21.0, p10.3
10Ex2 An MT example(Berger et. al., 1996)
Possible translation for the word in is
Constraint
Intuitive answer
11An MT example (cont)
Constraints
Intuitive answer
12An MT example (cont)
Constraints
Intuitive answer
??
13Ex3 POS tagging(Klein and Manning, 2003)
14Ex3 (cont)
15Ex4 overlapping features(Klein and Manning,
2003)
16Modeling the problem
- Objective function H(p)
- Goal Among all the distributions that satisfy
the constraints, choose the one, p, that
maximizes H(p). - Question How to represent constraints?
17Modeling
18Reference papers
- (Ratnaparkhi, 1997)
- (Ratnaparkhi, 1996)
- (Berger et. al., 1996)
- (Klein and Manning, 2003)
- ? Different notations.
19The basic idea
- Goal estimate p
- Choose p with maximum entropy (or uncertainty)
subject to the constraints (or evidence).
20Setting
- From training data, collect (a, b) pairs
- a thing to be predicted (e.g., a class in a
classification problem) - b the context
- Ex POS tagging
- aNN
- bthe words in a window and previous two tags
- Learn the prob of each (a, b) p(a, b)
21Features in POS tagging(Ratnaparkhi, 1996)
context (a.k.a. history)
allowable classes
22Features
- Feature (a.k.a. feature function, Indicator
function) is a binary-valued function on events - A the set of possible classes (e.g., tags in POS
tagging) - B space of contexts (e.g., neighboring words/
tags in POS tagging) - Ex
-
23Some notations
Finite training sample of events
Observed probability of x in S
The model ps probability of x
The jth feature
Observed expectation of
(empirical count of )
Model expectation of
24Constraints
- Models feature expectation observed feature
expectation -
-
- How to calculate ?
25Training data ? observed events
26Restating the problem
The task find p s.t.
where
Objective function H(p)
Constraints
Add a feature
27Questions
- Is P empty?
- Does p exist?
- Is p unique?
- What is the form of p?
- How to find p?
28What is the form of p?(Ratnaparkhi, 1997)
Theorem if then
Furthermore, p is unique.
29Using Lagrange multipliers
Minimize A(p)
30Two equivalent forms
31Relation to Maximum Likelihood
The log-likelihood of the empirical distribution
as predicted by a model q is defined as
Theorem if then
Furthermore, p is unique.
32Summary (so far)
Goal find p in P, which maximizes H(p).
It can be proved that, when p exists, it is
unique.
The model p in P with maximum entropy is the
model in Q that maximizes the likelihood of the
training sample
33Summary (cont)
- Adding constraints (features)
- (Klein and Manning, 2003)
- Lower maximum entropy
- Raise maximum likelihood of data
- Bring the distribution further away from uniform
- Bring the distribution closer to data
34Training
35Algorithms
- Generalized Iterative Scaling (GIS) (Darroch and
Ratcliff, 1972) - Improved Iterative Scaling (IIS) (Della Pietra
et al., 1995)
36GIS setup
- Requirements for running GIS
- Obey form of model and constraints
- An additional constraint
Let
Add a new feature fk1
37GIS algorithm
- Compute dj, j1, , k1
- Initialize (any values, e.g., 0)
- Repeat until converge
- For each j
- Compute
- Compute
-
- Update
38Approximation for calculating feature expectation
39Properties of GIS
- L(p(n1)) gt L(p(n))
- The sequence is guaranteed to converge to p.
- The converge can be very slow.
- The running time of each iteration is O(NPA)
- N the training set size
- P the number of classes
- A the average number of features that are active
for a given event (a, b).
40Feature selection
41Feature selection
- Throw in many features and let the machine select
the weights - Manually specify feature templates
- Problem too many features
- An alternative greedy algorithm
- Start with an empty set S
- Add a feature at each iteration
42Notation
With the feature set S
After adding a feature
The gain in the log-likelihood of the training
data
43Feature selection algorithm(Berger et al., 1996)
- Start with S being empty thus ps is uniform.
- Repeat until the gain is small enough
- For each candidate feature f
- Computer the model using IIS
- Calculate the log-likelihood gain
- Choose the feature with maximal gain, and add it
to S
? Problem too expensive
44Approximating gains(Berger et. al., 1996)
- Instead of recalculating all the weights,
calculate only the weight of the new feature.
45Training a MaxEnt Model
- Scenario 1 no feature selection during training
- Define features templates
- Create the feature set
- Determine the optimum feature weights via GIS or
IIS - Scenario 2 with feature selection during
training - Define feature templates
- Create candidate feature set S
- At every iteration, choose the feature from S
(with max gain) and determine its weight (or
choose top-n features and their weights).
46Case study
47POS tagging(Ratnaparkhi, 1996)
- Notation variation
- fj(a, b) a class, b context
- fj(hi, ti) h history for ith word, t tag for
ith word - History
- Training data
- Treat it as a list of (hi, ti) pairs.
- How many pairs are there?
48Using a MaxEnt Model
- Modeling
- Training
- Define features templates
- Create the feature set
- Determine the optimum feature weights via GIS or
IIS - Decoding
49Modeling
50Training step 1 define feature templates
History hi
Tag ti
51Step 2 Create feature set
- Collect all the features from the training data
- Throw away features that appear less than 10 times
52Step 3 determine the feature weights
- GIS
- Training time
- Each iteration O(NTA)
- N the training set size
- T the number of allowable tags
- A average number of features that are active for
a (h, t). - About 24 hours on an IBM RS/6000 Model 380.
- How many features?
53Decoding Beam search
- Generate tags for w1, find top N, set s1j
accordingly, j1, 2, , N - For i2 to n (n is the sentence length)
- For j1 to N
- Generate tags for wi, given s(i-1)j as previous
tag context - Append each tag to s(i-1)j to make a new
sequence. - Find N highest prob sequences generated above,
and set sij accordingly, j1, , N - Return highest prob sequence sn1.
54Beam search
55Viterbi search
56Decoding (cont)
- Tags for words
- Known words use tag dictionary
- Unknown words try all possible tags
- Ex time flies like an arrow
- Running time O(NTAB)
- N sentence length
- B beam size
- T tagset size
- A average number of features that are active for
a given event
57Experiment results
58Comparison with other learners
- HMM MaxEnt uses more context
- SDT MaxEnt does not split data
- TBL MaxEnt is statistical and it provides
probability distributions.
59MaxEnt Summary
- Concept choose the p that maximizes entropy
while satisfying all the constraints. - Max likelihood p is also the model within a
model family that maximizes the log-likelihood of
the training data. - Training GIS or IIS, which can be slow.
- MaxEnt handles overlapping features well.
- In general, MaxEnt achieves good performances on
many NLP tasks.
60Additional slides
61Ex4 (cont)
??
62IIS algorithm
- Compute dj, j1, , k1 and
- Initialize (any values, e.g., 0)
- Repeat until converge
- For each j
- Let be the solution to
-
- Update
63Calculating
If
Then
GIS is the same as IIS
Else
must be calcuated numerically.