Maximum Entropy Model

About This Presentation

Title:

Maximum Entropy Model

Description:

The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. ... Generate tags for wi, given s(i-1)j as previous tag context ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 64

Provided by: coursesWa5

Learn more at: http://courses.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Maximum Entropy Model

1
Maximum Entropy Model

LING 572
Fei Xia
02/08/07

2
Topics in LING 572

Easy
kNN, Rocchio, DT, DL
Feature selection, binarization, system
combination
Bagging
Self-training

3
Topics in LING 572

Slightly more complicated
Boosting
Co-training
Hard (to some people)
MaxEnt
EM

4
History

The concept of Maximum Entropy can be traced back
along multiple threads to Biblical times.
Introduced to NLP area by Berger et. al. (1996).
Used in many NLP tasks Tagging, Parsing, PP
attachment, LM,

5
Outline

Main idea
Modeling
Training estimating parameters
Feature selection during training
Case study

6
Main idea
7
Maximum Entropy

Why maximum entropy?
Maximize entropy Minimize commitment
Model all that is known and assume nothing about
what is unknown.
Model all that is known satisfy a set of
constraints that must hold
Assume nothing about what is unknown
choose the most uniform distribution
? choose the one with maximum entropy

8
Ex1 Coin-flip example(Klein Manning 2003)

Toss a coin p(H)p1, p(T)p2.
Constraint p1 p2 1
Question whats your estimation of p(p1, p2)?
Answer choose the p that maximizes H(p)

H
p1
p10.3
9
Coin-flip example (cont)
H
p1 p2 1
p2
p1
p1p21.0, p10.3
10
Ex2 An MT example(Berger et. al., 1996)
Possible translation for the word in is
Constraint
Intuitive answer
11
An MT example (cont)
Constraints
Intuitive answer
12
An MT example (cont)
Constraints
Intuitive answer
??
13
Ex3 POS tagging(Klein and Manning, 2003)
14
Ex3 (cont)
15
Ex4 overlapping features(Klein and Manning,
2003)
16
Modeling the problem

Objective function H(p)
Goal Among all the distributions that satisfy
the constraints, choose the one, p, that
maximizes H(p).
Question How to represent constraints?

17
Modeling
18
Reference papers

(Ratnaparkhi, 1997)
(Ratnaparkhi, 1996)
(Berger et. al., 1996)
(Klein and Manning, 2003)
? Different notations.

19
The basic idea

Goal estimate p
Choose p with maximum entropy (or uncertainty)
subject to the constraints (or evidence).

20
Setting

From training data, collect (a, b) pairs
a thing to be predicted (e.g., a class in a
classification problem)
b the context
Ex POS tagging
aNN
bthe words in a window and previous two tags
Learn the prob of each (a, b) p(a, b)

21
Features in POS tagging(Ratnaparkhi, 1996)
context (a.k.a. history)
allowable classes
22
Features

Feature (a.k.a. feature function, Indicator
function) is a binary-valued function on events
A the set of possible classes (e.g., tags in POS
tagging)
B space of contexts (e.g., neighboring words/
tags in POS tagging)
Ex

23
Some notations
Finite training sample of events
Observed probability of x in S
The model ps probability of x
The jth feature
Observed expectation of
(empirical count of )
Model expectation of
24
Constraints

Models feature expectation observed feature
expectation
How to calculate ?

25
Training data ? observed events
26
Restating the problem
The task find p s.t.
where
Objective function H(p)
Constraints
Add a feature
27
Questions

Is P empty?
Does p exist?
Is p unique?
What is the form of p?
How to find p?

28
What is the form of p?(Ratnaparkhi, 1997)
Theorem if then
Furthermore, p is unique.
29
Using Lagrange multipliers
Minimize A(p)
30
Two equivalent forms
31
Relation to Maximum Likelihood
The log-likelihood of the empirical distribution
as predicted by a model q is defined as
Theorem if then

Furthermore, p is unique.
32
Summary (so far)
Goal find p in P, which maximizes H(p).
It can be proved that, when p exists, it is
unique.
The model p in P with maximum entropy is the
model in Q that maximizes the likelihood of the
training sample
33
Summary (cont)

Adding constraints (features)
(Klein and Manning, 2003)
Lower maximum entropy
Raise maximum likelihood of data
Bring the distribution further away from uniform
Bring the distribution closer to data

34
Training
35
Algorithms

Generalized Iterative Scaling (GIS) (Darroch and
Ratcliff, 1972)
Improved Iterative Scaling (IIS) (Della Pietra
et al., 1995)

36
GIS setup

Requirements for running GIS
Obey form of model and constraints
An additional constraint

Let
Add a new feature fk1
37
GIS algorithm

Compute dj, j1, , k1
Initialize (any values, e.g., 0)
Repeat until converge
For each j
Compute
Compute
Update

38
Approximation for calculating feature expectation
39
Properties of GIS

L(p(n1)) gt L(p(n))
The sequence is guaranteed to converge to p.
The converge can be very slow.
The running time of each iteration is O(NPA)
N the training set size
P the number of classes
A the average number of features that are active
for a given event (a, b).

40
Feature selection
41
Feature selection

Throw in many features and let the machine select
the weights
Manually specify feature templates
Problem too many features
An alternative greedy algorithm
Start with an empty set S
Add a feature at each iteration

42
Notation
With the feature set S
After adding a feature
The gain in the log-likelihood of the training
data
43
Feature selection algorithm(Berger et al., 1996)

Start with S being empty thus ps is uniform.
Repeat until the gain is small enough
For each candidate feature f
Computer the model using IIS
Calculate the log-likelihood gain
Choose the feature with maximal gain, and add it
to S

? Problem too expensive
44
Approximating gains(Berger et. al., 1996)

Instead of recalculating all the weights,
calculate only the weight of the new feature.

45
Training a MaxEnt Model

Scenario 1 no feature selection during training
Define features templates
Create the feature set
Determine the optimum feature weights via GIS or
IIS
Scenario 2 with feature selection during
training
Define feature templates
Create candidate feature set S
At every iteration, choose the feature from S
(with max gain) and determine its weight (or
choose top-n features and their weights).

46
Case study
47
POS tagging(Ratnaparkhi, 1996)

Notation variation
fj(a, b) a class, b context
fj(hi, ti) h history for ith word, t tag for
ith word
History
Training data
Treat it as a list of (hi, ti) pairs.
How many pairs are there?

48
Using a MaxEnt Model

Modeling
Training
Define features templates
Create the feature set
Determine the optimum feature weights via GIS or
IIS
Decoding

49
Modeling
50
Training step 1 define feature templates
History hi
Tag ti
51
Step 2 Create feature set

Collect all the features from the training data
Throw away features that appear less than 10 times

52
Step 3 determine the feature weights

GIS
Training time
Each iteration O(NTA)
N the training set size
T the number of allowable tags
A average number of features that are active for
a (h, t).
About 24 hours on an IBM RS/6000 Model 380.
How many features?

53
Decoding Beam search

Generate tags for w1, find top N, set s1j
accordingly, j1, 2, , N
For i2 to n (n is the sentence length)
For j1 to N
Generate tags for wi, given s(i-1)j as previous
tag context
Append each tag to s(i-1)j to make a new
sequence.
Find N highest prob sequences generated above,
and set sij accordingly, j1, , N
Return highest prob sequence sn1.

54
Beam search
55
Viterbi search
56
Decoding (cont)

Tags for words
Known words use tag dictionary
Unknown words try all possible tags
Ex time flies like an arrow
Running time O(NTAB)
N sentence length
B beam size
T tagset size
A average number of features that are active for
a given event

57
Experiment results
58
Comparison with other learners

HMM MaxEnt uses more context
SDT MaxEnt does not split data
TBL MaxEnt is statistical and it provides
probability distributions.

59
MaxEnt Summary

Concept choose the p that maximizes entropy
while satisfying all the constraints.
Max likelihood p is also the model within a
model family that maximizes the log-likelihood of
the training data.
Training GIS or IIS, which can be slow.
MaxEnt handles overlapping features well.
In general, MaxEnt achieves good performances on
many NLP tasks.

60
Additional slides
61
Ex4 (cont)
??
62
IIS algorithm

Compute dj, j1, , k1 and
Initialize (any values, e.g., 0)
Repeat until converge
For each j
Let be the solution to
Update

63
Calculating
If
Then
GIS is the same as IIS
Else
must be calcuated numerically.

Write a Comment

User Comments (0)

About PowerShow.com

Maximum Entropy Model - PowerPoint PPT Presentation

Maximum Entropy Model

The concept of Maximum Entropy can be traced back along multiple threads to Biblical times. ... Generate tags for wi, given s(i-1)j as previous tag context ... – PowerPoint PPT presentation