Title: Sequence Learning
1Sequence Learning
- Sudeshna Sarkar
- 14 Aug 2008
2Alternative graphical models for part of speech
tagging
3Different Models for POS tagging
- HMM
- Maximum Entropy Markov Models
- Conditional Random Fields
4Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
5Dependency (1st order)
6Disadvantage of HMMs (1)
- No Rich Feature Information
- Rich information are required
- When xk is complex
- When data of xk is sparse
- Example POS Tagging
- How to evaluate P(wktk) for unknown words wk ?
- Useful features
- Suffix, e.g., -ed, -tion, -ing, etc.
- Capitalization
- Generative Model
- Parameter estimation maximize the joint
likelihood of training examples
7Generative Models
- Hidden Markov models (HMMs) and stochastic
grammars - Assign a joint probability to paired observation
and label sequences - The parameters typically trained to maximize the
joint likelihood of train examples
8Generative Models (contd)
- Difficulties and disadvantages
- Need to enumerate all possible observation
sequences - Not practical to represent multiple interacting
features or long-range dependencies of the
observations - Very strict independence assumptions on the
observations
9Making use of rich domain features
- A learning algorithm is as good as its features.
- There are many useful features to include in a
model - Most of them arent independent of each other
- Identity of word
- Ends in -shire
- Is capitalized
- Is head of noun phrase
- Is in a list of city names
- Is under node X in WordNet
- Word to left is verb
- Word to left is lowercase
- Is in bold font
- Is in hyperlink anchor
- Other occurrences in doc
10Problems with Richer Representationand a
Generative Model
- These arbitrary features are not independent
- Overlapping and long-distance dependences
- Multiple levels of granularity (words,
characters) - Multiple modalities (words, formatting, layout)
- Observations from past and future
- HMMs are generative models of the text
- Generative models do not easily handle these
non-independent features. Two choices - Model the dependencies. Each state would have
its own Bayes Net. But we are already starved
for training data! - Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
11Discriminative Models
- We would prefer a conditional modelP(yx)
instead of P(y,x) - Can examine features, but not responsible for
generating them. - Dont have to explicitly model their
dependencies. - Dont waste modeling effort trying to generate
what we are given at test time anyway. - Provide the ability to handle many arbitrary
features.
12Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
13Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Or, more generally
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
...
...
observations
...
entire observation sequence
O
O
O
O
t
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
14Exponential Form for Next State Function
st-1
Black-box classifier
weight
feature
Overall Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum likelihood (iterative scaling
or conjugate gradient).
15Principle of Maximum Entropy
- The correct distribution P(s,o) is that which
maximizes entropy, or uncertainty subject to
constraints - Constraints represent evidence
- Given k features, constraints have the form,
i.e. the models
expectation for each feature should match the
observed expectation - Philosophy Making inferences on the basis of
partial information without biasing the
assignment would amount to arbitrary assumptions
of information that we do not have
16Maximum Entropy Classifier
- Conditional model p(yx)
- Does not try to model p(x)
- Can work with complicated input features since we
do not need to model dependencies between them. - Principle of maximum entropy
- We want a classifier
- Matching feature constraints from training data
- Predictions maximize entropy
- There is a unique exponential family distribution
that meets these criteria. - Maximum Entropy Classifier
- p(yx?) inference and learning
17Indicator Features
- Feature functions f(x,y)
- f1(w,y) word is Sarani y Location
- f2(w,y) previous tag Per-begin, current word
suffix an, y Per-end
18Problems with MaxEnt classifier
- It makes decisions at each point independently
19MEMM
- Use a series of maximum entropy classifiers that
know the previous label - Define a Viterbi model of inference
- P(yx) ?t Pyt-1 (ytx)
- Finding the most likely label sequence given an
input sequence and learning - Combines the advantages of HMM and maximum
entropy. - But there is a problem.
20Maximum Entropy Markov Model
Label bias problem the probability transitions
leaving any given state must sum to one
21- In some state space configurations, MEMMs
essentially completely ignore the inputs - Example of label bias problem
- This is not a problem for HMMs, because the input
is generated by the model.
22Label Bias Example
P0.75
P0.25
- Given rib 3 times, rob 1 times
- Training p(10, r)0.75, p(40, r)0.25
- Inference
23Conditional Markov Models (CMMs) aka MEMMs aka
Maxent Taggers vs HMMS
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
24Random Field
25CRF
- CRFs have all the advantages of MEMMs without
label bias problem - MEMM uses per-state exponential model for the
conditional probabilities of next states given
the current state - CRF has a single exponential model for the joint
probability of the entire sequence of labels
given the observation sequence - Undirected acyclic graph
- Allow some transitions vote more strongly than
others depending on the corresponding observations
26Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
27Machine Learning a Panacea?
- A machine learning method is as good as the
feature set it uses - Shift focus from linguistic processing to feature
set design
28Features to use in IE
- Features are task dependent
- Good feature identification needs a good
knowledge of the domain combined with automatic
methods of feature selection.
29Feature Examples
- Extraction of protein and their interactions from
biomedical literature (Mooney) - For each token, they take the following as
features - Current token
- Last 2 tokens and next 2 tokens
- Output of dictionary-based tagger for these 5
tokens - Suffix for each of the 5 tokens (last 1, 2, and 3
characters) - Class labels for last 2 tokens
Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
30More Feature Examples
- line, sentence, or paragraph features
- length
- is centered in page
- percent of non-alphabetics
- white-space aligns with next line
- containing sentence has two verbs
- grammatically contains a question
- contains links to authoritative pages
- emissions that are uncountable
- features at multiple levels of granularity
- Example word features
- identity of word
- is in all caps
- ends in -ski
- is part of a noun phrase
- is in a list of city names
- is under node X in WordNet or Cyc
- is in bold font
- is in hyperlink anchor
- features of past future
- last person name was female
- next two words are and Associates
31Indicator Features
- Theyre a little different from the typical
supervised ML approach - Limited to binary values
- Think of a feature as being on or off rather than
as a feature with a value - Feature values are relative to an object/class
pair rather than being a function of the object
alone. - Typically have lots and lots of features (100s of
1000s of features is quite common.)
32Feature Templates
- Next word
- A feature template gives rise to VxT binary
features - Curse of Dimensionality
- Overfitting
33Feature Selection vs Extraction
- Feature selection Choosing kltd important
features, ignoring the remaining d k - Subset selection algorithms
- Feature extraction Project the
- original xi , i 1,...,d dimensions to
- new kltd dimensions, zj , j 1,...,k
- Principal components analysis (PCA), linear
discriminant analysis (LDA), factor analysis (FA)
34Feature Reduction
- Example domain NER in Hindi (Sujan Saha)
- Feature Value Selection
- Feature Value Clustering
ACL 2008 Kumar Saha Pabitra Mitra Sudeshna
SarkarWord Clustering and Word Selection Based
Feature Reduction for MaxEnt Based Hindi NER
35(No Transcript)
36- Better Approach
- Discriminative model which models P(yx) directly
- Maximize the conditional likelihood of training
examples
37Maximum Entropy modeling
- N-gram model probabilities depend on the
previous few tokens. - We may identify a more heterogeneous set of
features which contribute in some way to the
choice of the current word. (whether it is the
first word in a story, whether the next word is
to, whether one of the last 5 words is a
preposition, etc) - Maxent combines these features in a probabilistic
model. - The given features provide a constraint on the
model. - We would like to have a probability distribution
which, outside of these constraints, is as
uniform as possible has the maximum entropy
among all models that satisfy these constraints.
38Maximum Entropy Markov Model
- Discriminative Sub Models
- Unify two parameters in generative model into one
conditional model - Two parameters in generative model,
- parameter in source model
and parameter in noisy channel - Unified conditional model
- Employ maximum entropy principle
- Maximum Entropy Markov Model
39General Maximum Entropy Principle
- Model
- Model distribution P(Y X) with a set of features
f1, f2, ?, fl defined on X and Y - Idea
- Collect information of features from training
data - Principle
- Model what is known
- Assume nothing else
- ? Flattest distribution
- ? Distribution with the maximum Entropy
40Example
- (Berger et al., 1996) example
- Model translation of word in from English to
French - Need to model P(wordFrench)
- Constraints
- 1 Possible translations dans, en, à, au course
de, pendant - 2 dans or en used in 30 of the time
- 3 dans or à in 50 of the time
41Features
- Features
- 0-1 indicator functions
- 1 if (x, y) satisfies a predefined condition
- 0 if not
- Example POS Tagging
42Constraints
- Empirical Information
- Statistics from training data T
- Expected Value
- From the distribution P(Y X) we want to model
43Maximum Entropy Objective
44Dual Problem
- Dual Problem
- Conditional model
- Maximum likelihood of conditional data
- Solution
- Improved iterative scaling (IIS) (Berger et al.
1996) - Generalized iterative scaling (GIS) (McCallum et
al. 2000)
45Maximum Entropy Markov Model
- Use Maximum Entropy Approach to Model
- 1st order
- Features
- Basic features (like parameters in HMM)
- Bigram (1st order) or trigram (2nd order) in
source model - State-output pair feature (Xk xk, Yk yk)
- Advantage incorporate other advanced features on
(xk, yk)
46HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
47Performance in POS Tagging
- POS Tagging
- Data set WSJ
- Features
- HMM features, spelling features (like ed, -tion,
-s, -ing, etc.) - Results (Lafferty et al. 2001)
- 1st order HMM
- 94.31 accuracy, 54.01 OOV accuracy
- 1st order MEMM
- 95.19 accuracy, 73.01 OOV accuracy
48ME applications
- Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
- P(POS tag context)
- Information sources
- Word window (4)
- Word features (prefix, suffix, capitalization)
- Previous POS tags
49ME applications
- Abbreviation expansion (Pakhomov, 2002)
- Information sources
- Word window (4)
- Document title
- Word Sense Disambiguation (WSD) (Chao Dyer,
2002) - Information sources
- Word window (4)
- Structurally related words (4)
- Sentence Boundary Detection (Reynar
Ratnaparkhi, 1997) - Information sources
- Token features (prefix, suffix, capitalization,
abbreviation) - Word window (2)
50Solution
- Global Optimization
- Optimize parameters in a global model
simultaneously, not in sub models separately - Alternatives
- Conditional random fields
- Application of perceptron algorithm
51Why ME?
- Advantages
- Combine multiple knowledge sources
- Local
- Word prefix, suffix, capitalization (POS -
(Ratnaparkhi, 1996)) - Word POS, POS class, suffix (WSD - (Chao Dyer,
2002)) - Token prefix, suffix, capitalization,
abbreviation (Sentence Boundary - (Reynar
Ratnaparkhi, 1997)) - Global
- N-grams (Rosenfeld, 1997)
- Word window
- Document title (Pakhomov, 2002)
- Structurally related words (Chao Dyer, 2002)
- Sentence length, conventional lexicon (Och Ney,
2002) - Combine dependent knowledge sources
52Why ME?
- Advantages
- Add additional knowledge sources
- Implicit smoothing
- Disadvantages
- Computational
- Expected value at each iteration
- Normalizing constant
- Overfitting
- Feature selection
- Cutoffs
- Basic Feature Selection (Berger et al., 1996)
53Conditional Models
- Conditional probability P(label sequence y
observation sequence x) rather than joint
probability P(y, x) - Specify the probability of possible label
sequences given an observation sequence - Allow arbitrary, non-independent features on the
observation sequence X - The probability of a transition between labels
may depend on past and future observations - Relax strong independence assumptions in
generative models
54Discriminative ModelsMaximum Entropy Markov
Models (MEMMs)
- Exponential model
- Given training set X with label sequence Y
- Train a model ? that maximizes P(YX, ?)
- For a new data sequence x, the predicted label y
maximizes P(yx, ?) - Notice the per-state normalization
55MEMMs (contd)
- MEMMs have all the advantages of Conditional
Models - Per-state normalization all the mass that
arrives at a state must be distributed among the
possible successor states (conservation of score
mass) - Subject to Label Bias Problem
- Bias toward states with fewer outgoing transitions
56Label Bias Problem
- P(1 and 2 ro) P(2 1 and ro)P(1 ro)
P(2 1 and o)P(1 r) - P(1 and 2 ri) P(2 1 and ri)P(1 ri)
P(2 1 and i)P(1 r) - Since P(2 1 and x) 1 for all x, P(1 and 2
ro) P(1 and 2 ri) - In the training data, label value 2 is the only
label value observed after label value 1 - Therefore P(2 1) 1, so P(2 1 and x) 1 for
all x - However, we expect P(1 and 2 ri) to be
greater than P(1 and 2 ro). - Per-state normalization does not allow the
required expectation
57Solve the Label Bias Problem
- Change the state-transition structure of the
model - Not always practical to change the set of states
- Start with a fully-connected model and let the
training procedure figure out a good structure - Prelude the use of prior, which is very valuable
(e.g. in information extraction)
58Random Field
59Conditional Random Fields (CRFs)
- CRFs have all the advantages of MEMMs without
label bias problem - MEMM uses per-state exponential model for the
conditional probabilities of next states given
the current state - CRF has a single exponential model for the joint
probability of the entire sequence of labels
given the observation sequence - Undirected acyclic graph
- Allow some transitions vote more strongly than
others depending on the corresponding observations
60Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
61Example of CRFs
62Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
63Conditional Distribution
64Conditional Distribution (contd)
- CRFs use the observation-dependent
normalization Z(x) for the conditional
distributions
Z(x) is a normalization over the data sequence x
65Parameter Estimation for CRFs
- The paper provided iterative scaling algorithms
- It turns out to be very inefficient
- Prof. Dietterichs group applied Gradient
Descendent Algorithm, which is quite efficient
66Training of CRFs (From Prof. Dietterich)
- Then, take the derivative of the above equation
- For training, the first 2 items are easy to get.
- For example, for each lk, fk is a sequence of
Boolean numbers, such as 00101110100111. - is just the total number of 1s in the
sequence.
- The hardest thing is how to calculate Z(x)
67Training of CRFs (From Prof. Dietterich) (contd)
68POS tagging Experiments
69POS tagging Experiments (contd)
- Compared HMMs, MEMMs, and CRFs on Penn treebank
POS tagging - Each word in a given input sentence must be
labeled with one of 45 syntactic tags - Add a small set of orthographic features whether
a spelling begins with a number or upper case
letter, whether it contains a hyphen, and if it
contains one of the following suffixes -ing,
-ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies - oov out-of-vocabulary (not observed in the
training set)
70Summary
- Discriminative models are prone to the label bias
problem - CRFs provide the benefits of discriminative
models - CRFs solve the label bias problem well, and
demonstrate good performance