Sequence Learning

About This Presentation

Title:

Sequence Learning

Description:

Capitalization. Generative Model ... Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996) ... prefix, suffix, capitalization, abbreviation (Sentence ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 71

Provided by: DanJur1

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Learning

1
Sequence Learning

Sudeshna Sarkar
14 Aug 2008

2
Alternative graphical models for part of speech
tagging
3
Different Models for POS tagging

HMM
Maximum Entropy Markov Models
Conditional Random Fields

4
Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
5
Dependency (1st order)
6
Disadvantage of HMMs (1)

No Rich Feature Information
Rich information are required
When xk is complex
When data of xk is sparse
Example POS Tagging
How to evaluate P(wktk) for unknown words wk ?
Useful features
Suffix, e.g., -ed, -tion, -ing, etc.
Capitalization
Generative Model
Parameter estimation maximize the joint
likelihood of training examples

7
Generative Models

Hidden Markov models (HMMs) and stochastic
grammars
Assign a joint probability to paired observation
and label sequences
The parameters typically trained to maximize the
joint likelihood of train examples

8
Generative Models (contd)

Difficulties and disadvantages
Need to enumerate all possible observation
sequences
Not practical to represent multiple interacting
features or long-range dependencies of the
observations
Very strict independence assumptions on the
observations

9
Making use of rich domain features

A learning algorithm is as good as its features.
There are many useful features to include in a
model
Most of them arent independent of each other

Identity of word
Ends in -shire
Is capitalized
Is head of noun phrase
Is in a list of city names
Is under node X in WordNet

Word to left is verb
Word to left is lowercase
Is in bold font
Is in hyperlink anchor
Other occurrences in doc

10
Problems with Richer Representationand a
Generative Model

These arbitrary features are not independent
Overlapping and long-distance dependences
Multiple levels of granularity (words,
characters)
Multiple modalities (words, formatting, layout)
Observations from past and future
HMMs are generative models of the text
Generative models do not easily handle these
non-independent features. Two choices
Model the dependencies. Each state would have
its own Bayes Net. But we are already starved
for training data!
Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!

11
Discriminative Models

We would prefer a conditional modelP(yx)
instead of P(y,x)
Can examine features, but not responsible for
generating them.
Dont have to explicitly model their
dependencies.
Dont waste modeling effort trying to generate
what we are given at test time anyway.
Provide the ability to handle many arbitrary
features.

12
Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
13
Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Or, more generally
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
...
...
observations
...
entire observation sequence
O
O
O
O
t
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
14
Exponential Form for Next State Function
st-1
Black-box classifier
weight
feature
Overall Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum likelihood (iterative scaling
or conjugate gradient).
15
Principle of Maximum Entropy

The correct distribution P(s,o) is that which
maximizes entropy, or uncertainty subject to
constraints
Constraints represent evidence
Given k features, constraints have the form,
i.e. the models
expectation for each feature should match the
observed expectation
Philosophy Making inferences on the basis of
partial information without biasing the
assignment would amount to arbitrary assumptions
of information that we do not have

16
Maximum Entropy Classifier

Conditional model p(yx)
Does not try to model p(x)
Can work with complicated input features since we
do not need to model dependencies between them.
Principle of maximum entropy
We want a classifier
Matching feature constraints from training data
Predictions maximize entropy
There is a unique exponential family distribution
that meets these criteria.
Maximum Entropy Classifier
p(yx?) inference and learning

17
Indicator Features

Feature functions f(x,y)
f1(w,y) word is Sarani y Location
f2(w,y) previous tag Per-begin, current word
suffix an, y Per-end

18
Problems with MaxEnt classifier

It makes decisions at each point independently

19
MEMM

Use a series of maximum entropy classifiers that
know the previous label
Define a Viterbi model of inference
P(yx) ?t Pyt-1 (ytx)
Finding the most likely label sequence given an
input sequence and learning
Combines the advantages of HMM and maximum
entropy.
But there is a problem.

20
Maximum Entropy Markov Model
Label bias problem the probability transitions
leaving any given state must sum to one
21

In some state space configurations, MEMMs
essentially completely ignore the inputs
Example of label bias problem
This is not a problem for HMMs, because the input
is generated by the model.

22
Label Bias Example
P0.75
P0.25

Given rib 3 times, rob 1 times
Training p(10, r)0.75, p(40, r)0.25
Inference

23
Conditional Markov Models (CMMs) aka MEMMs aka
Maxent Taggers vs HMMS
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
24
Random Field
25
CRF

CRFs have all the advantages of MEMMs without
label bias problem
MEMM uses per-state exponential model for the
conditional probabilities of next states given
the current state
CRF has a single exponential model for the joint
probability of the entire sequence of labels
given the observation sequence
Undirected acyclic graph
Allow some transitions vote more strongly than
others depending on the corresponding observations

26
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
27
Machine Learning a Panacea?

A machine learning method is as good as the
feature set it uses
Shift focus from linguistic processing to feature
set design

28
Features to use in IE

Features are task dependent
Good feature identification needs a good
knowledge of the domain combined with automatic
methods of feature selection.

29
Feature Examples

Extraction of protein and their interactions from
biomedical literature (Mooney)
For each token, they take the following as
features
Current token
Last 2 tokens and next 2 tokens
Output of dictionary-based tagger for these 5
tokens
Suffix for each of the 5 tokens (last 1, 2, and 3
characters)
Class labels for last 2 tokens

Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
30
More Feature Examples

line, sentence, or paragraph features
length
is centered in page
percent of non-alphabetics
white-space aligns with next line
containing sentence has two verbs
grammatically contains a question
contains links to authoritative pages
emissions that are uncountable
features at multiple levels of granularity

Example word features
identity of word
is in all caps
ends in -ski
is part of a noun phrase
is in a list of city names
is under node X in WordNet or Cyc
is in bold font
is in hyperlink anchor
features of past future
last person name was female
next two words are and Associates

31
Indicator Features

Theyre a little different from the typical
supervised ML approach
Limited to binary values
Think of a feature as being on or off rather than
as a feature with a value
Feature values are relative to an object/class
pair rather than being a function of the object
alone.
Typically have lots and lots of features (100s of
1000s of features is quite common.)

32
Feature Templates

Next word
A feature template gives rise to VxT binary
features
Curse of Dimensionality
Overfitting

33
Feature Selection vs Extraction

Feature selection Choosing kltd important
features, ignoring the remaining d k
Subset selection algorithms
Feature extraction Project the
original xi , i 1,...,d dimensions to
new kltd dimensions, zj , j 1,...,k
Principal components analysis (PCA), linear
discriminant analysis (LDA), factor analysis (FA)

34
Feature Reduction

Example domain NER in Hindi (Sujan Saha)
Feature Value Selection
Feature Value Clustering

ACL 2008 Kumar Saha Pabitra Mitra Sudeshna
SarkarWord Clustering and Word Selection Based
Feature Reduction for MaxEnt Based Hindi NER
35
(No Transcript)
36

Better Approach
Discriminative model which models P(yx) directly
Maximize the conditional likelihood of training
examples

37
Maximum Entropy modeling

N-gram model probabilities depend on the
previous few tokens.
We may identify a more heterogeneous set of
features which contribute in some way to the
choice of the current word. (whether it is the
first word in a story, whether the next word is
to, whether one of the last 5 words is a
preposition, etc)
Maxent combines these features in a probabilistic
model.
The given features provide a constraint on the
model.
We would like to have a probability distribution
which, outside of these constraints, is as
uniform as possible has the maximum entropy
among all models that satisfy these constraints.

38
Maximum Entropy Markov Model

Discriminative Sub Models
Unify two parameters in generative model into one
conditional model
Two parameters in generative model,
parameter in source model
and parameter in noisy channel
Unified conditional model
Employ maximum entropy principle

Maximum Entropy Markov Model

39
General Maximum Entropy Principle

Model
Model distribution P(Y X) with a set of features
f1, f2, ?, fl defined on X and Y
Idea
Collect information of features from training
data
Principle
Model what is known
Assume nothing else
? Flattest distribution
? Distribution with the maximum Entropy

40
Example

(Berger et al., 1996) example
Model translation of word in from English to
French
Need to model P(wordFrench)
Constraints
1 Possible translations dans, en, à, au course
de, pendant
2 dans or en used in 30 of the time
3 dans or à in 50 of the time

41
Features

Features
0-1 indicator functions
1 if (x, y) satisfies a predefined condition
0 if not
Example POS Tagging

42
Constraints

Empirical Information
Statistics from training data T

Expected Value
From the distribution P(Y X) we want to model

Constraints

43
Maximum Entropy Objective

Entropy

Maximization Problem

44
Dual Problem

Dual Problem
Conditional model
Maximum likelihood of conditional data

Solution
Improved iterative scaling (IIS) (Berger et al.
1996)
Generalized iterative scaling (GIS) (McCallum et
al. 2000)

45
Maximum Entropy Markov Model

Use Maximum Entropy Approach to Model
1st order

Features
Basic features (like parameters in HMM)
Bigram (1st order) or trigram (2nd order) in
source model
State-output pair feature (Xk xk, Yk yk)
Advantage incorporate other advanced features on
(xk, yk)

46
HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
47
Performance in POS Tagging

POS Tagging
Data set WSJ
Features
HMM features, spelling features (like ed, -tion,
-s, -ing, etc.)
Results (Lafferty et al. 2001)
1st order HMM
94.31 accuracy, 54.01 OOV accuracy
1st order MEMM
95.19 accuracy, 73.01 OOV accuracy

48
ME applications

Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
P(POS tag context)
Information sources
Word window (4)
Word features (prefix, suffix, capitalization)
Previous POS tags

49
ME applications

Abbreviation expansion (Pakhomov, 2002)
Information sources
Word window (4)
Document title
Word Sense Disambiguation (WSD) (Chao Dyer,
2002)
Information sources
Word window (4)
Structurally related words (4)
Sentence Boundary Detection (Reynar
Ratnaparkhi, 1997)
Information sources
Token features (prefix, suffix, capitalization,
abbreviation)
Word window (2)

50
Solution

Global Optimization
Optimize parameters in a global model
simultaneously, not in sub models separately
Alternatives
Conditional random fields
Application of perceptron algorithm

51
Why ME?

Advantages
Combine multiple knowledge sources
Local
Word prefix, suffix, capitalization (POS -
(Ratnaparkhi, 1996))
Word POS, POS class, suffix (WSD - (Chao Dyer,
2002))
Token prefix, suffix, capitalization,
abbreviation (Sentence Boundary - (Reynar
Ratnaparkhi, 1997))
Global
N-grams (Rosenfeld, 1997)
Word window
Document title (Pakhomov, 2002)
Structurally related words (Chao Dyer, 2002)
Sentence length, conventional lexicon (Och Ney,
2002)
Combine dependent knowledge sources

52
Why ME?

Advantages
Add additional knowledge sources
Implicit smoothing
Disadvantages
Computational
Expected value at each iteration
Normalizing constant
Overfitting
Feature selection
Cutoffs
Basic Feature Selection (Berger et al., 1996)

53
Conditional Models

Conditional probability P(label sequence y
observation sequence x) rather than joint
probability P(y, x)
Specify the probability of possible label
sequences given an observation sequence
Allow arbitrary, non-independent features on the
observation sequence X
The probability of a transition between labels
may depend on past and future observations
Relax strong independence assumptions in
generative models

54
Discriminative ModelsMaximum Entropy Markov
Models (MEMMs)

Exponential model
Given training set X with label sequence Y
Train a model ? that maximizes P(YX, ?)
For a new data sequence x, the predicted label y
maximizes P(yx, ?)
Notice the per-state normalization

55
MEMMs (contd)

MEMMs have all the advantages of Conditional
Models
Per-state normalization all the mass that
arrives at a state must be distributed among the
possible successor states (conservation of score
mass)
Subject to Label Bias Problem
Bias toward states with fewer outgoing transitions

56
Label Bias Problem

Consider this MEMM

P(1 and 2 ro) P(2 1 and ro)P(1 ro)
P(2 1 and o)P(1 r)
P(1 and 2 ri) P(2 1 and ri)P(1 ri)
P(2 1 and i)P(1 r)
Since P(2 1 and x) 1 for all x, P(1 and 2
ro) P(1 and 2 ri)
In the training data, label value 2 is the only
label value observed after label value 1
Therefore P(2 1) 1, so P(2 1 and x) 1 for
all x
However, we expect P(1 and 2 ri) to be
greater than P(1 and 2 ro).
Per-state normalization does not allow the
required expectation

57
Solve the Label Bias Problem

Change the state-transition structure of the
model
Not always practical to change the set of states
Start with a fully-connected model and let the
training procedure figure out a good structure
Prelude the use of prior, which is very valuable
(e.g. in information extraction)

58
Random Field
59
Conditional Random Fields (CRFs)

CRFs have all the advantages of MEMMs without
label bias problem
MEMM uses per-state exponential model for the
conditional probabilities of next states given
the current state
CRF has a single exponential model for the joint
probability of the entire sequence of labels
given the observation sequence
Undirected acyclic graph
Allow some transitions vote more strongly than
others depending on the corresponding observations

60
Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
61
Example of CRFs
62
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
63
Conditional Distribution
64
Conditional Distribution (contd)

CRFs use the observation-dependent
normalization Z(x) for the conditional
distributions

Z(x) is a normalization over the data sequence x
65
Parameter Estimation for CRFs

The paper provided iterative scaling algorithms
It turns out to be very inefficient
Prof. Dietterichs group applied Gradient
Descendent Algorithm, which is quite efficient

66
Training of CRFs (From Prof. Dietterich)

Then, take the derivative of the above equation

For training, the first 2 items are easy to get.
For example, for each lk, fk is a sequence of
Boolean numbers, such as 00101110100111.
is just the total number of 1s in the
sequence.

The hardest thing is how to calculate Z(x)

67
Training of CRFs (From Prof. Dietterich) (contd)

Maximal cliques

68
POS tagging Experiments
69
POS tagging Experiments (contd)

Compared HMMs, MEMMs, and CRFs on Penn treebank
POS tagging
Each word in a given input sentence must be
labeled with one of 45 syntactic tags
Add a small set of orthographic features whether
a spelling begins with a number or upper case
letter, whether it contains a hyphen, and if it
contains one of the following suffixes -ing,
-ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
oov out-of-vocabulary (not observed in the
training set)