Sequence Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Sequence Learning

Description:

Capitalization. Generative Model ... Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996) ... prefix, suffix, capitalization, abbreviation (Sentence ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 71
Provided by: DanJur1
Category:

less

Transcript and Presenter's Notes

Title: Sequence Learning


1
Sequence Learning
  • Sudeshna Sarkar
  • 14 Aug 2008

2
Alternative graphical models for part of speech
tagging
3
Different Models for POS tagging
  • HMM
  • Maximum Entropy Markov Models
  • Conditional Random Fields

4
Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
5
Dependency (1st order)
6
Disadvantage of HMMs (1)
  • No Rich Feature Information
  • Rich information are required
  • When xk is complex
  • When data of xk is sparse
  • Example POS Tagging
  • How to evaluate P(wktk) for unknown words wk ?
  • Useful features
  • Suffix, e.g., -ed, -tion, -ing, etc.
  • Capitalization
  • Generative Model
  • Parameter estimation maximize the joint
    likelihood of training examples

7
Generative Models
  • Hidden Markov models (HMMs) and stochastic
    grammars
  • Assign a joint probability to paired observation
    and label sequences
  • The parameters typically trained to maximize the
    joint likelihood of train examples

8
Generative Models (contd)
  • Difficulties and disadvantages
  • Need to enumerate all possible observation
    sequences
  • Not practical to represent multiple interacting
    features or long-range dependencies of the
    observations
  • Very strict independence assumptions on the
    observations

9
Making use of rich domain features
  • A learning algorithm is as good as its features.
  • There are many useful features to include in a
    model
  • Most of them arent independent of each other
  • Identity of word
  • Ends in -shire
  • Is capitalized
  • Is head of noun phrase
  • Is in a list of city names
  • Is under node X in WordNet
  • Word to left is verb
  • Word to left is lowercase
  • Is in bold font
  • Is in hyperlink anchor
  • Other occurrences in doc

10
Problems with Richer Representationand a
Generative Model
  • These arbitrary features are not independent
  • Overlapping and long-distance dependences
  • Multiple levels of granularity (words,
    characters)
  • Multiple modalities (words, formatting, layout)
  • Observations from past and future
  • HMMs are generative models of the text
  • Generative models do not easily handle these
    non-independent features. Two choices
  • Model the dependencies. Each state would have
    its own Bayes Net. But we are already starved
    for training data!
  • Ignore the dependencies. This causes
    over-counting of evidence (ala naïve Bayes).
    Big problem when combining evidence, as in
    Viterbi!

11
Discriminative Models
  • We would prefer a conditional modelP(yx)
    instead of P(y,x)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their
    dependencies.
  • Dont waste modeling effort trying to generate
    what we are given at test time anyway.
  • Provide the ability to handle many arbitrary
    features.

12
Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
observations
observations
...
...
O
O
O
O
O
O
t
t
1
-
t
1
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
13
Locally Normalized Conditional Sequence Model
Maximum Entropy Markov Models McCallum, Freitag
Pereira, 2000 MaxEnt POS Tagger Ratnaparkhi,
1996 SNoW-based Markov Model Punyakanok Roth,
2000
Or, more generally
Conditional
Generative (traditional HMM)
S
S
S
S
S
S
transitions
t
-
1
t
t1
transitions
t
-
1
t
t1
...
...
...
...
...
...
observations
...
entire observation sequence
O
O
O
O
t
t
t
1
-
t
1
Standard belief propagation forward-backward
procedure. Viterbi and Baum-Welch follow
naturally.
14
Exponential Form for Next State Function
st-1
Black-box classifier
weight
feature
Overall Recipe - Labeled data is assigned to
transitions. - Train each states exponential
model by maximum likelihood (iterative scaling
or conjugate gradient).
15
Principle of Maximum Entropy
  • The correct distribution P(s,o) is that which
    maximizes entropy, or uncertainty subject to
    constraints
  • Constraints represent evidence
  • Given k features, constraints have the form,
    i.e. the models
    expectation for each feature should match the
    observed expectation
  • Philosophy Making inferences on the basis of
    partial information without biasing the
    assignment would amount to arbitrary assumptions
    of information that we do not have

16
Maximum Entropy Classifier
  • Conditional model p(yx)
  • Does not try to model p(x)
  • Can work with complicated input features since we
    do not need to model dependencies between them.
  • Principle of maximum entropy
  • We want a classifier
  • Matching feature constraints from training data
  • Predictions maximize entropy
  • There is a unique exponential family distribution
    that meets these criteria.
  • Maximum Entropy Classifier
  • p(yx?) inference and learning

17
Indicator Features
  • Feature functions f(x,y)
  • f1(w,y) word is Sarani y Location
  • f2(w,y) previous tag Per-begin, current word
    suffix an, y Per-end

18
Problems with MaxEnt classifier
  • It makes decisions at each point independently

19
MEMM
  • Use a series of maximum entropy classifiers that
    know the previous label
  • Define a Viterbi model of inference
  • P(yx) ?t Pyt-1 (ytx)
  • Finding the most likely label sequence given an
    input sequence and learning
  • Combines the advantages of HMM and maximum
    entropy.
  • But there is a problem.

20
Maximum Entropy Markov Model
Label bias problem the probability transitions
leaving any given state must sum to one
21
  • In some state space configurations, MEMMs
    essentially completely ignore the inputs
  • Example of label bias problem
  • This is not a problem for HMMs, because the input
    is generated by the model.

22
Label Bias Example
P0.75
P0.25
  • Given rib 3 times, rob 1 times
  • Training p(10, r)0.75, p(40, r)0.25
  • Inference

23
Conditional Markov Models (CMMs) aka MEMMs aka
Maxent Taggers vs HMMS
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
24
Random Field
25
CRF
  • CRFs have all the advantages of MEMMs without
    label bias problem
  • MEMM uses per-state exponential model for the
    conditional probabilities of next states given
    the current state
  • CRF has a single exponential model for the joint
    probability of the entire sequence of labels
    given the observation sequence
  • Undirected acyclic graph
  • Allow some transitions vote more strongly than
    others depending on the corresponding observations

26
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
27
Machine Learning a Panacea?
  • A machine learning method is as good as the
    feature set it uses
  • Shift focus from linguistic processing to feature
    set design

28
Features to use in IE
  • Features are task dependent
  • Good feature identification needs a good
    knowledge of the domain combined with automatic
    methods of feature selection.

29
Feature Examples
  • Extraction of protein and their interactions from
    biomedical literature (Mooney)
  • For each token, they take the following as
    features
  • Current token
  • Last 2 tokens and next 2 tokens
  • Output of dictionary-based tagger for these 5
    tokens
  • Suffix for each of the 5 tokens (last 1, 2, and 3
    characters)
  • Class labels for last 2 tokens

Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
30
More Feature Examples
  • line, sentence, or paragraph features
  • length
  • is centered in page
  • percent of non-alphabetics
  • white-space aligns with next line
  • containing sentence has two verbs
  • grammatically contains a question
  • contains links to authoritative pages
  • emissions that are uncountable
  • features at multiple levels of granularity
  • Example word features
  • identity of word
  • is in all caps
  • ends in -ski
  • is part of a noun phrase
  • is in a list of city names
  • is under node X in WordNet or Cyc
  • is in bold font
  • is in hyperlink anchor
  • features of past future
  • last person name was female
  • next two words are and Associates

31
Indicator Features
  • Theyre a little different from the typical
    supervised ML approach
  • Limited to binary values
  • Think of a feature as being on or off rather than
    as a feature with a value
  • Feature values are relative to an object/class
    pair rather than being a function of the object
    alone.
  • Typically have lots and lots of features (100s of
    1000s of features is quite common.)

32
Feature Templates
  • Next word
  • A feature template gives rise to VxT binary
    features
  • Curse of Dimensionality
  • Overfitting

33
Feature Selection vs Extraction
  • Feature selection Choosing kltd important
    features, ignoring the remaining d k
  • Subset selection algorithms
  • Feature extraction Project the
  • original xi , i 1,...,d dimensions to
  • new kltd dimensions, zj , j 1,...,k
  • Principal components analysis (PCA), linear
    discriminant analysis (LDA), factor analysis (FA)

34
Feature Reduction
  • Example domain NER in Hindi (Sujan Saha)
  • Feature Value Selection
  • Feature Value Clustering

ACL 2008 Kumar Saha Pabitra Mitra Sudeshna
SarkarWord Clustering and Word Selection Based
Feature Reduction for MaxEnt Based Hindi NER
35
(No Transcript)
36
  • Better Approach
  • Discriminative model which models P(yx) directly
  • Maximize the conditional likelihood of training
    examples

37
Maximum Entropy modeling
  • N-gram model probabilities depend on the
    previous few tokens.
  • We may identify a more heterogeneous set of
    features which contribute in some way to the
    choice of the current word. (whether it is the
    first word in a story, whether the next word is
    to, whether one of the last 5 words is a
    preposition, etc)
  • Maxent combines these features in a probabilistic
    model.
  • The given features provide a constraint on the
    model.
  • We would like to have a probability distribution
    which, outside of these constraints, is as
    uniform as possible has the maximum entropy
    among all models that satisfy these constraints.

38
Maximum Entropy Markov Model
  • Discriminative Sub Models
  • Unify two parameters in generative model into one
    conditional model
  • Two parameters in generative model,
  • parameter in source model
    and parameter in noisy channel
  • Unified conditional model
  • Employ maximum entropy principle
  • Maximum Entropy Markov Model

39
General Maximum Entropy Principle
  • Model
  • Model distribution P(Y X) with a set of features
    f1, f2, ?, fl defined on X and Y
  • Idea
  • Collect information of features from training
    data
  • Principle
  • Model what is known
  • Assume nothing else
  • ? Flattest distribution
  • ? Distribution with the maximum Entropy

40
Example
  • (Berger et al., 1996) example
  • Model translation of word in from English to
    French
  • Need to model P(wordFrench)
  • Constraints
  • 1 Possible translations dans, en, à, au course
    de, pendant
  • 2 dans or en used in 30 of the time
  • 3 dans or à in 50 of the time

41
Features
  • Features
  • 0-1 indicator functions
  • 1 if (x, y) satisfies a predefined condition
  • 0 if not
  • Example POS Tagging

42
Constraints
  • Empirical Information
  • Statistics from training data T
  • Expected Value
  • From the distribution P(Y X) we want to model
  • Constraints

43
Maximum Entropy Objective
  • Entropy
  • Maximization Problem

44
Dual Problem
  • Dual Problem
  • Conditional model
  • Maximum likelihood of conditional data
  • Solution
  • Improved iterative scaling (IIS) (Berger et al.
    1996)
  • Generalized iterative scaling (GIS) (McCallum et
    al. 2000)

45
Maximum Entropy Markov Model
  • Use Maximum Entropy Approach to Model
  • 1st order
  • Features
  • Basic features (like parameters in HMM)
  • Bigram (1st order) or trigram (2nd order) in
    source model
  • State-output pair feature (Xk xk, Yk yk)
  • Advantage incorporate other advanced features on
    (xk, yk)

46
HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
47
Performance in POS Tagging
  • POS Tagging
  • Data set WSJ
  • Features
  • HMM features, spelling features (like ed, -tion,
    -s, -ing, etc.)
  • Results (Lafferty et al. 2001)
  • 1st order HMM
  • 94.31 accuracy, 54.01 OOV accuracy
  • 1st order MEMM
  • 95.19 accuracy, 73.01 OOV accuracy

48
ME applications
  • Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
  • P(POS tag context)
  • Information sources
  • Word window (4)
  • Word features (prefix, suffix, capitalization)
  • Previous POS tags

49
ME applications
  • Abbreviation expansion (Pakhomov, 2002)
  • Information sources
  • Word window (4)
  • Document title
  • Word Sense Disambiguation (WSD) (Chao Dyer,
    2002)
  • Information sources
  • Word window (4)
  • Structurally related words (4)
  • Sentence Boundary Detection (Reynar
    Ratnaparkhi, 1997)
  • Information sources
  • Token features (prefix, suffix, capitalization,
    abbreviation)
  • Word window (2)

50
Solution
  • Global Optimization
  • Optimize parameters in a global model
    simultaneously, not in sub models separately
  • Alternatives
  • Conditional random fields
  • Application of perceptron algorithm

51
Why ME?
  • Advantages
  • Combine multiple knowledge sources
  • Local
  • Word prefix, suffix, capitalization (POS -
    (Ratnaparkhi, 1996))
  • Word POS, POS class, suffix (WSD - (Chao Dyer,
    2002))
  • Token prefix, suffix, capitalization,
    abbreviation (Sentence Boundary - (Reynar
    Ratnaparkhi, 1997))
  • Global
  • N-grams (Rosenfeld, 1997)
  • Word window
  • Document title (Pakhomov, 2002)
  • Structurally related words (Chao Dyer, 2002)
  • Sentence length, conventional lexicon (Och Ney,
    2002)
  • Combine dependent knowledge sources

52
Why ME?
  • Advantages
  • Add additional knowledge sources
  • Implicit smoothing
  • Disadvantages
  • Computational
  • Expected value at each iteration
  • Normalizing constant
  • Overfitting
  • Feature selection
  • Cutoffs
  • Basic Feature Selection (Berger et al., 1996)

53
Conditional Models
  • Conditional probability P(label sequence y
    observation sequence x) rather than joint
    probability P(y, x)
  • Specify the probability of possible label
    sequences given an observation sequence
  • Allow arbitrary, non-independent features on the
    observation sequence X
  • The probability of a transition between labels
    may depend on past and future observations
  • Relax strong independence assumptions in
    generative models

54
Discriminative ModelsMaximum Entropy Markov
Models (MEMMs)
  • Exponential model
  • Given training set X with label sequence Y
  • Train a model ? that maximizes P(YX, ?)
  • For a new data sequence x, the predicted label y
    maximizes P(yx, ?)
  • Notice the per-state normalization

55
MEMMs (contd)
  • MEMMs have all the advantages of Conditional
    Models
  • Per-state normalization all the mass that
    arrives at a state must be distributed among the
    possible successor states (conservation of score
    mass)
  • Subject to Label Bias Problem
  • Bias toward states with fewer outgoing transitions

56
Label Bias Problem
  • Consider this MEMM
  • P(1 and 2 ro) P(2 1 and ro)P(1 ro)
    P(2 1 and o)P(1 r)
  • P(1 and 2 ri) P(2 1 and ri)P(1 ri)
    P(2 1 and i)P(1 r)
  • Since P(2 1 and x) 1 for all x, P(1 and 2
    ro) P(1 and 2 ri)
  • In the training data, label value 2 is the only
    label value observed after label value 1
  • Therefore P(2 1) 1, so P(2 1 and x) 1 for
    all x
  • However, we expect P(1 and 2 ri) to be
    greater than P(1 and 2 ro).
  • Per-state normalization does not allow the
    required expectation

57
Solve the Label Bias Problem
  • Change the state-transition structure of the
    model
  • Not always practical to change the set of states
  • Start with a fully-connected model and let the
    training procedure figure out a good structure
  • Prelude the use of prior, which is very valuable
    (e.g. in information extraction)

58
Random Field
59
Conditional Random Fields (CRFs)
  • CRFs have all the advantages of MEMMs without
    label bias problem
  • MEMM uses per-state exponential model for the
    conditional probabilities of next states given
    the current state
  • CRF has a single exponential model for the joint
    probability of the entire sequence of labels
    given the observation sequence
  • Undirected acyclic graph
  • Allow some transitions vote more strongly than
    others depending on the corresponding observations

60
Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
61
Example of CRFs
62
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
63
Conditional Distribution
64
Conditional Distribution (contd)
  • CRFs use the observation-dependent
    normalization Z(x) for the conditional
    distributions

Z(x) is a normalization over the data sequence x
65
Parameter Estimation for CRFs
  • The paper provided iterative scaling algorithms
  • It turns out to be very inefficient
  • Prof. Dietterichs group applied Gradient
    Descendent Algorithm, which is quite efficient

66
Training of CRFs (From Prof. Dietterich)
  • Then, take the derivative of the above equation
  • For training, the first 2 items are easy to get.
  • For example, for each lk, fk is a sequence of
    Boolean numbers, such as 00101110100111.
  • is just the total number of 1s in the
    sequence.
  • The hardest thing is how to calculate Z(x)

67
Training of CRFs (From Prof. Dietterich) (contd)
  • Maximal cliques

68
POS tagging Experiments
69
POS tagging Experiments (contd)
  • Compared HMMs, MEMMs, and CRFs on Penn treebank
    POS tagging
  • Each word in a given input sentence must be
    labeled with one of 45 syntactic tags
  • Add a small set of orthographic features whether
    a spelling begins with a number or upper case
    letter, whether it contains a hyphen, and if it
    contains one of the following suffixes -ing,
    -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
  • oov out-of-vocabulary (not observed in the
    training set)

70
Summary
  • Discriminative models are prone to the label bias
    problem
  • CRFs provide the benefits of discriminative
    models
  • CRFs solve the label bias problem well, and
    demonstrate good performance
Write a Comment
User Comments (0)
About PowerShow.com