Global Inference in Learning for Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Global Inference in Learning for Natural Language Processing

Description:

When was Winnie the Pooh written? Why did Chris write two books of ... There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 49
Provided by: View
Category:

less

Transcript and Presenter's Notes

Title: Global Inference in Learning for Natural Language Processing


1
Global Inference in Learning forNatural Language
Processing
  • Vasin Punyakanok
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • Joint work with Dan Roth, Wen-tau Yih, and Dav
    Zimak

2
Story Comprehension
(ENGLAND, June, 1989) - Christopher Robin is
alive and well. He lives in England. He is the
same person that you read about in the book,
Winnie the Pooh. As a boy, Chris lived in a
pretty home called Cotchfield Farm. When Chris
was three years old, his father wrote a poem
about him. The poem was printed in a magazine
for others to read. Mr. Robin then wrote a book.
He made up a fairy tale land where Chris lived.
His friends were animals. There was a bear
called Winnie the Pooh. There was also an owl
and a young pig, called a piglet. All the
animals were stuffed toys that Chris owned. Mr.
Robin made them come to life with his words. The
places in the story were all near Cotchfield
Farm. Winnie the Pooh was written in 1925.
Children still love to read about Christopher
Robin and his animal friends. Most people don't
know he is a real person who is grown now. He
has written two books of his own. They tell what
it is like to be famous.
  • Who is Christopher Robin?
  • What did Mr. Robin do when Chris was three years
    old?
  • When was Winnie the Pooh written?
  • Why did Chris write two books of his own?

3
Stand Alone Ambiguity Resolution
  • Context Sensitive Spelling Correction
  • IIllinois bored of education
    board
  • Word Sense Disambiguation
  • ...Nissan Car and truck plant is
  • divide life into plant and animal kingdom
  • Part of Speech Tagging
  • (This DT) (can N) (will MD) (rust V)
    DT,MD,V,N
  • Coreference Resolution
  • The dog bit the kid. He was taken to a hospital.
  • The dog bit the kid. He was taken to a
    veterinarian.

4
Textual Entailment
  • Eyeing the huge market potential, currently led
    by Google, Yahoo took over search company
    Overture Services Inc. last year.
  • Yahoo acquired Overture.
  • Question Answering
  • Who acquired Overture?

5
Inference and Learning
  • Global decisions in which several local decisions
    play a role but there are mutual dependencies on
    their outcome.
  • Learned classifiers for different sub-problems
  • Incorporate classifiers information, along with
    constraints, in making coherent decisions
    decisions that respect the local classifiers as
    well as domain context specific constraints.
  • Global inference for the best assignment to all
    variables of interest.

6
Text Chunking
y
NP
ADJP
VP
ADVP
VP
The
guy
standing
there
is
so
tall
x
7
Full Parsing
S
y
NP
VP
NP
ADJP
VP
ADVP
The
guy
standing
there
is
so
tall
x
8
Outline
  • Semantic Role Labeling Problem
  • Global Inference with Integer Linear Programming
  • Some Issues with Learning and Inference
  • Global vs Local Training
  • Utility of Constraints in the Inference
  • Conclusion

9
Semantic Role Labeling
  • I left my pearls to my daughter in my will .
  • IA0 left my pearlsA1 to my daughterA2 in
    my willAM-LOC .
  • A0 Leaver
  • A1 Things left
  • A2 Benefactor
  • AM-LOC Location

10
Semantic Role Labeling
  • PropBank Palmer et. al. 05 provides a large
    human-annotated corpus of semantic verb-argument
    relations.
  • It adds a layer of generic semantic labels to
    Penn Tree Bank II.
  • (Almost) all the labels are on the constituents
    of the parse trees.
  • Core arguments A0-A5 and AA
  • different semantics for each verb
  • specified in the PropBank Frame files
  • 13 types of adjuncts labeled as AM-arg
  • where arg specifies the adjunct type

11
Semantic Role Labeling
12
The Approach
  • Pruning
  • Use heuristics to reduce the number of candidates
    (modified from XuePalmer04)
  • Argument Identification
  • Use a binary classifier to identify arguments
  • Argument Classification
  • Use a multiclass classifier to classify arguments
  • Inference
  • Infer the final output satisfying linguistic and
    structure constraints

13
Learning
  • Both argument identifier and argument classifier
    are trained phrase-based classifiers.
  • Features (some examples)
  • voice, phrase type, head word, path, chunk, chunk
    pattern, etc. some make use of a full syntactic
    parse
  • Learning Algorithm SNoW
  • Sparse network of linear functions
  • weights learned by regularized Winnow
    multiplicative update rule with averaged weight
    vectors
  • Probability conversion is done via softmax
  • pi expacti/?j expactj

14
Inference
  • The output of the argument classifier often
    violates some constraints, especially when the
    sentence is long.
  • Finding the best legitimate output is formalized
    as an optimization problem and solved via Integer
    Linear Programming.
  • Input
  • The probability estimation (by the argument
    classifier)
  • Structural and linguistic constraints
  • Allows incorporating expressive (non-sequential)
    constraints on the variables (the arguments
    types).

15
Integer Linear Programming Inference
  • For each argument ai and type t (including null)
  • Set up a Boolean variable ai,t indicating if ai
    is classified as t
  • Goal is to maximize
  • ?i score(ai t ) ai,t
  • Subject to the (linear) constraints
  • Any Boolean constraints can be encoded this way
  • If score(ai t ) P(ai t ), the objective is
    find the assignment that maximizes the expected
    number of arguments that are correct and
    satisfies the constraints

16
Linear Constraints
  • No overlapping or embedding arguments
  • ?ai,aj overlap or embed ai,null aj,null ? 1

17
Constraints
  • Constraints
  • No overlapping or embedding arguments
  • No duplicate argument classes for A0-A5
  • Exactly one V argument per predicate
  • If there is a C-V, there must be V-A1-C-V pattern
  • If there is an R-arg, there must be arg somewhere
  • If there is a C-arg, there must be arg somewhere
    before
  • Each predicate can take only core arguments that
    appear in its frame file.
  • More specifically, we check for only the minimum
    and maximum ids

18
SRL Results (CoNLL-2005)
  • Training section 02-21
  • Development section 24
  • Test WSJ section 23
  • Test Brown from Brown corpus (very small)

19
Inference with Multiple SRL systems
  • Goal is to maximize
  • ?i score(ai t ) ai,t
  • Subject to the (linear) constraints
  • Any Boolean constraints can be encoded this way
  • score(ai t ) ?k fk(ai t )
  • If system k has no opinion on ai, use a prior
    instead

20
Results with Multiple Systems (CoNLL-2005)
21
Outline
  • Semantic Role Labeling Problem
  • Global Inference with Integer Linear Programming
  • Some Issues with Learning and Inference
  • Global vs Local Training
  • Utility of Constraints in the Inference
  • Conclusion

22
Learning and Inference
Training w/o Constraints
Testing Inference with Constraints
IBT Inference-based Training
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
Which one is better? When and Why?
23
Comparisons of Learning Approaches
  • Coupling (IBT)
  • Optimize the true global objective function (This
    should be better in the limit)
  • Decoupling (LI)
  • More efficient
  • Reusability of classifiers
  • Modularity in training
  • No global examples required

24
Claims
  • When the local classification problems are
    easy, LI outperforms IBT.
  • Only when the local problems become difficult to
    solve in isolation, IBT outperforms LI, but
    needs a large enough number of training examples.
  • Will show experimental results and theoretical
    intuition to support our claims.

25
Perceptron-based Global Learning
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
26
Simulation
  • There are 5 local binary linear classifiers
  • Global classifier is also linear
  • h(x) argmaxy2C(Y) ?i fi(x,yi)
  • Constraints are randomly generated
  • The hypothesis is linearly separable at the
    global level given that the constraints are known
  • The separability level at the local level is
    varied

27
Bound Prediction
LI vs. IBT the more identifiable individual
problems are the better overall performance is
with LI
  • Local ? ?opt ( ( d log m log 1/? ) / m )1/2
  • Global ? 0 ( ( cd log m c2d log 1/? ) /
    m )1/2

28
Relative Merits SRL
Difficulty of the learning problem( features)
easy
hard
29
Summary
  • When the local classification problems are
    easy, LI outperforms IBT.
  • Only when the local problems become difficult to
    solve in isolation, IBT outperforms LI, but
    needs a large enough number of training examples.
  • Why does inference help at all?

30
About Constraints
  • We always assume that global coherency is good
  • Constraints does help in real world applications
  • Performance is usually measured at the local
    prediction
  • Depending on the performance metric, constraints
    can hurt

31
Results Contribution of Expressive Constraints
Roth Yih 05
  • Basic Learning with statistical constraints
    only
  • Additional constraints added at evaluation time
    (efficiency)

F1
CRF-D
CRF-ML
69.14
66.46
basic (Viterbi)
diff
diff
69.74
67.10
no dup
0.60
0.64
73.64
71.78
cand
3.90
4.68
73.71
71.71
argument
0.07
-0.07
73.78
71.72
verb pos
0.07
0.01
73.91
71.94
disallow
0.13
0.22
32
Assumptions
  • y h y1, , yl i
  • Non-interactive classifiers fi(x,yi)Each
    classifier does not use as inputs the outputs of
    other classifiers
  • Inference is linear summation
  • hun(x) argmaxy2Y ?i fi(x,yi)
  • hcon(x) argmaxy2C(Y) ?i fi(x,yi)
  • C(Y) always contains correct outputs
  • No assumption on the structure of constraints

33
Performance Metrics
  • Zero-one loss
  • Mistakes are calculated in terms of global
    mistakes
  • y is wrong if any of yi is wrong
  • Hamming loss
  • Mistakes are calculated in terms of local mistakes

34
Zero-One Loss
  • Constraints cannot hurt
  • Constraints never fix correct global outputs
  • This is not true for Hamming Loss

35
Boolean Cube
  • 4-bit binary outputs

4 mistakes
3 mistakes
2 mistakes
1 mistake
0 mistake
36
Hamming Loss
0011
37
Best Classifiers
38
When Constraints Cannot Hurt
  • ?i distance between the correct label and the
    2nd best label
  • ?i distance between the predicted label and the
    correct label
  • Fcorrect i fi is correct
  • Fwrong i fi is wrong
  • Constraints cannot hurt if
  • 8 i 2 Fcorrect ?i gt ?i 2 Fwrong ?i

39
An Empirical Investigation
  • SRL System
  • CoNLL-2005 WSJ test set

40
An Empirical Investigation
41
Good Classifiers
42
Bad Classifiers
43
Average Distance vs Gain in Hamming Loss
  • Good
  • High Loss ! Low Score(Low Gain)

44
Good Classifiers
45
Bad Classifiers
46
Average Gain in Hamming Loss vs Distance
  • Good
  • High Score ! Low Loss (High Gain)

47
Utility of Constraints
  • Constraints improve the performance because the
    classifiers are good
  • Good Classifiers
  • When the classifier is correct, it allows large
    margin between the correct label and the 2nd best
    label
  • When the classifier is wrong, the correct label
    is not far away from the predicted one

48
Conclusions
  • Show how global inference can be used
  • Semantic Role Labeling
  • Tradeoffs between Coupling vs. Decoupling
    learning and inference
  • Investigation of utility of constraints
  • The analyses are very preliminary
  • Average-case analysis for the tradeoffs between
    Coupling vs. Decoupling learning and inference
  • Better understanding for using constraints
  • More interactive classifiers
  • Different performance metrics, e.g. F1
  • Relation with margin
Write a Comment
User Comments (0)
About PowerShow.com