Title: Global Inference in Learning for Natural Language Processing
1Global Inference in Learning forNatural Language
Processing
- Vasin Punyakanok
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- Joint work with Dan Roth, Wen-tau Yih, and Dav
Zimak
2Story Comprehension
(ENGLAND, June, 1989) - Christopher Robin is
alive and well. He lives in England. He is the
same person that you read about in the book,
Winnie the Pooh. As a boy, Chris lived in a
pretty home called Cotchfield Farm. When Chris
was three years old, his father wrote a poem
about him. The poem was printed in a magazine
for others to read. Mr. Robin then wrote a book.
He made up a fairy tale land where Chris lived.
His friends were animals. There was a bear
called Winnie the Pooh. There was also an owl
and a young pig, called a piglet. All the
animals were stuffed toys that Chris owned. Mr.
Robin made them come to life with his words. The
places in the story were all near Cotchfield
Farm. Winnie the Pooh was written in 1925.
Children still love to read about Christopher
Robin and his animal friends. Most people don't
know he is a real person who is grown now. He
has written two books of his own. They tell what
it is like to be famous.
- Who is Christopher Robin?
- What did Mr. Robin do when Chris was three years
old? - When was Winnie the Pooh written?
- Why did Chris write two books of his own?
3Stand Alone Ambiguity Resolution
- Context Sensitive Spelling Correction
- IIllinois bored of education
board - Word Sense Disambiguation
- ...Nissan Car and truck plant is
- divide life into plant and animal kingdom
- Part of Speech Tagging
- (This DT) (can N) (will MD) (rust V)
DT,MD,V,N - Coreference Resolution
- The dog bit the kid. He was taken to a hospital.
- The dog bit the kid. He was taken to a
veterinarian.
4Textual Entailment
- Eyeing the huge market potential, currently led
by Google, Yahoo took over search company
Overture Services Inc. last year. - Yahoo acquired Overture.
- Question Answering
- Who acquired Overture?
5Inference and Learning
- Global decisions in which several local decisions
play a role but there are mutual dependencies on
their outcome. - Learned classifiers for different sub-problems
- Incorporate classifiers information, along with
constraints, in making coherent decisions
decisions that respect the local classifiers as
well as domain context specific constraints. - Global inference for the best assignment to all
variables of interest.
6Text Chunking
y
NP
ADJP
VP
ADVP
VP
The
guy
standing
there
is
so
tall
x
7Full Parsing
S
y
NP
VP
NP
ADJP
VP
ADVP
The
guy
standing
there
is
so
tall
x
8Outline
- Semantic Role Labeling Problem
- Global Inference with Integer Linear Programming
- Some Issues with Learning and Inference
- Global vs Local Training
- Utility of Constraints in the Inference
- Conclusion
9Semantic Role Labeling
- I left my pearls to my daughter in my will .
- IA0 left my pearlsA1 to my daughterA2 in
my willAM-LOC . - A0 Leaver
- A1 Things left
- A2 Benefactor
- AM-LOC Location
10Semantic Role Labeling
- PropBank Palmer et. al. 05 provides a large
human-annotated corpus of semantic verb-argument
relations. - It adds a layer of generic semantic labels to
Penn Tree Bank II. - (Almost) all the labels are on the constituents
of the parse trees. - Core arguments A0-A5 and AA
- different semantics for each verb
- specified in the PropBank Frame files
- 13 types of adjuncts labeled as AM-arg
- where arg specifies the adjunct type
11Semantic Role Labeling
12The Approach
- Pruning
- Use heuristics to reduce the number of candidates
(modified from XuePalmer04) - Argument Identification
- Use a binary classifier to identify arguments
- Argument Classification
- Use a multiclass classifier to classify arguments
- Inference
- Infer the final output satisfying linguistic and
structure constraints
13Learning
- Both argument identifier and argument classifier
are trained phrase-based classifiers. - Features (some examples)
- voice, phrase type, head word, path, chunk, chunk
pattern, etc. some make use of a full syntactic
parse - Learning Algorithm SNoW
- Sparse network of linear functions
- weights learned by regularized Winnow
multiplicative update rule with averaged weight
vectors - Probability conversion is done via softmax
- pi expacti/?j expactj
14Inference
- The output of the argument classifier often
violates some constraints, especially when the
sentence is long. - Finding the best legitimate output is formalized
as an optimization problem and solved via Integer
Linear Programming. - Input
- The probability estimation (by the argument
classifier) - Structural and linguistic constraints
- Allows incorporating expressive (non-sequential)
constraints on the variables (the arguments
types).
15Integer Linear Programming Inference
- For each argument ai and type t (including null)
- Set up a Boolean variable ai,t indicating if ai
is classified as t - Goal is to maximize
- ?i score(ai t ) ai,t
- Subject to the (linear) constraints
- Any Boolean constraints can be encoded this way
- If score(ai t ) P(ai t ), the objective is
find the assignment that maximizes the expected
number of arguments that are correct and
satisfies the constraints
16Linear Constraints
- No overlapping or embedding arguments
- ?ai,aj overlap or embed ai,null aj,null ? 1
17Constraints
- Constraints
- No overlapping or embedding arguments
- No duplicate argument classes for A0-A5
- Exactly one V argument per predicate
- If there is a C-V, there must be V-A1-C-V pattern
- If there is an R-arg, there must be arg somewhere
- If there is a C-arg, there must be arg somewhere
before - Each predicate can take only core arguments that
appear in its frame file. - More specifically, we check for only the minimum
and maximum ids
18SRL Results (CoNLL-2005)
- Training section 02-21
- Development section 24
- Test WSJ section 23
- Test Brown from Brown corpus (very small)
19Inference with Multiple SRL systems
- Goal is to maximize
- ?i score(ai t ) ai,t
- Subject to the (linear) constraints
- Any Boolean constraints can be encoded this way
- score(ai t ) ?k fk(ai t )
- If system k has no opinion on ai, use a prior
instead
20Results with Multiple Systems (CoNLL-2005)
21Outline
- Semantic Role Labeling Problem
- Global Inference with Integer Linear Programming
- Some Issues with Learning and Inference
- Global vs Local Training
- Utility of Constraints in the Inference
- Conclusion
22Learning and Inference
Training w/o Constraints
Testing Inference with Constraints
IBT Inference-based Training
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
Which one is better? When and Why?
23Comparisons of Learning Approaches
- Coupling (IBT)
- Optimize the true global objective function (This
should be better in the limit) - Decoupling (LI)
- More efficient
- Reusability of classifiers
- Modularity in training
- No global examples required
24Claims
- When the local classification problems are
easy, LI outperforms IBT. - Only when the local problems become difficult to
solve in isolation, IBT outperforms LI, but
needs a large enough number of training examples. - Will show experimental results and theoretical
intuition to support our claims.
25Perceptron-based Global Learning
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
26Simulation
- There are 5 local binary linear classifiers
- Global classifier is also linear
- h(x) argmaxy2C(Y) ?i fi(x,yi)
- Constraints are randomly generated
- The hypothesis is linearly separable at the
global level given that the constraints are known - The separability level at the local level is
varied
27Bound Prediction
LI vs. IBT the more identifiable individual
problems are the better overall performance is
with LI
- Local ? ?opt ( ( d log m log 1/? ) / m )1/2
- Global ? 0 ( ( cd log m c2d log 1/? ) /
m )1/2
28Relative Merits SRL
Difficulty of the learning problem( features)
easy
hard
29Summary
- When the local classification problems are
easy, LI outperforms IBT. - Only when the local problems become difficult to
solve in isolation, IBT outperforms LI, but
needs a large enough number of training examples. - Why does inference help at all?
30About Constraints
- We always assume that global coherency is good
- Constraints does help in real world applications
- Performance is usually measured at the local
prediction - Depending on the performance metric, constraints
can hurt
31Results Contribution of Expressive Constraints
Roth Yih 05
- Basic Learning with statistical constraints
only - Additional constraints added at evaluation time
(efficiency)
F1
CRF-D
CRF-ML
69.14
66.46
basic (Viterbi)
diff
diff
69.74
67.10
no dup
0.60
0.64
73.64
71.78
cand
3.90
4.68
73.71
71.71
argument
0.07
-0.07
73.78
71.72
verb pos
0.07
0.01
73.91
71.94
disallow
0.13
0.22
32Assumptions
- y h y1, , yl i
- Non-interactive classifiers fi(x,yi)Each
classifier does not use as inputs the outputs of
other classifiers - Inference is linear summation
- hun(x) argmaxy2Y ?i fi(x,yi)
- hcon(x) argmaxy2C(Y) ?i fi(x,yi)
- C(Y) always contains correct outputs
- No assumption on the structure of constraints
33Performance Metrics
- Zero-one loss
- Mistakes are calculated in terms of global
mistakes - y is wrong if any of yi is wrong
- Hamming loss
- Mistakes are calculated in terms of local mistakes
34Zero-One Loss
- Constraints cannot hurt
- Constraints never fix correct global outputs
- This is not true for Hamming Loss
35Boolean Cube
4 mistakes
3 mistakes
2 mistakes
1 mistake
0 mistake
36Hamming Loss
0011
37Best Classifiers
38When Constraints Cannot Hurt
- ?i distance between the correct label and the
2nd best label - ?i distance between the predicted label and the
correct label - Fcorrect i fi is correct
- Fwrong i fi is wrong
- Constraints cannot hurt if
- 8 i 2 Fcorrect ?i gt ?i 2 Fwrong ?i
39An Empirical Investigation
- SRL System
- CoNLL-2005 WSJ test set
40An Empirical Investigation
41Good Classifiers
42Bad Classifiers
43Average Distance vs Gain in Hamming Loss
- Good
- High Loss ! Low Score(Low Gain)
44Good Classifiers
45Bad Classifiers
46Average Gain in Hamming Loss vs Distance
- Good
- High Score ! Low Loss (High Gain)
47Utility of Constraints
- Constraints improve the performance because the
classifiers are good - Good Classifiers
- When the classifier is correct, it allows large
margin between the correct label and the 2nd best
label - When the classifier is wrong, the correct label
is not far away from the predicted one
48Conclusions
- Show how global inference can be used
- Semantic Role Labeling
- Tradeoffs between Coupling vs. Decoupling
learning and inference - Investigation of utility of constraints
- The analyses are very preliminary
- Average-case analysis for the tradeoffs between
Coupling vs. Decoupling learning and inference - Better understanding for using constraints
- More interactive classifiers
- Different performance metrics, e.g. F1
- Relation with margin