Title: IE With Undirected Models: the saga continues
1IE With Undirected Modelsthe saga continues
2Announcements
- Upcoming assignments
- Mon 2/23 Toutanova et al
- Wed 2/25 Klein Manning, intro to max margin
theory - Mon 3/1 no writeup due
- Wed 3/3 project proposal due
- personnel 1-2 page
- Spring break week, no class
3Motivation for CMMs
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state
4Implications of the model
- Does this do what we want?
- Q does Yi-1 depend on Xi1 ?
- a nodes is conditionally independent of its
non-descendents given its parents
5CRF model
y1
y2
y3
y4
x
6Dependency Nets
7Dependency nets
- Learning is simple and elegant (if you know each
nodes Markov blanket) just learn a
probabilistic classifier for P(Xpa(X)) for each
node X.
Pr(y1x,y2)
Pr(y2x,y1,y2)
Pr(y3x,y2,y4)
Pr(y4x,y3)
y1
y2
y3
y4
Learning is local, but inference is not, and need
not be unidirectional
x
8Toutanova, Klein, Manning, Singer
- Dependency nets for POS tagging vs CMMs.
- Maxent is used for local conditional model.
- Goals
- An easy-to-train bidirectional model
- A really good POS tagger
9Toutanova et al
- Dont use Gibbs sampling for inference instead
use a Viterbi variant (which is not guaranteed to
produce the ML sequence)
D 11, 11, 11, 12, 21, 33 ML state
11 P(a1b1)P(b1a1) lt 1 P(a3b3)P(b3a3)
1
10Results with model
Final test-set results
MXPost 47.6, 96.4, 86.2 CRF 95.7, 76.4
11Klein Manning Conditional Structure vs
Estimation
12Task 1 WSD (Word Sense Disambiguation)
Bushs election-year ad campaign will begin this
summer, with... (sense1) Bush whacking is tiring
but rewardingwho wants to spend all their time
on marked trails? (sense2) Class is
sense1/sense2, features are context words.
13Task 1 WSD (Word Sense Disambiguation)
Model 1 Naive Bayes multinomial model
Use conditional rule to predict sense s from
context-word observations o. Standard NB
training maximizes joint likelihood under
independence assumption
14Task 1 WSD (Word Sense Disambiguation)
Model 2 Keep same functional form, but maximize
conditional likelihood (sound familiar?)
or maybe SenseEval score
or maybe even
15Task 1 WSD (Word Sense Disambiguation)
- Optimize JL with std NB learning
- Optimize SCL, CL with conjugate gradient
- Also over non-deficient models (?) using
Lagrange penalties to enforce soft version of
deficiency constraint - I think this makes sure non-conditional version
is a valid probability - Dont even try on optimizing accuracy
- Penalty for extreme predictions in SCL
16(No Transcript)
17Task 2 POS Tagging
- Sequential problem
- Replace NB with HMM model.
- Standard algorithms maximize joint likelihood
- Claim keeping the same model but maximizing
conditional likelihood leads to a CRF - Is this true?
- Alternative is conditional structure (CMM)
18Using conditional structure vs maximizing
conditional likelihood
CMM factors Pr(s,o) into Pr(so)Pr(o). For the
CMM model, adding dependencies btwn observations
does not change Pr(so), ie JL estimate CL
estimate for Pr(so)
19Task 2 POS Tagging
Experiments with a simple feature set For fixed
model, CL is preferred to JL (CRF beats HMM) For
fixed objective, HMM is preferred to MEMM/CMM
20Error analysis for POS tagging
- Label bias is not the issue
- state-state dependencies are weak compared to
observation-state dependencies - too much emphasis on observation, not enough on
previous states (observation bias) - put another way label bias predicts
overprediction of states with few outgoing
transitions, of more generally, low entropy...
21Error analysis for POS tagging
22Background for next weekthe last 20 years of
learning theory
23Milestones in learning theory
- Valiant 1984 CACM
- Turing machines and Turing testsformal analysis
of AI problems - Chernoff bound shows that Prob(error of hgte) gt
Prob(h consistent with m examples)ltd - So given m examples, can afford to examine 2m
hypotheses
24Milestones in learning theory
- Haussler AAAI86
- Pick a small hypothesis from a large set
- Given m examples, can learn hypothesis of size
O(m) bits - Blumer,Ehrenfeucht,Haussler,Warmuth, STOC88
- Generalize notion of hypothesis size to
VC-dimension.
25More milestones....
- Littlestone MLJ88 Winnow algorithm
- Learning small hypothesis in many dimensions,
in mistake bounded model - Mistake bound VCdim.
- Blum COLT91
- Learning over infinitely many attributes in
mistake-bounded model - Learning as compression as learning...
26More milestones....
- Freund Schapire 1996
- boosting C4.5, even to extremes, does not overfit
data (!?) --how does this reconcile with Occams
razor? - Vapniks support vector machines
- kernel representation of a function
- true optimization in machine learning
- boosting as iterative margin maximization
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Comments
- For bag of words text, R2words in doc
- Vocabulary size matters not
31(No Transcript)