IE With Undirected Models: the saga continues - PowerPoint PPT Presentation

About This Presentation
Title:

IE With Undirected Models: the saga continues

Description:

IE With Undirected Models: the saga continues William W. Cohen CALD – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 32
Provided by: csCmuEdu120
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: IE With Undirected Models: the saga continues


1
IE With Undirected Modelsthe saga continues
  • William W. Cohen
  • CALD

2
Announcements
  • Upcoming assignments
  • Mon 2/23 Toutanova et al
  • Wed 2/25 Klein Manning, intro to max margin
    theory
  • Mon 3/1 no writeup due
  • Wed 3/3 project proposal due
  • personnel 1-2 page
  • Spring break week, no class

3
Motivation for CMMs
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state
4
Implications of the model
  • Does this do what we want?
  • Q does Yi-1 depend on Xi1 ?
  • a nodes is conditionally independent of its
    non-descendents given its parents

5
CRF model
y1
y2
y3
y4
x
6
Dependency Nets
7
Dependency nets
  • Learning is simple and elegant (if you know each
    nodes Markov blanket) just learn a
    probabilistic classifier for P(Xpa(X)) for each
    node X.

Pr(y1x,y2)
Pr(y2x,y1,y2)
Pr(y3x,y2,y4)
Pr(y4x,y3)
y1
y2
y3
y4
Learning is local, but inference is not, and need
not be unidirectional
x
8
Toutanova, Klein, Manning, Singer
  • Dependency nets for POS tagging vs CMMs.
  • Maxent is used for local conditional model.
  • Goals
  • An easy-to-train bidirectional model
  • A really good POS tagger

9
Toutanova et al
  • Dont use Gibbs sampling for inference instead
    use a Viterbi variant (which is not guaranteed to
    produce the ML sequence)

D 11, 11, 11, 12, 21, 33 ML state
11 P(a1b1)P(b1a1) lt 1 P(a3b3)P(b3a3)
1
10
Results with model
Final test-set results
MXPost 47.6, 96.4, 86.2 CRF 95.7, 76.4
11
Klein Manning Conditional Structure vs
Estimation
12
Task 1 WSD (Word Sense Disambiguation)
Bushs election-year ad campaign will begin this
summer, with... (sense1) Bush whacking is tiring
but rewardingwho wants to spend all their time
on marked trails? (sense2) Class is
sense1/sense2, features are context words.
13
Task 1 WSD (Word Sense Disambiguation)
Model 1 Naive Bayes multinomial model
Use conditional rule to predict sense s from
context-word observations o. Standard NB
training maximizes joint likelihood under
independence assumption
14
Task 1 WSD (Word Sense Disambiguation)
Model 2 Keep same functional form, but maximize
conditional likelihood (sound familiar?)
or maybe SenseEval score
or maybe even
15
Task 1 WSD (Word Sense Disambiguation)
  • Optimize JL with std NB learning
  • Optimize SCL, CL with conjugate gradient
  • Also over non-deficient models (?) using
    Lagrange penalties to enforce soft version of
    deficiency constraint
  • I think this makes sure non-conditional version
    is a valid probability
  • Dont even try on optimizing accuracy
  • Penalty for extreme predictions in SCL

16
(No Transcript)
17
Task 2 POS Tagging
  • Sequential problem
  • Replace NB with HMM model.
  • Standard algorithms maximize joint likelihood
  • Claim keeping the same model but maximizing
    conditional likelihood leads to a CRF
  • Is this true?
  • Alternative is conditional structure (CMM)

18
Using conditional structure vs maximizing
conditional likelihood
CMM factors Pr(s,o) into Pr(so)Pr(o). For the
CMM model, adding dependencies btwn observations
does not change Pr(so), ie JL estimate CL
estimate for Pr(so)
19
Task 2 POS Tagging
Experiments with a simple feature set For fixed
model, CL is preferred to JL (CRF beats HMM) For
fixed objective, HMM is preferred to MEMM/CMM
20
Error analysis for POS tagging
  • Label bias is not the issue
  • state-state dependencies are weak compared to
    observation-state dependencies
  • too much emphasis on observation, not enough on
    previous states (observation bias)
  • put another way label bias predicts
    overprediction of states with few outgoing
    transitions, of more generally, low entropy...

21
Error analysis for POS tagging
22
Background for next weekthe last 20 years of
learning theory
23
Milestones in learning theory
  • Valiant 1984 CACM
  • Turing machines and Turing testsformal analysis
    of AI problems
  • Chernoff bound shows that Prob(error of hgte) gt
    Prob(h consistent with m examples)ltd
  • So given m examples, can afford to examine 2m
    hypotheses

24
Milestones in learning theory
  • Haussler AAAI86
  • Pick a small hypothesis from a large set
  • Given m examples, can learn hypothesis of size
    O(m) bits
  • Blumer,Ehrenfeucht,Haussler,Warmuth, STOC88
  • Generalize notion of hypothesis size to
    VC-dimension.

25
More milestones....
  • Littlestone MLJ88 Winnow algorithm
  • Learning small hypothesis in many dimensions,
    in mistake bounded model
  • Mistake bound VCdim.
  • Blum COLT91
  • Learning over infinitely many attributes in
    mistake-bounded model
  • Learning as compression as learning...

26
More milestones....
  • Freund Schapire 1996
  • boosting C4.5, even to extremes, does not overfit
    data (!?) --how does this reconcile with Occams
    razor?
  • Vapniks support vector machines
  • kernel representation of a function
  • true optimization in machine learning
  • boosting as iterative margin maximization

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Comments
  • For bag of words text, R2words in doc
  • Vocabulary size matters not

31
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com