Information Extraction 2 - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Information Extraction 2

Description:

Title: Automatically Building Special Purpose Search Engines with Machine Learning Author: Andrew McCallum Last modified by: CSE Created Date: 10/28/1998 12:04:08 AM – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 42
Provided by: andrewm72
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction 2


1
Information Extraction 2
  • CSE 454
  • Based on Slides by
  • William W. Cohen
  • Carnegie Mellon University
  • Andrew McCallum
  • University of Massachusetts Amherst
  • From KDD 2003

2
Class Overview
Info Extraction
Self-Supervised
HMMs
INet Advertising
Security
Cloud Computing
Revisiting
Sliding Windows
3
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
Slides from Cohen McCallum
4
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




Slides from Cohen McCallum
5
Landscape of IE Tasks (1/4)Pattern Feature
Domain
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
Slides from Cohen McCallum
6
Landscape of IE Tasks (2/4)Pattern Scope
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon Book Pages
Resumes
University Names
Slides from Cohen McCallum
7
Landscape of IE Tasks (3/4)Pattern Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
University of Arkansas P.O. Box 140 Hope, AR
71802
Person names
was among the six houses sold by Hope Feldman
that year.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Slides from Cohen McCallum
8
Landscape of IE Tasks (4/4)Pattern Combinations
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
Slides from Cohen McCallum
9
Landscape of IE Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
and beyond
Any of these models can be used to capture words,
formatting or both.
Slides from Cohen McCallum
10
LandscapeFocus of this Tutorial
Pattern complexity
closed set
regular
complex
ambiguous
Pattern feature domain
words
words formatting
formatting
Pattern scope
site-specific
genre-specific
general
Pattern combinations
entity
binary
n-ary
Models
lexicon
regex
window
boundary
FSM
CFG
Slides from Cohen McCallum
11
Bayes Theorem
1702-1761
11
12
Bayesian Categorization
  • Let set of categories be c1, c2,cn
  • Let E be description of an instance.
  • Determine category of E by determining for each
    ci
  • P(E) can be determined since categories are
    complete and disjoint.

12
13
Naïve Bayesian Motivation
  • Problem Too many possible instances (exponential
    in m) to estimate all P(E ci)
  • If we assume features of an instance are
    independent given the category (ci)
    (conditionally independent).
  • Therefore, we then only need to know P(ej ci)
    for each feature and category.

13
14
Probabilistic Graphical Models
  • Nodes Random Variables
  • Directed Edges Causal Connections
  • Associated with a CPT (conditional probability
    table)

Spam?
Hidden state
y
Causal dependency (probabilistic) P(xi
yspam) P(xi y?spam)
Random variables (Boolean)
x1
x2
x3
Observable
Nigeria?
Widow?
CSE 454?
15
Recap Naïve Bayes
  • Assumption features independent given label
  • Generative Classifier
  • Model joint distribution p(x,y)
  • Inference
  • Learning counting
  • Can we use for IE directly?

Labels of neighboring words dependent!
The article appeared in the Seattle Times.
city?
Need to consider sequence!
length
capitalization
suffix
16
Sliding Windows
Slides from Cohen McCallum
17
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
18
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
19
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
20
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
Slides from Cohen McCallum
21
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun


w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATION window) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length,
prefix, suffix, content words Estimate from data
quantities like Pr(Place in prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
Other examples of sliding window Baluja et al
2000 (decision tree over individual words
their context)
Slides from Cohen McCallum
22
Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
Slides from Cohen McCallum
23
Realistic sliding-window-classifier IE
  • What windows to consider?
  • all windows containing as many tokens as the
    shortest example, but no more tokens than the
    longest example
  • How to represent a classifier? It might
  • Restrict the length of window
  • Restrict the vocabulary or formatting used
    before/after/inside window
  • Restrict the relative order of tokens, etc.
  • Learning Method
  • SRV Top-Down Rule Learning Frietag AAAI
    98
  • Rapier Bottom-Up Califf Mooney, AAAI
    99

Slides from Cohen McCallum
24
Rapier results precision/recall
Slides from Cohen McCallum
25
Rapier results vs. SRV
Slides from Cohen McCallum
26
Rule-learning approaches to sliding-window
classification Summary
  • SRV, Rapier, and WHISK Soderland KDD 97
  • Representations for classifiers allow restriction
    of the relationships between tokens, etc
  • Representations are carefully chosen subsets of
    even more powerful representations based on logic
    programming (ILP and Prolog)
  • Use of these heavyweight representations is
    complicated, but seems to pay off in results
  • Can simpler representations for classifiers work?

Slides from Cohen McCallum
27
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000
  • Another formulation learn 3 probabilistic
    classifiers
  • START(i) Prob( position i starts a field)
  • END(j) Prob( position j ends a field)
  • LEN(k) Prob( an extracted field has length k)
  • Score a possible extraction (i,j) by
  • START(i) END(j) LEN(j-i)
  • LEN(k) is estimated from a histogram

Slides from Cohen McCallum
28
BWI Learning to detect boundaries
  • BWI uses boosting to find detectors for START
    and END
  • Each weak detector has a BEFORE and AFTER pattern
    (on tokens before/after position i).
  • Each pattern is a sequence of
  • tokens and/or
  • wildcards like anyAlphabeticToken, anyNumber,
  • Weak learner for patterns uses greedy search (
    lookahead) to repeatedly extend a pair of empty
    BEFORE,AFTER patterns

Slides from Cohen McCallum
29
BWI Learning to detect boundaries
Naïve Bayes Field F1 Speaker 30 Location 61
Start Time 98
Slides from Cohen McCallum
30
Problems with Sliding Windows and Boundary
Finders
  • Decisions in neighboring parts of the input are
    made independently from each other.
  • Naïve Bayes Sliding Window may predict a seminar
    end time before the seminar start time.
  • It is possible for two overlapping windows to
    both be above threshold.
  • In a Boundary-Finding system, left boundaries are
    laid down independently from right boundaries,
    and their pairing happens as a separate step.

Slides from Cohen McCallum
31
Landscape of IE Techniques (1/1)Models
Each model can capture words, formatting, or both
Slides from Cohen McCallum
32
Finite State Machines
Slides from Cohen McCallum
33
Hidden Markov Models (HMMs)
standard sequence model in genomics, speech, NLP,

Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
O
O
O
t
t
1
-
t
1
Generates State sequence Observation
sequence
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
Slides from Cohen McCallum
34
Example The Dishonest Casino
  • A casino has two dice
  • Fair die
  • P(1) P(2) P(3) P(5) P(6) 1/6
  • Loaded die
  • P(1) P(2) P(3) P(5) 1/10
  • P(6) 1/2
  • Casino player switches back--forth between fair
    and loaded die once every 20 turns
  • Game
  • You bet 1
  • You roll (always with a fair die)
  • Casino player rolls (maybe with fair die, maybe
    with loaded die)
  • Highest number wins 2

Slides from Serafim Batzoglou
35
Question 1 Evaluation
  • GIVEN
  • A sequence of rolls by the casino player
  • 124552646214614613613666166466163661636616361
  • QUESTION
  • How likely is this sequence, given our model of
    how the casino works?
  • This is the EVALUATION problem in HMMs

Slides from Serafim Batzoglou
36
Question 2 Decoding
  • GIVEN
  • A sequence of rolls by the casino player
  • 1245526462146146136136661664661636616366163
  • QUESTION
  • What portion of the sequence was generated with
    the fair die, and what portion with the loaded
    die?
  • This is the DECODING question in HMMs

Slides from Serafim Batzoglou
37
Question 3 Learning
  • GIVEN
  • A sequence of rolls by the casino player
  • 124552646214614613613666166466163661636616361651
  • QUESTION
  • How loaded is the loaded die? How fair is the
    fair die? How often does the casino player change
    from fair to loaded, and back?
  • This is the LEARNING question in HMMs

Slides from Serafim Batzoglou
38
The dishonest casino model
0.05
0.95
0.95
FAIR
LOADED
P(1F) 1/6 P(2F) 1/6 P(3F) 1/6 P(4F)
1/6 P(5F) 1/6 P(6F) 1/6
P(1L) 1/10 P(2L) 1/10 P(3L) 1/10 P(4L)
1/10 P(5L) 1/10 P(6L) 1/2
0.05
Slides from Serafim Batzoglou
39
Whats this have to do with Info Extraction?
0.05
0.95
0.95
FAIR
LOADED
P(1F) 1/6 P(2F) 1/6 P(3F) 1/6 P(4F)
1/6 P(5F) 1/6 P(6F) 1/6
P(1L) 1/10 P(2L) 1/10 P(3L) 1/10 P(4L)
1/10 P(5L) 1/10 P(6L) 1/2
0.05
40
Whats this have to do with Info Extraction?
0.05
0.95
0.95
TEXT
NAME
P(the T) 0.003 P(from T) 0.002 ..
P(Dan N) 0.005 P(Sue N) 0.003
0.05
41
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
Slide by Cohen McCallum
42
IE with Hidden Markov Models
  • For sparse extraction tasks
  • Separate HMM for each type of target
  • Each HMM should
  • Model entire document
  • Consist of target and non-target states
  • Not necessarily fully connected

42
Slide by Okan Basegmez
43
Or Combined HMM
  • Example Research Paper Headers

43
Slide by Okan Basegmez
44
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
start-of-sentence
Org
or

(Five other name classes)
Back-off to
Back-off to
Other
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
Slide by Cohen McCallum
45
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Person
end-of-sentence
start-of-sentence
Train on 500k words of news wire text.
Org

(Five other name classes)
Other
Case Language F1 . Mixed
English 93 Upper English 91 Mixed
Spanish 90
Results
Slide adapted from Cohen McCallum
46
Finite State Model
Person
end-of-sentence
start-of-sentence
Org

(Five other name classes)
vs. Path
Other
y3
y4
y5
y6
y1
y2

x1
x2
x3
x4
x5
x6

47
Question 1 Evaluation
GIVEN A sequence of observations x1 x2 x3 x4
xN A trained HMM ?( ,
, ) QUESTION How likely is
this sequence, given our HMM ? P(x,?)
Why do we care? Need it for learning to choose
among competing models!
48
Question 2 - Decoding
GIVEN A sequence of observations x1 x2 x3 x4
xN A trained HMM ?( ,
, ) QUESTION How dow we choose the
corresponding parse (state sequence) y1 y2 y3 y4
yN , which best explains x1 x2 x3 x4 xN ?
There are several reasonable optimality criteria
single optimal sequence, average statistics for
individual states,
49
A parse of a sequence
Given a sequence x x1xN, A parse of o is a
sequence of states y y1, , yN
1
person
2
2
other
K
location
x1
x2
x3
xK
Slide by Serafim Batzoglou
50
Question 3 - Learning
GIVEN A sequence of observations x1 x2 x3 x4
xN QUESTION How do we learn the model
parameters ? ( , ,
) which maximize P(x, ? ) ?
51
Three Questions
  • Evaluation
  • Forward algorithm
  • (Could also go other direction)
  • Decoding
  • Viterbi algorithm
  • Learning
  • Baum-Welch Algorithm (aka forward-backward)
  • A kind of EM (expectation maximization)

52
A Solution to 1 Evaluation
  • Given observations xx1 xN and HMM ?, what is
    p(x) ?
  • Enumerate every possible state sequence yy1 yN
  • Probability of x and given particular y
  • Probability of particular y
  • Summing over all possible state sequences we get

2T multiplications per sequence
For small HMMs T10, N10 there are 10 billion
sequences!
NT state sequences!
53
Solution to 1 Evaluation
  • Use Dynamic Programming

Define forward variable probability that at
time t - the state is Si - the partial
observation sequence xx1 xt has been emitted
54
Forward Variable ?t(i)
Prob - that the state at time t vas value Si and
- the partial obs sequence xx1 xt has been
seen
1
1
1
1


person
2
2
2
2


other




K
K
K
K
location


x1
x2
x3
xt
55
Forward Variable ?t(i)
prob - that the state at t vas value Si and -
the partial obs sequence xx1 xt has been seen
1
1
1
1


person
2
2
2
2


other
Si




K
K
K
K
location


x1
x2
x3
xt
56
Solution to 1 Evaluation
  • Use Dynamic Programming
  • Cache and reuse inner sums
  • Define forward variables
  • probability that at time t
  • the state is yt Si
  • the partial observation sequence xx1 xt has
    been omitted

57
The Forward Algorithm
58
The Forward Algorithm
  • INITIALIZATION
  • INDUCTION
  • TERMINATION

Time O(K2N)
Space O(KN)
K S states N length of sequence
59
The Backward Algorithm
60
The Backward Algorithm
  • INITIALIZATION
  • INDUCTION
  • TERMINATION

Time O(K2N)
Space O(KN)
61
Three Questions
  • Evaluation
  • Forward algorithm
  • (Could also go other direction)
  • Decoding
  • Viterbi algorithm
  • Learning
  • Baum-Welch Algorithm (aka forward-backward)
  • A kind of EM (expectation maximization)

62
Need new slide!
Looks like this but deltas
63
Solution to 2 - Decoding
  • Given xx1 xN and HMM ?, what is best parse y1
    yN?
  • Several optimal solutions
  • 1. States which are individually most likely
  • 2. Single best state sequenceWe want to find
    sequence y1 yN,
  • such that P(x,y) is maximized
  • y argmaxy P( x, y )
  • Again, we can use dynamic programming!

64
The Viterbi Algorithm
  • DEFINE
  • INITIALIZATION
  • INDUCTION
  • TERMINATION

Backtracking to get state sequence y
65
The Viterbi Algorithm
x1 x2 xj-1 xj..xT





State 1
Max
2
i
dj(i)
K
Time O(K2T) Space O(KT)
Linear in length of sequence
Remember dk(i) probability of most likely
state seq ending
with state Sk
Slides from Serafim Batzoglou
66
The Viterbi Algorithm
Pedro Domingos
66
67
Three Questions
  • Evaluation
  • Forward algorithm
  • (Could also go other direction)
  • Decoding
  • Viterbi algorithm
  • Learning
  • Baum-Welch Algorithm (aka forward-backward)
  • A kind of EM (expectation maximization)

68
Solution to 3 - Learning
  • Given x1 xN , how do we learn ? ( ,
    , ) to maximize P(x)?
  • Unfortunately, there is no known way to
    analytically find a global maximum ? such
    that ? arg max P(o ?)
  • But it is possible to find a local maximum given
    an initial model ?, we can always find a model ?
    such that P(o ?) P(o ?)

69
Chicken Egg Problem
  • If we knew the actual sequence of states
  • It would be easy to learn transition and emission
    probabilities
  • But we cant observe states, so we dont!
  • If we knew transition emission probabilities
  • Then itd be easy to estimate the sequence of
    states (Viterbi)
  • But we dont know them!

69
Slide by Daniel S. Weld
70
Simplest Version
  • Mixture of two distributions
  • Know form of distribution variance, 5
  • Just need mean of each distribution

70
Slide by Daniel S. Weld
71
Input Looks Like
.01 .03 .05 .07 .09
71
Slide by Daniel S. Weld
72
We Want to Predict
?
.01 .03 .05 .07 .09
72
Slide by Daniel S. Weld
73
Chicken Egg
  • Note that coloring instances would be easy
  • if we knew Gausians.

.01 .03 .05 .07 .09
73
Slide by Daniel S. Weld
74
Chicken Egg
  • And finding the Gausians would be easy
  • If we knew the coloring

.01 .03 .05 .07 .09
74
Slide by Daniel S. Weld
75
Expectation Maximization (EM)
  • Pretend we do know the parameters
  • Initialize randomly set ?1? ?2?

.01 .03 .05 .07 .09
75
Slide by Daniel S. Weld
76
Expectation Maximization (EM)
  • Pretend we do know the parameters
  • Initialize randomly
  • E step Compute probability of instance having
    each possible value of the hidden variable

.01 .03 .05 .07 .09
76
Slide by Daniel S. Weld
77
Expectation Maximization (EM)
  • Pretend we do know the parameters
  • Initialize randomly
  • E step Compute probability of instance having
    each possible value of the hidden variable

.01 .03 .05 .07 .09
77
Slide by Daniel S. Weld
78
Expectation Maximization (EM)
  • Pretend we do know the parameters
  • Initialize randomly
  • E step Compute probability of instance having
    each possible value of the hidden variable

M step Treating each instance as fractionally
having both values compute the new parameter
values
.01 .03 .05 .07 .09
78
Slide by Daniel S. Weld
79
ML Mean of Single Gaussian
  • Uml argminu ?i(xi u)2

.01 .03 .05 .07 .09
79
Slide by Daniel S. Weld
80
Expectation Maximization (EM)
  • E step Compute probability of instance having
    each possible value of the hidden variable

M step Treating each instance as fractionally
having both values compute the new parameter
values
.01 .03 .05 .07 .09
80
Slide by Daniel S. Weld
81
Expectation Maximization (EM)
  • E step Compute probability of instance having
    each possible value of the hidden variable

.01 .03 .05 .07 .09
81
Slide by Daniel S. Weld
82
Expectation Maximization (EM)
  • E step Compute probability of instance having
    each possible value of the hidden variable

M step Treating each instance as fractionally
having both values compute the new parameter
values
.01 .03 .05 .07 .09
82
Slide by Daniel S. Weld
83
Expectation Maximization (EM)
  • E step Compute probability of instance having
    each possible value of the hidden variable

M step Treating each instance as fractionally
having both values compute the new parameter
values
.01 .03 .05 .07 .09
83
Slide by Daniel S. Weld
84
EM for HMMs
  • E step Compute probability of instance having
    each possible value of the hidden variable
  • Compute the forward and backward probabilities
    for given model parameters and our observations

M step Treating each instance as fractionally
having both values compute the new parameter
values - Re-estimate the model parameters -
Simple Counting
84
85
Summary - Learning
  • Use hill-climbing
  • Called the forward-backward (or Baum/Welch)
    algorithm
  • Idea
  • Use an initial parameter instantiation
  • Loop
  • Compute the forward and backward probabilities
    for given model parameters and our observations
  • Re-estimate the parameters
  • Until estimates dont change much

86
IE Resources
  • Data
  • RISE, http//www.isi.edu/muslea/RISE/index.html
  • Linguistic Data Consortium (LDC)
  • Penn Treebank, Named Entities, Relations, etc.
  • http//www.biostat.wisc.edu/craven/ie
  • http//www.cs.umass.edu/mccallum/data
  • Code
  • TextPro, http//www.ai.sri.com/appelt/TextPro
  • MALLET, http//www.cs.umass.edu/mccallum/mallet
  • SecondString, http//secondstring.sourceforge.net/
  • Both
  • http//www.cis.upenn.edu/adwait/penntools.html

Slides from Cohen McCallum
87
References
  • Bikel et al 1997 Bikel, D. Miller, S.
    Schwartz, R. and Weischedel, R. Nymble a
    high-performance learning name-finder. In
    Proceedings of ANLP97, p194-201.
  • Califf Mooney 1999, Califf, M.E. Mooney, R.
    Relational Learning of Pattern-Match Rules for
    Information Extraction, in Proceedings of the
    Sixteenth National Conference on Artificial
    Intelligence (AAAI-99).
  • Cohen, Hurst, Jensen, 2002 Cohen, W. Hurst,
    M. Jensen, L. A flexible learning system for
    wrapping tables and lists in HTML documents.
    Proceedings of The Eleventh International World
    Wide Web Conference (WWW-2002)
  • Cohen, Kautz, McAllester 2000 Cohen, W Kautz,
    H. McAllester, D. Hardening soft information
    sources. Proceedings of the Sixth International
    Conference on Knowledge Discovery and Data Mining
    (KDD-2000).
  • Cohen, 1998 Cohen, W. Integration of
    Heterogeneous Databases Without Common Domains
    Using Queries Based on Textual Similarity, in
    Proceedings of ACM SIGMOD-98.
  • Cohen, 2000a Cohen, W. Data Integration using
    Similarity Joins and a Word-based Information
    Representation Language, ACM Transactions on
    Information Systems, 18(3).
  • Cohen, 2000b Cohen, W. Automatically Extracting
    Features for Concept Learning from the Web,
    Machine Learning Proceedings of the Seventeeth
    International Conference (ML-2000).
  • Collins Singer 1999 Collins, M. and Singer,
    Y. Unsupervised models for named entity
    classification. In Proceedings of the Joint
    SIGDAT Conference on Empirical Methods in Natural
    Language Processing and Very Large Corpora, 1999.
  • De Jong 1982 De Jong, G. An Overview of the
    FRUMP System. In Lehnert, W. Ringle, M. H.
    (eds), Strategies for Natural Language
    Processing. Larence Erlbaum, 1982, 149-176.
  • Freitag 98 Freitag, D Information extraction
    from HTML application of a general machine
    learning approach, Proceedings of the Fifteenth
    National Conference on Artificial Intelligence
    (AAAI-98).
  • Freitag, 1999, Freitag, D. Machine Learning
    for Information Extraction in Informal Domains.
    Ph.D. dissertation, Carnegie Mellon University.
  • Freitag 2000, Freitag, D Machine Learning for
    Information Extraction in Informal Domains,
    Machine Learning 39(2/3) 99-101 (2000).
  • Freitag Kushmerick, 1999 Freitag, D
    Kushmerick, D. Boosted Wrapper Induction.
    Proceedings of the Sixteenth National Conference
    on Artificial Intelligence (AAAI-99)
  • Freitag McCallum 1999 Freitag, D. and
    McCallum, A. Information extraction using HMMs
    and shrinakge. In Proceedings AAAI-99 Workshop
    on Machine Learning for Information Extraction.
    AAAI Technical Report WS-99-11.
  • Kushmerick, 2000 Kushmerick, N Wrapper
    Induction efficiency and expressiveness,
    Artificial Intelligence, 118(pp 15-68).
  • Lafferty, McCallum Pereira 2001 Lafferty,
    J. McCallum, A. and Pereira, F., Conditional
    Random Fields Probabilistic Models for
    Segmenting and Labeling Sequence Data, In
    Proceedings of ICML-2001.
  • Leek 1997 Leek, T. R. Information extraction
    using hidden Markov models. Masters thesis. UC
    San Diego.
  • McCallum, Freitag Pereira 2000 McCallum, A.
    Freitag, D. and Pereira. F., Maximum entropy
    Markov models for information extraction and
    segmentation, In Proceedings of ICML-2000
  • Miller et al 2000 Miller, S. Fox, H.
    Ramshaw, L. Weischedel, R. A Novel Use of
    Statistical Parsing to Extract Information from
    Text. Proceedings of the 1st Annual Meeting of
    the North American Chapter of the ACL (NAACL), p.
    226 - 233.

Slides from Cohen McCallum
Write a Comment
User Comments (0)
About PowerShow.com