Title: Toward Unified Models of Information Extraction and Data Mining
1Toward Unified Models of Information Extraction
and Data Mining
- Andrew McCallum
- Information Extraction and Synthesis Laboratory
- Computer Science Department
- University of Massachusetts Amherst
- Joint work with
- Aron Culotta, Charles Sutton, Ben Wellner,
Khashayar Rohanimanesh, Wei Li
2Goal
Improving our abilityto mine actionable
knowledgefrom unstructured text.
3Pages Containing the Phrasehigh tech job
openings
4Extracting Job Openings from the Web
5A Portal for Job Openings
6Job Openings Category High Tech Keyword Java
Location U.S.
7Data Mining the Extracted Job Information
8IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy
of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
9What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
Free Soft..
Microsoft
Microsoft
TITLE ORGANIZATION
founder
CEO
VP
Stallman
NAME
Veghte
Bill Gates
Richard
Bill
13Larger Context
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
14Problem
- Combined in serial juxtaposition,
- IE and KD are unaware of each others
- weaknesses and opportunities.
- KD begins from a populated DB, unaware of where
the data came from, or its inherent
uncertainties. - IE is unaware of emerging patterns and
regularities in the DB. -
- The accuracy of both suffers, and significant
mining of complex text sources is beyond reach.
15Solution
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
16Solution
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
17Outline
a
- The need for unified IE and DM.
- Review of Conditional Random Fields for IE.
- Preliminary steps toward unification
- Joint Co-reference Resolution (Graph
Partitioning) - Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Segmentation and Co-ref (Iterated
Conditional Samples.) - Conclusions
18Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
19IE with Hidden Markov Models
Given a sequence of observations
Yesterday Rich Caruana spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Rich Caruana spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Rich Caruana
20We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
21Problems with Richer Representationand a Joint
Model
- These arbitrary features are not independent.
- Multiple levels of granularity (chars, words,
phrases) - Multiple dependent modalities (words, formatting,
layout) - Past future
- Two choices
Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
22Conditional Sequence Models
- We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(so) instead of P(s,o) - Can examine features, but not responsible for
generating them. - Dont have to explicitly model their
dependencies. - Dont waste modeling effort trying to generate
what we are given at test time anyway.
23From HMMs to Conditional Random Fields
Lafferty, McCallum, Pereira 2001
St-1
St
St1
Joint
...
...
Ot
Ot1
Ot-1
Conditional
St-1
St
St1
...
Ot
Ot1
Ot-1
...
where
(A super-special case of Conditional Random
Fields.)
Set parameters by maximum likelihood, using
optimization method on dL.
24Conditional Random Fields
Lafferty, McCallum, Pereira 2001
1. FSM special-case linear chain among
unknowns, parameters tied across time steps.
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
2. In general CRFs "Conditionally-traine
d Markov Network" arbitrary structure among
unknowns
3. Relational Markov Networks Taskar, Abbeel,
Koller 2002 Parameters tied across hits
from SQL-like queries ("clique templates")
25Training CRFs
Feature count using correct labels
Feature count using predicted labels
-
-
Smoothing penalty
26Linear-chain CRFs vs. HMMs
- Comparable computational efficiency for inference
- Features may be arbitrary functions of any or all
observations - Parameters need not fully specify generation of
observations can require less training data - Easy to incorporate domain knowledge
27Main Point 1
Conditional probability sequence models give
great flexibility regarding features used, and
have efficient dynamic-programming-based
algorithms for inference.
28Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.
Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds
1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
29Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
- Non-Table
- Table Title
- Table Header
- Table Data Row
- Table Section Data Row
- Table Footnote
- ... (12 in all)
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.
Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds
1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
- Percentage of digit chars
- Percentage of alpha chars
- Indented
- Contains 5 consecutive spaces
- Whitespace in this line aligns with prev.
- ...
- Conjunctions of all previous features, time
offset 0,0, -1,0, 0,1, 1,2.
30Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
Table segments, F1
HMM
65
64
Stateless MaxEnt
85
-
D error 85
D error 77
CRF w/out conjunctions
52
68
95
92
CRF
31 IE from Research Papers
McCallum et al 99
32IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
D error 40
33Main Point 2
Conditional Random Fields were more accurate in
practice than a generative model ... on a
research paper extraction task, ... and others,
including - a table extraction task - noun
phrase segmentation - named entity extraction -
34Outline
a
- The need for unified IE and DM.
- Review of Conditional Random Fields for IE.
- Preliminary steps toward unification
- Joint Co-reference Resolution (Graph
Partitioning) - Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Segmentation and Co-ref (Iterated
Conditional Samples.) - Conclusions
a
35IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
36IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
37Coreference Resolution
AKA "record linkage", "database record
deduplication", "citation matching", "object
correspondence", "identity uncertainty"
Output
Input
News article, with named-entity "mentions" tagged
Number of entities, N 3 1 Secretary of
State Colin Powell he Mr. Powell
Powell 2 Condoleezza Rice she
Rice 3 President Bush Bush
Today Secretary of State Colin Powell met with .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . he
. . . . . . . . . . . . . . . . . . . Condoleezza
Rice . . . . . . . . . Mr Powell . . . . . . . .
. .she . . . . . . . . . . . . . . . . . . . . .
Powell . . . . . . . . . . . . . . . President
Bush . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
38Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
Mention (4)
Y/N?
. . . Mr Powell . . .
. . . Powell . . .
N Two words in common 29 Y One word in
common 13 Y "Normalized" mentions are string
identical 39 Y Capitalized word in
common 17 Y 50 character tri-gram
overlap 19 N overlap -34 Y In same sentence 9 Y Within
two sentences 8 N Further than 3 sentences
apart -1 Y "Hobbs Distance" of entities in between two mentions
0 12 N Number of entities in between two mentions
4 -3 Y Font matches 1 Y Default -19
OVERALL SCORE 98 threshold0
39The Problem
Pair-wise merging decisions are being made
independently from each other
. . . Mr Powell . . .
affinity 98
Y
. . . Powell . . .
N
affinity -104
They should be made in relational dependence with
each other.
Y
affinity 11
. . . she . . .
Affinity measures are noisy and imperfect.
40A Generative Model Solution
Russell 2001, Pasula et al 2002
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
2) Number of entities is hard-coded into
the model structure, but we are supposed
to predict num entities! Thus we must modify
model structure during inference---MCMC.
. . .
41A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
45
Make pair-wise merging decisions in dependent
relation to each other by - calculating a joint
prob. - including all edge weights - adding
dependence on consistent triangles.
. . . Powell . . .
Y/N
Y/N
-30
Y/N
11
. . . she . . .
42A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
-(45)
. . . Powell . . .
N
N
-(-30)
Y
(11)
-4
. . . she . . .
43A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
N
-(-30)
Y
(11)
-infinity
. . . she . . .
44A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
N
-(-30)
N
-(11)
. . . she . . .
64
45Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
46Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
47Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
-22
48Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
314
49Markov Random Fields for Co-reference
- Train edge weight function by maximum likelihood
- (Can approximate gradient by Gibbs sampling, or
by stochastic gradient ascent, e.g. voted
perceptron). - Given labeled training data in which partitions
are given, learn an affinity measure for which
partitioning will re-produce those partitions. - Interested in better algorithms for graph
partitioning - Standard algorithms (e.g. Fiducia Mathesis) do
not apply with negative edge weights. - The action is in the interplay between positive
and negative edges. - Currently using modified version of
Correlational Clustering" Bansal, Blum Chawala,
2002---a very simple greedy algorithm.
50Co-reference Experimental Results
McCallum Wellner, 2003
Proper noun co-reference, among nouns having
coreferents DARPA ACE broadcast news
transcripts, 117 stories MUC-style
F1 Single-link threshold 91.65 Best prev
match Morton 90.98 MRFs 93.96
Derror28 DARPA MUC-6 newswire article
corpus, 30 stories MUC-style F1 Single-link
threshold 60.83 Best prev match Morton 88.83
MRFs 91.59 Derror24
51Outline
a
- The need for unified IE and DM.
- Review of Conditional Random Fields for IE.
- Preliminary steps toward unification
- Joint Co-reference Resolution (Graph
Partitioning) - Joint Labeling of Cascaded Sequences (Belief
Prop.) - Joint Segmentation and Co-ref (Iterated
Conditional Samples.) - Conclusions
a
a
52Cascaded Predictions
Named-entity tag
Part-of-speech
Segmentation
(output prediction)
Chinese character
(input observation)
53Cascaded Predictions
Named-entity tag
Part-of-speech
(output prediction)
Segmentation
(input observation)
Chinese character
(input observation)
54Cascaded Predictions
Named-entity tag
(output prediction)
Part-of-speech
(input obseration)
Segmentation
(input observation)
Chinese character
(input observation)
55Joint PredictionCross-Product over Labels
O(V x 14852) parameters
O(o x 14852) running time
3 x 45 x 11 1485 possible states
e.g. state label (Wordbeg, Noun, Person)
SegmentationPOSNE
(output prediction)
Chinese character
(input observation)
56Joint PredictionFactorial CRF
O(V x 2785) parameters
Named-entity tag
(output prediction)
Part-of-speech
(output prediction)
Segmentation
(output prediction)
Chinese character
(input observation)
57Linear-Chain to Factorial CRFsModel Definition
Linear-chain
...
y
...
x
Factorial
...
u
...
v
...
w
...
x
where
58Linear-Chain to Factorial CRFsLog-likelihood
Training
Linear-chain
...
y
...
x
Factorial
...
u
...
v
...
w
...
x
59Dynamic CRFsUndirected conditionally-trained
analogue to Dynamic Bayes Nets (DBNs)
Factorial
Higher-Order
Hierarchical
60Training CRFs
Feature count using correct labels
Feature count using predicted labels
-
-
Smoothing penalty
Same form as general CRFs
61Training DCRFs
Feature count using correct labels
Feature count using predicted labels
-
-
Smoothing penalty
Same form as general CRFs
62Inference (Exact)Junction Tree
Max-clique 3 x 45 x 45 6075 assignments
NP
POS
63Inference (Exact)Junction Tree
Max-clique 3 x 45 x 45 x 11 66825 assignments
NER
POS
SEG
64Inference (Approximate)Loopy Belief Approximation
v2
v1
v3
m3(v2)
m2(v1)
m5(v4)
m5(v4)
v6
v5
v4
m5(v6)
m4(v5)
65Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
66Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
67Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
68Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
69ExperimentsSimultaneous noun-phrase
part-of-speech tagging
B I I
B I I O O O N
N N O N N
V O V Rockwell International Corp. 's
Tulsa unit said it signed B I
I O B I
O B IO J N
V O N O N
N a tentative agreement extending its
contract with Boeing Co.
- Data from CoNLL Shared Task 2000 (Newswire)
- 8936 training instances
- 45 POS tags, 3 NP tags
- Features word identity, capitalization, regexs,
lexicons
70ExperimentsSimultaneous noun-phrase
part-of-speech tagging
B I I
B I I O O O N
N N O N N
V O V Rockwell International Corp. 's
Tulsa unit said it signed B I
I O B I
O B IO J N
V O N O N
N a tentative agreement extending its
contract with Boeing Co.
- Two experiments
- Compare exact and approximate inference
- Compare Noun Phrase Segmentation F1 of
- Cascaded CRFCRF
- Cascaded BrillCRF
- Joint Factorial DCRFs
71Comparing Inference Algorithms
72NounPhrase Experimental Results
73Outline
a
- The need for unified IE and DM.
- Review of Conditional Random Fields for IE.
- Preliminary steps toward unification
- Joint Co-reference Resolution (Graph
Partitioning) - Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Segmentation and Co-ref (Iterated Cond.
Samples.) - Conclusions
a
a
a
74Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
75Citation Segmentation and Coreference
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
76Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Y/N
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
- Segment citation fields
- Resolve coreferent papers
77Incorrect Segmentation Hurts Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
?
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
78Incorrect Segmentation Hurts Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
?
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
Solution Perform segmentation and
coreference jointly. Use segmentation
uncertainty to improve coreference and use
coreference to improve segmentation.
79Segmentation Coreference Model
s
CRF Segmentation
Observed citation
o
80Segmentation Coreference Model
c
Citation attributes
s
CRF Segmentation
Observed citation
o
81Segmentation Coreference Model
o
s
c
c
c
Citation attributes
s
s
CRF Segmentation
Observed citation
o
o
82Segmentation Coreference Model
o
s
c
y
y
pairwise coref
c
c
Citation attributes
y
s
s
CRF Segmentation
Observed citation
o
o
83- Such a highly connected graph makes exact
inference intractable, so
84Approximate Inference 1
m1(v2)
m2(v3)
v1
v3
v2
m3(v2)
m2(v1)
messages passed between nodes
v6
v5
v4
85Approximate Inference 1
m1(v2)
m2(v3)
- Loopy Belief
- Propagation
- Generalized Belief
- Propagation
v1
v3
v2
m3(v2)
m2(v1)
messages passed between nodes
v6
v5
v4
messages passed between regions
Here, a message is a conditional probability
table passed among nodes.But when message size
grows exponentially with region size!
86Approximate Inference 2
- Iterated Conditional
- Modes (ICM)
- Besag 1986
v2
v1
v3
v6i1 argmax P(v6i v \ v6i)
v6
v5
v4
v6i
87Approximate Inference 2
- Iterated Conditional
- Modes (ICM)
- Besag 1986
v2
v1
v3
v5j1 argmax P(v5j v \ v5j)
v6
v5
v4
v5j
88Approximate Inference 2
- Iterated Conditional
- Modes (ICM)
- Besag 1986
v2
v1
v3
v4k1 argmax P(v4k v \ v4k)
v6
v5
v4
v4k
But greedy, and easily falls into local minima.
89Approximate Inference 2
- Iterated Conditional
- Modes (ICM)
- Besag 1986
- Iterated Conditional Sampling (ICS) (our
proposal related work?) - Instead of passing only argmax, sample of
argmaxes of P(v4k v \ v4k) - i.e. an N-best list (the top N values)
v2
v1
v3
v4k1 argmax P(v4k v \ v4k)
v6
v5
v4
v4k
v2
v1
v3
Can use Generalized Version of this doing
exact inference on a region of several nodes at
once. Here, a message grows only linearly with
region size and N!
v6
v5
v4
90Sample N-best List from CRF Segmentation
Do exact inference over these linear-chain regions
o
s
c
Pass N-best List to coreference
p
y
y
prototype
p
pairwise vars
c
c
y
s
s
o
o
91Sample N-best List from Viterbi
Parameterized by N-Best lists
c
c
y
pairwise vars
s
s
o
o
92Sample N-best List from Viterbi
When calculating similarity with another
citation, have more opportunity to find correct,
matching fields.
c
c
y
s
s
o
o
93Results on 4 Sections of CiteSeer Citations
Coreference F1 performance
- Average error reduction is 35.
- Optimal makes best use of N-best list by using
true labels. - Indicates that even more improvement can be
obtained
94Outline
a
- The need for unified IE and DM.
- Review of Conditional Random Fields for IE.
- Preliminary steps toward unification
- Joint Co-reference Resolution (Graph
Partitioning) - Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Segmentation and Co-ref (Iterated
Conditional Samples.) - Conclusions
a
a
a
a
95Conclusions
- Conditional Random Fields combine the benefits of
- Conditional probability models (arbitrary
features) - Markov models (for sequences or other relations)
- Success in
- Coreference analysis
- Factorial finite state models
- Segmentation uncertainty aiding coreference
- Future work
- Structure learning, semi-supervised learning.
- Further tight integration of IE and Data Mining
- Application to Social Network Analysis.
96End of Talk