Toward Unified Models of Information Extraction and Data Mining PowerPoint PPT Presentation

presentation player overlay
1 / 95
About This Presentation
Transcript and Presenter's Notes

Title: Toward Unified Models of Information Extraction and Data Mining


1
Toward Unified Models of Information Extraction
and Data Mining
  • Andrew McCallum
  • Information Extraction and Synthesis Laboratory
  • Computer Science Department
  • University of Massachusetts Amherst
  • Joint work with
  • Aron Culotta, Charles Sutton, Ben Wellner,
    Khashayar Rohanimanesh, Wei Li

2
Goal
Improving our abilityto mine actionable
knowledgefrom unstructured text.
3
Pages Containing the Phrasehigh tech job
openings
4
Extracting Job Openings from the Web
5
A Portal for Job Openings
6
Job Openings Category High Tech Keyword Java
Location U.S.
7
Data Mining the Extracted Job Information
8
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy
of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
9
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
11
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation

Free Soft..
Microsoft
Microsoft
TITLE ORGANIZATION

founder

CEO
VP

Stallman
NAME
Veghte
Bill Gates
Richard
Bill
13
Larger Context
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
14
Problem
  • Combined in serial juxtaposition,
  • IE and KD are unaware of each others
  • weaknesses and opportunities.
  • KD begins from a populated DB, unaware of where
    the data came from, or its inherent
    uncertainties.
  • IE is unaware of emerging patterns and
    regularities in the DB.
  • The accuracy of both suffers, and significant
    mining of complex text sources is beyond reach.

15
Solution
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
16
Solution
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
17
Outline
a
  • The need for unified IE and DM.
  • Review of Conditional Random Fields for IE.
  • Preliminary steps toward unification
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Segmentation and Co-ref (Iterated
    Conditional Samples.)
  • Conclusions

18
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
19
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Rich Caruana spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Rich Caruana spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Rich Caruana
20
We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
21
Problems with Richer Representationand a Joint
Model
  • These arbitrary features are not independent.
  • Multiple levels of granularity (chars, words,
    phrases)
  • Multiple dependent modalities (words, formatting,
    layout)
  • Past future
  • Two choices

Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
22
Conditional Sequence Models
  • We prefer a model that is trained to maximize a
    conditional probability rather than joint
    probabilityP(so) instead of P(s,o)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their
    dependencies.
  • Dont waste modeling effort trying to generate
    what we are given at test time anyway.

23
From HMMs to Conditional Random Fields
Lafferty, McCallum, Pereira 2001
St-1
St
St1
Joint
...
...
Ot
Ot1
Ot-1
Conditional
St-1
St
St1
...
Ot
Ot1
Ot-1
...
where
(A super-special case of Conditional Random
Fields.)
Set parameters by maximum likelihood, using
optimization method on dL.
24
Conditional Random Fields
Lafferty, McCallum, Pereira 2001
1. FSM special-case linear chain among
unknowns, parameters tied across time steps.
St
St1
St2
St3
St4
O Ot, Ot1, Ot2, Ot3, Ot4
2. In general CRFs "Conditionally-traine
d Markov Network" arbitrary structure among
unknowns
3. Relational Markov Networks Taskar, Abbeel,
Koller 2002 Parameters tied across hits
from SQL-like queries ("clique templates")
25
Training CRFs
Feature count using correct labels
Feature count using predicted labels
-
-
Smoothing penalty
26
Linear-chain CRFs vs. HMMs
  • Comparable computational efficiency for inference
  • Features may be arbitrary functions of any or all
    observations
  • Parameters need not fully specify generation of
    observations can require less training data
  • Easy to incorporate domain knowledge

27
Main Point 1
Conditional probability sequence models give
great flexibility regarding features used, and
have efficient dynamic-programming-based
algorithms for inference.
28
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.

29
Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
  • Non-Table
  • Table Title
  • Table Header
  • Table Data Row
  • Table Section Data Row
  • Table Footnote
  • ... (12 in all)

Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
  • Percentage of digit chars
  • Percentage of alpha chars
  • Indented
  • Contains 5 consecutive spaces
  • Whitespace in this line aligns with prev.
  • ...
  • Conjunctions of all previous features, time
    offset 0,0, -1,0, 0,1, 1,2.

30
Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
Table segments, F1
HMM
65
64
Stateless MaxEnt
85
-
D error 85
D error 77
CRF w/out conjunctions
52
68
95
92
CRF
31
IE from Research Papers
McCallum et al 99
32
IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
D error 40
33
Main Point 2
Conditional Random Fields were more accurate in
practice than a generative model ... on a
research paper extraction task, ... and others,
including - a table extraction task - noun
phrase segmentation - named entity extraction -

34
Outline
a
  • The need for unified IE and DM.
  • Review of Conditional Random Fields for IE.
  • Preliminary steps toward unification
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Segmentation and Co-ref (Iterated
    Conditional Samples.)
  • Conclusions

a
35
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
36
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mining Prediction Outlier detection
Decision support
Label training data
37
Coreference Resolution
AKA "record linkage", "database record
deduplication", "citation matching", "object
correspondence", "identity uncertainty"
Output
Input
News article, with named-entity "mentions" tagged
Number of entities, N 3 1 Secretary of
State Colin Powell he Mr. Powell
Powell 2 Condoleezza Rice she
Rice 3 President Bush Bush
Today Secretary of State Colin Powell met with .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . he
. . . . . . . . . . . . . . . . . . . Condoleezza
Rice . . . . . . . . . Mr Powell . . . . . . . .
. .she . . . . . . . . . . . . . . . . . . . . .
Powell . . . . . . . . . . . . . . . President
Bush . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
38
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
Mention (4)
Y/N?
. . . Mr Powell . . .
. . . Powell . . .
N Two words in common 29 Y One word in
common 13 Y "Normalized" mentions are string
identical 39 Y Capitalized word in
common 17 Y 50 character tri-gram
overlap 19 N overlap -34 Y In same sentence 9 Y Within
two sentences 8 N Further than 3 sentences
apart -1 Y "Hobbs Distance" of entities in between two mentions
0 12 N Number of entities in between two mentions
4 -3 Y Font matches 1 Y Default -19
OVERALL SCORE 98 threshold0
39
The Problem
Pair-wise merging decisions are being made
independently from each other
. . . Mr Powell . . .
affinity 98
Y
. . . Powell . . .
N
affinity -104
They should be made in relational dependence with
each other.
Y
affinity 11
. . . she . . .
Affinity measures are noisy and imperfect.
40
A Generative Model Solution
Russell 2001, Pasula et al 2002
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
2) Number of entities is hard-coded into
the model structure, but we are supposed
to predict num entities! Thus we must modify
model structure during inference---MCMC.
. . .
41
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
45
Make pair-wise merging decisions in dependent
relation to each other by - calculating a joint
prob. - including all edge weights - adding
dependence on consistent triangles.
. . . Powell . . .
Y/N
Y/N
-30
Y/N
11
. . . she . . .
42
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
-(45)
. . . Powell . . .
N
N
-(-30)
Y
(11)
-4
. . . she . . .
43
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
N
-(-30)
Y
(11)
-infinity
. . . she . . .
44
A Markov Random Field for Co-reference
(MRF)
McCallum Wellner, 2003
. . . Mr Powell . . .
(45)
. . . Powell . . .
Y
N
-(-30)
N
-(11)
. . . she . . .
64
45
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
46
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
47
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
-22
48
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002
. . . Mr Powell . . .
45
. . . Powell . . .
-106
-30
-134
11
. . . Condoleezza Rice . . .
. . . she . . .
10
314
49
Markov Random Fields for Co-reference
  • Train edge weight function by maximum likelihood
  • (Can approximate gradient by Gibbs sampling, or
    by stochastic gradient ascent, e.g. voted
    perceptron).
  • Given labeled training data in which partitions
    are given, learn an affinity measure for which
    partitioning will re-produce those partitions.
  • Interested in better algorithms for graph
    partitioning
  • Standard algorithms (e.g. Fiducia Mathesis) do
    not apply with negative edge weights.
  • The action is in the interplay between positive
    and negative edges.
  • Currently using modified version of
    Correlational Clustering" Bansal, Blum Chawala,
    2002---a very simple greedy algorithm.

50
Co-reference Experimental Results
McCallum Wellner, 2003
Proper noun co-reference, among nouns having
coreferents DARPA ACE broadcast news
transcripts, 117 stories MUC-style
F1 Single-link threshold 91.65 Best prev
match Morton 90.98 MRFs 93.96
Derror28 DARPA MUC-6 newswire article
corpus, 30 stories MUC-style F1 Single-link
threshold 60.83 Best prev match Morton 88.83
MRFs 91.59 Derror24
51
Outline
a
  • The need for unified IE and DM.
  • Review of Conditional Random Fields for IE.
  • Preliminary steps toward unification
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Joint Labeling of Cascaded Sequences (Belief
    Prop.)
  • Joint Segmentation and Co-ref (Iterated
    Conditional Samples.)
  • Conclusions

a
a
52
Cascaded Predictions
Named-entity tag
Part-of-speech
Segmentation
(output prediction)
Chinese character
(input observation)
53
Cascaded Predictions
Named-entity tag
Part-of-speech
(output prediction)
Segmentation
(input observation)
Chinese character
(input observation)
54
Cascaded Predictions
Named-entity tag
(output prediction)
Part-of-speech
(input obseration)
Segmentation
(input observation)
Chinese character
(input observation)
55
Joint PredictionCross-Product over Labels
O(V x 14852) parameters
O(o x 14852) running time
3 x 45 x 11 1485 possible states
e.g. state label (Wordbeg, Noun, Person)
SegmentationPOSNE
(output prediction)
Chinese character
(input observation)
56
Joint PredictionFactorial CRF
O(V x 2785) parameters
Named-entity tag
(output prediction)
Part-of-speech
(output prediction)
Segmentation
(output prediction)
Chinese character
(input observation)
57
Linear-Chain to Factorial CRFsModel Definition
Linear-chain
...
y
...
x
Factorial
...
u
...
v
...
w
...
x
where
58
Linear-Chain to Factorial CRFsLog-likelihood
Training
Linear-chain
...
y
...
x
Factorial
...
u
...
v
...
w
...
x
59
Dynamic CRFsUndirected conditionally-trained
analogue to Dynamic Bayes Nets (DBNs)
Factorial
Higher-Order
Hierarchical
60
Training CRFs
Feature count using correct labels
Feature count using predicted labels
-
-
Smoothing penalty
Same form as general CRFs
61
Training DCRFs
Feature count using correct labels
Feature count using predicted labels
-
-
Smoothing penalty
Same form as general CRFs
62
Inference (Exact)Junction Tree
Max-clique 3 x 45 x 45 6075 assignments
NP
POS
63
Inference (Exact)Junction Tree
Max-clique 3 x 45 x 45 x 11 66825 assignments
NER
POS
SEG
64
Inference (Approximate)Loopy Belief Approximation
v2
v1
v3
m3(v2)
m2(v1)
m5(v4)
m5(v4)
v6
v5
v4
m5(v6)
m4(v5)
65
Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
66
Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
67
Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
68
Inference (Approximate)Tree Re-parameterization
Wainwright, Jaakkola, Willsky 2001
69
ExperimentsSimultaneous noun-phrase
part-of-speech tagging
B I I
B I I O O O N
N N O N N
V O V Rockwell International Corp. 's
Tulsa unit said it signed B I
I O B I
O B IO J N
V O N O N
N a tentative agreement extending its
contract with Boeing Co.
  • Data from CoNLL Shared Task 2000 (Newswire)
  • 8936 training instances
  • 45 POS tags, 3 NP tags
  • Features word identity, capitalization, regexs,
    lexicons

70
ExperimentsSimultaneous noun-phrase
part-of-speech tagging
B I I
B I I O O O N
N N O N N
V O V Rockwell International Corp. 's
Tulsa unit said it signed B I
I O B I
O B IO J N
V O N O N
N a tentative agreement extending its
contract with Boeing Co.
  • Two experiments
  • Compare exact and approximate inference
  • Compare Noun Phrase Segmentation F1 of
  • Cascaded CRFCRF
  • Cascaded BrillCRF
  • Joint Factorial DCRFs

71
Comparing Inference Algorithms
72
NounPhrase Experimental Results
73
Outline
a
  • The need for unified IE and DM.
  • Review of Conditional Random Fields for IE.
  • Preliminary steps toward unification
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Segmentation and Co-ref (Iterated Cond.
    Samples.)
  • Conclusions

a
a
a
74
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
75
Citation Segmentation and Coreference
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
  • Segment citation fields

76
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
Y/N
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
  • Segment citation fields
  • Resolve coreferent papers

77
Incorrect Segmentation Hurts Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
?
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
78
Incorrect Segmentation Hurts Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , B. Laurel (ed) ,
Addison-Wesley , 1990 .
?
Brenda Laurel . Interface Agents Metaphors
with Character , in Laurel , The Art of
Human-Computer Interface Design , 355-366 ,
1990 .
Solution Perform segmentation and
coreference jointly. Use segmentation
uncertainty to improve coreference and use
coreference to improve segmentation.
79
Segmentation Coreference Model
s
CRF Segmentation
Observed citation
o
80
Segmentation Coreference Model
c
Citation attributes
s
CRF Segmentation
Observed citation
o
81
Segmentation Coreference Model
o
s
c
c
c
Citation attributes
s
s
CRF Segmentation
Observed citation
o
o
82
Segmentation Coreference Model
o
s
c
y
y
pairwise coref
c
c
Citation attributes
y
s
s
CRF Segmentation
Observed citation
o
o
83
  • Such a highly connected graph makes exact
    inference intractable, so

84
Approximate Inference 1
m1(v2)
m2(v3)
  • Loopy Belief
  • Propagation

v1
v3
v2
m3(v2)
m2(v1)
messages passed between nodes
v6
v5
v4
85
Approximate Inference 1
m1(v2)
m2(v3)
  • Loopy Belief
  • Propagation
  • Generalized Belief
  • Propagation

v1
v3
v2
m3(v2)
m2(v1)
messages passed between nodes
v6
v5
v4
messages passed between regions
Here, a message is a conditional probability
table passed among nodes.But when message size
grows exponentially with region size!
86
Approximate Inference 2
  • Iterated Conditional
  • Modes (ICM)
  • Besag 1986

v2
v1
v3
v6i1 argmax P(v6i v \ v6i)
v6
v5
v4

v6i
87
Approximate Inference 2
  • Iterated Conditional
  • Modes (ICM)
  • Besag 1986

v2
v1
v3
v5j1 argmax P(v5j v \ v5j)
v6
v5
v4

v5j
88
Approximate Inference 2
  • Iterated Conditional
  • Modes (ICM)
  • Besag 1986

v2
v1
v3
v4k1 argmax P(v4k v \ v4k)
v6
v5
v4

v4k
But greedy, and easily falls into local minima.
89
Approximate Inference 2
  • Iterated Conditional
  • Modes (ICM)
  • Besag 1986
  • Iterated Conditional Sampling (ICS) (our
    proposal related work?)
  • Instead of passing only argmax, sample of
    argmaxes of P(v4k v \ v4k)
  • i.e. an N-best list (the top N values)

v2
v1
v3
v4k1 argmax P(v4k v \ v4k)
v6
v5
v4

v4k
v2
v1
v3
Can use Generalized Version of this doing
exact inference on a region of several nodes at
once. Here, a message grows only linearly with
region size and N!
v6
v5
v4
90
Sample N-best List from CRF Segmentation
Do exact inference over these linear-chain regions
o
s
c
Pass N-best List to coreference
p
y
y
prototype
p
pairwise vars
c
c
y
s
s
o
o
91
Sample N-best List from Viterbi
Parameterized by N-Best lists
c
c
y
pairwise vars
s
s
o
o
92
Sample N-best List from Viterbi
When calculating similarity with another
citation, have more opportunity to find correct,
matching fields.
c
c
y
s
s
o
o
93
Results on 4 Sections of CiteSeer Citations
Coreference F1 performance
  • Average error reduction is 35.
  • Optimal makes best use of N-best list by using
    true labels.
  • Indicates that even more improvement can be
    obtained

94
Outline
a
  • The need for unified IE and DM.
  • Review of Conditional Random Fields for IE.
  • Preliminary steps toward unification
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Segmentation and Co-ref (Iterated
    Conditional Samples.)
  • Conclusions

a
a
a
a
95
Conclusions
  • Conditional Random Fields combine the benefits of
  • Conditional probability models (arbitrary
    features)
  • Markov models (for sequences or other relations)
  • Success in
  • Coreference analysis
  • Factorial finite state models
  • Segmentation uncertainty aiding coreference
  • Future work
  • Structure learning, semi-supervised learning.
  • Further tight integration of IE and Data Mining
  • Application to Social Network Analysis.

96
End of Talk
Write a Comment
User Comments (0)
About PowerShow.com