Title: Information Extraction, Data Mining
1Information Extraction, Data Mining Joint
Inference
- Andrew McCallum
- Computer Science Department
- University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta,
Khashayar Rohanemanesh, Ben Wellner, Karl
Schultz, Michael Hay, Michael Wick, David Mimno.
2My Research
Building models that mine actionable
knowledgefrom unstructured text.
3From Text to Actionable Knowledge
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
4A Natural Language Processing Pipeline
Pragmatics Anaphora Resolution Semantic Role
Labeling Entity Recognition Parsing Chunking P
OS tagging
5Unified Natural Language Processing
Pragmatics Anaphora Resolution Semantic Role
Labeling Entity Recognition Parsing Chunking P
OS tagging
Unified, joint inference.
6Problem
- Combined in serial juxtaposition,
- IE and DM are unaware of each others
- weaknesses and opportunities.
- DM begins from a populated DB, unaware of where
the data came from, or its inherent errors and
uncertainties. - IE is unaware of emerging patterns and
regularities in the DB. -
- The accuracy of both suffers, and significant
mining of complex text sources is beyond reach.
7Solution
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
8Solution
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
9Scientific Questions
- What model structures will capture salient
dependencies? - Will joint inference actually improve accuracy?
- How to do inference in these large graphical
models? - How to do parameter estimation efficiently in
these models,which are built from multiple large
components? - How to do structure discovery in these models?
10Scientific Questions
- What model structures will capture salient
dependencies? - Will joint inference actually improve accuracy?
- How to do inference in these large graphical
models? - How to do parameter estimation efficiently in
these models, which are built from multiple
large components? - How to do structure discovery in these models?
11Methods of Inference
- Exact
- Exhaustively explore all interpretations
- Graphical model has low tree-width
- Variational
- Represent distribution in simpler model that is
close - Monte-Carlo
- Randomly (but cleverly) sample to explore
interpretations
12Outline
- Examples of IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Joint Segmentation and Co-ref (Sparse BP)
- Probability First-order Logic, Co-ref on
Entities (MCMC) - Semi-supervised Learning
- Demo Rexa, a Web portal for researchers
13Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
O
O
O
t
t
1
-
t
1
Generates State sequence
Observation sequence
o1 o2 o3 o4 o5 o6 o7 o8
14IE with Hidden Markov Models
Given a sequence of observations
Yesterday Yoav Freund spoke this example sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Tony Jebara spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Tony Jebara
15We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
16Problems with Richer Representationand a Joint
Model
- These arbitrary features are not independent.
- Multiple levels of granularity (chars, words,
phrases) - Multiple dependent modalities (words, formatting,
layout) - Past future
- Two choices
Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
17Conditional Sequence Models
- We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(so) instead of P(s,o) - Can examine features, but not responsible for
generating them. - Dont have to explicitly model their
dependencies. - Dont waste modeling effort trying to generate
what we are given at test time anyway.
18From HMMs to Conditional Random Fields
Lafferty, McCallum, Pereira 2001
St-1
St
St1
Joint
...
...
Ot
Ot1
Ot-1
Conditional
St-1
St
St1
...
Ot
Ot1
Ot-1
...
where
(A super-special case of Conditional Random
Fields.)
Set parameters by maximum likelihood, using
optimization method on dL.
19(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
(sequence) given input (sequence)
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
20Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.
Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds
1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
21Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
- Non-Table
- Table Title
- Table Header
- Table Data Row
- Table Section Data Row
- Table Footnote
- ... (12 in all)
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.
Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds
1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
- Percentage of digit chars
- Percentage of alpha chars
- Indented
- Contains 5 consecutive spaces
- Whitespace in this line aligns with prev.
- ...
- Conjunctions of all previous features, time
offset 0,0, -1,0, 0,1, 1,2.
22Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
Table segments, F1
HMM
65
64
Stateless MaxEnt
85
-
95
92
CRF
23 IE from Research Papers
McCallum et al 99
24IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
? error 40
25Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
Labels Examples
PER Yayuk Basuki Innocent Butare ORG 3M KDP
Cleveland LOC Cleveland Nirmal Hriday The
Oval MISC Java Basque 1,000 Lakes Rally
26Automatically Induced Features
McCallum Li, 2003, CoNLL
Index Feature 0 inside-noun-phrase
(ot-1) 5 stopword (ot) 20 capitalized
(ot1) 75 wordthe (ot) 100 in-person-lexicon
(ot-1) 200 wordin (ot2) 500 wordRepublic
(ot1) 711 wordRBI (ot) headerBASEBALL 1027 he
aderCRICKET (ot) in-English-county-lexicon
(ot) 1298 company-suffix-word (firstmentiont2) 40
40 location (ot) POSNNP (ot) capitalized
(ot) stopword (ot-1) 4945 moderately-rare-first-
name (ot-1) very-common-last-name
(ot) 4474 wordthe (ot-2) wordof (ot)
27Named Entity Extraction Results
McCallum Li, 2003, CoNLL
Method F1 HMMs BBN's Identifinder 73 CRFs
w/out Feature Induction 83 CRFs with Feature
Induction 90 based on LikelihoodGain
28Outline
- The Need for IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Probability First-order Logic, Co-ref on
Entities (MCMC) - Joint Information Integration (MCMC Sample
Rank) - Demo Rexa, a Web portal for researchers
291. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
301. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
311. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every
stage to do well.
321. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Loopy Belief Propagation
33Outline
- The Need for IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Probability First-order Logic, Co-ref on
Entities (MCMC) - Joint Information Integration (MCMC Sample
Rank) - Demo Rexa, a Web portal for researchers
342. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004
Senator Joe Green said today .
Green ran for
Dependency among similar, distant mentions
ignored.
352. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004
Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterized BP
See also Finkel, et al, 2005
Wainwright et al, 2002
36Outline
- The Need for IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Probability First-order Logic, Co-ref on
Entities (MCMC) - Joint Information Integration (MCMC Sample
Rank) - Demo Rexa, a Web portal for researchers
373. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
Mr. Hill
Dana Hill
Amy Hall
25 reduction in error on co-reference of
proper nouns in newswire.
she
Dana
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
38Coreference Resolution
AKA "record linkage", "database record
deduplication", "citation matching", "object
correspondence", "identity uncertainty"
Output
Input
News article, with named-entity "mentions" tagged
Number of entities, N 3 1 Secretary of
State Colin Powell he Mr. Powell
Powell 2 Condoleezza Rice she
Rice 3 President Bush Bush
Today Secretary of State Colin Powell met with .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . he
. . . . . . . . . . . . . . . . . . . Condoleezza
Rice . . . . . . . . . Mr Powell . . . . . . . .
. .she . . . . . . . . . . . . . . . . . . . . .
Powell . . . . . . . . . . . . . . . President
Bush . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
39Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
Mention (4)
Y/N?
. . . Mr Powell . . .
. . . Powell . . .
N Two words in common 29 Y One word in
common 13 Y "Normalized" mentions are string
identical 39 Y Capitalized word in
common 17 Y gt 50 character tri-gram
overlap 19 N lt 25 character tri-gram
overlap -34 Y In same sentence 9 Y Within
two sentences 8 N Further than 3 sentences
apart -1 Y "Hobbs Distance" lt 3 11 N Number
of entities in between two mentions
0 12 N Number of entities in between two mentions
gt 4 -3 Y Font matches 1 Y Default -19
OVERALL SCORE 98 gt threshold0
40Entity Resolution
mention
Mr. Hill
mention
mention
Dana Hill
Amy Hall
mention
mention
she
Dana
41Entity Resolution
Mr. Hill
entity
Dana Hill
Amy Hall
she
Dana
entity
42Entity Resolution
Mr. Hill
entity
Dana Hill
Amy Hall
entity
she
Dana
43Entity Resolution
Mr. Hill
entity
entity
Dana Hill
Amy Hall
she
Dana
entity
44The Problem
Independent pairwise affinity with connected
components
Mr. Hill
Affinity measures are noisy and imperfect.
C
N
Dana Hill
Amy Hall
N
Pair-wise merging decisions are being made
independently from each other
C
N
C
C
N
N
They should be made jointly.
she
Dana
C
45CRF for Co-reference
McCallum Wellner, 2003, ICML
Mr. Hill
C
N
Dana Hill
Amy Hall
N
N
C
Make pair-wise merging decisions jointly by -
calculating a joint prob. - including all edge
weights - adding dependence on consistent
triangles.
C
C
N
N
she
Dana
C
46CRF for Co-reference
McCallum Wellner, 2003, ICML
Mr. Hill
C
N
Dana Hill
Amy Hall
N
N
C
C
C
N
N
she
Dana
C
47A Generative Model Solution
Russell 2001, Pasula et al 2002
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
2) Number of entities is hard-coded into
the model structure, but we are supposed
to predict num entities! Thus we must modify
model structure during inference---MCMC.
. . .
48CRF for Co-reference
Mr. Hill
(23)
-(-55)
C
N
-(-44)
Dana Hill
Amy Hall
N
-(-23)
(11)
C
N
C
C
N
N
(17)
-(-9)
(10)
-(-22)
she
Dana
C
(4)
218
49CRF for Co-reference
Mr. Hill
(23)
-(-55)
C
N
-(-44)
Dana Hill
Amy Hall
N
-(-23)
(11)
C
N
C
C
N
N
(17)
-(-9)
(10)
-(-22)
she
Dana
N
-(4)
210
0
50CRF for Co-reference
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
-12
0
51Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002 Correlationa
l Clustering Bansal Blum Demaine
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
52Pairwise Affinity is not Enough
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
53Pairwise Affinity is not Enough
Mr. Hill
N
C
Dana Hill
Amy Hall
N
N
N
N
C
C
N
she
Dana
C
54Pairwise Affinity is not Enough
she
C
N
she
Amy Hall
N
C
N
C
C
N
N
she
she
N
55Pairwise Comparisons Not EnoughExamples
- ? mentions are pronouns?
- Entities have multiple attributes (name, email,
institution, location)need to measure
compatibility among them. - Having 2 given names is common, but not 4.
- e.g. Howard M. Dean / Martin, Dean / Howard
Martin - Need to measure size of the clusters of mentions.
- ? a pair of lastname strings that differ gt 5?
- We need to ask ?, ? questions about a set of
mentions - We want first-order logic!
56Outline
- The Need for IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Probability First-order Logic, Co-ref on
Entities (MCMC) - Joint Information Integration (MCMC Sample
Rank) - Demo Rexa, a Web portal for researchers
57Pairwise Affinity is not Enough
she
C
N
she
Amy Hall
N
C
N
C
C
N
N
she
she
N
58Partition Affinity CRF
she
she
Amy Hall
she
she
Ask arbitrary questions about all entities in a
partition with first-order logic...
... bringing together LOGIC and PROBABILITY
59Partition Affinity CRF
she
she
Amy Hall
she
she
60Partition Affinity CRF
she
she
Amy Hall
she
she
61Partition Affinity CRF
she
she
Amy Hall
she
she
62Partition Affinity CRF
she
she
Amy Hall
she
she
63This space complexity is common in probabilistic
first-order logic models
64Markov Logic First-Order Logic as a Template
to Define CRF Parameters Richardson
Domingos 2005
Paskin Russell 2002 Taskar et al 2003
ground Markov network
grounding Markov network requires space
O(nr) n number constants
r highest clause arity
65How can we perform inference and learning in
models that cannot be grounded?
66Inference in Weighted First-Order LogicSAT
Solvers
- Weighted SAT solvers Kautz et al 1997
- Requires complete grounding of network
- LazySAT Singla Domingos 2006
- Saves memory by only storing clauses that may
become unsatisfied - Initialization still requires time O(nr) to visit
all ground clauses
67Inference in Weighted First-Order LogicMCMC
- Gibbs Sampling
- Difficult to move between high probability
configurations by changing single variables - Although, consider MC-SAT. Poon Domingos 06
- An alternative Metropolis-Hastings
sampling Culotta McCallum 2006 - 2 parts proposal distribution, acceptance
distribution. - Can be extended to partial configurations
- Only instantiate relevant variables
- Successfully used in BLOG models Milch et al
2005 - Key advantage can design arbitrary smart jumps
68Dont represent all alternatives...
69Dont represent all alternatives... just one
at a time
Proposal Distribution
Stochastic Jump
70Model
First-order features
Amy Hall
Dana Hill
she
Dana
she
Amy
she
fw SamePerson(x) fb DifferentPerson(x, x )
71Proposal Distribution
y y
Dean Martin Howie Martin Howard Martin
Howie Martin
Dean Martin Howie Martin
Howard Martin Dino
72Proposal Distribution
Dean Martin Howie Martin
Howard Martin Dino
y y
Dean Martin Howie Martin Howard Martin
Howie Martin
73Proposal Distribution
Dean Martin Howie Martin
Howard Martin Dino
y y
74Metropolis-HastingsJump acceptance probability
- p(y)/p(y) likelihood ratio
- Ratio of P(YX)
- ZX cancels!
- q(yy) proposal distribution
- probability of proposing move y ?y
- ratio makes up for any biases in the proposal
distribution
75Learning in Probabilistic First-order
LogicParameter estimation weight learning
- Input
- First-order formulae
- ?x S(x) ?T(x)
- Labeled data
- a, b, c S(a), T(a), S(b), T(b), S(c)
- Output
- Weights for each formula
- ?x S(x) ?T(x) 0.67
?xy Coreferent(x,y) ? Pronoun(x)
?xy Coreferent(x,y) ? Pronoun(x) -2.3
76Learning the Likelihood Ratio
Given a pair of configurations, learn to rank the
better configuration higher.
77Parameter Estimation in Large State Spaces
- Most methods require calculating gradient of
log-likelihood, P(y1, y2, y3,... x1, x2,
x3,...)... - ...which in turn requires expectations of
marginals, P(y1 x1, x2, x3,...) - But, getting marginal distributions by sampling
can be inefficient due to large sample space. - Alternative Perceptron. Approx gradient from
difference between true output and models
predicted best output. - But, even finding models predicted best output
is expensive. - We propose Sample Rank Culotta, Wick, Hall,
McCallum, HLT 2007Learn to rank
intermediate solutions P(y11, y20, y31,...
...) gt P(y10, y20, y31,... ...)
78Ranking vs Classification Training
- Instead of trainingPowell, Mr. Powell, he --gt
YESPowell, Mr. Powell, she --gt NO - ...Rather...Powell, Mr. Powell, he gt
Powell, Mr. Powell, she - In general, higher-ranked example may contain
errors
Powell, Mr. Powell, George, he gt Powell,
Mr. Powell, George, she
79Ranking Parameter Update
In our experiments, we use a large-margin
update based on MIRA Crammer, Singer 2003
Wt1 argminW Wt - W
s.t. Score(Q, W) - Score(Q, W) 1
80Error-driven Training
- Input
- Observed data X //
Input mentions - True labeling P //
True clustering - Prediction algorithm A //
Clustering algorithm - Initial weights W, prediction Q // Initial
clustering - Iterate until convergence
- Q ? A(Q, W, O) //
Merge clusters - If Q introduces an error
- UpdateWeights(Q, Q, P, O, W)
- Else Q ? Q
81UpdateWeights(Q, Q, P, O, W)Learning to Rank
Pairs of Predictions
- Using truth P, generate a new Q that is a
better modification of Q than Q. - Update W s.t. Q ? A(Q, W, O)
- Update parameters so Q is ranked higher than Q
82Ranking Intermediate SolutionsExample
1.
2.
3.
4.
5.
? Model 3 ? Truth 0.3
? Model -23 ? Truth -0.2
? Model 10 ? Truth -0.1
? Model -10 ? Truth -0.1
UPDATE
- Like PerceptronProof of convergence under
Marginal Separability - More constrained than Maximum LikelihoodParamete
rs must correctly rank incorrect solutions!
83Sample Rank Algorithm
- 1. Proposer
- 2. Performance Metric
- 3. Inputs input sequence x and an initial
(random) configuration - 4. Initialization set the parameter vector
- 5. Output Parameters
- 6. Score function
- 7. For t 1,,T and i 0, , n-1 do
- . Generate a training instance
- . Let and be the best and worst
configurations among and according
to the performance metric. - . If
-
- . end if
-
- 8. end for
84Marginal Separability
- For an input sequence x, feature vector
and a parameter vector , define - A training set is called separable
with margin if there exists some vector with
such that
85Convergence Results
- Theorem 1 For any training set that is
seperable with margin , the number of times
the algorithm makes a ranking error is bounded.
86Example 1
Lower F1, Higher Utility for leading to the goal
configuration
Goal Configuration
Initial Configuration
Greedy next state selection will pick the higher
F1 configuration and may get stuck in a local
optimal solution
Higher F1, Lower Utility for leading to the goal
configuration
Greedy move
87Example 2
CLAIM Singletons are good states along paths to
goal
Why? A few number of merges can reach goal
But F1 is harmonic mean of P and R
Medium F1 Low Utility
Low F1 High Utility
and Singletons have almost zero R
8 moves from goal
13 moves from goal
882. Proposed Configuration
It is not clear what choice of performance metric
is more intuitive for intermediate configurations
1. Initial Configuration
Obama U.S. Palin He Biden
Obama Biden Biden Sen. Obama He She Palin U.S. The
US America
Biden America She The US Sen. Obama
How important is it that our model ranks these?
89Delayed Feedback Problem
Configuration space
- Model the problem as a reinforcement learning
problem to address the delayed feedback problem,
and noise in the performance metric
Goal configuration
- Define a cost function as the temporal
difference of the performance metric between any
two configurations
TD-error F(yt1) - F(y t)
- During training, use sample rank algorithm for
learning the cost function
Initial configuration
- During test, use the approximate cost function
learned during training in a reinforcement
learning setting of the problem.
90Weighted Logics TechniquesOverview
- Metropolis-Hastings (MH) for inference
- Freely bake-in domain knowledge about fruitful
jumps MH safely takes care of its biases. - Avoid memory and time consumption with massive
deterministic constraint factors built jump
functions that simply avoid illegal states. - Sample Rank
- Dont train by likelihood of completely correct
solution... - ...train to properly rank intermediate
configurationsthe partition function
(normalizer) cancels!...plus other efficiencies
91Feature List
- Exact Match/Mis-Match
- Entity type
- Gender (requires lexicon)
- Number
- Case
- Entity Text
- Entity Head
- Entity Modifier/Numerical Modifier
- Sentence
- WordNet hypernym,synonym,antonym
- Other
- Relative pronoun agreement
- Sentence distance in bins
- Partial text overlaps
- Quantification
- Existential? a gender mismatch? three different
first names - Universal? NER type match? named mentions str
identical - Filters (limit quantifiers to mention type)
- None
- Pronoun
- Nominal (description)
- Proper (name)
92Partition Affinity CRF Experiments
Culotta, Wick, Hall, McCallum, 2007
Better Training
New state-of-the-art
Likelihood-basedTraining Rank-basedTraining
Partition Affinity 69.2 79.3
Pairwise Affinity 62.4 72.5
Better Representation
B-Cubed F1 Score on ACE 2004 Noun Coreference
To our knowledge, best previously reported
results 1997 65 2002 67 2005 68
93Outline
- The Need for IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Probability First-order Logic, Co-ref on
Entities (MCMC) - Joint Information Integration (MCMC Sample
Rank) - Demo Rexa, a Web portal for researchers
94Information Integration
Schema A
First Name Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Schema B
Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
Schema Matching
Coreference
Schema A Schema B
First Name Name
Last Name Phone
Contact
John 1 John 2
J. Smith John Smith
J. Smith J Smiht
John Smith
John D. Smith
Normalized DB
Entity Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555
95Data Integration
Schema 13
Schema 25
Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
First Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Schema Matching
Coref
Canonicalization
Result Normalized DB
Entity Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555
96Information IntegrationA Family of Related Tasks
- GOAL
- Combine multiple heterogeneous sources into a
single repository. - Requires
- Concept-mapping across different representations
(schema matching) - Data deduplication across different repositories
(coreference) - Selecting a representation to store in the
resulting DB (canonicalization / normalization)
97Information Integration Steps
1. Schema Matching
First Name, Last Name
Phone
2. Coreference
Name
3. Canonicalization
J. Smith
Contact
J. Smith
John Smith
John Smith ..
Amanda Jones ..
..
Amanda
A. Jones
98Problems with a Pipeline
- 1. Data integration tasks are highly correlated
- 2. Errors can propagate
99Schema Matching First
1. Schema Matching
2. Coreference
J. Smith
Provides Evidence
First Name, Last Name
J. Smith
Phone
Name
John Smith
Amanda
Contact
A. Jones
NEW FEATURES
1. String Identical F.NameL.NameName 2. Same
Area Code 3-gram in Phone/Contact 3.
100Coreference First
1. Coreference
2. Schema Matching
Provides Evidence
J. Smith
First Name, Last Name
J. Smith
Phone
Name
John Smith
Amanda
Contact
A. Jones
NEW FEATURES
1. Field values similar across corefd records 2.
PhoneContact has same value for J. Smith
mentions 3.
101Problems with a Pipeline
- 1. Data integration tasks are highly correlated
- 2. Errors can propagate
102Hazards of a Pipeline
1. Schema Matching
Table A
Name Corporation
Amanda Jones J. Smith Sons
J. Smith IBM
Full Name
Phone
Company Name
Table B
Full Name Company Name
Amanda Jones Smith Sons
John Smith IBM
Contact
2. Coreferent?
ERRORS PROPOGATE
103Canonicalization
Entity 87
John Smith J. Smith J. Smith J. Smiht J.S
Mith Jonh smith John
Coref
Canoni- calization
John Smith
Typically occurs AFTER coreference
- Desiterata
- Complete Contains all information (e.g. first
last) - Error-free No typos (e.g. avoid Smiht)
- Central Represents all mentions (not Mith)
Access to such features would be very helpful to
Coref
104Schema Matching
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
- x6 is a set of attributes phone,contact,telephone
- x7 is a set of attributes last name, last name
- f67 is a factor between x6/x7
- y67 is a binary variable indicating a match (no)
- f7 is a factor over cluster x7
- y7 is a binary variable indicating match (yes)
105Schema Matching
- x1 is a set of mentions J. Smith,John,John
Smith - x2 is a set of mentions Amanda, A. Jones
- f12 is a factor between x1/x2
- y12 is a binary variable indicating a match (no)
- f1 is a factor over cluster x1
- y1 is a binary variable indicating match (yes)
- Entity/attribute factors omitted for clarity
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
x1
x2
106Schema Matching
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
x1
x2
107Canonicalizations Role
- For each attribute in each cluster, there is a
Canonical value variable - Variable takes on one value from all possible
values in cluster - Canonical record is the tuple of all canonical
attribute variables - There is one canonical record per coreference
cluster - Canonical records are used to compute additional
features
108Model Summary
- Conditional random field
- Two clustering components
- Clusters of attributes (schema matching)
- Clusters of mentions (coref)
- Factors represent affinities between and among
clusters (sets) - Schema matching affinity factors
- Coreference affinity factors
- Canonicalization factors
- Learning/inference is intractable
109Parameter Estimation
- Three sets of parameters
- Schema matching
- Coreference
- Canonicalization
- Labeled data for
- Schema matching
- Coreference
- Canonicalization model is set to default string
edit parameters
110Parameter Estimation
Setting Coreference Params
Setting Schema Match Params
Ground Truth Schema Matching
Ground Truth Coreference
Entity 1 Entity 2
J. Smith A Jones
J. Smith Amanda Jones
John Smith
John D. Smith
Schema A Schema B
First Name Name
Last Name Phone
Contact
Fix S.M. truth
Fix coref truth
Sample coref training examples
Sample S.M. training examples
111Inference (MAP/MPE)
GOAL Find a configuration that maximizes P(YX)
Clustering Inference for Coreference and Schema
Matching
Greedy Agglomerative
Inference for canonicalization
Find attribute value centroid for each set
112Joint Inference
Iteration n-1
Iteration n
1. Coreference
Greedy Agglomerative
113EXPERIMENTS
114Dataset
- Faculty and alumni listings from university
websites, plus an IE system - 9 different schemas
- 1400 mentions, 294 coreferent
115Example Schemas
DEX IE Northwestern Fac UPenn Fac
First Name Name Name
Middle Name Title First Name
Last Name PhD Alma Mater Last Name
Title Research Interests JobDepartment
Department Office Address
Company Name E-mail
Home Phone
Office Phone
Fax Number
E-mail
116Experiments
- Prune down to 200 singletons plus the 294
coreference clusters - Create training/testing sets from keeping DEX
schema in both - Has most interesting cases of coreference
- Sets still disjoint since mapping is defined
between two schemas
117Coreference Features
First order quantifications/aggregations over
- Cosine distance (unweighted)
- Cosine distance (TFIDF)
- Cosine distance between mapped fields
- Substring match between mapped fields
- All of the above comparisons, but use canonical
records
118Schema Matching Features
First order quantifications/aggregations over
- String identical
- Sub string matches
- TFIDF weighted cosine distance
- All of the above with between coreferent mentions
only
119System Settings
- MaxEnt piecewise training
- Agglomerative Search for sub-tasks
- Stopping threshold 0.5
- Four iterations for joint inference
120Systems
- ISO Each task in isolation
- CASC Coref -gt Schema matching
- CASC Schema matching -gt Coref
- JOINT Coref Schema matching
Our new work
Each system is evaluated with and without Joint
canonicalization
121Coreference Results
Pair Pair Pair MUC MUC MUC
F1 Prec Recall F1 Prec Recall
No Canon ISO 72.7 88.9 61.5 75.0 88.9 64.9
No Canon CASC 64.0 66.7 61.5 65.7 66.7 64.9
No Canon JOINT 76.5 89.7 66.7 78.8 89.7 70.3
Canon ISO 78.3 90.0 69.2 80.6 90.0 73.0
Canon CASC 65.8 67.6 64.1 67.6 67.6 67.6
Canon JOINT 81.7 90.6 74.4 84.1 90.6 74.4
Note cascade does worse than ISO
122Schema Matching Results
Pair Pair Pair MUC MUC MUC
F1 Prec Recall F1 Prec Recall
No Canon ISO 50.9 40.9 67.5 69.2 81.8 60.0
No Canon CASC 50.9 40.9 67.5 69.2 81.8 60.0
No Canon JOINT 68.9 100 52.5 69.6 100 53.3
Canon ISO 50.9 40.9 67.5 69.2 81.8 60.0
Canon CASC 52.3 41.8 70.0 74.1 83.3 66.7
Canon JOINT 71.0 100 55.0 75.0 100 60.0
Note cascade not as harmful here
123Summary and Future Work
- Towards a grand unified data integration model
- Integrate other info integration tasks
- Apply to other families of problems
- Joint modeling drastically improves results for
related tasks - Try other learning/inference algorithms
- More sophisticated canonicalization models
- Additional datasets
124Related Work (Joint Inference)
- Composition of Conditional Random Fields for
Transfer Learning (Sutton 05) - Various named entity recognition tasks
- Joint Inference in Information Extraction (Poon
and Domingos 07) - Coreference segmentation
125(No Transcript)
126Data
- 270 Wikipedia articles
- 1000 paragraphs
- 4700 relations
- 52 relation types
- JobTitle, BirthDay, Friend, Sister, Husband,
Employer, Cousin, Competition, Education, - Targeted for density of relations
- Bush/Kennedy/Manning/Coppola families and friends
127(No Transcript)
128George W. Bush his father George H. W.
Bush his cousin John Prescott Ellis
George H. W. Bush his sister Nancy Ellis Bush
Nancy Ellis Bush her son John Prescott Ellis
Cousin Fathers Sisters Son
129John Kerry celebrated with Stuart Forbes
likely a cousin
130Iterative DB Construction
- Joseph P. Kennedy, Sr
- son John F. Kennedy with Rose
Fitzgerald
Name Son
Joseph P. Kennedy John F. Kennedy
Rose Fitzgerald John F. Kennedy
(0.3)
131Results
ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5
F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363
Prec .6475 .7019 .6799 .7177 .7095 .7553 .7343
Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614
ME maximum entropy CRF conditional random
field RCRF CRF mined features
132Examples of Discovered Relational Features
- Mother Father?Wife
- Cousin Mother?Husband?Nephew
- Friend Education?Student
- Education Father?Education
- Boss Boss?Son
- MemberOf Grandfather?MemberOf
- Competition PoliticalParty?Member?Competition
133(No Transcript)
134(No Transcript)
135Outline
- The Need for IE and Data Mining.
- Motivate Joint Inference
- Brief introduction to Conditional Random Fields
- Joint inference Information Extraction Examples
- Joint Labeling of Cascaded Sequences (Belief
Propagation) - Joint Labeling of Distant Entities (BP by Tree
Reparameterization) - Joint Co-reference Resolution (Graph
Partitioning) - Probability First-order Logic, Co-ref on
Entities (MCMC) - Joint Information Integration (MCMC Sample
Rank) - Demo Rexa, a Web portal for researchers
136Data Mining Research Literature
- Better understand structure of our own research
area. - Structure helps us learn a new field.
- Aid collaboration
- Map how ideas travel through social networks of
researchers. - Aids for hiring and finding reviewers!
- Measure impact of papers or people.
137Traditional Bibliometrics
- Analyses a small amount of data(e.g. 19 articles
from a single issue of a journal) - Uses journal as a proxy for research
topic(but there is no journal for information
extraction) - Uses impact measures almost exclusively based on
simple citation counts.
How can we use topic models to create new,
interesting impact measures? Can create a social
network of scientific sub-fields?
138Our Data
- Over 1.6 million research papers, gathered as
part of Rexa.info portal. - Cross linked references / citations.
139Previous Systems
140(No Transcript)
141Previous Systems
Cites
Research Paper
142More Entities and Relations
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
143(No Transcript)
144(No Transcript)
145(No Transcript)
146(No Transcript)
147(No Transcript)
148(No Transcript)
149(No Transcript)
150(No Transcript)
151(No Transcript)
152(No Transcript)
153(No Transcript)
154(No Transcript)
155(No Transcript)
156Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
157Topical Bibliometric Impact Measures
Mann, Mimno, McCallum, 2006
- Topical Citation Counts
- Topical Impact Factors
- Topical Longevity
- Topical Precedence
- Topical Diversity
- Topical Transfer
158Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
159Topical Diversity
Papers that had the most influence across many
other fields...
160Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
HighDiversity
LowDiversity
161Summary
- Joint inference needed for avoiding cascading
errors in information extraction and data mining. - Most fundamental problem in NLP, data mining, ...
- Can be performed in CRFs
- Cascaded sequences (Factorial CRFs)
- Distant correlations (Skip-chain CRFs)
- Co-reference (Affinity-matrix CRFs)
- Logic Probability (efficient by MCMC Sample
Rank) - Information Integration
- Rexa New research paper search engine, mining
the interactions in our community.
162(No Transcript)
163(No Transcript)
164Outline
- Model / Feature Engineering
- Brief review of IE w/ Conditional Random Fields
- Flexibility to use non-independent features
- Inference
- Entity Resolution with Probability First-order
Logic - Resolution Canonicalization Schema Mapping
- Inference by Metropolis-Hastings
- Parameter Estimation
- Semi-supervised Learning with Label
Regularization - ...with Feature Labeling
- Generalized Expectation criteria