Information Extraction, Data Mining

About This Presentation

Title:

Information Extraction, Data Mining

Description:

Information Extraction, Data Mining & Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton ... – PowerPoint PPT presentation

Number of Views:298

Avg rating:3.0/5.0

Slides: 121

Provided by: uma96

Learn more at: https://people.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction, Data Mining

1
Information Extraction, Data Mining Joint
Inference

Andrew McCallum
Computer Science Department
University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta,
Khashayar Rohanemanesh, Ben Wellner, Karl
Schultz, Michael Hay, Michael Wick, David Mimno.
2
My Research
Building models that mine actionable
knowledgefrom unstructured text.
3
From Text to Actionable Knowledge
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
4
A Natural Language Processing Pipeline
Pragmatics Anaphora Resolution Semantic Role
Labeling Entity Recognition Parsing Chunking P
OS tagging
5
Unified Natural Language Processing
Pragmatics Anaphora Resolution Semantic Role
Labeling Entity Recognition Parsing Chunking P
OS tagging
Unified, joint inference.
6
Problem

Combined in serial juxtaposition,
IE and DM are unaware of each others
weaknesses and opportunities.
DM begins from a populated DB, unaware of where
the data came from, or its inherent errors and
uncertainties.
IE is unaware of emerging patterns and
regularities in the DB.
The accuracy of both suffers, and significant
mining of complex text sources is beyond reach.

7
Solution
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
8
Solution
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
9
Scientific Questions

What model structures will capture salient
dependencies?
Will joint inference actually improve accuracy?
How to do inference in these large graphical
models?
How to do parameter estimation efficiently in
these models,which are built from multiple large
components?
How to do structure discovery in these models?

10
Scientific Questions

What model structures will capture salient
dependencies?
Will joint inference actually improve accuracy?
How to do inference in these large graphical
models?
How to do parameter estimation efficiently in
these models, which are built from multiple
large components?
How to do structure discovery in these models?

11
Methods of Inference

Exact
Exhaustively explore all interpretations
Graphical model has low tree-width
Variational
Represent distribution in simpler model that is
close
Monte-Carlo
Randomly (but cleverly) sample to explore
interpretations

12
Outline

Examples of IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Joint Segmentation and Co-ref (Sparse BP)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Semi-supervised Learning
Demo Rexa, a Web portal for researchers

13
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
O
O
O
t
t
1
-
t
1
Generates State sequence
Observation sequence
o1 o2 o3 o4 o5 o6 o7 o8
14
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Yoav Freund spoke this example sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Tony Jebara spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Tony Jebara
15
We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
16
Problems with Richer Representationand a Joint
Model

These arbitrary features are not independent.
Multiple levels of granularity (chars, words,
phrases)
Multiple dependent modalities (words, formatting,
layout)
Past future
Two choices

Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
17
Conditional Sequence Models

We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(so) instead of P(s,o)
Can examine features, but not responsible for
generating them.
Dont have to explicitly model their
dependencies.
Dont waste modeling effort trying to generate
what we are given at test time anyway.

18
From HMMs to Conditional Random Fields
Lafferty, McCallum, Pereira 2001
St-1
St
St1
Joint
...
...
Ot
Ot1
Ot-1
Conditional
St-1
St
St1
...
Ot
Ot1
Ot-1
...
where
(A super-special case of Conditional Random
Fields.)
Set parameters by maximum likelihood, using
optimization method on dL.
19
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
(sequence) given input (sequence)
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
20
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.

Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.

21
Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF

Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)

Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.

Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features

Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5 consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features, time
offset 0,0, -1,0, 0,1, 1,2.

22
Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
Table segments, F1
HMM
65
64
Stateless MaxEnt
85
-
95
92
CRF
23
IE from Research Papers
McCallum et al 99
24
IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
? error 40
25
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
Labels Examples
PER Yayuk Basuki Innocent Butare ORG 3M KDP
Cleveland LOC Cleveland Nirmal Hriday The
Oval MISC Java Basque 1,000 Lakes Rally
26
Automatically Induced Features
McCallum Li, 2003, CoNLL
Index Feature 0 inside-noun-phrase
(ot-1) 5 stopword (ot) 20 capitalized
(ot1) 75 wordthe (ot) 100 in-person-lexicon
(ot-1) 200 wordin (ot2) 500 wordRepublic
(ot1) 711 wordRBI (ot) headerBASEBALL 1027 he
aderCRICKET (ot) in-English-county-lexicon
(ot) 1298 company-suffix-word (firstmentiont2) 40
40 location (ot) POSNNP (ot) capitalized
(ot) stopword (ot-1) 4945 moderately-rare-first-
name (ot-1) very-common-last-name
(ot) 4474 wordthe (ot-2) wordof (ot)
27
Named Entity Extraction Results
McCallum Li, 2003, CoNLL
Method F1 HMMs BBN's Identifinder 73 CRFs
w/out Feature Induction 83 CRFs with Feature
Induction 90 based on LikelihoodGain
28
Outline

The Need for IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Joint Information Integration (MCMC Sample
Rank)
Demo Rexa, a Web portal for researchers

29
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
30
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
31
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every
stage to do well.
32
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Loopy Belief Propagation
33
Outline

The Need for IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Joint Information Integration (MCMC Sample
Rank)
Demo Rexa, a Web portal for researchers

34
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
Dependency among similar, distant mentions
ignored.
35
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterized BP
See also Finkel, et al, 2005
Wainwright et al, 2002
36
Outline

The Need for IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Joint Information Integration (MCMC Sample
Rank)
Demo Rexa, a Web portal for researchers

37
3. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
Mr. Hill
Dana Hill
Amy Hall
25 reduction in error on co-reference of
proper nouns in newswire.
she
Dana
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
38
Coreference Resolution
AKA "record linkage", "database record
deduplication", "citation matching", "object
correspondence", "identity uncertainty"
Output
Input
News article, with named-entity "mentions" tagged
Number of entities, N 3 1 Secretary of
State Colin Powell he Mr. Powell
Powell 2 Condoleezza Rice she
Rice 3 President Bush Bush
Today Secretary of State Colin Powell met with .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . he
. . . . . . . . . . . . . . . . . . . Condoleezza
Rice . . . . . . . . . Mr Powell . . . . . . . .
. .she . . . . . . . . . . . . . . . . . . . . .
Powell . . . . . . . . . . . . . . . President
Bush . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
39
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
Mention (4)
Y/N?
. . . Mr Powell . . .
. . . Powell . . .
N Two words in common 29 Y One word in
common 13 Y "Normalized" mentions are string
identical 39 Y Capitalized word in
common 17 Y gt 50 character tri-gram
overlap 19 N lt 25 character tri-gram
overlap -34 Y In same sentence 9 Y Within
two sentences 8 N Further than 3 sentences
apart -1 Y "Hobbs Distance" lt 3 11 N Number
of entities in between two mentions
0 12 N Number of entities in between two mentions
gt 4 -3 Y Font matches 1 Y Default -19
OVERALL SCORE 98 gt threshold0
40
Entity Resolution
mention
Mr. Hill
mention
mention
Dana Hill
Amy Hall
mention
mention
she
Dana
41
Entity Resolution
Mr. Hill
entity
Dana Hill
Amy Hall
she
Dana
entity
42
Entity Resolution
Mr. Hill
entity
Dana Hill
Amy Hall
entity
she
Dana
43
Entity Resolution
Mr. Hill
entity
entity
Dana Hill
Amy Hall
she
Dana
entity
44
The Problem
Independent pairwise affinity with connected
components
Mr. Hill
Affinity measures are noisy and imperfect.
C
N
Dana Hill
Amy Hall
N
Pair-wise merging decisions are being made
independently from each other
C
N
C
C
N
N
They should be made jointly.
she
Dana
C
45
CRF for Co-reference
McCallum Wellner, 2003, ICML
Mr. Hill
C
N
Dana Hill
Amy Hall
N
N
C
Make pair-wise merging decisions jointly by -
calculating a joint prob. - including all edge
weights - adding dependence on consistent
triangles.
C
C
N
N
she
Dana
C
46
CRF for Co-reference
McCallum Wellner, 2003, ICML
Mr. Hill
C
N
Dana Hill
Amy Hall
N
N
C
C
C
N
N
she
Dana
C
47
A Generative Model Solution
Russell 2001, Pasula et al 2002
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
2) Number of entities is hard-coded into
the model structure, but we are supposed
to predict num entities! Thus we must modify
model structure during inference---MCMC.
. . .
48
CRF for Co-reference
Mr. Hill
(23)
-(-55)
C
N
-(-44)
Dana Hill
Amy Hall
N
-(-23)
(11)
C
N
C
C
N
N
(17)
-(-9)
(10)
-(-22)
she
Dana
C
(4)
218
49
CRF for Co-reference
Mr. Hill
(23)
-(-55)
C
N
-(-44)
Dana Hill
Amy Hall
N
-(-23)
(11)
C
N
C
C
N
N
(17)
-(-9)
(10)
-(-22)
she
Dana
N
-(4)
210
0
50
CRF for Co-reference
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
-12
0
51
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002 Correlationa
l Clustering Bansal Blum Demaine
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
52
Pairwise Affinity is not Enough
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
53
Pairwise Affinity is not Enough
Mr. Hill
N
C
Dana Hill
Amy Hall
N
N
N
N
C
C
N
she
Dana
C
54
Pairwise Affinity is not Enough
she
C
N
she
Amy Hall
N
C
N
C
C
N
N
she
she
N
55
Pairwise Comparisons Not EnoughExamples

? mentions are pronouns?
Entities have multiple attributes (name, email,
institution, location)need to measure
compatibility among them.
Having 2 given names is common, but not 4.
e.g. Howard M. Dean / Martin, Dean / Howard
Martin
Need to measure size of the clusters of mentions.
? a pair of lastname strings that differ gt 5?
We need to ask ?, ? questions about a set of
mentions
We want first-order logic!

56
Outline

The Need for IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Joint Information Integration (MCMC Sample
Rank)
Demo Rexa, a Web portal for researchers

57
Pairwise Affinity is not Enough
she
C
N
she
Amy Hall
N
C
N
C
C
N
N
she
she
N
58
Partition Affinity CRF
she
she
Amy Hall
she
she
Ask arbitrary questions about all entities in a
partition with first-order logic...
... bringing together LOGIC and PROBABILITY
59
Partition Affinity CRF
she
she
Amy Hall
she
she
60
Partition Affinity CRF
she
she
Amy Hall
she
she
61
Partition Affinity CRF
she
she
Amy Hall
she
she
62
Partition Affinity CRF
she
she
Amy Hall
she
she
63
This space complexity is common in probabilistic
first-order logic models
64
Markov Logic First-Order Logic as a Template
to Define CRF Parameters Richardson
Domingos 2005
Paskin Russell 2002 Taskar et al 2003
ground Markov network
grounding Markov network requires space
O(nr) n number constants
r highest clause arity
65
How can we perform inference and learning in
models that cannot be grounded?
66
Inference in Weighted First-Order LogicSAT
Solvers

Weighted SAT solvers Kautz et al 1997
Requires complete grounding of network
LazySAT Singla Domingos 2006
Saves memory by only storing clauses that may
become unsatisfied
Initialization still requires time O(nr) to visit
all ground clauses

67
Inference in Weighted First-Order LogicMCMC

Gibbs Sampling
Difficult to move between high probability
configurations by changing single variables
Although, consider MC-SAT. Poon Domingos 06
An alternative Metropolis-Hastings
sampling Culotta McCallum 2006
2 parts proposal distribution, acceptance
distribution.
Can be extended to partial configurations
Only instantiate relevant variables
Successfully used in BLOG models Milch et al
2005
Key advantage can design arbitrary smart jumps

68
Dont represent all alternatives...
69
Dont represent all alternatives... just one
at a time
Proposal Distribution
Stochastic Jump
70
Model
First-order features
Amy Hall
Dana Hill
she
Dana
she
Amy
she
fw SamePerson(x) fb DifferentPerson(x, x )
71
Proposal Distribution
y y
Dean Martin Howie Martin Howard Martin
Howie Martin
Dean Martin Howie Martin
Howard Martin Dino
72
Proposal Distribution
Dean Martin Howie Martin
Howard Martin Dino
y y
Dean Martin Howie Martin Howard Martin
Howie Martin
73
Proposal Distribution
Dean Martin Howie Martin
Howard Martin Dino
y y
74
Metropolis-HastingsJump acceptance probability

p(y)/p(y) likelihood ratio
Ratio of P(YX)
ZX cancels!
q(yy) proposal distribution
probability of proposing move y ?y
ratio makes up for any biases in the proposal
distribution

75
Learning in Probabilistic First-order
LogicParameter estimation weight learning

Input
First-order formulae
?x S(x) ?T(x)
Labeled data
a, b, c S(a), T(a), S(b), T(b), S(c)
Output
Weights for each formula
?x S(x) ?T(x) 0.67

?xy Coreferent(x,y) ? Pronoun(x)
?xy Coreferent(x,y) ? Pronoun(x) -2.3
76
Learning the Likelihood Ratio
Given a pair of configurations, learn to rank the
better configuration higher.
77
Parameter Estimation in Large State Spaces

Most methods require calculating gradient of
log-likelihood, P(y1, y2, y3,... x1, x2,
x3,...)...
...which in turn requires expectations of
marginals, P(y1 x1, x2, x3,...)
But, getting marginal distributions by sampling
can be inefficient due to large sample space.
Alternative Perceptron. Approx gradient from
difference between true output and models
predicted best output.
But, even finding models predicted best output
is expensive.
We propose Sample Rank Culotta, Wick, Hall,
McCallum, HLT 2007Learn to rank
intermediate solutions P(y11, y20, y31,...
...) gt P(y10, y20, y31,... ...)

78
Ranking vs Classification Training

Instead of trainingPowell, Mr. Powell, he --gt
YESPowell, Mr. Powell, she --gt NO
...Rather...Powell, Mr. Powell, he gt
Powell, Mr. Powell, she
In general, higher-ranked example may contain
errors

Powell, Mr. Powell, George, he gt Powell,
Mr. Powell, George, she
79
Ranking Parameter Update
In our experiments, we use a large-margin
update based on MIRA Crammer, Singer 2003
Wt1 argminW Wt - W
s.t. Score(Q, W) - Score(Q, W) 1
80
Error-driven Training

Input
Observed data X //
Input mentions
True labeling P //
True clustering
Prediction algorithm A //
Clustering algorithm
Initial weights W, prediction Q // Initial
clustering
Iterate until convergence
Q ? A(Q, W, O) //
Merge clusters
If Q introduces an error
UpdateWeights(Q, Q, P, O, W)
Else Q ? Q

81
UpdateWeights(Q, Q, P, O, W)Learning to Rank
Pairs of Predictions

Using truth P, generate a new Q that is a
better modification of Q than Q.
Update W s.t. Q ? A(Q, W, O)
Update parameters so Q is ranked higher than Q

82
Ranking Intermediate SolutionsExample
1.
2.
3.
4.
5.
? Model 3 ? Truth 0.3
? Model -23 ? Truth -0.2
? Model 10 ? Truth -0.1
? Model -10 ? Truth -0.1
UPDATE

Like PerceptronProof of convergence under
Marginal Separability
More constrained than Maximum LikelihoodParamete
rs must correctly rank incorrect solutions!

83
Sample Rank Algorithm

1. Proposer
2. Performance Metric
3. Inputs input sequence x and an initial
(random) configuration
4. Initialization set the parameter vector
5. Output Parameters
6. Score function
7. For t 1,,T and i 0, , n-1 do
. Generate a training instance
. Let and be the best and worst
configurations among and according
to the performance metric.
. If
. end if
8. end for

84
Marginal Separability

For an input sequence x, feature vector
and a parameter vector , define
A training set is called separable
with margin if there exists some vector with
such that

85
Convergence Results

Theorem 1 For any training set that is
seperable with margin , the number of times
the algorithm makes a ranking error is bounded.

86
Example 1
Lower F1, Higher Utility for leading to the goal
configuration
Goal Configuration
Initial Configuration
Greedy next state selection will pick the higher
F1 configuration and may get stuck in a local
optimal solution
Higher F1, Lower Utility for leading to the goal
configuration
Greedy move
87
Example 2
CLAIM Singletons are good states along paths to
goal
Why? A few number of merges can reach goal
But F1 is harmonic mean of P and R
Medium F1 Low Utility
Low F1 High Utility
and Singletons have almost zero R
8 moves from goal
13 moves from goal
88
2. Proposed Configuration
It is not clear what choice of performance metric
is more intuitive for intermediate configurations
1. Initial Configuration
Obama U.S. Palin He Biden
Obama Biden Biden Sen. Obama He She Palin U.S. The
US America
Biden America She The US Sen. Obama
How important is it that our model ranks these?
89
Delayed Feedback Problem
Configuration space

Model the problem as a reinforcement learning
problem to address the delayed feedback problem,
and noise in the performance metric

Goal configuration

Define a cost function as the temporal
difference of the performance metric between any
two configurations

TD-error F(yt1) - F(y t)

During training, use sample rank algorithm for
learning the cost function

Initial configuration

During test, use the approximate cost function
learned during training in a reinforcement
learning setting of the problem.

90
Weighted Logics TechniquesOverview

Metropolis-Hastings (MH) for inference
Freely bake-in domain knowledge about fruitful
jumps MH safely takes care of its biases.
Avoid memory and time consumption with massive
deterministic constraint factors built jump
functions that simply avoid illegal states.
Sample Rank
Dont train by likelihood of completely correct
solution...
...train to properly rank intermediate
configurationsthe partition function
(normalizer) cancels!...plus other efficiencies

91
Feature List

Exact Match/Mis-Match
Entity type
Gender (requires lexicon)
Number
Case
Entity Text
Entity Head
Entity Modifier/Numerical Modifier
Sentence
WordNet hypernym,synonym,antonym
Other
Relative pronoun agreement
Sentence distance in bins
Partial text overlaps

Quantification
Existential? a gender mismatch? three different
first names
Universal? NER type match? named mentions str
identical
Filters (limit quantifiers to mention type)
None
Pronoun
Nominal (description)
Proper (name)

92
Partition Affinity CRF Experiments
Culotta, Wick, Hall, McCallum, 2007
Better Training
New state-of-the-art
Likelihood-basedTraining Rank-basedTraining
Partition Affinity 69.2 79.3
Pairwise Affinity 62.4 72.5
Better Representation
B-Cubed F1 Score on ACE 2004 Noun Coreference
To our knowledge, best previously reported
results 1997 65 2002 67 2005 68
93
Outline

The Need for IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Joint Information Integration (MCMC Sample
Rank)
Demo Rexa, a Web portal for researchers

94
Information Integration
Schema A
First Name Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Schema B
Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
Schema Matching
Coreference
Schema A Schema B
First Name Name
Last Name Phone
Contact
John 1 John 2
J. Smith John Smith
J. Smith J Smiht
John Smith
John D. Smith
Normalized DB

Entity Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555

95
Data Integration
Schema 13
Schema 25
Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
First Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Schema Matching
Coref
Canonicalization
Result Normalized DB
Entity Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555

96
Information IntegrationA Family of Related Tasks

GOAL
Combine multiple heterogeneous sources into a
single repository.
Requires
Concept-mapping across different representations
(schema matching)
Data deduplication across different repositories
(coreference)
Selecting a representation to store in the
resulting DB (canonicalization / normalization)

97
Information Integration Steps
1. Schema Matching
First Name, Last Name
Phone
2. Coreference
Name
3. Canonicalization
J. Smith
Contact
J. Smith
John Smith
John Smith ..
Amanda Jones ..
..
Amanda
A. Jones
98
Problems with a Pipeline

1. Data integration tasks are highly correlated
2. Errors can propagate

99
Schema Matching First
1. Schema Matching
2. Coreference
J. Smith
Provides Evidence
First Name, Last Name
J. Smith
Phone
Name
John Smith
Amanda
Contact
A. Jones
NEW FEATURES
1. String Identical F.NameL.NameName 2. Same
Area Code 3-gram in Phone/Contact 3.
100
Coreference First
1. Coreference
2. Schema Matching
Provides Evidence
J. Smith
First Name, Last Name
J. Smith
Phone
Name
John Smith
Amanda
Contact
A. Jones
NEW FEATURES
1. Field values similar across corefd records 2.
PhoneContact has same value for J. Smith
mentions 3.
101
Problems with a Pipeline

1. Data integration tasks are highly correlated
2. Errors can propagate

102
Hazards of a Pipeline
1. Schema Matching
Table A
Name Corporation
Amanda Jones J. Smith Sons
J. Smith IBM
Full Name
Phone
Company Name
Table B
Full Name Company Name
Amanda Jones Smith Sons
John Smith IBM
Contact
2. Coreferent?
ERRORS PROPOGATE
103
Canonicalization
Entity 87
John Smith J. Smith J. Smith J. Smiht J.S
Mith Jonh smith John
Coref
Canoni- calization
John Smith
Typically occurs AFTER coreference

Desiterata
Complete Contains all information (e.g. first
last)
Error-free No typos (e.g. avoid Smiht)
Central Represents all mentions (not Mith)

Access to such features would be very helpful to
Coref
104
Schema Matching
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54

x6 is a set of attributes phone,contact,telephone
x7 is a set of attributes last name, last name
f67 is a factor between x6/x7
y67 is a binary variable indicating a match (no)
f7 is a factor over cluster x7
y7 is a binary variable indicating match (yes)

105
Schema Matching

x1 is a set of mentions J. Smith,John,John
Smith
x2 is a set of mentions Amanda, A. Jones
f12 is a factor between x1/x2
y12 is a binary variable indicating a match (no)
f1 is a factor over cluster x1
y1 is a binary variable indicating match (yes)
Entity/attribute factors omitted for clarity

y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
x1
x2
106
Schema Matching
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
x1
x2
107
Canonicalizations Role

For each attribute in each cluster, there is a
Canonical value variable
Variable takes on one value from all possible
values in cluster
Canonical record is the tuple of all canonical
attribute variables
There is one canonical record per coreference
cluster
Canonical records are used to compute additional
features

108
Model Summary

Conditional random field
Two clustering components
Clusters of attributes (schema matching)
Clusters of mentions (coref)
Factors represent affinities between and among
clusters (sets)
Schema matching affinity factors
Coreference affinity factors
Canonicalization factors
Learning/inference is intractable

109
Parameter Estimation

Three sets of parameters
Schema matching
Coreference
Canonicalization
Labeled data for
Schema matching
Coreference
Canonicalization model is set to default string
edit parameters

110
Parameter Estimation
Setting Coreference Params
Setting Schema Match Params
Ground Truth Schema Matching
Ground Truth Coreference
Entity 1 Entity 2
J. Smith A Jones
J. Smith Amanda Jones
John Smith
John D. Smith
Schema A Schema B
First Name Name
Last Name Phone
Contact
Fix S.M. truth
Fix coref truth
Sample coref training examples
Sample S.M. training examples
111
Inference (MAP/MPE)
GOAL Find a configuration that maximizes P(YX)
Clustering Inference for Coreference and Schema
Matching
Greedy Agglomerative
Inference for canonicalization
Find attribute value centroid for each set
112
Joint Inference
Iteration n-1
Iteration n
1. Coreference
Greedy Agglomerative
113
EXPERIMENTS
114
Dataset

Faculty and alumni listings from university
websites, plus an IE system
9 different schemas
1400 mentions, 294 coreferent

115
Example Schemas
DEX IE Northwestern Fac UPenn Fac
First Name Name Name
Middle Name Title First Name
Last Name PhD Alma Mater Last Name
Title Research Interests JobDepartment
Department Office Address
Company Name E-mail
Home Phone
Office Phone
Fax Number
E-mail
116
Experiments

Prune down to 200 singletons plus the 294
coreference clusters
Create training/testing sets from keeping DEX
schema in both
Has most interesting cases of coreference
Sets still disjoint since mapping is defined
between two schemas

117
Coreference Features
First order quantifications/aggregations over

Cosine distance (unweighted)
Cosine distance (TFIDF)
Cosine distance between mapped fields
Substring match between mapped fields
All of the above comparisons, but use canonical
records

118
Schema Matching Features
First order quantifications/aggregations over

String identical
Sub string matches
TFIDF weighted cosine distance
All of the above with between coreferent mentions
only

119
System Settings

MaxEnt piecewise training
Agglomerative Search for sub-tasks
Stopping threshold 0.5
Four iterations for joint inference

120
Systems

ISO Each task in isolation
CASC Coref -gt Schema matching
CASC Schema matching -gt Coref
JOINT Coref Schema matching

Our new work
Each system is evaluated with and without Joint
canonicalization
121
Coreference Results
Pair Pair Pair MUC MUC MUC
F1 Prec Recall F1 Prec Recall
No Canon ISO 72.7 88.9 61.5 75.0 88.9 64.9
No Canon CASC 64.0 66.7 61.5 65.7 66.7 64.9
No Canon JOINT 76.5 89.7 66.7 78.8 89.7 70.3
Canon ISO 78.3 90.0 69.2 80.6 90.0 73.0
Canon CASC 65.8 67.6 64.1 67.6 67.6 67.6
Canon JOINT 81.7 90.6 74.4 84.1 90.6 74.4
Note cascade does worse than ISO
122
Schema Matching Results
Pair Pair Pair MUC MUC MUC
F1 Prec Recall F1 Prec Recall
No Canon ISO 50.9 40.9 67.5 69.2 81.8 60.0
No Canon CASC 50.9 40.9 67.5 69.2 81.8 60.0
No Canon JOINT 68.9 100 52.5 69.6 100 53.3
Canon ISO 50.9 40.9 67.5 69.2 81.8 60.0
Canon CASC 52.3 41.8 70.0 74.1 83.3 66.7
Canon JOINT 71.0 100 55.0 75.0 100 60.0
Note cascade not as harmful here
123
Summary and Future Work

Towards a grand unified data integration model
Integrate other info integration tasks
Apply to other families of problems
Joint modeling drastically improves results for
related tasks
Try other learning/inference algorithms
More sophisticated canonicalization models
Additional datasets

124
Related Work (Joint Inference)

Composition of Conditional Random Fields for
Transfer Learning (Sutton 05)
Various named entity recognition tasks
Joint Inference in Information Extraction (Poon
and Domingos 07)
Coreference segmentation

125
(No Transcript)
126
Data

270 Wikipedia articles
1000 paragraphs
4700 relations
52 relation types
JobTitle, BirthDay, Friend, Sister, Husband,
Employer, Cousin, Competition, Education,
Targeted for density of relations
Bush/Kennedy/Manning/Coppola families and friends

127
(No Transcript)
128
George W. Bush his father George H. W.
Bush his cousin John Prescott Ellis
George H. W. Bush his sister Nancy Ellis Bush
Nancy Ellis Bush her son John Prescott Ellis
Cousin Fathers Sisters Son
129
John Kerry celebrated with Stuart Forbes
likely a cousin
130
Iterative DB Construction

Joseph P. Kennedy, Sr
son John F. Kennedy with Rose
Fitzgerald

Name Son
Joseph P. Kennedy John F. Kennedy
Rose Fitzgerald John F. Kennedy
(0.3)
131
Results
ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5
F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363
Prec .6475 .7019 .6799 .7177 .7095 .7553 .7343
Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614
ME maximum entropy CRF conditional random
field RCRF CRF mined features
132
Examples of Discovered Relational Features

Mother Father?Wife
Cousin Mother?Husband?Nephew
Friend Education?Student
Education Father?Education
Boss Boss?Son
MemberOf Grandfather?MemberOf
Competition PoliticalParty?Member?Competition

133
(No Transcript)
134
(No Transcript)
135
Outline

The Need for IE and Data Mining.
Motivate Joint Inference
Brief introduction to Conditional Random Fields
Joint inference Information Extraction Examples
Joint Labeling of Cascaded Sequences (Belief
Propagation)
Joint Labeling of Distant Entities (BP by Tree
Reparameterization)
Joint Co-reference Resolution (Graph
Partitioning)
Probability First-order Logic, Co-ref on
Entities (MCMC)
Joint Information Integration (MCMC Sample
Rank)
Demo Rexa, a Web portal for researchers

136
Data Mining Research Literature

Better understand structure of our own research
area.
Structure helps us learn a new field.
Aid collaboration
Map how ideas travel through social networks of
researchers.
Aids for hiring and finding reviewers!
Measure impact of papers or people.

137
Traditional Bibliometrics

Analyses a small amount of data(e.g. 19 articles
from a single issue of a journal)
Uses journal as a proxy for research
topic(but there is no journal for information
extraction)
Uses impact measures almost exclusively based on
simple citation counts.

How can we use topic models to create new,
interesting impact measures? Can create a social
network of scientific sub-fields?
138
Our Data

Over 1.6 million research papers, gathered as
part of Rexa.info portal.
Cross linked references / citations.

139
Previous Systems
140
(No Transcript)
141
Previous Systems
Cites
Research Paper
142
More Entities and Relations
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
143
(No Transcript)
144
(No Transcript)
145
(No Transcript)
146
(No Transcript)
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
150
(No Transcript)
151
(No Transcript)
152
(No Transcript)
153
(No Transcript)
154
(No Transcript)
155
(No Transcript)
156
Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
157
Topical Bibliometric Impact Measures
Mann, Mimno, McCallum, 2006

Topical Citation Counts
Topical Impact Factors
Topical Longevity
Topical Precedence
Topical Diversity
Topical Transfer

158
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
159
Topical Diversity
Papers that had the most influence across many
other fields...
160
Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
HighDiversity
LowDiversity
161
Summary

Joint inference needed for avoiding cascading
errors in information extraction and data mining.
Most fundamental problem in NLP, data mining, ...
Can be performed in CRFs
Cascaded sequences (Factorial CRFs)
Distant correlations (Skip-chain CRFs)
Co-reference (Affinity-matrix CRFs)
Logic Probability (efficient by MCMC Sample
Rank)
Information Integration
Rexa New research paper search engine, mining
the interactions in our community.