Information Extraction, Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Information Extraction, Data Mining

Description:

Information Extraction, Data Mining & Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton ... – PowerPoint PPT presentation

Number of Views:298
Avg rating:3.0/5.0
Slides: 121
Provided by: uma96
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction, Data Mining


1
Information Extraction, Data Mining Joint
Inference
  • Andrew McCallum
  • Computer Science Department
  • University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta,
Khashayar Rohanemanesh, Ben Wellner, Karl
Schultz, Michael Hay, Michael Wick, David Mimno.
2
My Research
Building models that mine actionable
knowledgefrom unstructured text.
3
From Text to Actionable Knowledge
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
4
A Natural Language Processing Pipeline
Pragmatics Anaphora Resolution Semantic Role
Labeling Entity Recognition Parsing Chunking P
OS tagging
5
Unified Natural Language Processing
Pragmatics Anaphora Resolution Semantic Role
Labeling Entity Recognition Parsing Chunking P
OS tagging
Unified, joint inference.
6
Problem
  • Combined in serial juxtaposition,
  • IE and DM are unaware of each others
  • weaknesses and opportunities.
  • DM begins from a populated DB, unaware of where
    the data came from, or its inherent errors and
    uncertainties.
  • IE is unaware of emerging patterns and
    regularities in the DB.
  • The accuracy of both suffers, and significant
    mining of complex text sources is beyond reach.

7
Solution
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
8
Solution
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
9
Scientific Questions
  • What model structures will capture salient
    dependencies?
  • Will joint inference actually improve accuracy?
  • How to do inference in these large graphical
    models?
  • How to do parameter estimation efficiently in
    these models,which are built from multiple large
    components?
  • How to do structure discovery in these models?

10
Scientific Questions
  • What model structures will capture salient
    dependencies?
  • Will joint inference actually improve accuracy?
  • How to do inference in these large graphical
    models?
  • How to do parameter estimation efficiently in
    these models, which are built from multiple
    large components?
  • How to do structure discovery in these models?

11
Methods of Inference
  • Exact
  • Exhaustively explore all interpretations
  • Graphical model has low tree-width
  • Variational
  • Represent distribution in simpler model that is
    close
  • Monte-Carlo
  • Randomly (but cleverly) sample to explore
    interpretations

12
Outline
  • Examples of IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Joint Segmentation and Co-ref (Sparse BP)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Semi-supervised Learning
  • Demo Rexa, a Web portal for researchers

13
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
O
O
O
t
t
1
-
t
1
Generates State sequence
Observation sequence
o1 o2 o3 o4 o5 o6 o7 o8
14
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Yoav Freund spoke this example sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Tony Jebara spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Tony Jebara
15
We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
16
Problems with Richer Representationand a Joint
Model
  • These arbitrary features are not independent.
  • Multiple levels of granularity (chars, words,
    phrases)
  • Multiple dependent modalities (words, formatting,
    layout)
  • Past future
  • Two choices

Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
17
Conditional Sequence Models
  • We prefer a model that is trained to maximize a
    conditional probability rather than joint
    probabilityP(so) instead of P(s,o)
  • Can examine features, but not responsible for
    generating them.
  • Dont have to explicitly model their
    dependencies.
  • Dont waste modeling effort trying to generate
    what we are given at test time anyway.

18
From HMMs to Conditional Random Fields
Lafferty, McCallum, Pereira 2001
St-1
St
St1
Joint
...
...
Ot
Ot1
Ot-1
Conditional
St-1
St
St1
...
Ot
Ot1
Ot-1
...
where
(A super-special case of Conditional Random
Fields.)
Set parameters by maximum likelihood, using
optimization method on dL.
19
(Linear Chain) Conditional Random Fields
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to
maximize conditional probability of output
(sequence) given input (sequence)
Finite state model
Graphical model
OTHER PERSON OTHER ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input seq
said Jones a Microsoft VP
20
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.

21
Table Extraction from Government Reports
Pinto, McCallum, Wei, Croft, 2003 SIGIR
100 documents from www.fedstats.gov
Labels
CRF
  • Non-Table
  • Table Title
  • Table Header
  • Table Data Row
  • Table Section Data Row
  • Table Footnote
  • ... (12 in all)

Cash receipts from marketings of milk during 1995
at 19.9 billion dollars, was slightly below
1994. Producer returns averaged 12.93 per
hundredweight, 0.19 per hundredweight
below 1994. Marketings totaled 154 billion
pounds, 1 percent above 1994. Marketings
include whole milk sold to plants and dealers as
well as milk sold directly to consumers.


An estimated 1.56 billion pounds of milk
were used on farms where produced, 8 percent
less than 1994. Calves were fed 78 percent of
this milk with the remainder consumed in
producer households.



Milk Cows
and Production of Milk and Milkfat
United States,
1993-95
-------------------------------------------------
-------------------------------
Production of Milk and Milkfat
2/ Number
-------------------------------------------------
------ Year of Per Milk Cow
Percentage Total
Milk Cows 1/------------------- of Fat in All
------------------
Milk Milkfat Milk Produced Milk
Milkfat ----------------------------------------
----------------------------------------
1,000 Head --- Pounds --- Percent
Million Pounds

1993 9,589 15,704 575
3.66 150,582 5,514.4 1994
9,500 16,175 592 3.66
153,664 5,623.7 1995 9,461
16,451 602 3.66 155,644
5,694.3 ----------------------------------------
---------------------------------------- 1/
Average number during year, excluding heifers not
yet fresh. 2/ Excludes milk
sucked by calves.
Features
  • Percentage of digit chars
  • Percentage of alpha chars
  • Indented
  • Contains 5 consecutive spaces
  • Whitespace in this line aligns with prev.
  • ...
  • Conjunctions of all previous features, time
    offset 0,0, -1,0, 0,1, 1,2.

22
Table Extraction Experimental Results
Pinto, McCallum, Wei, Croft, 2003 SIGIR
Line labels, percent correct
Table segments, F1
HMM
65
64
Stateless MaxEnt
85
-
95
92
CRF
23
IE from Research Papers
McCallum et al 99
24
IE from Research Papers
Field-level F1 Hidden Markov Models
(HMMs) 75.6 Seymore, McCallum, Rosenfeld,
1999 Support Vector Machines (SVMs) 89.7 Han,
Giles, et al, 2003 Conditional Random Fields
(CRFs) 93.9 Peng, McCallum, 2004
? error 40
25
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN
1996-08-22 South African provincial side Boland
said on Thursday they had signed Leicestershire
fast bowler David Millns on a one year contract.
Millns, who toured Australia with England A in
1992, replaces former England all-rounder Phillip
DeFreitas as Boland's overseas professional.
Labels Examples
PER Yayuk Basuki Innocent Butare ORG 3M KDP
Cleveland LOC Cleveland Nirmal Hriday The
Oval MISC Java Basque 1,000 Lakes Rally
26
Automatically Induced Features
McCallum Li, 2003, CoNLL
Index Feature 0 inside-noun-phrase
(ot-1) 5 stopword (ot) 20 capitalized
(ot1) 75 wordthe (ot) 100 in-person-lexicon
(ot-1) 200 wordin (ot2) 500 wordRepublic
(ot1) 711 wordRBI (ot) headerBASEBALL 1027 he
aderCRICKET (ot) in-English-county-lexicon
(ot) 1298 company-suffix-word (firstmentiont2) 40
40 location (ot) POSNNP (ot) capitalized
(ot) stopword (ot-1) 4945 moderately-rare-first-
name (ot-1) very-common-last-name
(ot) 4474 wordthe (ot-2) wordof (ot)
27
Named Entity Extraction Results
McCallum Li, 2003, CoNLL
Method F1 HMMs BBN's Identifinder 73 CRFs
w/out Feature Induction 83 CRFs with Feature
Induction 90 based on LikelihoodGain
28
Outline
  • The Need for IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Joint Information Integration (MCMC Sample
    Rank)
  • Demo Rexa, a Web portal for researchers

29
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
30
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
31
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every
stage to do well.
32
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Loopy Belief Propagation
33
Outline
  • The Need for IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Joint Information Integration (MCMC Sample
    Rank)
  • Demo Rexa, a Web portal for researchers

34
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
Dependency among similar, distant mentions
ignored.
35
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterized BP
See also Finkel, et al, 2005
Wainwright et al, 2002
36
Outline
  • The Need for IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Joint Information Integration (MCMC Sample
    Rank)
  • Demo Rexa, a Web portal for researchers

37
3. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
Mr. Hill
Dana Hill
Amy Hall
25 reduction in error on co-reference of
proper nouns in newswire.
she
Dana
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
38
Coreference Resolution
AKA "record linkage", "database record
deduplication", "citation matching", "object
correspondence", "identity uncertainty"
Output
Input
News article, with named-entity "mentions" tagged
Number of entities, N 3 1 Secretary of
State Colin Powell he Mr. Powell
Powell 2 Condoleezza Rice she
Rice 3 President Bush Bush
Today Secretary of State Colin Powell met with .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . he
. . . . . . . . . . . . . . . . . . . Condoleezza
Rice . . . . . . . . . Mr Powell . . . . . . . .
. .she . . . . . . . . . . . . . . . . . . . . .
Powell . . . . . . . . . . . . . . . President
Bush . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . Rice . . . . . . . . . .
. . . . . . Bush . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
39
Inside the Traditional Solution
Pair-wise Affinity Metric
Mention (3)
Mention (4)
Y/N?
. . . Mr Powell . . .
. . . Powell . . .
N Two words in common 29 Y One word in
common 13 Y "Normalized" mentions are string
identical 39 Y Capitalized word in
common 17 Y gt 50 character tri-gram
overlap 19 N lt 25 character tri-gram
overlap -34 Y In same sentence 9 Y Within
two sentences 8 N Further than 3 sentences
apart -1 Y "Hobbs Distance" lt 3 11 N Number
of entities in between two mentions
0 12 N Number of entities in between two mentions
gt 4 -3 Y Font matches 1 Y Default -19
OVERALL SCORE 98 gt threshold0
40
Entity Resolution
mention
Mr. Hill
mention
mention
Dana Hill
Amy Hall
mention
mention
she
Dana
41
Entity Resolution
Mr. Hill
entity
Dana Hill
Amy Hall
she
Dana
entity
42
Entity Resolution
Mr. Hill
entity
Dana Hill
Amy Hall
entity
she
Dana
43
Entity Resolution
Mr. Hill
entity
entity
Dana Hill
Amy Hall
she
Dana
entity
44
The Problem
Independent pairwise affinity with connected
components
Mr. Hill
Affinity measures are noisy and imperfect.
C
N
Dana Hill
Amy Hall
N
Pair-wise merging decisions are being made
independently from each other
C
N
C
C
N
N
They should be made jointly.
she
Dana
C
45
CRF for Co-reference
McCallum Wellner, 2003, ICML
Mr. Hill
C
N
Dana Hill
Amy Hall
N
N
C
Make pair-wise merging decisions jointly by -
calculating a joint prob. - including all edge
weights - adding dependence on consistent
triangles.
C
C
N
N
she
Dana
C
46
CRF for Co-reference
McCallum Wellner, 2003, ICML
Mr. Hill
C
N
Dana Hill
Amy Hall
N
N
C
C
C
N
N
she
Dana
C
47
A Generative Model Solution
Russell 2001, Pasula et al 2002
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
2) Number of entities is hard-coded into
the model structure, but we are supposed
to predict num entities! Thus we must modify
model structure during inference---MCMC.
. . .
48
CRF for Co-reference
Mr. Hill
(23)
-(-55)
C
N
-(-44)
Dana Hill
Amy Hall
N
-(-23)
(11)
C
N
C
C
N
N
(17)
-(-9)
(10)
-(-22)
she
Dana
C
(4)
218
49
CRF for Co-reference
Mr. Hill
(23)
-(-55)
C
N
-(-44)
Dana Hill
Amy Hall
N
-(-23)
(11)
C
N
C
C
N
N
(17)
-(-9)
(10)
-(-22)
she
Dana
N
-(4)
210
0
50
CRF for Co-reference
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
-12
0
51
Inference in these MRFs Graph Partitioning
Boykov, Vekler, Zabih, 1999, Kolmogorov
Zabih, 2002, Yu, Cross, Shi, 2002 Correlationa
l Clustering Bansal Blum Demaine
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
52
Pairwise Affinity is not Enough
Mr. Hill
-(23)
(-55)
N
C
-(-44)
Dana Hill
Amy Hall
N
-(-23)
-(11)
N
N
N
C
C
N
-(17)
(-9)
(10)
-(-22)
she
Dana
C
(4)
53
Pairwise Affinity is not Enough
Mr. Hill
N
C
Dana Hill
Amy Hall
N
N
N
N
C
C
N
she
Dana
C
54
Pairwise Affinity is not Enough
she
C
N
she
Amy Hall
N
C
N
C
C
N
N
she
she
N
55
Pairwise Comparisons Not EnoughExamples
  • ? mentions are pronouns?
  • Entities have multiple attributes (name, email,
    institution, location)need to measure
    compatibility among them.
  • Having 2 given names is common, but not 4.
  • e.g. Howard M. Dean / Martin, Dean / Howard
    Martin
  • Need to measure size of the clusters of mentions.
  • ? a pair of lastname strings that differ gt 5?
  • We need to ask ?, ? questions about a set of
    mentions
  • We want first-order logic!

56
Outline
  • The Need for IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Joint Information Integration (MCMC Sample
    Rank)
  • Demo Rexa, a Web portal for researchers

57
Pairwise Affinity is not Enough
she
C
N
she
Amy Hall
N
C
N
C
C
N
N
she
she
N
58
Partition Affinity CRF
she
she
Amy Hall
she
she
Ask arbitrary questions about all entities in a
partition with first-order logic...
... bringing together LOGIC and PROBABILITY
59
Partition Affinity CRF
she
she
Amy Hall
she
she
60
Partition Affinity CRF
she
she
Amy Hall
she
she
61
Partition Affinity CRF
she
she
Amy Hall
she
she
62
Partition Affinity CRF
she
she
Amy Hall
she
she
63
This space complexity is common in probabilistic
first-order logic models
64
Markov Logic First-Order Logic as a Template
to Define CRF Parameters Richardson
Domingos 2005
Paskin Russell 2002 Taskar et al 2003
ground Markov network
grounding Markov network requires space
O(nr) n number constants
r highest clause arity
65
How can we perform inference and learning in
models that cannot be grounded?
66
Inference in Weighted First-Order LogicSAT
Solvers
  • Weighted SAT solvers Kautz et al 1997
  • Requires complete grounding of network
  • LazySAT Singla Domingos 2006
  • Saves memory by only storing clauses that may
    become unsatisfied
  • Initialization still requires time O(nr) to visit
    all ground clauses

67
Inference in Weighted First-Order LogicMCMC
  • Gibbs Sampling
  • Difficult to move between high probability
    configurations by changing single variables
  • Although, consider MC-SAT. Poon Domingos 06
  • An alternative Metropolis-Hastings
    sampling Culotta McCallum 2006
  • 2 parts proposal distribution, acceptance
    distribution.
  • Can be extended to partial configurations
  • Only instantiate relevant variables
  • Successfully used in BLOG models Milch et al
    2005
  • Key advantage can design arbitrary smart jumps

68
Dont represent all alternatives...
69
Dont represent all alternatives... just one
at a time
Proposal Distribution
Stochastic Jump
70
Model
First-order features
Amy Hall
Dana Hill
she
Dana
she
Amy
she
fw SamePerson(x) fb DifferentPerson(x, x )
71
Proposal Distribution
y y
Dean Martin Howie Martin Howard Martin
Howie Martin
Dean Martin Howie Martin
Howard Martin Dino
72
Proposal Distribution
Dean Martin Howie Martin
Howard Martin Dino
y y
Dean Martin Howie Martin Howard Martin
Howie Martin
73
Proposal Distribution
Dean Martin Howie Martin
Howard Martin Dino
y y
74
Metropolis-HastingsJump acceptance probability
  • p(y)/p(y) likelihood ratio
  • Ratio of P(YX)
  • ZX cancels!
  • q(yy) proposal distribution
  • probability of proposing move y ?y
  • ratio makes up for any biases in the proposal
    distribution

75
Learning in Probabilistic First-order
LogicParameter estimation weight learning
  • Input
  • First-order formulae
  • ?x S(x) ?T(x)
  • Labeled data
  • a, b, c S(a), T(a), S(b), T(b), S(c)
  • Output
  • Weights for each formula
  • ?x S(x) ?T(x) 0.67

?xy Coreferent(x,y) ? Pronoun(x)
?xy Coreferent(x,y) ? Pronoun(x) -2.3
76
Learning the Likelihood Ratio
Given a pair of configurations, learn to rank the
better configuration higher.
77
Parameter Estimation in Large State Spaces
  • Most methods require calculating gradient of
    log-likelihood, P(y1, y2, y3,... x1, x2,
    x3,...)...
  • ...which in turn requires expectations of
    marginals, P(y1 x1, x2, x3,...)
  • But, getting marginal distributions by sampling
    can be inefficient due to large sample space.
  • Alternative Perceptron. Approx gradient from
    difference between true output and models
    predicted best output.
  • But, even finding models predicted best output
    is expensive.
  • We propose Sample Rank Culotta, Wick, Hall,
    McCallum, HLT 2007Learn to rank
    intermediate solutions P(y11, y20, y31,...
    ...) gt P(y10, y20, y31,... ...)

78
Ranking vs Classification Training
  • Instead of trainingPowell, Mr. Powell, he --gt
    YESPowell, Mr. Powell, she --gt NO
  • ...Rather...Powell, Mr. Powell, he gt
    Powell, Mr. Powell, she
  • In general, higher-ranked example may contain
    errors

Powell, Mr. Powell, George, he gt Powell,
Mr. Powell, George, she
79
Ranking Parameter Update
In our experiments, we use a large-margin
update based on MIRA Crammer, Singer 2003
Wt1 argminW Wt - W
s.t. Score(Q, W) - Score(Q, W) 1
80
Error-driven Training
  • Input
  • Observed data X //
    Input mentions
  • True labeling P //
    True clustering
  • Prediction algorithm A //
    Clustering algorithm
  • Initial weights W, prediction Q // Initial
    clustering
  • Iterate until convergence
  • Q ? A(Q, W, O) //
    Merge clusters
  • If Q introduces an error
  • UpdateWeights(Q, Q, P, O, W)
  • Else Q ? Q

81
UpdateWeights(Q, Q, P, O, W)Learning to Rank
Pairs of Predictions
  • Using truth P, generate a new Q that is a
    better modification of Q than Q.
  • Update W s.t. Q ? A(Q, W, O)
  • Update parameters so Q is ranked higher than Q

82
Ranking Intermediate SolutionsExample
1.
2.
3.
4.
5.
? Model 3 ? Truth 0.3
? Model -23 ? Truth -0.2
? Model 10 ? Truth -0.1
? Model -10 ? Truth -0.1
UPDATE
  • Like PerceptronProof of convergence under
    Marginal Separability
  • More constrained than Maximum LikelihoodParamete
    rs must correctly rank incorrect solutions!

83
Sample Rank Algorithm
  • 1. Proposer
  • 2. Performance Metric
  • 3. Inputs input sequence x and an initial
    (random) configuration
  • 4. Initialization set the parameter vector
  • 5. Output Parameters
  • 6. Score function
  • 7. For t 1,,T and i 0, , n-1 do
  • . Generate a training instance
  • . Let and be the best and worst
    configurations among and according
    to the performance metric.
  • . If
  • . end if
  • 8. end for

84
Marginal Separability
  • For an input sequence x, feature vector
    and a parameter vector , define
  • A training set is called separable
    with margin if there exists some vector with
    such that

85
Convergence Results
  • Theorem 1 For any training set that is
    seperable with margin , the number of times
    the algorithm makes a ranking error is bounded.

86
Example 1
Lower F1, Higher Utility for leading to the goal
configuration
Goal Configuration
Initial Configuration
Greedy next state selection will pick the higher
F1 configuration and may get stuck in a local
optimal solution
Higher F1, Lower Utility for leading to the goal
configuration
Greedy move
87
Example 2
CLAIM Singletons are good states along paths to
goal
Why? A few number of merges can reach goal
But F1 is harmonic mean of P and R
Medium F1 Low Utility
Low F1 High Utility
and Singletons have almost zero R
8 moves from goal
13 moves from goal
88
2. Proposed Configuration
It is not clear what choice of performance metric
is more intuitive for intermediate configurations
1. Initial Configuration
Obama U.S. Palin He Biden
Obama Biden Biden Sen. Obama He She Palin U.S. The
US America
Biden America She The US Sen. Obama
How important is it that our model ranks these?
89
Delayed Feedback Problem
Configuration space
  • Model the problem as a reinforcement learning
    problem to address the delayed feedback problem,
    and noise in the performance metric

Goal configuration
  • Define a cost function as the temporal
    difference of the performance metric between any
    two configurations

TD-error F(yt1) - F(y t)
  • During training, use sample rank algorithm for
    learning the cost function

Initial configuration
  • During test, use the approximate cost function
    learned during training in a reinforcement
    learning setting of the problem.

90
Weighted Logics TechniquesOverview
  • Metropolis-Hastings (MH) for inference
  • Freely bake-in domain knowledge about fruitful
    jumps MH safely takes care of its biases.
  • Avoid memory and time consumption with massive
    deterministic constraint factors built jump
    functions that simply avoid illegal states.
  • Sample Rank
  • Dont train by likelihood of completely correct
    solution...
  • ...train to properly rank intermediate
    configurationsthe partition function
    (normalizer) cancels!...plus other efficiencies

91
Feature List
  • Exact Match/Mis-Match
  • Entity type
  • Gender (requires lexicon)
  • Number
  • Case
  • Entity Text
  • Entity Head
  • Entity Modifier/Numerical Modifier
  • Sentence
  • WordNet hypernym,synonym,antonym
  • Other
  • Relative pronoun agreement
  • Sentence distance in bins
  • Partial text overlaps
  • Quantification
  • Existential? a gender mismatch? three different
    first names
  • Universal? NER type match? named mentions str
    identical
  • Filters (limit quantifiers to mention type)
  • None
  • Pronoun
  • Nominal (description)
  • Proper (name)

92
Partition Affinity CRF Experiments
Culotta, Wick, Hall, McCallum, 2007
Better Training
New state-of-the-art
Likelihood-basedTraining Rank-basedTraining
Partition Affinity 69.2 79.3
Pairwise Affinity 62.4 72.5
Better Representation
B-Cubed F1 Score on ACE 2004 Noun Coreference
To our knowledge, best previously reported
results 1997 65 2002 67 2005 68
93
Outline
  • The Need for IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Joint Information Integration (MCMC Sample
    Rank)
  • Demo Rexa, a Web portal for researchers

94
Information Integration
Schema A
First Name Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Schema B
Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
Schema Matching
Coreference
Schema A Schema B
First Name Name
Last Name Phone
Contact
John 1 John 2
J. Smith John Smith
J. Smith J Smiht
John Smith
John D. Smith
Normalized DB

Entity Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555

95
Data Integration
Schema 13
Schema 25
Name Phone
John Smith U.S. 222-444-1337
John D. Smith 444 1337
J Smiht 432-111-5555
First Last Name Contact
J. Smith 222-444-1337
J. Smith 444 1337
John Smith (1) 4321115555
Schema Matching
Coref
Canonicalization
Result Normalized DB
Entity Name Phone
523 John Smith 222-444-1337
524 John D. Smith 432-111-5555

96
Information IntegrationA Family of Related Tasks
  • GOAL
  • Combine multiple heterogeneous sources into a
    single repository.
  • Requires
  • Concept-mapping across different representations
    (schema matching)
  • Data deduplication across different repositories
    (coreference)
  • Selecting a representation to store in the
    resulting DB (canonicalization / normalization)

97
Information Integration Steps
1. Schema Matching
First Name, Last Name
Phone
2. Coreference
Name
3. Canonicalization
J. Smith
Contact
J. Smith
John Smith
John Smith ..
Amanda Jones ..
..
Amanda
A. Jones
98
Problems with a Pipeline
  • 1. Data integration tasks are highly correlated
  • 2. Errors can propagate

99
Schema Matching First
1. Schema Matching
2. Coreference
J. Smith
Provides Evidence
First Name, Last Name
J. Smith
Phone
Name
John Smith
Amanda
Contact
A. Jones
NEW FEATURES
1. String Identical F.NameL.NameName 2. Same
Area Code 3-gram in Phone/Contact 3.
100
Coreference First
1. Coreference
2. Schema Matching
Provides Evidence
J. Smith
First Name, Last Name
J. Smith
Phone
Name
John Smith
Amanda
Contact
A. Jones
NEW FEATURES
1. Field values similar across corefd records 2.
PhoneContact has same value for J. Smith
mentions 3.
101
Problems with a Pipeline
  • 1. Data integration tasks are highly correlated
  • 2. Errors can propagate

102
Hazards of a Pipeline
1. Schema Matching
Table A
Name Corporation
Amanda Jones J. Smith Sons
J. Smith IBM
Full Name
Phone
Company Name
Table B
Full Name Company Name
Amanda Jones Smith Sons
John Smith IBM
Contact
2. Coreferent?
ERRORS PROPOGATE
103
Canonicalization
Entity 87
John Smith J. Smith J. Smith J. Smiht J.S
Mith Jonh smith John
Coref
Canoni- calization
John Smith
Typically occurs AFTER coreference
  • Desiterata
  • Complete Contains all information (e.g. first
    last)
  • Error-free No typos (e.g. avoid Smiht)
  • Central Represents all mentions (not Mith)

Access to such features would be very helpful to
Coref
104
Schema Matching
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
  • x6 is a set of attributes phone,contact,telephone
  • x7 is a set of attributes last name, last name
  • f67 is a factor between x6/x7
  • y67 is a binary variable indicating a match (no)
  • f7 is a factor over cluster x7
  • y7 is a binary variable indicating match (yes)

105
Schema Matching
  • x1 is a set of mentions J. Smith,John,John
    Smith
  • x2 is a set of mentions Amanda, A. Jones
  • f12 is a factor between x1/x2
  • y12 is a binary variable indicating a match (no)
  • f1 is a factor over cluster x1
  • y1 is a binary variable indicating match (yes)
  • Entity/attribute factors omitted for clarity

y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
x1
x2
106
Schema Matching
y7
y67
f7
x6
x7
y5
x5
x8
f67
f5
f8
y8
x4
y5
y54
y54
x1
x2
107
Canonicalizations Role
  • For each attribute in each cluster, there is a
    Canonical value variable
  • Variable takes on one value from all possible
    values in cluster
  • Canonical record is the tuple of all canonical
    attribute variables
  • There is one canonical record per coreference
    cluster
  • Canonical records are used to compute additional
    features

108
Model Summary
  • Conditional random field
  • Two clustering components
  • Clusters of attributes (schema matching)
  • Clusters of mentions (coref)
  • Factors represent affinities between and among
    clusters (sets)
  • Schema matching affinity factors
  • Coreference affinity factors
  • Canonicalization factors
  • Learning/inference is intractable

109
Parameter Estimation
  • Three sets of parameters
  • Schema matching
  • Coreference
  • Canonicalization
  • Labeled data for
  • Schema matching
  • Coreference
  • Canonicalization model is set to default string
    edit parameters

110
Parameter Estimation
Setting Coreference Params
Setting Schema Match Params
Ground Truth Schema Matching
Ground Truth Coreference
Entity 1 Entity 2
J. Smith A Jones
J. Smith Amanda Jones
John Smith
John D. Smith
Schema A Schema B
First Name Name
Last Name Phone
Contact
Fix S.M. truth
Fix coref truth
Sample coref training examples
Sample S.M. training examples
111
Inference (MAP/MPE)
GOAL Find a configuration that maximizes P(YX)
Clustering Inference for Coreference and Schema
Matching
Greedy Agglomerative
Inference for canonicalization
Find attribute value centroid for each set
112
Joint Inference
Iteration n-1
Iteration n
1. Coreference
Greedy Agglomerative
113
EXPERIMENTS
114
Dataset
  • Faculty and alumni listings from university
    websites, plus an IE system
  • 9 different schemas
  • 1400 mentions, 294 coreferent

115
Example Schemas
DEX IE Northwestern Fac UPenn Fac
First Name Name Name
Middle Name Title First Name
Last Name PhD Alma Mater Last Name
Title Research Interests JobDepartment
Department Office Address
Company Name E-mail
Home Phone
Office Phone
Fax Number
E-mail
116
Experiments
  • Prune down to 200 singletons plus the 294
    coreference clusters
  • Create training/testing sets from keeping DEX
    schema in both
  • Has most interesting cases of coreference
  • Sets still disjoint since mapping is defined
    between two schemas

117
Coreference Features
First order quantifications/aggregations over
  • Cosine distance (unweighted)
  • Cosine distance (TFIDF)
  • Cosine distance between mapped fields
  • Substring match between mapped fields
  • All of the above comparisons, but use canonical
    records

118
Schema Matching Features
First order quantifications/aggregations over
  • String identical
  • Sub string matches
  • TFIDF weighted cosine distance
  • All of the above with between coreferent mentions
    only

119
System Settings
  • MaxEnt piecewise training
  • Agglomerative Search for sub-tasks
  • Stopping threshold 0.5
  • Four iterations for joint inference

120
Systems
  • ISO Each task in isolation
  • CASC Coref -gt Schema matching
  • CASC Schema matching -gt Coref
  • JOINT Coref Schema matching

Our new work
Each system is evaluated with and without Joint
canonicalization
121
Coreference Results
Pair Pair Pair MUC MUC MUC
F1 Prec Recall F1 Prec Recall
No Canon ISO 72.7 88.9 61.5 75.0 88.9 64.9
No Canon CASC 64.0 66.7 61.5 65.7 66.7 64.9
No Canon JOINT 76.5 89.7 66.7 78.8 89.7 70.3
Canon ISO 78.3 90.0 69.2 80.6 90.0 73.0
Canon CASC 65.8 67.6 64.1 67.6 67.6 67.6
Canon JOINT 81.7 90.6 74.4 84.1 90.6 74.4
Note cascade does worse than ISO
122
Schema Matching Results
Pair Pair Pair MUC MUC MUC
F1 Prec Recall F1 Prec Recall
No Canon ISO 50.9 40.9 67.5 69.2 81.8 60.0
No Canon CASC 50.9 40.9 67.5 69.2 81.8 60.0
No Canon JOINT 68.9 100 52.5 69.6 100 53.3
Canon ISO 50.9 40.9 67.5 69.2 81.8 60.0
Canon CASC 52.3 41.8 70.0 74.1 83.3 66.7
Canon JOINT 71.0 100 55.0 75.0 100 60.0
Note cascade not as harmful here
123
Summary and Future Work
  • Towards a grand unified data integration model
  • Integrate other info integration tasks
  • Apply to other families of problems
  • Joint modeling drastically improves results for
    related tasks
  • Try other learning/inference algorithms
  • More sophisticated canonicalization models
  • Additional datasets

124
Related Work (Joint Inference)
  • Composition of Conditional Random Fields for
    Transfer Learning (Sutton 05)
  • Various named entity recognition tasks
  • Joint Inference in Information Extraction (Poon
    and Domingos 07)
  • Coreference segmentation

125
(No Transcript)
126
Data
  • 270 Wikipedia articles
  • 1000 paragraphs
  • 4700 relations
  • 52 relation types
  • JobTitle, BirthDay, Friend, Sister, Husband,
    Employer, Cousin, Competition, Education,
  • Targeted for density of relations
  • Bush/Kennedy/Manning/Coppola families and friends

127
(No Transcript)
128
George W. Bush his father George H. W.
Bush his cousin John Prescott Ellis
George H. W. Bush his sister Nancy Ellis Bush
Nancy Ellis Bush her son John Prescott Ellis
Cousin Fathers Sisters Son
129
John Kerry celebrated with Stuart Forbes
likely a cousin
130
Iterative DB Construction
  • Joseph P. Kennedy, Sr
  • son John F. Kennedy with Rose
    Fitzgerald

Name Son
Joseph P. Kennedy John F. Kennedy
Rose Fitzgerald John F. Kennedy
(0.3)
131
Results
ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5
F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363
Prec .6475 .7019 .6799 .7177 .7095 .7553 .7343
Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614
ME maximum entropy CRF conditional random
field RCRF CRF mined features
132
Examples of Discovered Relational Features
  • Mother Father?Wife
  • Cousin Mother?Husband?Nephew
  • Friend Education?Student
  • Education Father?Education
  • Boss Boss?Son
  • MemberOf Grandfather?MemberOf
  • Competition PoliticalParty?Member?Competition

133
(No Transcript)
134
(No Transcript)
135
Outline
  • The Need for IE and Data Mining.
  • Motivate Joint Inference
  • Brief introduction to Conditional Random Fields
  • Joint inference Information Extraction Examples
  • Joint Labeling of Cascaded Sequences (Belief
    Propagation)
  • Joint Labeling of Distant Entities (BP by Tree
    Reparameterization)
  • Joint Co-reference Resolution (Graph
    Partitioning)
  • Probability First-order Logic, Co-ref on
    Entities (MCMC)
  • Joint Information Integration (MCMC Sample
    Rank)
  • Demo Rexa, a Web portal for researchers

136
Data Mining Research Literature
  • Better understand structure of our own research
    area.
  • Structure helps us learn a new field.
  • Aid collaboration
  • Map how ideas travel through social networks of
    researchers.
  • Aids for hiring and finding reviewers!
  • Measure impact of papers or people.

137
Traditional Bibliometrics
  • Analyses a small amount of data(e.g. 19 articles
    from a single issue of a journal)
  • Uses journal as a proxy for research
    topic(but there is no journal for information
    extraction)
  • Uses impact measures almost exclusively based on
    simple citation counts.

How can we use topic models to create new,
interesting impact measures? Can create a social
network of scientific sub-fields?
138
Our Data
  • Over 1.6 million research papers, gathered as
    part of Rexa.info portal.
  • Cross linked references / citations.

139
Previous Systems
140
(No Transcript)
141
Previous Systems
Cites
Research Paper
142
More Entities and Relations
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
143
(No Transcript)
144
(No Transcript)
145
(No Transcript)
146
(No Transcript)
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
150
(No Transcript)
151
(No Transcript)
152
(No Transcript)
153
(No Transcript)
154
(No Transcript)
155
(No Transcript)
156
Topical Transfer
Citation counts from one topic to another.
Map producers and consumers
157
Topical Bibliometric Impact Measures
Mann, Mimno, McCallum, 2006
  • Topical Citation Counts
  • Topical Impact Factors
  • Topical Longevity
  • Topical Precedence
  • Topical Diversity
  • Topical Transfer

158
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic Cits Paper Title
Web Pages 31 Trawling the Web for Emerging Cyber-Communities, Kumar, Raghavan,... 1999.
Computer Vision 14 On being Undigital with digital cameras extending the dynamic...
Video 12 Lessons learned from the creation and deployment of a terabyte digital video libr..
Graphs 12 Trawling the Web for Emerging Cyber-Communities
Web Pages 11 WebBase a repository of Web pages
159
Topical Diversity
Papers that had the most influence across many
other fields...
160
Topical Diversity
Entropy of the topic distribution among papers
that cite this paper (this topic).
HighDiversity
LowDiversity
161
Summary
  • Joint inference needed for avoiding cascading
    errors in information extraction and data mining.
  • Most fundamental problem in NLP, data mining, ...
  • Can be performed in CRFs
  • Cascaded sequences (Factorial CRFs)
  • Distant correlations (Skip-chain CRFs)
  • Co-reference (Affinity-matrix CRFs)
  • Logic Probability (efficient by MCMC Sample
    Rank)
  • Information Integration
  • Rexa New research paper search engine, mining
    the interactions in our community.

162
(No Transcript)
163
(No Transcript)
164
Outline
  • Model / Feature Engineering
  • Brief review of IE w/ Conditional Random Fields
  • Flexibility to use non-independent features
  • Inference
  • Entity Resolution with Probability First-order
    Logic
  • Resolution Canonicalization Schema Mapping
  • Inference by Metropolis-Hastings
  • Parameter Estimation
  • Semi-supervised Learning with Label
    Regularization
  • ...with Feature Labeling
  • Generalized Expectation criteria
Write a Comment
User Comments (0)
About PowerShow.com