Title: GraphBased Methods for Open Domain Information Extraction
1Graph-Based Methods for Open Domain
Information Extraction
- William W. Cohen
- Machine Learning Dept. and Language Technologies
Institute - School of Computer Science
- Carnegie Mellon University
2Traditional IE vs Open Domain IE
- Goal recognize people, places, companies, times,
dates, in NL text. - Supervised learning from corpus completely
annotated with target entity class (e.g.
people) - Linear-chain CRFs
- Language- and genre-specific extractors
- Goal recognize arbitrary entity sets in text
- Minimal info about entity class
- Example 1 ICML, NIPS
- Example 2 Machine learning conferences
- Semi-supervised learning
- from very large corpora (WWW)
- Graph-based learning methods
- Techniques are largely language-independent (!)
- Graph abstraction fits many languages
3Examples with three seeds
4Outline
- History
- Open-domain IE by pattern-matching
- The bootstrapping-with-noise problem
- Bootstrapping as a graph walk
- Open-domain IE as finding nodes near seeds on a
graph - Approach 1 A natural graph derived from a
smaller corpus learned similarity - Approach 2 A carefully-engineered graph derived
from huge corpus
5History Open-domain IE by pattern-matching
(Hearst, 92)
- Start with seeds NIPS, ICML
- Look thru a corpus for certain patterns
- at NIPS, AISTATS, KDD and other learning
conferences - Expand from seeds to new instances
- Repeat.until ___
- on PC of KDD, SIGIR, and
6Bootstrapping as graph proximity
7Outline
- Open-domain IE as finding nodes near seeds on a
graph - Approach 1 A natural graph derived from a
smaller corpus learned similarity - Approach 2 A carefully-engineered graph derived
from huge corpus (e.g.s above)
with Einat Minkov (Nokia)
with Richard Wang (CMU ? ?)
8Learning Similarity Measures for Parsed Text
(Minkov Cohen, EMNLP 2008)
nsubj
partmod
prep.with
boys
like
playing
cars
all
kinds
det
prep.of
NN
NN
VB
VB
DT
NN
Dependency parsed sentence is a naturally
represented as a tree
9Learning Similarity Measures for Parsed Text
(Minkov Cohen, EMNLP 2008)
Dependency parsed corpus is naturally
represented as a graph
10Learning Similarity Measures for Parsed Text
(Minkov Cohen, EMNLP 2008)
- Open IE Goal
- Find coordinate terms (eg, girl/boy,
dolls/cars) in the graph, or find - Similarity measure S so S(girl,boy) is high
- What about off-the-shelf similarity measures
- PPR/RWR
- Hitting time
- Commute time
- ?
11Personalized PR/RWR
A query languageQ ,
The graph Nodes Node type Edge label Edge weight
Returns a list of nodes (of type )
ranked by the graph walk probs.
graph walk parameters edge weights T , walk
length K and reset probability ?.
Approximate with power iteration, cut off after
fixed number of iterations K.
Mx,y Prob. of reaching y from x in one
step the edge weight from x to y, over the
total outgoing weight from x.
Personalized PageRankreset probability biased
towardsinitial distribution.
12mention
nsubj
mention-1
mention
nsubj-1
mention-1
girls
girls1
like1
like
like2
boys2
boys
13mention
nsubj
mention-1
mention
nsubj-1
mention-1
girls
girls1
like1
like
like2
boys2
boys
mention
nsubj
partmod
mention-1
mention
mention-1
girls
girls1
like1
playing1
playing
boys
14mention
nsubj
mention-1
Prep.with
mention-1
girls
girls1
like1
playing1
dolls1
dolls
Useful but not our goal here
15Learning a better similarity metric
Task T (query class)
Seed words (girl, boy, )
Query q
Query a
Query b
Rel. answers a
Rel. answers b
Rel. answers q
GRAPH WALK
- node rank 1
- node rank 2
- node rank 3
- node rank 4
-
- node rank 10
- node rank 11
- node rank 12
-
- node rank 50
- node rank 1
- node rank 2
- node rank 3
- node rank 4
-
- node rank 10
- node rank 11
- node rank 12
-
- node rank 50
- node rank 1
- node rank 2
- node rank 3
- node rank 4
-
- node rank 10
- node rank 11
- node rank 12
-
- node rank 50
Potential new instances of the target
concept (doll, child, toddler, )
16Learning methods
Weight tuning weights learned per edge
typeDiligenti et al, 2005
Reranking re-order the retrieved list using
global featuresof all paths from source to
destination Minkov et al, 2006
boys
dolls
FEATURES
nsubj?.nsubj-inv
nsubj ? partmod ? prep.in
nsubj ? partmod ? partmod-inv ? nsubj-inv
like, playing
like, playing
17Learning methods Path-Constrained Graph Walk
- PCW (summary) for each node x, learn
- P(x?z relevant(z) history(Vq,x) )
- History(Vq,x) seq of edge labels leading from
Vq to x, with all histories stored in a tree
boys
dolls
boys
nsubj?.nsubj-inv
nsubj ? partmod ? prep.in
nsubj ? partmod ? partmod-inv ? nsubj-inv
dolls
18City and person name extraction
- City names Vq sydney, stamford, greenville,
los_angeles - Person names Vq carter, dave_kingman,
pedro_ramos, florio
- 10 (X4 seeds) queries for each task
- Train queries q1-q5 / test queries q6-q10
- Extract nodes of type NE.
- GW 6 steps, uniform/learned weights
- Reranking top 200 nodes (using learned weights)
- Path trees 20 correct / 20 incorrect threshold
0.5
19MUC
City names
Person names
precision
rank
20MUC
City names
Person names
precision
rank
conj-and, prep-in, nn, appos
subj, obj, poss, nn
21MUC
City names
Person names
precision
rank
conj-and, prep-in, nn, appos
subj, obj, poss, nn
prep-in-inv ? conj-andnn-inv ? nn
nsubj ? nsubj-invappos ? nn-inv
22MUC
City names
Person names
precision
rank
conj-and, prep-in, nn, appos
subj, obj, poss, nn
Prep-in-inv ? conj-andnn-inv ? nn
nsubj ? nsubj-invappos ? nn-inv
LEX.based, LEX.downtown
LEX.mr, LEX.president
23Vector-space models
- Co-occurrence vectors (counts window /- 2)
- Dependency vectors Padó Lapata, Comp Ling 07
- A path value function
- Length-based value 1 / length(path)
- Relation based value subj-5, obj-4, obl-3,
gen-2, else-1 - Context selection function
- Minimal verbal predicate-argument (length 1)
- Medium coordination, genitive construction,
noun compounds (lt3) - Maximal combinations of the above (lt4)
- Similarity function
- Cosine
- Lin
- Only score the top nodes retrieved with
reranking (1000 overall)
24GWs Vector models
MUC
City names
Person names
precision
rank
- The graph-based methods are best (syntactic
learning)
25GWs Vector models
MUC AP
City names
Person names
precision
rank
- The advantage of the graph based models
diminishes with the amount of data. - This is hard to evaluate at high ranks (manual
labeling)
26Outline
- Open-domain IE as finding nodes near seeds on a
graph - Approach 1 A natural graph derived from a
smaller corpus learned similarity - Approach 2 A carefully-engineered graph derived
from huge corpus
with Einat Minkov (CMU ? Nokia)
with Richard Wang (CMU ? ?)
27Set Expansion for Any Language (SEAL) (Wang
Cohen, ICDM 07)
- Basic ideas
- Dynamically build the graph using queries to the
web - Constrain the graph to be as useful as possible
- Be smart about queries
- Be smart about patterns use clever methods for
finding meaningful structure on web pages
28System Architecture
- Pentax
- Sony
- Kodak
- Minolta
- Panasonic
- Casio
- Leica
- Fuji
- Samsung
- Fetcher download web pages from the Web that
contain all the seeds - Extractor learn wrappers from web pages
- Ranker rank entities extracted by wrappers
29The Extractor
- Learn wrappers from web documents and seeds on
the fly - Utilize semi-structured documents
- Wrappers defined at character level
- Very fast
- No tokenization required thus language
independent - Wrappers derived from doc d applied to d only
- See ICDM 2007 paper for details
30.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .
- Find prefix of each seed and put in reverse
order - ford1 /ecnaniffer agt yllareneG
- Ford2 gtdrof/ /ecnaniffer agt yllareneG
- honda1 /ecnaniffer agt ot derapmoc
- Honda2 gtadnoh/ /ecnaniffer agt ot
- Organize these into a trie, tagging each node
with a set of seeds
yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
h1
gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
31.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .
- Find prefix of each seed and put in reverse
order - Organize these into a trie, tagging each node
with a set of seeds. - A left context for a valid wrapper is a node
tagged with one instance of each seed.
yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
h1
gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
32.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .
- Find prefix of each seed and put in reverse
order - Organize these into a trie, tagging each node
with a set of seeds. - A left context for a valid wrapper is a node
tagged with one instance of each seed. - The corresponding right context is the longest
common suffix of the corresponding seed instances.
gt
gtFordlt/agt sales
yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
gtHondalt/agt while
h1
gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
lt/agt
33I am noise
Me too!
34The Ranker
- Rank candidate entity mentions based on
similarity to seeds - Noisy mentions should be ranked lower
- Random Walk with Restart (GW)
- as before
- Whats the graph?
35Building a Graph
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4
- A graph consists of a fixed set of
- Node Types seeds, document, wrapper, mention
- Labeled Directed Edges find, derive, extract
- Each edge asserts that a binary relation r holds
- Each edge has an inverse relation r-1 (graph is
cyclic) - Intuition good extractions are extracted by many
good wrappers, and good wrappers extract many
good extractions,
36Evaluation Datasets closed sets
37Evaluation Method
- Mean Average Precision
- Commonly used for evaluating ranked lists in IR
- Contains recall and precision-oriented aspects
- Sensitive to the entire ranking
- Mean of average precisions for each ranked list
Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank
- Evaluation Procedure (per dataset)
- Randomly select three true entities and use their
first listed mentions as seeds - Expand the three seeds obtained from step 1
- Repeat steps 1 and 2 five times
- Compute MAP for the five ranked lists
True Entities total number of true entities
in this dataset
38Experimental Results 3 seeds
- Vary Extractor Ranker Top N URLs
- Extractor
- E1 Baseline Extractor (longest common context
for all seed occurrences) - E2 Smarter Extractor (longest common context
for 1 occurrence of each seed) - Ranker EF Baseline (Most Frequent), GW
Graph Walk - N URLs 100, 200, 300
39Side by side comparisons
Telukdar, Brants, Liberman, Pereira, CoNLL 06
40Side by side comparisons
EachMovie vs WWW
NIPS vs WWW
Ghahramani Heller, NIPS 2005
41A limitation of the original SEAL
42Proposed Solution Iterative SEAL (iSEAL)(Wang
Cohen, ICDM 2008)
- Makes several calls to SEAL, each call
- Expands a couple of seeds
- Aggregates statistics
- Evaluate iSEAL using
- Two iterative processes
- Supervised vs. Unsupervised (Bootstrapping)
- Two seeding strategies
- Fixed Seed Size vs. Increasing Seed Size
- Five ranking methods
43ISeal (Fixed Seed Size, Supervised)
Initial Seeds
- Finally rank nodes by proximity to seeds in the
full graph - Refinement (ISS) Increase size of seed set for
each expansion over time 2,3,4,4, - Variant (Bootstrap) use high-confidence
extractions when seeds run out
44Ranking Methods
- Random Graph Walk with Restart
- H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random
walk with restart and its application. In ICDM,
2006. - PageRank
- L. Page, S. Brin, R. Motwani, and T. Winograd.
The PageRank citation ranking Bringing order to
the web. 1998. - Bayesian Sets (over flattened graph)
- Z. Ghahramani and K. A. Heller. Bayesian sets. In
NIPS, 2005. - Wrapper Length
- Weights each item based on the length of common
contextual string of that item and the seeds - Wrapper Frequency
- Weights each item based on the number of wrappers
that extract the item
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49Little difference between ranking methods for
supervised case (all seeds correct) large
differences when bootstrapping
Increasing seed size 2,3,4,4, makes all
ranking methods improve steadily in bootstrapping
case
50(No Transcript)
51Current work
- Start with name of concept (e.g., NFL teams)
- Look for (language-dependent) patterns
- for successful NFL teams (e.g., Pittsburgh
Steelers, New York Giants, ) - Take most frequent answers as seeds
- Run bootstrapping iSEAL with seed sizes 2,3,4,4.
52Datasets with concept names
53(No Transcript)
54Experimental results
Direct use of text patterns
55(No Transcript)
56Comparison to Kozareva, Riloff Hovy (which uses
concept name plus a single instance as seed)no
seed used.
57Comparison to Pasca (using web search queries,
CIKM 07)
58Comparison to WordNet Nk
- Snow et al series of experiments learning
hyper/hyponyms - Bootstrap from Wordnet examples
- Use dependency-parsed free text
- E.g., added 30k new instances with fairly high
precision - Many are concepts named-entity instances
- Experiments with ASIA on concepts from Wordnet
shows a fairly common problem - E.g., movies gives as instances comedy,
action/adventure, family, drama, . - I.e., ASIA finds a lower level in a hierarchy,
maybe not the one you want
59Comparison to WordNet Nk
- Filter a simulated sanity check
- Consider only concepts expanded in Wordnet 30k
that seem to have named-entities as instances and
have at least instances - Run ASIA on each concept
- Discard result if less than 50 of the Wordnet
instances are in ASIAs output
60(No Transcript)
61Two More Systems to Compare to
- Van Durme Pasca, 2008
- Requires an English part-of-speech tagger.
- Analyzed 100 million cached Web documents in
English (for many classes). - Talukdar et al, 2008
- Requires 5 seed instances as input (for each
class). - Utilizes output from Van Durmes system and 154
million tables from the WebTables database (for
many classes). - ASIA
- Does not require any part-of-speech tagger
(nearly language-independent). - Supports multiple languages such as English,
Chinese, and Japanese. - Analyzes around 200400 Web documents (for each
class). - Requires only the class name as input.
- Given a class name, extraction usually finishes
within a minute (including network latency of
fetching web pages).
62- Precisions of Talukdar and Van Durmes systems
were obtained from Figure 2 in Talukdar et al,
2008.
63(for your reference)
64Top 10 Instances from ASIA
65Summary/Conclusions
- Open-domain IE as finding nodes near seeds on a
graph
66Summary/Conclusions
- Open-domain IE as finding nodes near seeds on a
graph, approach 1 - Minkov Cohen, EMNLP 08
- Graph dependency-parsed corpus
- Off-the-shelf distance metrics not great
- With a little bit of learning
- Results significantly better than
- state-of-the-art on small corpora
- (e.g. a personal email corpus)
- Results competitive on 2M word
- corpora
67(No Transcript)
68Summary/Conclusions
- Open-domain IE as finding nodes near seeds on a
graph, approach 2 - Wang Cohen, ICDM 07, 08
- Graph built on-the-fly with web queries
- A good graph matters!
- Off-the-shelf distance metrics work
- Differences are minimal for clean seeds
- Modest improvements from learning w/ clean seeds
- E.g., reranking (not described here)
- Bigger differences in similarity measures
- with noisy seeds
69Thanks to
- DARPA PAL program
- Minkov, Cohen, Wang
- Yahoo! Research Labs
- Minkov
- Google Research Grant program
- Wang
Sponsored links http//boowa.com (Richards
demo)