GraphBased Methods for Open Domain Information Extraction - PowerPoint PPT Presentation

About This Presentation

Title:

GraphBased Methods for Open Domain Information Extraction

Description:

GraphBased Methods for Open Domain Information Extraction – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 68

Provided by: willia95

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: GraphBased Methods for Open Domain Information Extraction

1
Graph-Based Methods for Open Domain
Information Extraction

William W. Cohen
Machine Learning Dept. and Language Technologies
Institute
School of Computer Science
Carnegie Mellon University

2
Traditional IE vs Open Domain IE

Goal recognize people, places, companies, times,
dates, in NL text.
Supervised learning from corpus completely
annotated with target entity class (e.g.
people)
Linear-chain CRFs
Language- and genre-specific extractors

Goal recognize arbitrary entity sets in text
Minimal info about entity class
Example 1 ICML, NIPS
Example 2 Machine learning conferences
Semi-supervised learning
from very large corpora (WWW)
Graph-based learning methods
Techniques are largely language-independent (!)
Graph abstraction fits many languages

3
Examples with three seeds
4
Outline

History
Open-domain IE by pattern-matching
The bootstrapping-with-noise problem
Bootstrapping as a graph walk
Open-domain IE as finding nodes near seeds on a
graph
Approach 1 A natural graph derived from a
smaller corpus learned similarity
Approach 2 A carefully-engineered graph derived
from huge corpus

5
History Open-domain IE by pattern-matching
(Hearst, 92)

Start with seeds NIPS, ICML
Look thru a corpus for certain patterns
at NIPS, AISTATS, KDD and other learning
conferences
Expand from seeds to new instances
Repeat.until ___
on PC of KDD, SIGIR, and

6
Bootstrapping as graph proximity
7
Outline

Open-domain IE as finding nodes near seeds on a
graph
Approach 1 A natural graph derived from a
smaller corpus learned similarity
Approach 2 A carefully-engineered graph derived
from huge corpus (e.g.s above)

with Einat Minkov (Nokia)
with Richard Wang (CMU ? ?)
8
Learning Similarity Measures for Parsed Text
(Minkov Cohen, EMNLP 2008)
nsubj
partmod
prep.with
boys
like
playing
cars
all
kinds
det
prep.of
NN
NN
VB
VB
DT
NN
Dependency parsed sentence is a naturally
represented as a tree
9
Learning Similarity Measures for Parsed Text
(Minkov Cohen, EMNLP 2008)
Dependency parsed corpus is naturally
represented as a graph
10
Learning Similarity Measures for Parsed Text
(Minkov Cohen, EMNLP 2008)

Open IE Goal
Find coordinate terms (eg, girl/boy,
dolls/cars) in the graph, or find
Similarity measure S so S(girl,boy) is high
What about off-the-shelf similarity measures
PPR/RWR
Hitting time
Commute time
?

11
Personalized PR/RWR
A query languageQ ,
The graph Nodes Node type Edge label Edge weight
Returns a list of nodes (of type )
ranked by the graph walk probs.
graph walk parameters edge weights T , walk
length K and reset probability ?.
Approximate with power iteration, cut off after
fixed number of iterations K.
Mx,y Prob. of reaching y from x in one
step the edge weight from x to y, over the
total outgoing weight from x.
Personalized PageRankreset probability biased
towardsinitial distribution.
12
mention
nsubj
mention-1
mention
nsubj-1
mention-1
girls
girls1
like1
like
like2
boys2
boys
13
mention
nsubj
mention-1
mention
nsubj-1
mention-1
girls
girls1
like1
like
like2
boys2
boys
mention
nsubj
partmod
mention-1
mention
mention-1
girls
girls1
like1
playing1
playing

boys
14
mention
nsubj
mention-1
Prep.with
mention-1
girls
girls1
like1
playing1
dolls1
dolls
Useful but not our goal here
15
Learning a better similarity metric
Task T (query class)
Seed words (girl, boy, )
Query q

Query a
Query b
Rel. answers a
Rel. answers b
Rel. answers q
GRAPH WALK

node rank 1
node rank 2
node rank 3
node rank 4
node rank 10
node rank 11
node rank 12
node rank 50

node rank 1
node rank 2
node rank 3
node rank 4
node rank 10
node rank 11
node rank 12
node rank 50

node rank 1
node rank 2
node rank 3
node rank 4
node rank 10
node rank 11
node rank 12
node rank 50

Potential new instances of the target
concept (doll, child, toddler, )
16
Learning methods
Weight tuning weights learned per edge
typeDiligenti et al, 2005
Reranking re-order the retrieved list using
global featuresof all paths from source to
destination Minkov et al, 2006

boys
dolls
FEATURES

Edge label sequences

nsubj?.nsubj-inv
nsubj ? partmod ? prep.in
nsubj ? partmod ? partmod-inv ? nsubj-inv

Lexical unigrams

like, playing
like, playing
17
Learning methods Path-Constrained Graph Walk

PCW (summary) for each node x, learn
P(x?z relevant(z) history(Vq,x) )
History(Vq,x) seq of edge labels leading from
Vq to x, with all histories stored in a tree

boys
dolls
boys
nsubj?.nsubj-inv
nsubj ? partmod ? prep.in
nsubj ? partmod ? partmod-inv ? nsubj-inv
dolls
18
City and person name extraction

City names Vq sydney, stamford, greenville,
los_angeles
Person names Vq carter, dave_kingman,
pedro_ramos, florio

10 (X4 seeds) queries for each task
Train queries q1-q5 / test queries q6-q10
Extract nodes of type NE.
GW 6 steps, uniform/learned weights
Reranking top 200 nodes (using learned weights)
Path trees 20 correct / 20 incorrect threshold
0.5

19
MUC
City names
Person names
precision
rank
20
MUC
City names
Person names
precision
rank
conj-and, prep-in, nn, appos
subj, obj, poss, nn
21
MUC
City names
Person names
precision
rank
conj-and, prep-in, nn, appos
subj, obj, poss, nn
prep-in-inv ? conj-andnn-inv ? nn
nsubj ? nsubj-invappos ? nn-inv
22
MUC
City names
Person names
precision
rank
conj-and, prep-in, nn, appos
subj, obj, poss, nn
Prep-in-inv ? conj-andnn-inv ? nn
nsubj ? nsubj-invappos ? nn-inv
LEX.based, LEX.downtown
LEX.mr, LEX.president
23
Vector-space models

Co-occurrence vectors (counts window /- 2)
Dependency vectors Padó Lapata, Comp Ling 07
A path value function
Length-based value 1 / length(path)
Relation based value subj-5, obj-4, obl-3,
gen-2, else-1
Context selection function
Minimal verbal predicate-argument (length 1)
Medium coordination, genitive construction,
noun compounds (lt3)
Maximal combinations of the above (lt4)
Similarity function
Cosine
Lin

Only score the top nodes retrieved with
reranking (1000 overall)

24
GWs Vector models
MUC
City names
Person names
precision
rank

The graph-based methods are best (syntactic
learning)

25
GWs Vector models
MUC AP
City names
Person names
precision
rank

The advantage of the graph based models
diminishes with the amount of data.
This is hard to evaluate at high ranks (manual
labeling)

26
Outline

Open-domain IE as finding nodes near seeds on a
graph
Approach 1 A natural graph derived from a
smaller corpus learned similarity
Approach 2 A carefully-engineered graph derived
from huge corpus

with Einat Minkov (CMU ? Nokia)
with Richard Wang (CMU ? ?)
27
Set Expansion for Any Language (SEAL) (Wang
Cohen, ICDM 07)

Basic ideas
Dynamically build the graph using queries to the
web
Constrain the graph to be as useful as possible
Be smart about queries
Be smart about patterns use clever methods for
finding meaningful structure on web pages

28
System Architecture

Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung

Canon
Nikon
Olympus

Fetcher download web pages from the Web that
contain all the seeds
Extractor learn wrappers from web pages
Ranker rank entities extracted by wrappers

29
The Extractor

Learn wrappers from web documents and seeds on
the fly
Utilize semi-structured documents
Wrappers defined at character level
Very fast
No tokenization required thus language
independent
Wrappers derived from doc d applied to d only
See ICDM 2007 paper for details

30
.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .

Find prefix of each seed and put in reverse
order
ford1 /ecnaniffer agt yllareneG
Ford2 gtdrof/ /ecnaniffer agt yllareneG
honda1 /ecnaniffer agt ot derapmoc
Honda2 gtadnoh/ /ecnaniffer agt ot
Organize these into a trie, tagging each node
with a set of seeds

yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
31
.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .

Find prefix of each seed and put in reverse
order
Organize these into a trie, tagging each node
with a set of seeds.
A left context for a valid wrapper is a node
tagged with one instance of each seed.

yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
32
.. Generally lta reffinance/fordgtFordlt/agt sales
compared to lta reffinance/hondagtHondalt/agt
while lta hreffinance/gmgtGeneral Motorslt/agt and
lta hreffinance/bentleygtBentleylt/agt .

Find prefix of each seed and put in reverse
order
Organize these into a trie, tagging each node
with a set of seeds.
A left context for a valid wrapper is a node
tagged with one instance of each seed.
The corresponding right context is the longest
common suffix of the corresponding seed instances.

gt
gtFordlt/agt sales
yllareneG
f1
f1,h1
/ecnaniffer agt
ot derapmoc
gtHondalt/agt while
h1

gt
drof/ /ecnaniffer agt yllareneG..
f2
f1,f2,h1,h2
f2,h2
adnoh/ /ecnaniffer agt ot ..
h2
lt/agt
33
I am noise
Me too!
34
The Ranker

Rank candidate entity mentions based on
similarity to seeds
Noisy mentions should be ranked lower
Random Walk with Restart (GW)
as before
Whats the graph?

35
Building a Graph
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4

A graph consists of a fixed set of
Node Types seeds, document, wrapper, mention
Labeled Directed Edges find, derive, extract
Each edge asserts that a binary relation r holds
Each edge has an inverse relation r-1 (graph is
cyclic)
Intuition good extractions are extracted by many
good wrappers, and good wrappers extract many
good extractions,

36
Evaluation Datasets closed sets
37
Evaluation Method

Mean Average Precision
Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Mean of average precisions for each ranked list

Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank

Evaluation Procedure (per dataset)
Randomly select three true entities and use their
first listed mentions as seeds
Expand the three seeds obtained from step 1
Repeat steps 1 and 2 five times
Compute MAP for the five ranked lists

True Entities total number of true entities
in this dataset
38
Experimental Results 3 seeds

Vary Extractor Ranker Top N URLs
Extractor
E1 Baseline Extractor (longest common context
for all seed occurrences)
E2 Smarter Extractor (longest common context
for 1 occurrence of each seed)
Ranker EF Baseline (Most Frequent), GW
Graph Walk
N URLs 100, 200, 300

39
Side by side comparisons
Telukdar, Brants, Liberman, Pereira, CoNLL 06
40
Side by side comparisons
EachMovie vs WWW
NIPS vs WWW
Ghahramani Heller, NIPS 2005
41
A limitation of the original SEAL
42
Proposed Solution Iterative SEAL (iSEAL)(Wang
Cohen, ICDM 2008)

Makes several calls to SEAL, each call
Expands a couple of seeds
Aggregates statistics
Evaluate iSEAL using
Two iterative processes
Supervised vs. Unsupervised (Bootstrapping)
Two seeding strategies
Fixed Seed Size vs. Increasing Seed Size
Five ranking methods

43
ISeal (Fixed Seed Size, Supervised)
Initial Seeds

Finally rank nodes by proximity to seeds in the
full graph
Refinement (ISS) Increase size of seed set for
each expansion over time 2,3,4,4,
Variant (Bootstrap) use high-confidence
extractions when seeds run out

44
Ranking Methods

Random Graph Walk with Restart
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random
walk with restart and its application. In ICDM,
2006.
PageRank
L. Page, S. Brin, R. Motwani, and T. Winograd.
The PageRank citation ranking Bringing order to
the web. 1998.
Bayesian Sets (over flattened graph)
Z. Ghahramani and K. A. Heller. Bayesian sets. In
NIPS, 2005.
Wrapper Length
Weights each item based on the length of common
contextual string of that item and the seeds
Wrapper Frequency
Weights each item based on the number of wrappers
that extract the item

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
Little difference between ranking methods for
supervised case (all seeds correct) large
differences when bootstrapping
Increasing seed size 2,3,4,4, makes all
ranking methods improve steadily in bootstrapping
case
50
(No Transcript)
51
Current work

Start with name of concept (e.g., NFL teams)
Look for (language-dependent) patterns
for successful NFL teams (e.g., Pittsburgh
Steelers, New York Giants, )
Take most frequent answers as seeds
Run bootstrapping iSEAL with seed sizes 2,3,4,4.

52
Datasets with concept names
53
(No Transcript)
54
Experimental results
Direct use of text patterns
55
(No Transcript)
56
Comparison to Kozareva, Riloff Hovy (which uses
concept name plus a single instance as seed)no
seed used.
57
Comparison to Pasca (using web search queries,
CIKM 07)
58
Comparison to WordNet Nk

Snow et al series of experiments learning
hyper/hyponyms
Bootstrap from Wordnet examples
Use dependency-parsed free text
E.g., added 30k new instances with fairly high
precision
Many are concepts named-entity instances
Experiments with ASIA on concepts from Wordnet
shows a fairly common problem
E.g., movies gives as instances comedy,
action/adventure, family, drama, .
I.e., ASIA finds a lower level in a hierarchy,
maybe not the one you want

59
Comparison to WordNet Nk

Filter a simulated sanity check
Consider only concepts expanded in Wordnet 30k
that seem to have named-entities as instances and
have at least instances
Run ASIA on each concept
Discard result if less than 50 of the Wordnet
instances are in ASIAs output

60
(No Transcript)
61
Two More Systems to Compare to

Van Durme Pasca, 2008
Requires an English part-of-speech tagger.
Analyzed 100 million cached Web documents in
English (for many classes).
Talukdar et al, 2008
Requires 5 seed instances as input (for each
class).
Utilizes output from Van Durmes system and 154
million tables from the WebTables database (for
many classes).
ASIA
Does not require any part-of-speech tagger
(nearly language-independent).
Supports multiple languages such as English,
Chinese, and Japanese.
Analyzes around 200400 Web documents (for each
class).
Requires only the class name as input.
Given a class name, extraction usually finishes
within a minute (including network latency of
fetching web pages).