Title: LanguageIndependent Set Expansion of Named Entities using the Web
1Language-Independent Set Expansion of Named
Entities using the Web
- Richard C. Wang William W. Cohen
- Language Technologies Institute
- Carnegie Mellon University
- Pittsburgh, PA 15213 USA
2Outline
- Introduction
- System Architecture
- Fetcher
- Extractor
- Ranker
- Evaluation
- Conclusion
3What is Set Expansion?
- For example,
- Given a query spit, boogers, ear wax
- Answer is puke, toe jam, sweat, ....
- More formally,
- Given a small number of seeds x1, x2, , xk
where each xi St - Answer is a listing of other probable elements
e1, e2, , en where each ei St - A well-known example of a web-based set expansion
system is Google Sets - http//labs.google.com/sets
4What is it used for?
- Derive features for
- Named Entity Recognition (Settles, 2004)
(Talukdar, 2006) - Expand true named entities in training set
- Utilize expanded names to assign features to
words - Concept Learning (Cohen, 2000)
- Given a set of instances, look in web pages for
tables or lists that contain some of those
instances - Automatically extract features from those pages
- Define features over the instances found
- Relation Learning (Cafarella et al, 2005)
(Etzioni et al, 2005) - Extract items from tables or lists that contain
given seeds - Utilize extracted items and their contexts for
learning relations
5Our Set Expander SEAL
Set Expander for Any Language
- Features
- Independent of human/markup language
- Support seeds in English, Chinese, Japanese,
Korean, ... - Accept documents in HTML, XML, SGML, TeX, WikiML,
- Does not require pre-annotated training data
- Utilize readily-available corpus World Wide Web
- Learns wrappers on the fly
- Based on two research contributions
- Automatic construction of wrappers
- Extracts lists of entities on semi-structured
web pages - Use of random graph walk
- Ranks extracted entities so that those most
likely to be in the target set are ranked higher
6System Architecture
- Pentax
- Sony
- Kodak
- Minolta
- Panasonic
- Casio
- Leica
- Fuji
- Samsung
- Fetcher download web pages from the Web
- Extractor learn wrappers from web pages
- Ranker rank entities extracted by wrappers
7The Fetcher
- Procedure
- Compose a search query using all seeds
- Use Google API to request for top N URLs
- We use N 100, 200, and 300 for evaluation
- Fetch URLs by using a crawler
- Send fetched documents to the Extractor
8The Extractor
- Learn wrappers from web documents and seeds on
the fly - Utilize semi-structured documents
- Wrappers defined at character level
- No tokenization required thus language
independent - However, very specific thus page-dependent
- Wrappers derived from document d is applied to d
only
9Extractor E1 finds maximally-long contexts that
bracket all instances of every seed
It seems to be working but what if I add one
more instance of toyota?
It seems to be working too but how about a more
complex example?
10I am a noisy entity mention
Me too!
Can you find common contexts that bracket
all instances of every seed?
I guess not! Lets try out Extractor E2 and see
if it works
Extractor E2 finds maximally-long contexts that
bracket at least one instance of every seed
Horray! It seems like Extractor E2 works! But how
do we get rid of those noisy entity mentions?
11Extractor Summary
- A wrapper consists of a pair of left (L) and
right (R) context string - All strings between (but not containing) L and R
are extracted - Referred to as candidate entity mention
- We compared two versions of wrapper
- Maximally-long contextual strings that bracket
- all instances of every seed (Extractor E1)
- at least one instance of every seed (Extractor E2)
12The Ranker
- Rank candidate entity mentions based on
similarity to seeds - Noisy mentions should be ranked lower
- We compare two methods for ranking
- Extracted Frequency (EF)
- of times an entity mention is extracted
- Random Graph Walk (GW)
- Probability of an entity mention node being
reached in a graph (explained in next slide)
13Building a Graph
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4
- A graph consists of a fixed set of
- Node Types seeds, document, wrapper, mention
- Labeled Directed Edges find, derive, extract
- Each edge asserts that a binary relation r holds
- Each edge has an inverse relation r-1 (graph is
cyclic)
Minkov et al. Contextual Search and Name
Disambiguation in Email using Graphs. SIGIR 2006
14Random Graph Walk
Probability of picking an edge relation r given a
source node x
curryauto.com, ... wrapper 1, ... honda,
acura, ...
Probability of picking a target node y given an
edge relation r and source node x
find, find-1, derive, derive-1, extract, extract-1
Legend Node x, y, z Edge Relation r An edge
from x to y with relation r Stop Probability ?
Recursive computation of probability
Probability of staying at a node (0.5)
Probability of reaching any node z from x
Probability of continuing to node z from x
Probability of staying at node x
15Evaluation Datasets
16Evaluation Method
- Mean Average Precision
- Commonly used for evaluating ranked lists in IR
- Contains recall and precision-oriented aspects
- Sensitive to the entire ranking
- Mean of average precisions for each ranked list
Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank
- Evaluation Procedure (per dataset)
- Randomly select three true entities and use their
first listed mentions as seeds - Expand the three seeds obtained from step 1
- Repeat steps 1 and 2 five times
- Compute MAP for the five ranked lists
True Entities total number of true entities
in this dataset
17Experimental Results
Legend Extractor Ranker Top N
URLs Extractor E1 Extractor E1, E2
Extractor E2 Ranker EF Extracted
Frequency, GW Graph Walk N 100, 200, 300
18Conclusion Future Work
- Conclusion
- Unsupervised approach for expanding sets of named
entities - Domain and language independent
- SEAL performs better than Google Sets
- Higher Mean Average Precision on our datasets
- Handle not only English, but also Chinese and
Japanese - Future Work
- Learn from graphs to re-rank extracted mentions
- Bootstrap named entities by using extracted
mentions in previous expansion as seeds - Identify possible class names for expanded sets
- i.e. car makers, constellations, presidents
19References
20Top three mentions are the seeds
Try it out at http//rcwang.com/seal
21Top three mentions are the seeds
Try it out at http//rcwang.com/seal