LanguageIndependent Set Expansion of Named Entities using the Web

About This Presentation

Title:

LanguageIndependent Set Expansion of Named Entities using the Web

Description:

Language-Independent Set Expansion of Named Entities using the Web ... Casio. Leica. Fuji. Samsung. 7 / 20. Language Technologies Institute, Carnegie Mellon University ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 21

Provided by: Richar8

Category:

more less

Transcript and Presenter's Notes

Title: LanguageIndependent Set Expansion of Named Entities using the Web

1
Language-Independent Set Expansion of Named
Entities using the Web

Richard C. Wang William W. Cohen
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213 USA

2
Outline

Introduction
System Architecture
Fetcher
Extractor
Ranker
Evaluation
Conclusion

3
What is Set Expansion?

For example,
Given a query spit, boogers, ear wax
Answer is puke, toe jam, sweat, ....
More formally,
Given a small number of seeds x1, x2, , xk
where each xi St
Answer is a listing of other probable elements
e1, e2, , en where each ei St
A well-known example of a web-based set expansion
system is Google Sets
http//labs.google.com/sets

4
What is it used for?

Derive features for
Named Entity Recognition (Settles, 2004)
(Talukdar, 2006)
Expand true named entities in training set
Utilize expanded names to assign features to
words
Concept Learning (Cohen, 2000)
Given a set of instances, look in web pages for
tables or lists that contain some of those
instances
Automatically extract features from those pages
Define features over the instances found
Relation Learning (Cafarella et al, 2005)
(Etzioni et al, 2005)
Extract items from tables or lists that contain
given seeds
Utilize extracted items and their contexts for
learning relations

5
Our Set Expander SEAL
Set Expander for Any Language

Features
Independent of human/markup language
Support seeds in English, Chinese, Japanese,
Korean, ...
Accept documents in HTML, XML, SGML, TeX, WikiML,
Does not require pre-annotated training data
Utilize readily-available corpus World Wide Web
Learns wrappers on the fly
Based on two research contributions
Automatic construction of wrappers
Extracts lists of entities on semi-structured
web pages
Use of random graph walk
Ranks extracted entities so that those most
likely to be in the target set are ranked higher

6
System Architecture

Pentax
Sony
Kodak
Minolta
Panasonic
Casio
Leica
Fuji
Samsung

Canon
Nikon
Olympus

Fetcher download web pages from the Web
Extractor learn wrappers from web pages
Ranker rank entities extracted by wrappers

7
The Fetcher

Procedure
Compose a search query using all seeds
Use Google API to request for top N URLs
We use N 100, 200, and 300 for evaluation
Fetch URLs by using a crawler
Send fetched documents to the Extractor

8
The Extractor

Learn wrappers from web documents and seeds on
the fly
Utilize semi-structured documents
Wrappers defined at character level
No tokenization required thus language
independent
However, very specific thus page-dependent
Wrappers derived from document d is applied to d
only

9

Extractor E1 finds maximally-long contexts that
bracket all instances of every seed
It seems to be working but what if I add one
more instance of toyota?
It seems to be working too but how about a more
complex example?

10
I am a noisy entity mention
Me too!
Can you find common contexts that bracket
all instances of every seed?
I guess not! Lets try out Extractor E2 and see
if it works
Extractor E2 finds maximally-long contexts that
bracket at least one instance of every seed
Horray! It seems like Extractor E2 works! But how
do we get rid of those noisy entity mentions?
11
Extractor Summary

A wrapper consists of a pair of left (L) and
right (R) context string
All strings between (but not containing) L and R
are extracted
Referred to as candidate entity mention
We compared two versions of wrapper
Maximally-long contextual strings that bracket
all instances of every seed (Extractor E1)
at least one instance of every seed (Extractor E2)

12
The Ranker

Rank candidate entity mentions based on
similarity to seeds
Noisy mentions should be ranked lower
We compare two methods for ranking
Extracted Frequency (EF)
of times an entity mention is extracted
Random Graph Walk (GW)
Probability of an entity mention node being
reached in a graph (explained in next slide)

13
Building a Graph
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4

A graph consists of a fixed set of
Node Types seeds, document, wrapper, mention
Labeled Directed Edges find, derive, extract
Each edge asserts that a binary relation r holds
Each edge has an inverse relation r-1 (graph is
cyclic)

Minkov et al. Contextual Search and Name
Disambiguation in Email using Graphs. SIGIR 2006
14
Random Graph Walk
Probability of picking an edge relation r given a
source node x
curryauto.com, ... wrapper 1, ... honda,
acura, ...
Probability of picking a target node y given an
edge relation r and source node x
find, find-1, derive, derive-1, extract, extract-1
Legend Node x, y, z Edge Relation r An edge
from x to y with relation r Stop Probability ?
Recursive computation of probability
Probability of staying at a node (0.5)
Probability of reaching any node z from x
Probability of continuing to node z from x
Probability of staying at node x
15
Evaluation Datasets
16
Evaluation Method

Mean Average Precision
Commonly used for evaluating ranked lists in IR
Contains recall and precision-oriented aspects
Sensitive to the entire ranking
Mean of average precisions for each ranked list

Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank

Evaluation Procedure (per dataset)
Randomly select three true entities and use their
first listed mentions as seeds
Expand the three seeds obtained from step 1
Repeat steps 1 and 2 five times
Compute MAP for the five ranked lists

True Entities total number of true entities
in this dataset
17
Experimental Results
Legend Extractor Ranker Top N
URLs Extractor E1 Extractor E1, E2
Extractor E2 Ranker EF Extracted
Frequency, GW Graph Walk N 100, 200, 300
18
Conclusion Future Work

Conclusion
Unsupervised approach for expanding sets of named
entities
Domain and language independent
SEAL performs better than Google Sets
Higher Mean Average Precision on our datasets
Handle not only English, but also Chinese and
Japanese
Future Work
Learn from graphs to re-rank extracted mentions
Bootstrap named entities by using extracted
mentions in previous expansion as seeds
Identify possible class names for expanded sets
i.e. car makers, constellations, presidents

19
References
20
Top three mentions are the seeds
Try it out at http//rcwang.com/seal
21
Top three mentions are the seeds
Try it out at http//rcwang.com/seal

Write a Comment

User Comments (0)