LanguageIndependent Set Expansion of Named Entities using the Web - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

LanguageIndependent Set Expansion of Named Entities using the Web

Description:

Language-Independent Set Expansion of Named Entities using the Web ... Casio. Leica. Fuji. Samsung. 7 / 20. Language Technologies Institute, Carnegie Mellon University ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 21
Provided by: Richar8
Category:

less

Transcript and Presenter's Notes

Title: LanguageIndependent Set Expansion of Named Entities using the Web


1
Language-Independent Set Expansion of Named
Entities using the Web
  • Richard C. Wang William W. Cohen
  • Language Technologies Institute
  • Carnegie Mellon University
  • Pittsburgh, PA 15213 USA

2
Outline
  • Introduction
  • System Architecture
  • Fetcher
  • Extractor
  • Ranker
  • Evaluation
  • Conclusion

3
What is Set Expansion?
  • For example,
  • Given a query spit, boogers, ear wax
  • Answer is puke, toe jam, sweat, ....
  • More formally,
  • Given a small number of seeds x1, x2, , xk
    where each xi St
  • Answer is a listing of other probable elements
    e1, e2, , en where each ei St
  • A well-known example of a web-based set expansion
    system is Google Sets
  • http//labs.google.com/sets

4
What is it used for?
  • Derive features for
  • Named Entity Recognition (Settles, 2004)
    (Talukdar, 2006)
  • Expand true named entities in training set
  • Utilize expanded names to assign features to
    words
  • Concept Learning (Cohen, 2000)
  • Given a set of instances, look in web pages for
    tables or lists that contain some of those
    instances
  • Automatically extract features from those pages
  • Define features over the instances found
  • Relation Learning (Cafarella et al, 2005)
    (Etzioni et al, 2005)
  • Extract items from tables or lists that contain
    given seeds
  • Utilize extracted items and their contexts for
    learning relations

5
Our Set Expander SEAL
Set Expander for Any Language
  • Features
  • Independent of human/markup language
  • Support seeds in English, Chinese, Japanese,
    Korean, ...
  • Accept documents in HTML, XML, SGML, TeX, WikiML,
  • Does not require pre-annotated training data
  • Utilize readily-available corpus World Wide Web
  • Learns wrappers on the fly
  • Based on two research contributions
  • Automatic construction of wrappers
  • Extracts lists of entities on semi-structured
    web pages
  • Use of random graph walk
  • Ranks extracted entities so that those most
    likely to be in the target set are ranked higher

6
System Architecture
  • Pentax
  • Sony
  • Kodak
  • Minolta
  • Panasonic
  • Casio
  • Leica
  • Fuji
  • Samsung
  • Canon
  • Nikon
  • Olympus
  • Fetcher download web pages from the Web
  • Extractor learn wrappers from web pages
  • Ranker rank entities extracted by wrappers

7
The Fetcher
  • Procedure
  • Compose a search query using all seeds
  • Use Google API to request for top N URLs
  • We use N 100, 200, and 300 for evaluation
  • Fetch URLs by using a crawler
  • Send fetched documents to the Extractor

8
The Extractor
  • Learn wrappers from web documents and seeds on
    the fly
  • Utilize semi-structured documents
  • Wrappers defined at character level
  • No tokenization required thus language
    independent
  • However, very specific thus page-dependent
  • Wrappers derived from document d is applied to d
    only

9


Extractor E1 finds maximally-long contexts that
bracket all instances of every seed
It seems to be working but what if I add one
more instance of toyota?
It seems to be working too but how about a more
complex example?




10
I am a noisy entity mention
Me too!
Can you find common contexts that bracket
all instances of every seed?
I guess not! Lets try out Extractor E2 and see
if it works
Extractor E2 finds maximally-long contexts that
bracket at least one instance of every seed
Horray! It seems like Extractor E2 works! But how
do we get rid of those noisy entity mentions?
11
Extractor Summary
  • A wrapper consists of a pair of left (L) and
    right (R) context string
  • All strings between (but not containing) L and R
    are extracted
  • Referred to as candidate entity mention
  • We compared two versions of wrapper
  • Maximally-long contextual strings that bracket
  • all instances of every seed (Extractor E1)
  • at least one instance of every seed (Extractor E2)

12
The Ranker
  • Rank candidate entity mentions based on
    similarity to seeds
  • Noisy mentions should be ranked lower
  • We compare two methods for ranking
  • Extracted Frequency (EF)
  • of times an entity mention is extracted
  • Random Graph Walk (GW)
  • Probability of an entity mention node being
    reached in a graph (explained in next slide)

13
Building a Graph
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4
  • A graph consists of a fixed set of
  • Node Types seeds, document, wrapper, mention
  • Labeled Directed Edges find, derive, extract
  • Each edge asserts that a binary relation r holds
  • Each edge has an inverse relation r-1 (graph is
    cyclic)

Minkov et al. Contextual Search and Name
Disambiguation in Email using Graphs. SIGIR 2006
14
Random Graph Walk
Probability of picking an edge relation r given a
source node x
curryauto.com, ... wrapper 1, ... honda,
acura, ...
Probability of picking a target node y given an
edge relation r and source node x
find, find-1, derive, derive-1, extract, extract-1
Legend Node x, y, z Edge Relation r An edge
from x to y with relation r Stop Probability ?
Recursive computation of probability
Probability of staying at a node (0.5)
Probability of reaching any node z from x
Probability of continuing to node z from x
Probability of staying at node x
15
Evaluation Datasets
16
Evaluation Method
  • Mean Average Precision
  • Commonly used for evaluating ranked lists in IR
  • Contains recall and precision-oriented aspects
  • Sensitive to the entire ranking
  • Mean of average precisions for each ranked list

Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank
  • Evaluation Procedure (per dataset)
  • Randomly select three true entities and use their
    first listed mentions as seeds
  • Expand the three seeds obtained from step 1
  • Repeat steps 1 and 2 five times
  • Compute MAP for the five ranked lists

True Entities total number of true entities
in this dataset
17
Experimental Results
Legend Extractor Ranker Top N
URLs Extractor E1 Extractor E1, E2
Extractor E2 Ranker EF Extracted
Frequency, GW Graph Walk N 100, 200, 300
18
Conclusion Future Work
  • Conclusion
  • Unsupervised approach for expanding sets of named
    entities
  • Domain and language independent
  • SEAL performs better than Google Sets
  • Higher Mean Average Precision on our datasets
  • Handle not only English, but also Chinese and
    Japanese
  • Future Work
  • Learn from graphs to re-rank extracted mentions
  • Bootstrap named entities by using extracted
    mentions in previous expansion as seeds
  • Identify possible class names for expanded sets
  • i.e. car makers, constellations, presidents

19
References
20
Top three mentions are the seeds
Try it out at http//rcwang.com/seal
21
Top three mentions are the seeds
Try it out at http//rcwang.com/seal
Write a Comment
User Comments (0)
About PowerShow.com