Disambiguating Toponyms in News

About This Presentation

Title:

Disambiguating Toponyms in News

Description:

Retrieve geographically-relevant news documents ... e.g., 'Washington' is predominantly a Capital in MAC1 ... Test on news with disambiguators stripped out ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 16

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Disambiguating Toponyms in News

1
Disambiguating Toponyms in News

Eric Garbin and Inderjeet Mani
Department of Linguistics
Georgetown University
egarbin_at_cox.net, im5_at_georgetown.edu

2
Motivation for Toponym Disambiguation

Toponyms (Place Names) are often ambiguous
250 La Esperanzas in Colombia
40 Buffalos in the U.S. (from GNIS)
29 Berlins in the U.S. (from GNIS)
Grounding of spatial linguistic information can
be useful in GIS systems
mapping to lat-longs, boundaries for map display
spatial reasoning related to geographical
inclusion
GIS-related applications
Retrieve geographically-relevant news documents
Visualize on a map the places a document
collection references
Answer a question using a geographical knowledge
base

3
Toponym Questions

How ambiguous are toponyms in news?
Corpus study using English news
What resources can be used to build a
disambiguator?
Internet-harvested gazetteer or toponym list
From www.worldatlas.com, www.world-gazetteer.com,
and www.geonames.usgs.gov
Raw corpora
Human-annotated corpora
Given the lack of annotated data, how well can we
learn from automatic annotation?
Noisy Annotation using LexScan and Heuristic
Disambiguation Preferences
LexScan tags strings in text with entry
information from gazetteer
Machine Learning from Noisy Annotation

4
Quantifying Toponym Ambiguity

13,860 toponyms (out of a total of 27,649) have
multiple entries in USGS Concise Gazetteer (GNIS)
1827 of them found in one month of English
Gigaword (MAC1)
588 of those had discriminators within a local
(5-word) window
i.e., discriminators matching gazetteer features
that disambiguated the toponym County, State,
Type, Lat, Long, Elevation
Thus, only 1/3 of the 1827 toponyms found in MAC1
that were ambiguous in GNIS had a local
discriminator.
Discriminators are usually global (document
level)
73 of toponyms with a local discriminator at
first mention lacked a local discriminator on all
subsequent mentions

5
Resources Gazetteer
World Gaz
World Atlas
GNIS
Cities Population gt 0.5M
Countries, Country-Capitals
U.S. States
Make LexScan- Readable
Strip Diacritics
Expand Aliases
Merge Entries
Preference Based Order
WAG
6
Resources Corpora
DEV
TR TEST
GOLD
7
Noisy Annotation LexScan

Creates a lexical trie from a file of
phrase-entry pairs
Text is scanned, phrase looked up in the trie,
phrase is tagged with G1,..,G4,CLASS
Aliases allowed, complete match required
Preferred sense indicated by file order, which is
established by heuristic preferences
Scan MAC1 for two dozen very frequent toponyms
ambiguous in WAG, e.g., Washington, Georgia, New
York, Valencia
Prefer MAC1s most frequent sense for that
toponym
e.g., Washington is predominantly a Capital in
MAC1
For toponyms outside this most frequent set,
prefer the most specific sense in MAC1
Capital gt Ppl gt Civil
since there are a smaller number of Capitals than
Populated places, prefer Capitals to Populated
Places.

8
Feature Exploration

Features are terms found in window around CLASS
(?3, ?20 words)
Features collocated with CLASS more freq than
chance?
tf.idf filtered features scored for pointwise
mutual info with CLASS

9
Feature Exploration
(Terms found within 3 words of CLASS PPL)

Features are terms found in window around CLASS
(?3, ?20 words)
Features collocated with CLASS more freq than
chance?
tf.idf filtered features scored for pointwise
mutual info with CLASS
Features collocated significantly more with one
CLASS than another?
95 confidence if t gt 1.645
city (t1.19) was added as a feature

10
Features for Toponym Disambiguation

Toponym-Internal
ARLINGTON, May 1 AllCapstrue
Mass., U.K., N.S.W. Abbrevtrue
Local Context
city2 of1 Lagos city?LeftPos2
from Brussels to1 visit2 today3 visit?W3Context
at least one PMI/TS term in window
Global Context
Sports News PPL? TagDiscourse
ltPPLgtBostonlt/PPLgt, Mass park the car in Boston
PPL ?CorefClass

11
Machine Learning Results

Best performance on HAC is 78.5 (SVM)
Best results here
(78.3) shown for Ripper
Accuracy grows substantially with training set
size
Combining MAC-ML with MAC-DEV improved accuracy
on HAC (over MAC-DEV) by about 13 for k 20

(5.47M)
(11.68M)
Training Data (words)
12
Machine Learning Results (cont.)

Best performance on HAC is 78.5 (SVM)
Best results here (78.3) shown for Ripper
More noisy training data shrinks difference
between performance on HAC and perfomance on
noisy annotated data
(e.g., MAC-DEVMAC-ML)

(5.47M)
(11.68M)
Training Data (words)
13
Issues in Toponym Disambiguation

TagDiscourse (the set of toponym CLASSes
mentioned in the document) was useful
Ignoring TagDiscourse during learning dropped the
accuracy nearly 9
Indicates that prior mention of a class increases
the likelihood of that class.
Window Size
Increase in accuracy from more training data was
less for smaller windows, e.g., 4.5 for k 3
Increasing window size only lowered accuracy
e.g., increasing from 3 to 4, 10, or 20
lowered accuracy by 5.7 percentage points on
MAC-DEV

14
Related Work

Machine learning to induce Internet gazetteers
(Uryupina, 2003)
Hand-coded rules and discourse features (Li,
Srihari et al. 2003)
Minimally Supervised Approach of (Smith Mann,
2003).
Train on disambiguated toponyms Nashville, Tenn.
Expand training set with one-sense-per-discourse
assumption
Test on news with disambiguators stripped out

15
Conclusion Future Work

Results
More than 2/3 of toponyms found in a month of
NYT that were ambiguous in a gazetteer lacked a
local discriminator in the text
Merged internet data into a common gazetteer
format
Unsupervised machine learner trained from noisy
annotation had 78.5 disambiguation accuracy on
human-annotated news corpus
To be explored...
Cluster terms for data sparseness
Combine with minimally supervised approaches
(Brill 1995) (Smith and Mann 2003)
Expand gazetteer and add geo-visualization system
Knowledge-poor approach can extend to
multilingual extraction