Disambiguating Toponyms in News - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Disambiguating Toponyms in News

Description:

Retrieve geographically-relevant news documents ... e.g., 'Washington' is predominantly a Capital in MAC1 ... Test on news with disambiguators stripped out ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 16
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Disambiguating Toponyms in News


1
Disambiguating Toponyms in News
  • Eric Garbin and Inderjeet Mani
  • Department of Linguistics
  • Georgetown University
  • egarbin_at_cox.net, im5_at_georgetown.edu

2
Motivation for Toponym Disambiguation
  • Toponyms (Place Names) are often ambiguous
  • 250 La Esperanzas in Colombia
  • 40 Buffalos in the U.S. (from GNIS)
  • 29 Berlins in the U.S. (from GNIS)
  • Grounding of spatial linguistic information can
    be useful in GIS systems
  • mapping to lat-longs, boundaries for map display
  • spatial reasoning related to geographical
    inclusion
  • GIS-related applications
  • Retrieve geographically-relevant news documents
  • Visualize on a map the places a document
    collection references
  • Answer a question using a geographical knowledge
    base

3
Toponym Questions
  • How ambiguous are toponyms in news?
  • Corpus study using English news
  • What resources can be used to build a
    disambiguator?
  • Internet-harvested gazetteer or toponym list
  • From www.worldatlas.com, www.world-gazetteer.com,
    and www.geonames.usgs.gov
  • Raw corpora
  • Human-annotated corpora
  • Given the lack of annotated data, how well can we
    learn from automatic annotation?
  • Noisy Annotation using LexScan and Heuristic
    Disambiguation Preferences
  • LexScan tags strings in text with entry
    information from gazetteer
  • Machine Learning from Noisy Annotation

4
Quantifying Toponym Ambiguity
  • 13,860 toponyms (out of a total of 27,649) have
    multiple entries in USGS Concise Gazetteer (GNIS)
  • 1827 of them found in one month of English
    Gigaword (MAC1)
  • 588 of those had discriminators within a local
    (5-word) window
  • i.e., discriminators matching gazetteer features
    that disambiguated the toponym County, State,
    Type, Lat, Long, Elevation
  • Thus, only 1/3 of the 1827 toponyms found in MAC1
    that were ambiguous in GNIS had a local
    discriminator.
  • Discriminators are usually global (document
    level)
  • 73 of toponyms with a local discriminator at
    first mention lacked a local discriminator on all
    subsequent mentions

5
Resources Gazetteer
World Gaz
World Atlas
GNIS
Cities Population gt 0.5M
Countries, Country-Capitals
U.S. States
Make LexScan- Readable
Strip Diacritics
Expand Aliases
Merge Entries
Preference Based Order
WAG
6
Resources Corpora
DEV
TR TEST
GOLD
7
Noisy Annotation LexScan
  • Creates a lexical trie from a file of
    phrase-entry pairs
  • Text is scanned, phrase looked up in the trie,
    phrase is tagged with G1,..,G4,CLASS
  • Aliases allowed, complete match required
  • Preferred sense indicated by file order, which is
    established by heuristic preferences
  • Scan MAC1 for two dozen very frequent toponyms
    ambiguous in WAG, e.g., Washington, Georgia, New
    York, Valencia
  • Prefer MAC1s most frequent sense for that
    toponym
  • e.g., Washington is predominantly a Capital in
    MAC1
  • For toponyms outside this most frequent set,
    prefer the most specific sense in MAC1
  • Capital gt Ppl gt Civil
  • since there are a smaller number of Capitals than
    Populated places, prefer Capitals to Populated
    Places.

8
Feature Exploration
  • Features are terms found in window around CLASS
    (?3, ?20 words)
  • Features collocated with CLASS more freq than
    chance?
  • tf.idf filtered features scored for pointwise
    mutual info with CLASS

9
Feature Exploration
(Terms found within 3 words of CLASS PPL)
  • Features are terms found in window around CLASS
    (?3, ?20 words)
  • Features collocated with CLASS more freq than
    chance?
  • tf.idf filtered features scored for pointwise
    mutual info with CLASS
  • Features collocated significantly more with one
    CLASS than another?
  • 95 confidence if t gt 1.645
  • city (t1.19) was added as a feature

10
Features for Toponym Disambiguation
  • Toponym-Internal
  • ARLINGTON, May 1 AllCapstrue
  • Mass., U.K., N.S.W. Abbrevtrue
  • Local Context
  • city2 of1 Lagos city?LeftPos2
  • from Brussels to1 visit2 today3 visit?W3Context
  • at least one PMI/TS term in window
  • Global Context
  • Sports News PPL? TagDiscourse
  • ltPPLgtBostonlt/PPLgt, Mass park the car in Boston
    PPL ?CorefClass

11
Machine Learning Results
  • Best performance on HAC is 78.5 (SVM)
  • Best results here
  • (78.3) shown for Ripper
  • Accuracy grows substantially with training set
    size
  • Combining MAC-ML with MAC-DEV improved accuracy
    on HAC (over MAC-DEV) by about 13 for k 20

(5.47M)
(11.68M)
Training Data (words)
12
Machine Learning Results (cont.)
  • Best performance on HAC is 78.5 (SVM)
  • Best results here (78.3) shown for Ripper
  • More noisy training data shrinks difference
    between performance on HAC and perfomance on
    noisy annotated data
  • (e.g., MAC-DEVMAC-ML)

(5.47M)
(11.68M)
Training Data (words)
13
Issues in Toponym Disambiguation
  • TagDiscourse (the set of toponym CLASSes
    mentioned in the document) was useful
  • Ignoring TagDiscourse during learning dropped the
    accuracy nearly 9
  • Indicates that prior mention of a class increases
    the likelihood of that class.
  • Window Size
  • Increase in accuracy from more training data was
    less for smaller windows, e.g., 4.5 for k 3
  • Increasing window size only lowered accuracy
  • e.g., increasing from 3 to 4, 10, or 20
    lowered accuracy by 5.7 percentage points on
    MAC-DEV

14
Related Work
  • Machine learning to induce Internet gazetteers
    (Uryupina, 2003)
  • Hand-coded rules and discourse features (Li,
    Srihari et al. 2003)
  • Minimally Supervised Approach of (Smith Mann,
    2003).
  • Train on disambiguated toponyms Nashville, Tenn.
  • Expand training set with one-sense-per-discourse
    assumption
  • Test on news with disambiguators stripped out

15
Conclusion Future Work
  • Results
  • More than 2/3 of toponyms found in a month of
    NYT that were ambiguous in a gazetteer lacked a
    local discriminator in the text
  • Merged internet data into a common gazetteer
    format
  • Unsupervised machine learner trained from noisy
    annotation had 78.5 disambiguation accuracy on
    human-annotated news corpus
  • To be explored...
  • Cluster terms for data sparseness
  • Combine with minimally supervised approaches
    (Brill 1995) (Smith and Mann 2003)
  • Expand gazetteer and add geo-visualization system
  • Knowledge-poor approach can extend to
    multilingual extraction

16
Recall, Precision, F1 for W20
  • CIVIL
  • PPL

(5.47M)
(11.68M)
(5.47M)
(11.68M)
Training Data (words)
Training Data (words)
17
Gazetteer Entry
Write a Comment
User Comments (0)
About PowerShow.com