Title: Bootstrapping Toponym Classifiers
1Bootstrapping Toponym Classifiers
- David Smith and Gideon Mann
- Center for Language and Speech Processing
- Johns Hopkins University
2Exploiting Labeled Text
- Writers often insert explicit disambiguating
labels after place names - A sideshow juggler in Virginia Beach, Va., and a
roller-coaster operator at Hershey Park in
Hershey, Pa., have also been victims of laser
harassment. - Texts with these labels provide training and
testing examples for geographic name
disambiguation
3Classifier Construction
- Some lexical cues help predict a particular place
NASHVILLE , Tenn. - The home of country music is
singing the blues after the sale of its last
locally owned music publishing company to CBS
Records. Tree International Publishing, ranked as
Billboard magazine's No. 1 country music
publisher for the last 16 years, is being sold to
New York-based CBS for a reported 45 million to
50 million, The Tennessean reported today.
4Classifier Construction
- Other contexts provide cues as to general area,
e.g. state
GRANTS PASS, Ore. - ... As more and more federal
lands are set aside for spotted owls and other
types of wildlife and recreation areas, the land
available for perpetual commercial timber
management decreases...
PORTLAND, Ore. - Environmentalists trying to
protect the northern spotted owl cheered a
federal judge's decision halting logging on five
timber tracts...
5Classifier Construction
- Other disambiguation cues are more tenuous
Researchers at Harvard Medical School, and
Canada's London Regional Cancer in London,
Ontario correlated genetic changes found in these
tumors with their sensitivity to
chemotherapy. He accomplished that feat when he
knocked down two Luftwaffe fighters near
Brunswick, Germany, on May 8, 1944, on his final
mission. A sideshow juggler in Virginia Beach,
Va., and a roller-coaster operator at Hershey
Park in Hershey, Pa., have also been victims of
laser harassment.
6How Does This Help?
- Readily available, growing corpus
- Easy to apply WSD methods
- Easy to adapt to many genres
- The crucial questions
- Realism Do contexts with labels well represent
contexts without labels? - Generality Does the method perform well in
different genres?
7Annotated Text
- Hand tag all toponyms to test coverage
All supplies for Rosecrans had to be brought from
ltname key"tgn,7013959" reg"Nashville-Davidson
(inhabited place), Davidson, Tennessee, United
States, North and Central America"
type"place"gtNashvillelt/namegt. The railroad
between this base and the army was in possession
of the government up to ltnamegtBridgeportlt/namegt,
the point at which the road crosses to the south
side of the ltnamegtTennessee Riverlt/namegt but
Bragg, holding Lookout and ltnamegtRaccoon
mountainslt/namegt west of ltnamegtChattanoogalt/namegt,
commanded the railroad, the river and the
shortest and best wagon-roads, both south and
north of the ltnamegtTennessee lt/namegt, between
ltnamegtChattanoogalt/namegt and ltnamegtBridgeportlt/nam
egt.
8Corpus Statistics
9How Hard is the Task?
- Ambiguitys greatest hits
- 41 Oxfords
- 73 Springfields
- 91 Washingtons
- 97 Georgetowns
- Anything with Creek
- But not everywhere is this hard
10How Hard is the Task?
11How Hard is the Task?
12Experimental Setup
- Use labeled toponyms for training contexts
- Strip stopwords and build naïve Bayes classifiers
- In untagged text, test on labeled toponyms
- In tagged text, test on all toponyms
- Back-off to state or country models
- N.B. Baseline is most frequent referent for
toponym
13Evaluation
14Discussion
- Improvements from naïve Bayes are small
- 1.2 relative improvements on untagged data
- But unseen toponyms, using backoff, could improve
- Degraded performance on tagged data
- Lack of many context clues
- Broader range of toponym references
- Strong locality effects not modeled (cf. Smith
Crane 2001)
15Future Work
- Apply other methods from WSD
- Incorporate proximity along with hierarchical
information - Hand tag data to confirm conclusions in news
genre - Disambiguation for extending gazetteers, personal
name disambiguation, and event detection