Title: Treatment of Duplicates in the ADL Gazetteer
1Treatment of Duplicates in the ADL Gazetteer
- Jordan Hastings Linda Hill
- Alexandria Digital Library ProjectDepartment of
GeographyUniversity of CaliforniaSanta Barbara
2Introduction (1)
- What is a gazetteer?
- Spatial dictionary of named and typed features
located in the environment. - Traditional Appendix in Atlas
- Digital Computer Database
3Introduction (2)
- What are duplicates?
- Features that are somehow conflicted re
names, types, or locations - One feature - many names
- One name - many features
4Introduction (3)
- What is the ADL Gazetteer?
- http//www.alexandria.ucsb.edu/gazetteer
- Key access component for digital geodata
- Pilot implementation of publicly-accessible
placename (feature) database service - Fundamental GI Science research activity
5Outline of Talk
- Tour of gazetteer-related issues in
California-Nevada, esp. Lake Tahoe - Discussion of approach to resolving issues
regarding duplicates - Demonstration of software that implements the
approach
6(No Transcript)
7California Nevada
Source DataESRI ArcView 3.2
8(No Transcript)
9(No Transcript)
10Names Features (4)
- Lake Bigler, thru 1920s
- Lake Bonpland (also Bondland), thru 1890s
- Da-ow-a-ga, thru 1850s
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Discussion (1) Definitions
- DEF Feature Humanly recognizable, persistent
phenomenon in the environment - Each feature integrates interrelates three
differentkinds of attributes (with special
issues) - Location (framework scale, accuracy)
- Name (linguistics, culture)
- Type (taxonomy, ontology)
- DEF Gazetteer Database of Featuresi.e., a
spatial dictionary,continually evolving
16Discussion (2) Approach
- Multiple metrics of feature similarity
- Geospatial
- Proximity (familiarity)
- Containment (hierarchy)
- Textual
- Notation (as written)
- Diction (as spoken)
- Weighted combinations of these metrics
17Discussion (3) Specifics
- Geospatial Metrics (w/Subtleties)
- Great Circle Distance
- Bounding Box Topology(Polygons may not be
better!) - Inside
- Nearto
- both scaled areally
Twin Lakes
Half Moon Lake
18Discussion (4) Specifics
- Textual Metrics (w/Subtleties)
- Hamming Distance (hd)
- hd (Lake,Pond) 4
- Edit Distance (ed)
- ed (Lake,Lakes) 1
- Soundex (sdx)
- sdx (Pyramid Lake) P653
- sdx (Lake Tahoe) T000
1 B,P,F,V2 C,S,K,G,J,Q,X,Z3 D,T4 L5
M,N6 R
19Discussion (5) Specifics
- Canonical Names
- Tahoe
- Lake Tahoe Tahoe, Lake
- Tahoe, Lake
- but
- Lake Bigler Bigler, Lake
- Big Frog Lake
Big Frog, LakeFrog, Big, Lake
?
20Demonstration (1) - Background
- GNIS Dataset http//geoname.usgs.gov
- Public product of USGS / Mapping Div. for BGN
- Centroid point features, from many 1100K- maps
- Web-accessible, updated ad hoc
- GDT Dataset http//www.gdt1.com
- Private product, sold into logistics mapping
firms - Polygon line features, from DLGs, other sources
- CD-publication (75 for U.S.), updated quarterly
21demo
22Demonstration (2) - Processing
- De-Duping GNIS
- 1) By Name -- sampled thru Cs
- 2) By Location -- sampled thru Cs
- Full (prior) results viewed
- Matching GNIS to GDT
- 3) By Combination -- run to completion
- Statistics, Metrics discussed reviewed
23Summary (1)
- Features Cover a Large Territory
- Crisp or Diffuse
- Compact or Extended
- Tangible or Abstract
- Naming Features is Human Necessity
- Linguistic Reference
- Identity and Ownership
- Navigation and Wayfinding
24Summary (2)
- Feature Names are Numerous Various
- Polynymous, multi-lingual
- Suffused with linguistic conventions
- Time-variant
- Feature Locations also Numerous Various
- Projected, multi-scale
- Obscured by cartographic conventions
- Time-variant
- Types, too, can be Numerous Various
25Summary (3)
- Automatic Recognition of Duplicates
- Essential to gazetteer construction
- Relies on both geospatial textual
metricsweighting of combinations is subjective - Results in multiple characterizationsfor a
single feature in many (most) cases,? database
visualization implications - Gazetteers pushing at the limits of GIS
- spatially, temporally, and ontologically
26Observations
- Features are subjective, not objective
- Duplicate features are not problems, but clews
to important subtleties - No right answer to feature-izing the
environment. Features vary - Spatially (scale)
- Temporally
- Culturally (socially)
- Cognitively (personally)
27end
28Future Work
- Widening beyond California Nevada
- Adjusting metrics weights, regionally
- Testing computational costs/benefits of polygon
vs. bounding box calculations - Exploring database mechanisms to deal with
complexity of gazetteer knowledge - Implementing in Web-mapping GIS
29Feature Types (1)
- Dependable Type System
- Because Features are Objects
- Because Human Mind Categorizes
- Types present in Taxonomy
- Hierarchy is Natural in Environment
- Because Human Mind Categorizes
30Feature Types (2) Examples
- Cultural Environment
- Nations -gt States -gt Provinces -gt Districts
31Feature Types (2) - Examples
- Physical Environment
- Watersources Springs--gtSeeps
- Watercourses Rivers--gtStreams--gtCreeks
- Waterbodies Lakes--gtPonds--gtSloughs
?Glaciers
32Feature Types (2)
- Type Examples
- Cultural Environment
- Nations -gt States -gt Provinces -gt Districts
- Physical Environment
- Watersources Springs, Seeps
- Watercourses Rivers, Streams, Creeks
- Waterbodies Lakes, Ponds, Sloughs, ?Glaciers
33Fundaments (1)
- Definition Gazetteer A spatial dictionary of
named typed features in the environment - Implications
- Features uniquely identified
- Searchable by name and type
- Also searchable geospatially
34Fundaments (2)
- Duplicates An approximate notion
- Firm types, close in hierarchy
- Locations close dependent on scale
- Names close dependent on language or not at
all - All aspects variant in time
35Fundaments (3)
- Database Implications / Support
- Custom Datatypes
- Hierarchy
- Geometry
- Multiple Attribution (unlimited)
- Names
- Locations
- Efficient Geospatial Processing
36Approach (1)
- Independent Measures of Duplicates
- 1. Type Thesaurus Metrics
- Inter-feature hierarchy, explicit linkages
- 2. Geospatial Metrics
- Intra-feature size, compactness,
- Inter-feature distance, overlap,
- 3. Geonomial Metrics
- Intra-feature NL translation not considered
yet - Intra-feature stemming, soundex, substitution
37Approach (2)
- Unified Assessment of Duplicates
- Weighted Combination of Measures
- 1 Type
- 2 Location(s)
- 3 Name(s)
- Geographic Visualization, over Maps
- Final Authority of Human Cataloger
38Gazetteer DuplicatesProcessing Cycle
random features
prep
grouped features
rework
39Gazetteer DuplicatesProcessing Cycle
random features
prep
grouped features
rework
40Gazetteer DuplicatesProcessing Cycle
41Gazetteer DuplicatesProcessing Cycle
review
42Gazetteer DuplicatesProcessing Cycle
random features
prep
grouped features
rework
weigh
review
accepted
suspended
post
featuredatabase
reject
trash
43end
44(No Transcript)
45Tour (TBD)