Title: Preservation of Digital Geospatial Data: Challenges and Opportunities Steve Morris Head of Digital Library Initaitives North Carolina State University Libraries
1Preservation of Digital Geospatial Data
Challenges and Opportunities Steve MorrisHead
of Digital Library InitaitivesNorth Carolina
State University Libraries
NARA Meeting
Dec. 14, 2005
2Outline
- Digital Geospatial Data Types
- Risks to Digital Geospatial Data
- Overview of NC Geospatial Data Archiving Project
- Preservation Challenges and Possible Solutions
3Geospatial data types Vector data
4Geospatial data types Satellite imagery
5Geospatial data types Aerial imagery
6Geospatial data types Aerial imagery
7Geospatial data types Aerial imagery
8Geospatial data types Tabular data (w/vector)
9Time series vector data Parcel Boundary Changes
2001-2004, North Raleigh, NC
10Time series Ortho imagery Vicinity of
Raleigh-Durham International Airport 1993-2002
11Todays geospatial data as tomorrows cultural
heritage
12Risks to Digital Geospatial Data
.shp
.mif
.gml
.e00
.dwg
.dgn
.bsb
.bil
.sid
13Risks to Digital Geospatial Data
- Producer focus on current data
- Time-versioned content generally not archives
- Future support of data formats in question
- Vast range of data formats in use--complex
- Shift to streaming data for access
- Archives have been a by-product of providing
access - Preservation metadata requirements
- Descriptive, administrative, technical, DRM
- Geodatabases
- Complex functionality
14NC Geospatial Data Archiving Project
- Partnership between university library (NCSU) and
state agency (NCCGIA) - Focus on state and local geospatial content in
North Carolina (state demonstration) - Tied to NC OneMap initiative, which provides for
seamless access to data, metadata, and inventory
information - Objective engage existing state/federal
geospatial data infrastructures in preservation
15Targeted Content
- Resource Types
- GIS vector (point/line/polygon) data
- Digital orthophotography
- Digital maps
- Tabular data (e.g. assessment data)
- Content Producers
- Mostly state, local, regional agencies
- Some university, not-for-profit, commercial
- Selected local federal projects
16Local Government GIS Archival Issues
- Data resources are highly distributed and subject
to frequent update - More detailed, current, accurate than
federal/state data resources - North Carolina local agency GIS environment
- 100 counties, 95 with GIS
- 85 counties with high resolution orthophotography
- Growing number of municipal systems
- Value 162 million plus investment (est. in
2003)
17Work plan in a Nutshell
- Work from existing data inventories
- NC OneMap Data Sharing Agreements as the
blanket, individual agreements as the quilt - Partnership work with existing geospatial data
infrastructures (state and federal) - Technical approach
- METS with FGDC, PREMIS?, GeoDRM?
- Dspace now re-ingest to different environment
- Web services consumption for archival development
18NCGDAP Philosphy of Engagement
Provide feedback to producer organizations/ inform
state geospatial infrastructure
Take the data as in the manner In which it can
be obtained
Wrangle and archive data
Note the Project in North Carolina Geospatial
Data Archiving Project the process, the
learning experience, and the engagement with
geospatial data infrastructures are more
important than the archive
19Big Challenges
- Format migration paths
- Management of data versions over time
- Preservation metadata
- Harnessing geospatial web services
- Preserving cartographic representation
- Keeping content repository-agnostic
- Preserving geodatabases
- More
20Vector Data Format Issues
- Vector data much more complicated than image data
- Archiving vs. Permanent access
- An open pile of XML might make an archive, but
if using it requires a team of programmers to do
digital archaeology then it does not provide
permanent access - Piles of XML need to be widely understood piles
- GML need widely accepted application schemas
(like OSMM?) - The Geodatabase conundrum
- Export feature classes, and lose topology,
annotation, relationships, etc. - or use the Geodatabase as the primary archival
platform (some are now thinking this way)
21GIS Software Used NC Local Agencies
Source NC OneMap Data Inventory 2004
22Vector Data Format Options
- Option A use an open format and have a really
unfortunate transformation and limited vendor
support for the output object - Option B use closed format but retain the
original content and count on short- and
medium-term vendor support. - Option C do both to buy time and look for an
open, ASCII-based solution. (watch GML activity) - No sweet spot, just an evolving and changing mix
of - flawed options that are used in combination.
23Geography Markup Language Issues
- GML still more useful as a transfer format than
an archival format, support limited even for
transfer - Permanent access requirements
- profiles and application schemas widely
understood and supported, avoid requiring
digital archaeology - role of GML Simple Features Profile?
- Assessing formats for preservation
sustainability factors, quality functionality
factors - Apply same approach to GML profiles and
application schemas?
24Geography Markup Language Issues
- Plans for environmental scan of existing GML
profiles and application schemas or profiles - schema name (e.g. OSMM, top10NL, ESRI GML,
LandGML) - responsible agency schema has official
government status? - GML version known unsupported GML components
- schema history known interoperation with other
schemas - vendor support translator support stability
over time
25Managing Time-versioned Content
26Managing Time-versioned Content
- Many local agency data layers continuously
updated - E.g., some county cadastral data updated
dailyolder versions not generally available - Individual versioned datasets will wander off
from the archive - How do users get current metadata/DRM/object
from a versioned dataset found in the wild? - How do we certify concurrency and agreement
between the metadata and the data?
27Managing Time-versioned Content
- Can we manage the relationship loosely using a
persistent identifier link to a parent object?
Persistent ID Resolver
Parent Object Manager
version
version
version
version
version
28Preservation Metadata Issues
- FGDC Metadata
- Many flavors, incoming metadata needs processing
- Cross-walk elements to PREMIS, MODS?
- Metadata wrapper/Content packaging
- METS (Metadata Encoding and Transmission
Standard) vs. other industry solutions - Need a geospatial industry solution for the
METS-like problem - GeoDRM a likely triggerwrapper to enforce
licensing (MPEG 21 references in OGIS Web
Services 3)
29Metadata Availability
30Harnessing Geospatial Web Services
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Geospatial Web Service Types
- Image services
- Deliver image resulting from query against
underlying data - Limited opportunity for analysis
- Feature services
- Stream actual feature data, greater opportunity
for data analysis - Other
- Geocoding services
- Routing
- .etc.
37(No Transcript)
38Geospatial Web Services Rights IssuesExample
Desktop GIS-accessible ArcIMS
- 39 of 100 NC counties have desktop GIS-accessible
ArcIMS services - It is difficult to know how many of these
counties actually expect users to either - A) access data through desktop GIS for viewing
only, or - B) extract and download data
39Harnessing Geospatial Web Services
- Automated content identification
- capabilities files, registries, catalog
services - WMS (Web Map Service) for batch extraction of
image atlases - last ditch capture option
- preserve cartographic representation
- retain records of decision-making process
- feature services (WFS) later.
- Rights issues in the web services space are
ambiguous
40Web mash-ups and the New Mainstream Geospatial
Web Services
41Preserving Cartographic Representation
42Preserving Cartographic Representation
- The true counterpart of the old map is not the
GIS dataset, but rather the cartographic
representation that builds on that data - Intellectual choices about symbolization, layer
combinations - Data models, analysis, annotations
- Cartographic representation typically encoded in
proprietary files (.avl, .lyr, .apr, .mxd) that
do not lend themselves well to migration - Symbologies have meaning to particular
communities at particular points in time,
preserving information about symbol sets and
their meaning is a different problem
43Preserving Cartographic Representation
- Image-based approaches
- Generate images using Map Book or similar tools
- Harvest existing atlas images
- Capture atlases from WMS servers
- Export layouts or maps to image
- Vector-based approaches
- Store explicitly in the data format (e.g. Feature
Class Representation in ArcGIS 9.2) - Archive and upward-migrate existing files .avl,
.apr, .lyr, .mxd, etc. - SVG, VML or other XML approaches
- Other?
44Preserving Cartographic Representation
45Preserving Cartographic Representation
46Repository Architecture Issues
- Interest in how geospatial content interacts with
widely available digital repository software - Focus on salient, domain-specific issues
- Challenge remain repository agnostic
- Avoid imprinting on repository software
environment - Preservation package should not be the same as
the ingest object of the first environment - Tension between exploiting repository software
features vs. becoming software dependent
47Preserving Geodatabases
- Spatial databases in general vs. ESRI Geodatabase
format - Not just data layers and attributesalso
topology, annotation, relationships, behaviors - ESRI Geodatabase archival issues
- XML Export, Geodatabase History, File
Geodatabase, Geodatabase Replication - Some looking to Geodatabase as archival platform
(in addition to feature class export)
48Geodatabase Availability
- Local agencies, especially municipalities, are
increasingly turning to the ESRI Geodatabase
format to manage geospatial data. - According to the 2003 Local Government GIS Data
Inventory, 10.0 of all county framework data and
32.7 of all municipal framework data were
managed in that format.
49Evolving Geodatabase Handling Approaches
Project Stage Planned Approach
Original Proposal (Nov. 2003) Export feature classes as shapefiles archive Geodatabases less than 2 GB in size
Finalized Work Plan (Dec. 2004) Also export content as Geodatabase XML
Possible Future Work Plan Changes Explore maintenance of some archival content in Geodatabase form explore Geodatabase replication as an archive development approach archive Geodatabases of unlimited size
50Efficient Content Replication
- Content replication also needed for
- Disaster preparedness
- State and federal data improvement projects
- Aggregation by regional geospatial web service
providers - WFS, e.g. efficiency in complete content
transfer? - Rsync-like function, plus rights management,
inventory processes, metadata management,
informed by data update cycles - Archiving delta files vs. complete replication
need to avoid requiring digital archaeology in
the future
51Points of Engagement with the Open Geospatial
Consortium (OGC)
- GML for archiving
- GeoDRM -- Adding preservation use cases
- Content Packaging -- Industry solution?
- Web Services Context Documents
- Can we save data state as well as application
state? - Content Replication
- Is this layer in the architecture?
- Persistent Identifiers
52Project Outcomes
- Demonstration archive
- Outreach activity planting seeds
- International, national, state, local, commercial
- Learning experience, informing
- Spatial data infrastructure
- Commercial vendors (data/software/consulting)
- Repository software communities
- Metadata practice (both GIS preservation)
- Rights management developments
- Data and interoperability standards
53Content Identification and Selection
- Work from NC OneMap Data Inventory
- Combine with inventory information from various
state agencies and from previous NCSU efforts - Develop methodology for selecting from among
early, middle, and late stage products - Develop criteria for time series development
- Investigate use of emerging Open Geospatial
Consortium technologies in data identification
54Content Acquisition
- Work from NC OneMap Data Sharing Agreements as a
starting point (the blanket) - Secure individual agreements (the quilt)
- Investigate use of OGC technologies in capture
- Explore use of METS as a metadata wrapper
- Ingest FGDC metadata Xwalk to MODS? PREMIS?
- Maybe METS DRM short term GeoDRM long term
- Consider links to services version management
- Get the geospatial community to tackle the
content packaging problem (maybe MPEG 21?)
55Partnership Building
- Work within context of the NC OneMap initiative
- State, local, federal partnership
- State expression of the National Map
- Defined characteristic Historic and temporal
data will be maintained and available - Advisory Committee drawn from the NC Geographic
Information Coordinating Council subcommittees - Seek external partners
- National States Geographic Information Council
- FGDC Historical Data Committee
- more
56Content Retention and Transfer
- Ingest into Dspace
- Explore how geospatial content interacts with
existing digital repository software environments - Investigate re-ingest into a second platform
- Challenge keep the collection repository-agnostic
- Start to define format migration paths
- Special problem geodatabases
- Purse long term solution
- Roles of data producing agencies, state agencies
NC OneMap NCSU
57Project Status
- Completing inventory analysis stage
- Storage system and backup deployed
- DSpace deployed to production
- Metadata workflow finalized
- Ingest workflow near finalization
- Content migration workflow near finalization
- Regional site visits planned for coming months
- Wide range of outreach/collaboration FGDC, ESRI,
EDINA (JISC), USGS, OGC, TRB, etc. - Pilot project, georegistering digital archival
geologic maps
58Questions?
Contact Steve Morris Head, Digital Library
Initiatives NCSU Libraries ph (919)
515-1361 Steven_Morris_at_ncsu.edu