Title: Challenges and Solutions for Digital Geospatial Data Preservation Jeff Essic Geospatial Data Service
1Challenges and Solutions for Digital Geospatial
Data PreservationJeff EssicGeospatial Data
Services LibrarianNorth Carolina State
University Libraries
Digital Preservation Summit Indiana State
University
May 21, 2008
2NC Geospatial Data Archiving Project
- Partnership between university library (NCSU) and
state agency (NCCGIA), with Library of Congress
under the National Digital Information
Infrastructure and Preservation Program (NDIIPP) - One of 8 initial NDIIPP collection building
partnerships - Focus on state and local geospatial content in
North Carolina (state demonstration) - Tied to NC OneMap initiative, which provides for
seamless access to data, metadata, and
inventories
3NCGDAP Goals
- Repository Goal
- Capture at-risk data
- Explore technical and organizational challenges
- Project End Goal
- Data Producers Improved temporal data management
practices - Archives More efficient means of acquiring and
preserving data - Progress towards best practices
4NCGDAP Specifics
- Funding
- 520,000 for 2005-2007
- 500,000 for 18 month extension
- Staff
- 1.5 at NCSU
- Approx. same at NCCGIA
5Selected Geospatial Data Archive Projects
6Outline
- Key Geospatial Data Types
- Risks to Digital Geospatial Data
- Value in Temporal/Historical Geospatial Data
- Archiving Challenges
- Solutions in Progress
7Key Geospatial Content Types
8Data Types Digital Orthophotography
- All 100 NC counties with orthos
- 1-5 flight years per county
- 30-300 gb per flight
9Geospatial Data Types Vector GIS
- County, municipal, state
- Detailed, accurate, current
- Frequently updated
- Cadastral (tax parcels)
- Street centerlines
- Zoning
- Topographic contours
- School, sheriff, fire
- Voting precincts
- More
10Imagery Durable Static Simple structure Mostly
open formats Vector data Volatile Frequent
update Complex structure Mostly proprietary
formats
Imagery Durable Static Simple structure Mostly
open formats Vector data Volatile Frequent
update Complex structure Mostly proprietary
formats
Downtown Raleigh Near State Capitol 2005 Wake
County Ortho
Downtown Raleigh Near State Capitol 2005 Wake
County Ortho
11Data Types Spatial Databases
- Vector, raster, and tabular data
- Relationships
- Behaviors
- Annotation
- Data Models
12Geospatial Data Types Cartographic
- GIS Software
- Software project file (.mxd, .apr, )
- Data layer file (.avl, .lyr, )
- PDF, GeoPDF map exports
- Web Services-based representations
13Other Geospatial Data Types Place-based Data
Street Views
Oblique Imagery
3D Images
Tax Dept. Photos
- Present-day value in location-based services and
mobile applications - Future value for cultural heritage, descriptions
of places
14Other Geospatial Data Types Web 2.0 Content
15Geospatial Data Compelling Issues
- Dynamic content
- Constantly updated information
- Data versioning
- Digital object complexity
- Spatially enabled databases
- Complicated, multi-component formats
- Proprietary formats
16Risks to Geospatial Data
17Digital Preservation Points of Failure
- Data is not saved, or
- cant be found, or
- media is obsolete, or
- media is corrupt, or
- format is obsolete, or
- file is corrupt, or
- meaning is lost
18Risks to Geospatial Data
- Producer focus on current data
- Data overwrite as common practice
- Future support of data formats in question
- No open, supported format for vector data
- Shift to web services-based access
- Data becoming more ephemeral
- Inadequate or nonexistent metadata
- Impedes discovery and use
- Increasing use of spatial databases for data
management - The whole is greater than the sum of the parts
19Value in Historical/Temporal Geospatial Data
20Value in Older Data Cultural Heritage
Future uses of data are difficult to anticipate
(as with Sanborn Maps)
21Application Impervious Surface Change Mapping
A.
B.
2002 Impervious
2004 Aerial Photography
C.
D.
2004 Impervious Update
2004 Impervious using 2002 Mask
22Application Shoreline Change Mapping
23Application Identifying Land Use Changes
1993
1998
1999
2005
2002
Use case Land use and impervious surface change
analysis
24(No Transcript)
25Preservation Challenges
26Challenge Data Capture
2006 Frequency of Capture Survey targeting North
Carolina counties and municipalities
Response yes 65.3, no 34.7 (out of
57.6 response rate)
27Challenge Data Capture
- Industry focus on latest and greatest data
- Industry temporally-impaired from the point of
view of data availability, software support, etc. - Loss of memory about the data
- Of superceded county orthophoto flights in NC
- Only 22 recorded in the states GIS inventory
- Only 30 accessible through county map servers
Some older inventories only available through
Internet Archive
28Survey of current archiving practice among NC
counties and municipalities
All of our data is kept monthly for 1 year
i.e., September 2006 tape will be overwritten
September 2007. I do a weekly backup of
existing data but it is overwriting the
previously saved data. All of our data is
archived daily, then weekly, then monthly, and
yearly. No emphasis on historical data here.
We just try to keep from losing data completely.
Very minimal hardware to work with and no money.
29Survey of current archiving practice among NC
counties and municipalities
We are only an emerging GIS. But it is my
intention that ALL data will be
archived. Getting ready to implement this type
of archiving of data. I have not done this,
but it does seem like a good idea! I do not
see why this can not be incorporated with
disaster recovery. Don't you think you would
foster greater support?
Tremendous data producer interest in digitizing
and georeferencing old analog imagery and maps
30Challenge Preservation Metadata
Results from a 2006 survey of all 100 NC counties
and 25 largest NC municipalities
31Challenge Vector Data Formats
- No widely-supported, open vector formats for
geospatial data - Spatial Data Transfer Standard (SDTS) not widely
supported - Geography Markup Language (GML) diversity of
application schemas and profiles a challenge for
permanent access - Spatial Databases
- The whole is more than the sum of the parts, and
the whole is very difficult to preserve - Can export individual data layers for curation,
but relationships and context are lost
32Problem Multiple choice for format type,
coordinate system, tiling scheme
33Challenge Digital Object Complexity
- Files
- Multi-file dataset
- Georeferencing
- Metadata file
- Symbols file
- Additional
- documentation
- License
- Disclaimer
- More
- Metadata
- FGDC
- Acquisition metadata
- Transfer metadata
- Ingest metadata
- Archive rights
- Archive processes
- Collection metadata
- Series metadata
34Challenge Cartographic Representation
Counterpart to the map is not just the dataset
but also models, symbolization, classification,
annotation, etc.
35Challenge Geospatial Web Services
USGS nat_haz ArcIMS Service, 7 May 2008, 1105 am
36Carrboro, NC Population 17,797 (2005 est.)
24 downloadable GIS data layers
6 web mapping applications
4 OGC WMS services (web services)
9 downloadable PDF map layers
37Other Challenges
- Rights management
- Data versioning
- Semantic issues
- Large scale content transfer
- Integrating older analog data
- More
38Solutions in Progress
39Different Ways to Approach Preservation
- Technical solutions How do we preserve acquired
content over the long term? - Cultural/Organizational solutions How do we make
the data more preservableand more prone to be
preservedfrom point of production?
Current use and data sharing requirements not
archiving needs are most likely to drive
improved preservability of content and
improvement of metadata
40Different Ways to Approach Preservation
- Technical solutions How do we archive acquired
content over the long term? - Build data repositories not just as an end in
itself but also as a catalyst for discussion
within the data community - Develop repository ingest workflows create
technical points of engagement with other NDIIPP
preservation projects and build on collective
learning experience
41Different Ways to Approach Preservation
- Cultural/Organizational solutions How do we make
the data more preservableand more prone to be
archivedfrom point of production? - Engage data producer community and spatial data
infrastructure through outreach and engagement
influence practice - Sell the problem to software vendors and
standards development - Find overlap with more compelling business
problems disaster preparedness, business
continuity, road building, etc. - Start a discussion about roles at the local,
state, and federal level
42Content Identification
Technical Solution Data Repository
43Formal Inventory Processes
- Alleviate contact fatigue on part of local
agencies - 20 different NC state agencies contact local
agencies for data also, federal/regional
agencies - Geospatial data is complex, requiring lengthy
inventory process - Must capture descriptive, technical, and
administrative information related to the data - Make the inventory available as a sharable data
store
44What do Inventories Offer to Archives?
- Data Availability Information
- Detailed information by data layer
- Contact Information
- Minimal Metadata
- Descriptive, technical, administrative
- Rights Information
- Document Technical Environment
- Software used, formats, transfer methods
- Future Data Development Plans
45Detailed Information About Data
Source NC OneMap Data Inventory 2004
46Inventories as Source of MetadataExample
Surface Water
47Content Selection
48Selection Issues
- Most content is already at some level of risk
- Early-Middle-Late Stage issues
- Middle stage is usually the sweet spot, e.g.
TIFF orthophotos vs. raw images or compressed
images - Also added-value products digital maps,
cartographic representation - Digital maps record or not?
- Frequency of capture
49Time series vector data Parcel Boundary Changes
2001-2004, North Raleigh, NC
Continuously updated data Frequency of
snapshots? Different for various framework
layers?
50Sept. 2006 Frequency of Capture Survey
- Survey objective
- Document current practices for obtaining archival
snapshots of county/municipal geospatial vector
data layers - Seek guidance about frequency of capture
- Survey topics
- General questions about data archiving practice
- Specific questions about parcels, street
centerlines, jurisdictional boundaries, and
zoning - Survey subjects
- All 100 counties and 25 municipalities
- 58 response rate
- Survey conducted September 2006
51Frequency of Capture Survey
52Data Capture Survey Results Overview
- Two-thirds of responding agencies create and
retain periodic snapshots - Long-term retention more common in counties with
larger populations - Storage environments vary, with servers and
CD-ROMs most common - Offsite storage (or both onsite and offsite) is
used by nearly half of the respondents - Popularity of historic images has resulted in
scanning and geo-referencing of hardcopy aerial
photos among one-third of the respondents
53Survey Observations
- Process of survey formulation and implementation
helped to socialize the problem of archiving data - Local innovation needs to be mined further to
inform development of best practices - Business drivers for archiving need more study
(e.g., stated adherence to retention policy) - Exposure to peer practice encourages archiving
- Pronounced local interest in scanning/rectifying
older analog maps and imagery
54Content Exchange
55Solutions Content Exchange Infrastructure
- High volume of state/federal requests for local
data - Solving the present-day problems of data sharing
is a pre-requisite to solving the problem of
long-term access - Leveraging more compelling business reasons to
put the data in motion (disaster preparedness,
business continuity, highway construction,
census, ) - Content exchange networks
- Minimize need to make contact
- Add technical, administrative, descriptive
metadata - Establish rights and provenance
56Solutions Content Exchange Infrastructure
- Nov. 2007 NC Geographic Information
Coordinating Council (GICC) - Ten Recommendations in Support of Geospatial
Data Sharing released - Recommendation Establish archive and long term
data access strategies - Suggested best practices include Establish a
policy and procedure for the provision of access
to historic data, especially for framework data
layers. - http//www.ncgicc.org/CurrentActivities/TenRecomme
ndationsinSupportofGeospatialData/tabid/156/Defaul
t.aspx
57Solutions Get the Data in Motion
- Harvesting use cases for older data as part of
outreach
Survey of current archiving practice among NC
counties and municipalities
58Solutions Getting the Data in Motion
- Important Objectives
- Minimize Direct Contact
- Document Data
- Clarify Rights
- Routinize Transfers
- Leverage other business uses that put data in
motion - Continuity of operations
- Highway Planning
- Floodplain Mapping
Most costly part of archive development is
identifying, negotiating acquisition, and then
transferring data
59Solutions Getting the Data in Motion
- NC GIS Inventory
- Efficient data identification
- Adding preservation elements
Orthophoto Data Distribution System
sneakernet Transfer of large quantities of
imagery
- NC OneMap Data Download and Viewer
- Public access
- Data visualization
Street Centerline Data Distribution
System Efficient transfer of data from 100
counties, with metadata and clarified
rights http//www.ncstreetmap.com
60Solutions County and City GIS Data Directories
- Tracking data, map servers, and web services
since 2000 - Ranked 3rd in traffic among entry points to
library website - Persistent identifiers
- usage tracking
- IDs used in other sites
- Peers compare activities
- Community help in site maintenance
61Repository Development
62General Workflow
- Receive Data from Agency
- Copy data from agency source to NCSU workstation
- Create Dspace collection space for the data
- Create administrative metadata
- Process geospatial metadata
- Scan geospatial formats and migrate to archival
format - Ingest original and archival data objects, and
geospatial administrative metadata to Dspace
63Repository Status
- Acquired 4 TB of data with more on the way
- Disk space being used initially for data
staging - Inventorying
- In the process of ingesting content into DSpace
- Metadata generation
64Summary
Technical Solution Data Repository
65Data Capture Challenge Implemented Solutions
- Downloading or acquiring low hanging fruit
- Frequency based on FOC survey
- Tapping into existing content exchange networks
- Orthophoto sneakernet
- NC OneMap
- NCStreetmaps.org
- Floodplain Mapping data distribution
- Others
66Preservation Metadata Challenge Implemented
Solutions
- Creating our own based on
- Non-standard documentation
- Inventories
- Personal information exchanges
- Data context
- Clues, memory,
- and other sleuthing
67Vector Data Formats and Complexity Challenges
Implemented Solutions
- Converting and Preserving data in Shapefile
format - Not ideal, but
- Specifications are published
- Stable, widely accepted and known format
- Ingest content into Dspace object model
- Exportability, Transfer, Extraction, and
Conversion being tested
68Cartographic Representation Challenge
Implemented Solutions
- Scanned, georeferenced, and compressed over 286
NC geologic maps, in cooperation with NC Geologic
Survey
131,680 1430,000
1500,000 12.5 M
69Geospatial Web Services Challenge Implemented
Solutions
- Still searching
- WMS (Web Map Service)
- Can only capture derived static images, losing
the underlying data intelligence - Possible use for agent-based image atlas creation
- WFS (Web Feature Service)
- Transfers actual vector data as GML
- Not widely deployed variation in configuration
- Scalability for bulk transfer questionable
70Engaging Spatial Data Infrastructure
Cultural/Organigation Solution Engaging Others
71NC Spatial Data Infrastructure NC OneMap
- NC OneMap is a next generation mechanism to
coordinate and disseminate geographic information
in North Carolina and interact with the NSDI. - Objectives
- Build a common
- understanding of North
- Carolina data resources
- Enable widespread
- access and distribution
- of geospatial data
72NC OneMap
- Objectives (cont.)
- Develop ongoing data
- inventory for all geospatial data
- holdings
- http//nc.gisinventory.net
- Develop content standards
- for key data themes
- NC Geographic Information
- Coordinating Council (GICC)
- One of the defined characteristics of NC OneMap
is that Historic and temporal data will be
maintained and available.
73Points of Engagement with Spatial Data
Infrastructure
- Framework data communities
- Snapshot frequency, naming schemes,
classification, GML application schemas, format
strategies - Metadata standards and outreach
- Persistent identifiers, versioning, feedback on
metadata quality - Content replication/transfer
- For data improvement projects, disaster
preparedness, aggregation by regional service
providers, and archives - Where does archiving and preservation fit in?
74Archival and Long Term Access Working Group
- Initiated by NC Geographic Information
Coordinating Council in 2008 to address growing
concerns of state and local agencies about
long-term access to data - Federal, state, regional, and local agency
representation - Key focus
- Best practices for data snapshots and retention
- State Archives processes appraisal, selection,
retention schedules, etc. - Who, What, Why, When, Where, How
- Promising outcome of NCGDAP multiple parties
and levels discussing data archiving on their
own.
75Regional Partnerships
- Focused on development of shared infrastructure
for cultivating access to data - Becoming test beds for innovation in the area of
data sharing and data management, including
archiving
76NDIIPP Multi-State Geospatial Project
- Lead organizations North Carolina Center for
Geographic Information Analysis (NCCGIA) and
State Archives of NC - Partners
- Leading state geospatial organizations of
Kentucky and Utah - State Archives of Kentucky and Utah
- NCSU Libraries in catalytic/advisory role
- State-to-state and geo-to-Archives collaboration
- 2 year project Nov. 2007-Dec. 2009
- Archives as part of Spatial Data Infrastructure
77Engaging Industry
78Cultural Changing Industry Thinking
- Is the geospatial industry temporally-impaired?
- Lack of access to older data
- Lack for tool/model support for temporal analysis
- Metadata poor support for changing data
- Education building class projects around
available data (i.e., not temporal) - Increased interest now in temporal applications?
- Increased demand for temporal data?
- Improved tool support ArcGIS 9.2 animation
tools Geodatabase History, etc.
79Project Status
What About Commercial Data?
Cultivating a commercial market for older data.
Part of permanent access is marketing,
advertising, and putting older data into the path
of the user
80Conclusions
81Conclusions
- Supporting temporal analysis requirements gets
more attention than archiving and preservation - Leverage existing infrastructure
- Current data sharing needs drive infrastructure
improvements that help archiving - Leverage business needs that are more compelling
than preservation (e.g., continuity of
operations) - Facilitate stakeholder ownership of the solutions
- Mine state and local archiving innovations
82Slide PresentationTemporarily
athttp//www4.ncsu.edu/jfessic/DPW08.pptLater
, permanently linked athttp//www.lib.ncsu.edu/n
cgdap
Steve Morris Jeff Essic Head, Digital Library
Initiatives Geospatial Data Services
Librarian NCSU Libraries NCSU Libraries ph
(919) 515-1361 ph (919) 515-5698 Steven_Morris_at_
ncsu.edu Jeff_Essic_at_ncsu.edu