Title: P1246990952BuZjz
1Investigating Metadata for Long-Lived Geospatial
Resources An Exploration By Nancy J.
Hoebelheinrich Metadata Coordinator Digital
Library Systems Services
Stanford Digital Repository
2To Be Discussed
- The Study
- Question asked
- Methodology used
- MD standards strengths / weaknesses
- Conclusions Recommendations
- Future work needed
3The Study
- NGDA Project
- Pertinent project objective
- Collect and archive major segments of at-risk
digital geospatial data and images - Partners / Backgrounds / Areas of experience
- UCSB Alexandria Digital Library / Presentation
- Stanford Libraries Stanford Digital Repository
/ Preservation - Differences in experiences gave rise to study
question
4Study Question
- What metadata is needed for long-lived geospatial
data formats? - Grounded in previous studies
- Hunolt paper for USGCRP Office
- Digital Preservation Coalition (UK)
- NSF
- OAIS Reference model
- Duerr, Parsons articles
- OCLC / RLG Preservation studies
5Methodology used
- Evaluate four fairly typical geospatial data
formats - Shapefiles, DOQQs, DRGs Landsat 7 satellite
images (preliminary) - Compare / contrast 3 different approaches to
documenting - FGDC Content Standard
- CIESIN Geospatial Electronic Records
- PREMIS
6Categories of information about the resources
- Environment (computer platforms)
- Semantic Underpinnings
- Domain specific terminology
- Provenance
- Data trustworthiness
- Data quality
- Appropriate use
7Environment (computing platform)
- Definition characteristics of the hw / sw
configuration that allow a resource to function
properly - Function could be defined as
- Rendering
- Viewing
- Using
- May need to be repeated
8Environment, cont.
- All 3 systems have means for documenting these
characteristics - Both PREMIS and GER provide more granularity
parsability, e.g., creatingApplication, sw, hw
name, versions dependencies, environment type,
etc. - FGDC uses technical prerequisites native
data set
9Semantic Underpinnings
- Detailed concepts
- Meaning or essence of data
- Significance of data, i.e., why preserve it?
- Purpose or function served by data
- Intended community
- FGDC GER have fairly extensive set,
particularly GER - PREMIS NOT descriptive or domain specific, so
not covered
10Domain specific terminology
- For geospatial, particularly valuable
- Keywords associated with data themes
- Spatial coverage
- Time period
- Stratum coverage / place names
- GER FGDC cover
- PREMIS NOT descriptive MD
11Provenance
- Detailed concepts
- Info about the events, parameters source data
associated w/ construction of data set prior to
ingestion - Source of data
- Changes made to data inside the preservation
archive - FGDC, GER, PREMIS all ok for 1st 2
- FGDC NOT for last
12Provenance, cont.
- Greater level granularity / parsability in
PREMIS using Object, Event Agent entities - See Example for Rumsey Historical Map Image
Collection about descriptive MD transformation
13Use of PREMIS Event Data Elements
- Example Event 1
- Transform of descriptive MD from MS Access db
XML MODS
- Why this event?
- In case of questions from outside data provider
- Retain singular scripts transform mechanisms
14PREMIS Event Excerpt (v1.1)
15Use of PREMIS Event Data Elements
- Example Event 2
- Merge c\temp\states1c\temp \states2
c\temp\USA - (includes process merge and data sources
- Advantage can describe events once in
repository, unlike FGDC, but - Can include if prior to ingestion?
- Why this event?
- Important to describe processes during different
phases of lifecycle, even prior to ingestion - Not to be able to do so problemmatic for
geospatial resources - Is best practice issue for this domain
16Data trustworthiness
- Detailed concepts
- Who are parties responsible for creation,
development, storage, maintenance of data set - Where is data located
- How is data available
- What important factors about the data should be
preserved
17Data trustworthiness, cont.
- Parties
- Location of data
- Factors to preserve
- FGDC, only originator, GER PREMIS more
granular parsable - GER FGDC seem more specific less inclusive
(only POV of distributor) for last 2
18Data Quality
- Detailed concepts
- General condition statement
- Accuracy of the data
- Fidelity of relationships within the data set
- Accuracy of measurements of the data
- FGDC has tags, but are very specific
- GER not much coverage here
19Appropriate use
- Detailed concepts
- Legal use and liability statements
- Technical characteristics that impact use
- FGDC PREMIS have, GER NOT
- FGDC NOT, GER PREMIS have means of linking to
format registry info - More about format registries, later
20PREMIS Significant properties
- Way within PREMIS to document
- Data trustworthiness data creator / provider
reliable authentic - Data quality describing completeness, logical
consistency, attribute accuracy - Data Provenance processes sources for dataset
understandable reliable - Appropriate use understanding of the specific
needs of the designated community? - Other important factors to preserve
- More work needs to be done in this area
21Strengths weaknesses FGDC
- Rich in detail
- Specificity for the geospatial domain
- Ubiquity
- Very complex laborious to complete
- Poor means for describing relationships among
file components of a digital resource - No way to describe digital resource once within
preservation archive
22Strengths weaknesses GER
- Focus on archiving
- Comprehensive
- Little known as yet
- No data dictionary, so unclear how to apply tags
(cardinality, repeatability, etc.) - Relational DB format
- Unclear if and/or how to describe digital
resource once within preservation archive
23Strengths, weaknesses PREMIS
- Applicable at many levels of digital resource
abstract physical - Capability for describing relationships among
file components of digital resource - Capability for describing digital resource during
its entire lifecycle within the preservation
archive - Generic focused upon preservation
- Not specific enough for geospatial
- Does not include critical semantics or
descriptive information important for using
digital resource - Fairly young specification unclear how to
document significant properties for digital
resources
24Recommendations
- Use of content standard (e.g., FGDC or ISO when
replaces) - Best used for semantics, domain specific
terminology - PREMIS
- Best used for management of resources over time
using - PREMIS Object entity
- PREMIS Event entity
- PREMIS Agent entity
- Useful to package resources metadata together
to facilitate tracking of aggregation of
resource(s), MD resource structure file
inventory, e.g., METS
25Issues Challenges
- What if domain specific MD is not available?
- If not, how can one get important info from data
creators? - How to determine what is truly necessary for use
of data sets? - Establishment of geospatial format registries
- Getting buy-in from geospatial domains for use of
vocabularies, etc. (see Global Spatial Data
Infrastructure http//www.gsdi.org/Default.asp )
- More research needs to be done on significant
properties like that done by JISC DPC studies,
e.g., SPs of vector images
26Future directions for NGDA Project
- Further investigation of other geospatial formats
including more vector based data such as - layers of the National Atlas
- National Map (sections of California)
- Landsat 7 ETM imagery
- Derived data sets from Stanford faculty
27Future directions, cont.
- Format Registry investigation - what should be
included in a format registry for geospatial - Contact with key vendors, e.g. ESRI,
SafeSoftware, etc. - Monitoring what others are doing with e-science
social science data sets, e.g., - NCSU, Johns Hopkins
- National Australian Archive (NAA)
- JISC and DPC in the UK
- NDIIPP US Multi-state project
- Those using new DDI v 3.0 schema
28References, contact info
- JISC DPC studies on significant properties
http//www.dpconline.org/graphics/events/080407wor
kshop.html - See Duce and Nielsen papers
- Full paper available at http//www.ngda.org/resea
rch.php - National Geospatial Digital Archive
http//www.ngda.org/index.php - Examples of METS with PREMIS on METS public wiki
- http//www.socialtext.net/mim-2006/index.cgi?profi
le_playground
29Questions? / comments?
- Nancy J. Hoebelheinrich
- nhoebel_at_stanford.edu
- John Banning jwbanning_at_gmail.com