Title: Digital Data Preservation in Astronomy: A Collaboration Among Libraries, Publishers, and the Virtual
1Digital Data Preservation in Astronomy A
Collaboration Among Libraries, Publishers, and
the Virtual Observatory
A pilot project aimed at preserving, curating,
and enabling access to digital data and
associated electronic journals content.
- Robert Hanisch, Space Telescope Science Institute
- Sayeed Choudhury, Tim DiLauro, Alex Szalay, and
Ethan Vishniac, - The Johns Hopkins University
- Julie Steffen, University of Chicago Press
- Teresa Ehling, Cornell University
- Robert Milkey, American Astronomical Society
- Ray Plante, National Center for Supercomputer
Applications -
2Outline for Presentation
- The Virtual Observatory
- Data in Astronomy
- The data preservation problem
- A scenario
- Past experience and research
- Approach
- A prototype project
3The Virtual Observatory
- The Virtual Observatory enables new science by
greatly enhancing access to data and computing
resources. The VO makes it easy to locate,
retrieve, and analyze data from archives and
catalogs worldwide. - The VO is about data discovery, access, and
integration. - The VO is NOT a huge centralized data repository.
- The VO provides standard protocols for obtaining
data from distributed collections. - The VO is national (US NVO) and international
(IVOA).
4Without VO
n services, n interfaces
astronomer
archive 1
service 3
archive 2
service 2
archive 3
service 1
survey 1
survey 3
survey 2
5With VO
n services, 1 interface
astronomer
archive 1
service 3
archive 2
service 2
VO
archive 3
service 1
survey 1
survey 3
survey 2
6Why is Astronomy Data Special?
- It has no commercial value
- No privacy concerns
- Can freely share results with others
- Great for experimenting with algorithms
- It is real and well documented
- High-dimensional (with confidence intervals)
- Spatial
- Temporal
- Diverse and distributed
- Many different instruments from many different
places and many different times - The questions are interesting
- There is a lot of it (soon petabytes)
7Data Flow (Levels of Data)
8The data preservation problem
- Research communities publish peer-reviewed
journal papers that describe highly processed
data. - Long-term preservation and curation systems for
digital journal content, including the digital
data presented only graphically, are not
currently in place. - The research cannot be verified and the results
cannot be easily compared to other data in order
to broaden impact. - Public funds invested in scientific research do
not have maximum return on investment. Essential
legacy datasets may be lost.
9Storyboard
10Storyboard
11Storyboard
Save as FITS Copy to my VOSpace
Display in Aladin
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Astronomy Digital Image Library
16ADIL query
17ADIL query
- ADIL is great, but
- Data capture and curation is separate from
manuscript processing - Data access is not integrated into the journals
- Data management is centralized
18Repository-related Research
- Digital Library framework comprises
service-oriented architecture with repositories
as foundation, especially for digital
preservation - Archive Ingest and Handling Test (AIHT) through
Library of Congress NDIIPP - A Technology Analysis of Repositories and Service
Integration (funded by Mellon Foundation) - Project STORE (Source to Output Repositories)
19Approach
- Integrate digital data management into the
publication process (data capture, review,
metadata tagging and validation, storage). - Exploit emerging information technology standards
for managing distributed data collections,
including digital journals. - Provide multiple access methods to digital data
to maximize visibility and re-use. - Exploit information management and curation
experience in the university libraries and build
on long-term institutional commitments to
preservation.
20Components
- Publication
- Editorial Process
- Data capture
- Metadata capture validation
- Links
- Identifiers
- Library
- Curation
- Preservation
- Data Storage Appliance
- Metadata database
- Digital data objects
- Ancillary information
- Data Storage Appliance
- Metadata database
- Digital data objects
- Ancillary information
- Data Storage Appliance
- Metadata database
- Digital data objects
- Ancillary information
replication services
VOSpace
- Data Access
- VO portals
- Journal portals
- Other after-market distributors
- Registry
- Logging
21A prototype project
- Implement end-to-end prototype using astronomy
scholarly publications as a test-bed. - Understand operational costs and develop
long-term business plan for preservation of
peer-reviewed journal content and associated
supporting data. - Develop associated policies affecting data
accessibility (e.g., move toward requiring
digital data availability as requirement for
publication). - Utilize commodity open-source technologies and
partner with Virtual Observatory to maximize
return on investment, flexibility, adaptability. - Long-term evaluate impact on citations and
productivity resulting from having ready access
to digital data.
22A prototype project
- Tasks
- metadata definition
- content management tool evaluation/selection
(Fedora) - physical storage and replication
- publication process revisions and testing
- policy development
- business model development
- Shared technology development/deployment
with National Virtual Observatory
23Current collaborators
- The Johns Hopkins University-Sheridan Libraries,
Edinburgh University Library, University of
Washington Library and Cornell University Library
(information management and curation) - The National Virtual Observatory project
(representatives from JHU, Space Telescope
Science Institute, and the National Center for
Supercomputing Applications) - American Astronomical Society (journals, editors)
- The University of Chicago Press (publisher for
the AAS journals)
24Status
- Support from
- UK JISC (Joint Information Systems Committee) and
CURL (Consortium of Research Libraries in the
British Isles) - US Institute of Museum and Library Services
- Support committed from
- Microsoft
- SPARC (Scholarly Publishing and Academic
Resources Coalition) - TeraGrid
- NVO
- Development has started