LongLived Data Collections - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

LongLived Data Collections

Description:

Research data archive (RDA) - 95% LLDC. Meteorological and ... 20 & 60 GB/Cart. 200 GB/cart. Multi-phased plan, 2 years. 9/3/09. Steven Worley, NCAR/SCD ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 18
Provided by: steven172
Category:

less

Transcript and Presenter's Notes

Title: LongLived Data Collections


1
Long-Lived Data Collections
  • Outline
  • Data Archiving
  • Data Maintenance
  • Data Migration

2
Data Archiving
  • At NCAR
  • Research data archive (RDA) - 95 LLDC
  • Meteorological and physical oceanographic data
  • Built over 35 years
  • 500 datasets, 25 TB, growing daily
  • Nine data stewards (grad. degrees in met./ocn.)

3
Data Archiving
Monthly Mean Air Temperature at 2m
  • One Example ERA-40
  • Global Atmospheric Reanalysis
  • 1957-2002
  • Many reference frames
  • Pressure surfaces
  • Isentropic ..
  • Many resolutions
  • 2.5
  • Spectral , N80
  • Expect O(1000) users

4
Data Archiving
  • Practices and Policies
  • Save 2x copies
  • Offsite backup
  • under different management system
  • Time stable attributes
  • No proprietary data formats
  • Access software in basic languages
  • Fortran, C,
  • Minimize software dependence on complex libraries
  • E.g. netCDF, HDF

5
Data Archiving
  • PP, continued
  • Shared Responsibility Cross Agency
  • For large collections, e.g. ERA-40, 35 TB
  • Two step archive plan
  • 1st Data stays with PI - distributes
  • Applies stewardship, QC, analysis, documentation
  • 2nd Mature data transferred to an archive center
  • Long-term preservation and continued access
  • Should an archive plan be part of a NSF proposal?
  • Fits into broader impacts
  • Include data formats and metadata

6
Data Archiving
  • PP, Continued
  • Data compression
  • Important for efficient storage and transport
  • Use open standards
  • Submission to a data center
  • Early submission advantages
  • Data, captured before runs out
  • Unburdens PI from data management
  • Greater sharing more science knowledge gains
  • Disadvantages
  • PI first evaluation rights
  • Not a problem now
  • Authorization and authentication

7
Data Maintenance
  • Practices and Policies
  • Use change control system, all transaction
  • Creation
  • File additions, fixes, replacement
  • Metadata updates
  • The data and metadata remain tightly linked.
  • Note this system itself, viable for decades
  • Same principles as the archive
  • Employ science data stewards
  • Additional insurance for accurate data
    preservation

8
Data Maintenance
  • PP
  • Do data integrity checks
  • Monitor all network transfers for faults
  • Receipt and reconciliation reports
  • Many checks byte counts, test files, comparisons
  • Keep user information current
  • Changes trigger web page updates

9
Data Maintenance
  • PP - Concerns
  • Fact Huge collections of web based
    documentation.
  • Text, Images, Links
  • Embedded scripting (e.g. java script )
  • HOW DO YOU ARCHIVE WEB SITES?
  • Access content 20 years from now?
  • Data in DBMSs
  • Software dependent
  • Not viable for LLDCs technology trap

10
Data Maintenance
  • PP
  • Use standard metadata
  • Version control
  • Lineage documentation
  • Publication documentation
  • Preservation status
  • LLDCs are seldom static
  • New metadata, data corrections, new links
  • Need flexible maintenance methods

11
Data Migration
  • Example SCD/NCAR Mass Storage System
  • RDA plus MUCH MORE
  • NCAR super computers
  • NCAR data analysis machines
  • Other NCAR/UCAR Divisions and Programs
  • How much data is a LLDC?
  • Ongoing debate with our users/scientists
  • data storage policies?

12
Data Migration
  • Scales of the problem (ref., 01/21/2004)
  • 21.5 Million Files, 1.7 PB
  • Growth 50 TB/month total
  • 1 Million file moves per month

13
Data Migration
NCAR MSS 1986-2003
14
Data Migration
  • History
  • Since 1986, 5 migrations
  • All tape media
  • NCAR MSS software, scalable
  • Software and system changes may trigger
    migrations
  • Future
  • Media replacement
  • 20 60 GB/Cart. ? 200 GB/cart.
  • Multi-phased plan, 2 years

15
Data Migration
  • Migration factors
  • Done interleaved with normal operations
  • Almost continuous now
  • Tape life cycle probably 6-10 years
  • BUT, nominal service may be 3-5 years
  • allow for migration time
  • Option Extend nominal service
  • Deploy dedicated migration system
  • Unlikely too expensive

16
Data Migration
  • Practices and policies
  • Need to define data life cycle at creation time
  • E.g. if retention 5-years no migration is
    necessary
  • Recognize, difficult decision for scientist
  • May not be known a priori
  • Allow for adjustable retention period
  • Allow for peer review
  • Advantage
  • use the full life cycle of the media
  • Disadvantage
  • complex storage systems
  • Various media types and end-of-life dates
  • Recognize, LLDCs (if irreplaceable) data must be
    migrated

17
Conclusions
  • Need an archive plan for LLDCs
  • Maintain LLDCs with data stewards and curation
    experts
  • Need integrated data migration plans and data
    retention policies
  • If LLDCs are irreplaceable data, preserve in
    perpetuity
Write a Comment
User Comments (0)
About PowerShow.com