Title: Preserving Scientific Data
1Preserving Scientific Data
- Jamie Shiers, Information Technology Department,
CERN, Geneva, Switzerland
2Agenda
- Motivation for preserving scientific data
examples from a range of sciences - Volume of data involved and related issues
- Some concrete archiving examples from Particle
Physics - Remaining challenges
- Conclusions
3Motivation
- Climate data in an era when climate change is
hotly debated, the motivations appear clear - Medical data important for understanding issues
such as historical pandemics, cross-species
diseases etc. Avian flu, HIV, - Cosmological data plays a vital role in our
evolving understanding of the Universe
astrophysics community has an explicit policy
(data is made public after 1 year data volume
doubles each year) - Particle Physics data Similar arguments will
we ever be able to build similar accelerators to
those of today? If we lose this data, what of
our scientific heritage? Need to look at old data
for a signal that should have been seen (has
happened several times)
4Standard Cosmology Good model from 0.01
sec after Big Bang Supported by considerable
observational evidence
Elementary Particle Physics From the Standard
Model into the unknown towards energies of 1 TeV
and beyond the Terascale
Towards Quantum Gravity From the unknown into
the unknown...
http//www.damtp.cam.ac.uk/user/gr/public/bb_histo
ry.html
5Issues
- How much data is involved?
- Preserving the bits
- Understanding the bits
6How much data is involved?
- In 1998, the following estimates were made
regarding the data from LEP (1989 2000) that
should be kept
Experiment Analysis dataset Reconstructable dataset
ALEPH 250GB 1-2TB
DELPHI 2-6TB
L3 500GB 5TB
OPAL 300GB 1-2TB
- By todays standards, these data volumes are
trivial - Even though the total volume of data at the LHC
is much much higher, the data that must be kept
beyond the life of the machine (2007 to 2020)
will be easily handled by then - The LHC will generate some 15PB of data per year!
7The LHC machine - Overview
Introduction Status of LHCb ATLAS
ALICE CMS Conclusions
8The size of HEP detectors
Introduction Status of LHCb ATLAS
ALICE CMS Conclusions
ATLAS
Bld. 40
CMS
9Understanding the bits
- In the mid-1990s, a successful re-analysis of
10-year old data from the JADE collaboration at
the PETRA accelerator at DESY was made - A sub-set of the data was found abandoned in an
office corner. The programs to read the data were
in an obsolete language and were unusable. The
data format was proprietary (but de-codable). - This provided valuable input into the LEP data
archive - Data format will this be readable in 5 / 10 /
100 years? 1000? - Programs languages / operating systems /
hardware platforms have very short life-spans wrt
an archive - Metadata essential to understand what the data
means - The best solution to date is a so-called Museum
system, but this is still a very short term
solution wrt even Einstein, let alone Tyco Brahe,
Kepler and Newton
10Preserving the bits
- Lifetimes of Particle Physics experiments are
extremely long! Currently measured in decades - Ironically, one of the solutions proposed for the
LEP data archive (the then-current proposal for
the LHC) was later abandoned (technical /
commercial reasons) - This necessitated a triple migration
- Of 300TB of data between storage media
- Of the same data from one data format to another
- Of the accompanying processing codes.
- In the end, the exercise took around 2 months per
100TB of data migrated, as well as a significant
amount of effort (1 FTE / 100TB) and hardware
resources
11Outstanding Issues
- There are no data formats, programming languages,
computing hardware or operating systems with
lifetimes that can be guaranteed beyond the short
term - Virtual machine technology may extend an
environments (see above) natural life perhaps
doubling it - Reducing the data into a much simplified and
widely-used format can have significant
advantages, but only allows restricted analyses
to be performed - Preserving the detailed knowledge of the
experimental apparatus is beyond current
technology it would require extreme discipline
on behalf of the researchers as well as major
advances in the understanding and description of
metadata
12Conclusions
- As long as advances in storage capacity continue,
there are no significant issues related to the
volume of scientific data that must be kept - Periodic migration between different types of
storage media must be foreseen - Specific storage formats must also be catered for
this can require much more significant (time
consuming and expensive) migrations - By far the biggest problem concerns understanding
the data there is currently no clear solution
in this domain
13References
- LEP Data archive
- 1997 http//s.web.cern.ch/s/sticklan/www/archive/
- 2002 http//mgt-focus.web.cern.ch/mgt-focus/Focus
25/maggim.pdf - 2003 http//cern.ch/pfeiffer/LEP-Data-Archive/pro
posal/ProposalForTheLEPDataArchive.html - http//tenchini.home.cern.ch/tenchini/Status_Archi
ving_6_Mar_2003.pdf - Lisbon workshop
- http//cern.ch/knobloch/talks/CernCodataLisbon.ppt
- http//www.erpanet.org/events/2003/lisbon/LisbonRe
portFinal.pdf - COMPASS / HARP data migrations
- http//storageconference.org/2003/papers/06-Lubeck
-Overview.pdf - http//www.slac.stanford.edu/econf/C0303241/proc/p
apers/THKT001.PDF - http//indico.cern.ch/getFile.py/access?contribId
448sessionId24resId1materialIdpaperconfId0
14Acknowledgements
- The following people provided material and / or
pointers for this talk (knowingly or otherwise) - LEP Data Archive coordinators
- David Stickland, David.Stickland_at_cern.ch (L3)
- Andreas Pfeiffer, Andreas.Pfeiffer_at_cern.ch
- Marcello Maggi, Marcello.Maggi_at_ba.infn.it (ALEPH)
- COMPASS / HARP migrations
- Andrea Valassi, Andrea.Valassi_at_cern.ch
- ERPANET/CODATA Workshop
- Jürgen Knobloch, Juergen.Knobloch_at_cern.ch
15The End