SJC CGW2 - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

SJC CGW2

Description:

Catering for a whole community. Curation Issues. File formats, complexity and specialisation ... Catering for a whole community. What data is worth storing? ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 30
Provided by: simon175
Category:
Tags: sjc | catering | cgw2 | uk

less

Transcript and Presenter's Notes

Title: SJC CGW2


1
Chemistry research data in the modern age A
clear need for curation expertise Simon
Coles School of Chemistry, University of
Southampton, U.K. s.j.coles_at_soton.ac.uk
2
Data Generation
Data Collection
Synthesis
Publication
Data Workup
Data Processing
3
Data Types
G bytes
RAW data
M bytes
DERIVED data
Lab / Institution
Subject Repository / Data Centre / Public Domain
k bytes
RESULTS data
4
Incentives and Drivers
  • Chemists dont think about their data!
  • They need to understand that their data is
    valuable and has a use beyond that of an
    immediate gain, before they will consider
    curation issues.
  • So what are the incentives and drivers?
  • Data Management
  • Data Deluge
  • Publishing Data
  • Validation, Assessment and Peer Review
  • Re-analysing Data
  • Data Reuse and Derivative Studies
  • Publishing and Funding Mandates

5
Curation Incentives - Data Management, Deluge
Publishing
Data from experiments conducted as recently as
six months ago might be suddenly deemed
important, but those researchers may never find
those numbers or if they did might not know
what those numbers meant Lost in some research
assistants computer, the data are often
irretrievable or an undecipherable string of
digits To vet experiments, correct errors, or
find new breakthroughs, scientists desperately
need better ways to store and retrieve research
data Data from Big Science is easier to
handle, understand and archive. Small Science is
horribly heterogeneous and far more vast. In time
Small Science will generate 2-3 times more data
than Big Science. Lost in a Sea of Science
Data S.Carlson, The Chronicle of Higher
Education (23/06/2006)
6
Curation Incentives - Data Management, Deluge
Publishing
2,000,000
30,000,000
450,000
7
Curation Incentives - Data Management, Deluge
Publishing
8
Separating Data from Interpretations
Intellect Interpretation (Journal article,
report, etc)
Underlying data (Institutional data repository)
9
The eCrystals Data Repository
An Institutional Repository http//ecrystals.chem.
soton.ac.uk
10
The Repository for the Laboratory
Create new compound
Deposit
Add experiment data and metadata
11
Curation Incentives - Validation Peer Review
12
Curation Incentives - Raw Data Re-analysis
Good data
Difficult data
You never know when data might have to be
revisited or new innovations will allow
re-interpretation!
13
Curation Incentives - Funding and/or publishing
mandates
  • Mandates to store / make data available
  • RCUK statement

14
Curation Incentives - Derivative Science
  • Starting points for new science
  • Derivation of knowledgebases

15
Curation Issues
  • Need to engage stakeholders throughout the whole
    research data lifecycle
  • Instrument manufacturers,
  • scientists,
  • archivists,
  • librarians,
  • subject repositories,
  • data centres,
  • publishers,
  • funders,
  • data miners information providers

16
Curation Issues
  • File formats, complexity and specialisation
  • Data corruption and bit rot
  • Quantity of data

17
Curation Issues
  • File formats, complexity and specialisation
  • Data corruption and bit rot
  • Quantity of data
  • Future proofing
  • Technology developments
  • eScience

18
Curation Issues
  • File formats, complexity and specialisation
  • Data corruption and bit rot
  • Quantity of data
  • Catering for a whole community

19
Curation Issues
  • File formats, complexity and specialisation
  • Data corruption and bit rot
  • Quantity of data
  • Catering for a whole community
  • What data is worth storing?
  • Estimated that the real cost of a crystal
    structure is 75 - 100 (200)
  • But what about the cost of producing the
    crystal?
  • Priceless!
  • The crystal was synthesised in a specialised
    laboratory, by highly trained researchers under a
    specific research program
  • A laboratory, researcher or scheme of work is a
    transient or evolving entity
  • As much data as possible must be acquired and
    future-proofed whilst the analyst has the
    substance to hand

20
Curation Issues
  • File formats, complexity and specialisation
  • Data corruption and bit rot
  • Quantity of data
  • Catering for a whole community
  • What data is worth storing?
  • Provenance, workflow and rights protection

21
Curation Issues
  • File formats, complexity and specialisation
  • Data corruption and bit rot
  • Quantity of data
  • Catering for a whole community
  • What data is worth storing?
  • Provenance, workflow and protection of rights
  • Available expertise, library/information services
    structure
  • Cost and policy
  • Business models
  • Subject librarian model - working closely with
    practitioners
  • New funding/structure models to support open data
    as OA takes off
  • Working group to assess the volume and diversity
    of research data
  • JISC funded survey - Cost of preserving research
    data
  • Commercialisation of knowledge derived from
    collections of data

22
Dealing with Data Report, June 2007
Recommendations 1
  • JISC should develop a Data Audit Framework to
    enable all Universities colleges to carry out
    an audit of departmental data collections,
    awareness, policies practice
  • Each Higher Education Institution should
    implement an Institutional Data Management,
    Preservation Sharing Policy, which recommends
    data deposit in an appropriate open access data
    repository and/or data centre where these exist.

23
Institutional Structure
  • Encourage restructuring through strategic funding
  • Rechannel existing funding routes
  • Financial structure money for self archive or
    OA publishing
  • Physical structure embed LIS/curation staff in
    departments for advocacy need to go native.
  • Library / Information services need to be
    introspective / reinvent

24
Advocacy
  • Younger digital generation
  • Elders will not listen
  • Method to engage at departmental level
  • Funders undervaluing work need enlightening

25
Funding
  • Small science
  • Low budget / funding
  • Hypo publishing
  • Unsupported
  • Initial target areas that are safe i.e. no
    sensitive data

26
Practice
  • Small science vs big science
  • Instrumentation vs manual
  • Automate data capture
  • Heterogeneity/variety in practice
  • Problems same in industry

27
Tools
  • Seamless
  • Simple to use
  • Low barrier to use
  • Integrated into familiar environment
  • Self describing (generrate provenance and
    preservation metadata in the background)
  • Tagging / controlled vocab tools / servers
  • Vocab checking
  • Browser tools (familiar to youth)
  • Thin client tools repository lite. Minimal
    management. Highly distributed repositories

28
eInfrastructure
  • Semantic / controlled vocabulary central services

29
Economic models and value
  • Data NOT valueless once published (EPSRC train
    of thought)
  • What is the value of departmental level data
    this is not necessarily monetary
  • Department, institution, individual, data centre,
    pharma, government, research council, public,
    third party services/businesses
  • We undervalue data
  • Subject repository economic sustainability
  • Evidence to back up advocacy
Write a Comment
User Comments (0)
About PowerShow.com