Title: Digital Curation 101
1- Digital Curation 101
- Taster
- Joy Davidson, Associate Director, DCC
british.editor_at_erpanet.org - Sarah Higgins, Standards Advisor, DCC
Sarah.Higgins_at_ed.ac.uk
Funded by
This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 2.5 UK
Scotland License. To view a copy of this license,
visit http//creativecommons.org/licenses/by-nc-sa
/2.5/scotland/ or, (b) send a letter to
Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.
2- DC 101 aims and objectives
- Data management and curation are becoming
increasingly integral for successful research or
digitisation bids. Using the context of beginning
a new research bid, this short course aims to
introduce participants to the DCC Curation
Lifecycle Model as a means of contextualising the
range and nature of roles and activities required
to maintain access to data over time. While the
DCC Curation Lifecycle Model is sequential, it is
flexible and allows users to start at any point
in the model, or start to address issues which
have had lower priority, depending on their
current needs. - Ultimately, tools and approaches will evolve over
time, but if participants understand the bigger
picture they will be in a better position to make
critical decisions that best reflect their
individual needs. The course will introduce
participants to some of the tools and approaches
and provide them with pointers to further
information and support. - The course is aimed at researchers, content
creators and those who support them. We hope that
participants leave the course equipped to explain
why data curation is important and what roles
they have to play in the process. -
3- What is curation?
- Data have importance as the evidential base for
scholarly conclusions, and for the validation of
those conclusions, a basic tenet of which is
reproducibility. - Curation is the active management and appraisal
of data over the lifecycle of scholarly and
scientific interest it is the key to
reproducibility and reuse. This adds value
through the provision of context and linkage
placing emphasis on 'publishing' data in ways
that ease reuse, with implications for metadata
and interoperability. - Data curation is part of good research and
content management practice.
4- Why Curate?
- Curation brings immediate and longer-term
benefits - Access to reliable, working data both for the
creator and users - Compliance with funding body and research council
mandates on data sharing, management and access - Independent validation of research findings
- Reliable lab and field electronic notebooks
through trustworthy capture - Large amounts of data can be developed and
analysed across different locations by
maintaining consistency in working practices and
interpretations - Relationship management between different
versions of dynamic or evolving datasets is
easier - Facilitated linkage with related research and
between primary, secondary and tertiary data - Knowledge and data originating from short-term
research projects does not become obsolete or
inaccessible when funding expires - Innovative data set combining is possible e.g.
combined historic biodiversity data and GIS data
can be used to investigate trends in ecosystem
development.
5- Lifecycle approach to curation
- digital materials are fragile and susceptible to
change from technological advances from creation
onwards - activities (or lack of) at each lifecycle stage
influence ability to manage and preserve
materials in subsequent stages - reliable re-use of digital materials is only
possible if materials are curated in such a way
that their authenticity and integrity are
retained - requires significant input and buy-in from the
range of stakeholders creators, curators, IT
staff, management - helps maximise initial investment made in
creating or gathering data - supports verification of provenance
- facilitates continuity of service
- From Pennock, Maureen,
- Digital Curation A Life-Cycle Approach
- to Managing and Preserving Usable Digital
Information, (2007)
6- The DCC Curation Lifecycle Model
- Provides a graphical high level overview of the
stages - required for successful curation and preservation
of data. - It can be used to plan activities within an
- organisation or consortium to ensure all
- necessary stages are undertaken,
- each in the correct sequence.
- Full Lifecycle Actions
- Sequential Actions
- Occasional Actions
- http//www.dcc.ac.uk/lifecycle-model/
7- Researchers and content creators tend to focus
on - conceptualise
- create or receive
- ingest
- store
- access, use and reuse
- data
- description
- community watch and participation
8- Researchers and content creators tend to focus
less on - appraise and select
- dispose
- preservation action
- transform
- representation information
- preservation planning
- curate and preserve
- migrate
- reappraise
9- Conceptualise
- Conceive and plan the creation of data,
including capture method and storage options. - Researchers
- define a research question
- begin to design the experiment
- seek funding
- conceive and plan the creation of data
- consider capture methods and
- storage options
- identify research collaborators
- identify potential subjects
- Roles researcher, funding bodies, publishers, IT
department, ethics panel - Plan with digital curation in mind!
- Decisions made at the Conceptualise stage impact
on every other stage of the lifecycle.
10- Specific issues to consider for the Conceptualise
stage - Research design and workflows what do you want
to do? - What storage needs to you anticipate using? Does
your institution have the capacity for this?
Will you keep raw or derived data or both? - Will you make use of any existing data? Will you
need to obtain rights to use it? - Do you want your data to interoperate with other
datasets? If so, how will you ensure that this is
possible? - What are the funders requirements regarding
curation and preservation? Will they pay for
curation activity? - Will the research involve any legal restrictions
on the use and access to the data? - Are there any data protection issues that will
require data cleaning before the data can be
accessed and used? - Do you require ethical approval from your
institution or funder? Will this have any impact
on the datas potential use and reuse? - Do you need to calibrate data capture devices?
Will this need to occur at multiple sites? - Will the data be released under Creative Commons
or Science Commons licenses? - Are there likely to be any embargoes on data
publication?
11- Create or Receive
- Create data including administrative,
descriptive, structural and technical metadata.
Preservation metadata may also be added at the
time of creation. OR Receive data, in
accordance with documented collecting policies,
from data creators, other archives, repositories
or data centres, and if required assign
appropriate metadata. - Roles researchers, information specialists,
technical support - Ensure data are curation ready!
- Be careful - data may be irreplaceable
- Capture context for long-term reuse and
comprehensibility. - Clearly identify IPR at an early stage. This can
become murky later in the process. -
12- Specific issues to consider for the Create or
Receive stage - What do you want people to be able to do with the
data you are generating? - What do you not want people to be able to do with
the data? - Are there any variations between data capture
tools located at different sites? How will you
ensure that these are recorded/addressed?
Consistency of testing and data acquisition are
crucial. - Will you be adhering to any content, syntax, and
structure standards? Are these easily available
for use by everyone on the project team? - Who will have rights over any collaboratively
generated data (eg., databases) - Who will you record contextual metadata and how?
- What level of data quality do you need to
achieve? How will you ensure this level is
achieved across all partners? - Will you make use of any ontologies to facilitate
data integration? - Will you make use of any data collection
policies? - How will you handles file naming and version
control? - Do you have access to training and support for
any/all of the above?
13- Ingest and Store
- Transfer data to an archive, repository, data
centre or other custodian. Adhere to documented
guidance, policies or legal requirements. Store
the data in a secure manner adhering to relevant
standards. - Data is transferred to a curation environment
such as an institutional repository or a
subject-based repository. - Roles information specialists, repository
managers, researchers - Prepare data for long-term storage, access and
continuity! Storage may be a dedicated data
repository or a folder on a shared drive, but
must be considered, secure and adhere to relevant
standards. -
14- Specific issues to consider for the Ingest and
Store stages - Does the data have sufficient metadata? If more
is required, who will be responsible for
providing it? - Will the data require additional cleaning before
it can be ingested into the repository? - Will frequent access to the data be required? If
so, this could affect the storage choices. - What level of responsibility does the repository
indicate it will take on with regards to
stewardship? - Does the repository accept your data formats? If
not, will there be any normalisation processes
that may occur with the deposit of non-preferred
formats? - Does the repository outsource any of its
activity? Could this have an impact on your data?
- Does the repository have sufficient resources and
policies in place? - Once ingest is complete, is there a formal
acknowledgement that the transfer of custody has
occurred?
15- Access, Use and Reuse
- Ensure that data is accessible to both
designated users and reusers, on a day-to-day
basis. This may be in the form of publicly
available published information. Robust access
controls and authentication procedures may be
applicable. - Roles repository managers, researchers
- Ensure access and continuity!
16- Specific issues to consider for the Access, Use
and Reuse stage - Are the intended users of the data able to access
it and make use of it? i.e., are they able to use
the data in the way that you originally intended
them to use it? What about non-intended users? - Are there any restrictions on access and reuse
Ensure that these are communicated to the
repository staff. - Researchers should work with repository managers
to develop suitable access policies and terms for
use of the data - If you are planning on making your data freely
accessible for reuse, have you supplied enough
context to enable its reliable reuse? - Are they adequate finding aid to help locate and
retrieve your data within the repository? - Is the data practically interoperable with other
datasets? Does it need to be?
17- Appraise and Select
- Evaluate data and select for long-term curation
and preservation. Adhere to documented guidance,
policies or legal requirements. - Researchers and content creators, along with
information specialists use quality checks to
identify and evaluate data for long-term
curation - must be legal, appropriate, and valuable
- may include data objects, metadata, and
contextual information. - Roles researchers, information specialists,
funding bodies - Develop robust policies! The keep everything
approach quickly becomes unviable. As the volume
of curated data increases, efficient search and
retrieval becomes more difficult.
18- Specific issues to consider for the Appraise and
Select stage - Does the data meet the data quality metrics
identified by both the researchers and the
archive? Who will be responsible for the final
decision? Can errors in the data remain
undetected at this stage, and cause problems at
later stages? - Has enough contextual information been collected
to make an informed decision about which data to
keep? - What is the minimum you need to keep for your
data findings and publications to be supported
over time? - Are there any data that you, by law, are not
allowed to keep? How will it be destroyed and
what evidence will you be able to provide to
support this if necessary? - Do you have any schedule for re-appraisal over
time? - Do you have access to expertise in your project
staff or at your institution to assist with
selection and appraisal? - Your initial bid is a good place to start as
youll have clearly indicated what outputs you
planned to produce. - Does your selection and appraisal fit in with
your funding body requirements? What do they
expect you to keep and where does it need to be
kept?
19- Preservation action
- Undertake actions to ensure long-term
preservation and retention of the authoritative
nature of data. Preservation actions should
ensure that data remains authentic, reliable and
usable while maintaining its integrity. Actions
include data cleaning, validation, assigning
preservation metadata, assigning representation
information and ensuring acceptable data
structures or file formats. - Roles information specialists, preservation
practitioners, repository managers - Community Watch activities can be very helpful at
this stage to identify imminent risks to data.
20- Specific issues to consider for the Preservation
Action stage - Does the repository participate in community
watch and ongoing preservation planning activity?
- Does the repository manager know what the
significant properties of your data are? If not,
some preservation actions can alter the
significant properties. - Are any preservation actions undertaken
transparent and documented? - Does the repository have legal rights to
undertake preservation actions at all? - Does the researcher require notification of any
preservation actions that may affect the intended
use of the data? If so, have mechanisms been set
in place to facilitate this? - If certain actions are recommended, are they
suitable for your data? If not, are repository
staff aware of any restrictions?
21- Transform
- Create new data from the original, for example
by migration into a different format or by
creating a subset, by selection or query, to
create newly derived results, perhaps for
publication. - New data may be generated from the original
- by format migration
- through integration with other data
- by new analyses and techniques applied within or
across disciplines - Roles researchers
- New uses for curated data! Derivative data, new
visualisations or enhancements feed back into the
Conceptualise and Create stages of the lifecycle
which then starts anew.
22- Specific issues to consider for the Transform
stage - Metadata aggregation to join up with other
datasets, this integration of data drives new
curation requirements. - Image normalisation and automated analysis
creates a variety of new contextual and
provenance information - If transformations or derivatives are produced
(e.g. noise reduction) it must be accompanied by
appropriate metadata - Use community standards for recording provenance
to safeguard against fast changing techniques. - Does the community have sufficient support in
transformation actions? - Is more value gained from producing new data or
from transforming old data in new ways?
23- More information on all these stages is in the
workshop packs! - info_at_dcc.ac.uk
24- Tools and resources to help with the DCC Curation
Lifecycle stages - Conceptualise
- DCC Policy Pages http//www.dcc.ac.uk/resource/cu
ration-policies - Check our handy table as a starting point to make
sure you are aware of any curation related
requirements for your particular funding body. If
your funding body is not in our table, please get
in touch with us so that we can add their policy
details. - DCC Helpdesk http//www.dcc.ac.uk/helpdesk
- If you need further assistance at this stage,
please dont hesitate to drop us a line via our
helpdesk and well make every effort to support
your curation activity.
25- Tools and resources to help with the DCC Curation
Lifecycle stages - Create or Receive
- DCC DIFFUSE
- You might wish to consult the DCC DIFFUSE
database for standards frameworks related to your
area of research. We strongly encourage the
contribution of standards frameworks for specific
domains from our user community to help ensure
that this is a community-driven resource.
http//www.dcc.ac.uk/diffuse/ - DCC Technology and Standards Watch papers
http//www.dcc.ac.uk/resource - AHDS advice on creating digital resources
http//www.ahds.ac.uk/creating/index.htm/
26- Tools and resources to help with the DCC Curation
Lifecycle stages - Ingest and Store
- AHDS recommended stable formats for different
types of data http//www.ahds.ac.uk/depositing/de
posit-formats.htm - Access, Use, Re-use
- DCC Resource Centre http//www.dcc.ac.uk/resource
- DCC Helpdesk http//www.dcc.ac.uk/helpdesk
- DCC Legal Blog http//dccblawg.blogspot.com/
- DCC Briefing Papers (particularly Data
Protection) http//www.dcc.ac.uk/resource/briefin
g-papers/
27- Tools and resources to help with the DCC Curation
Lifecycle stages - Appraise and Select
- Data Audit Framework tool http//www.data-audit.e
u/ - DCC Briefing Paper and Curation Manual chapter on
Appraisal and Selection http//www.dcc.ac.uk/res
ource/briefing-papers/ http//www.dcc.ac.uk/resou
rce/curation-manual/chapters/ - US Geological Survey selection and appraisal
toolkit
28- Tools and resources to help with the DCC Curation
Lifecycle stages - Preservation Action
- Dr. Manfred Thallers Fileshooter tool
- Good for assessing file format robustness using
your own success metrics. - http//github.com/mcarden/shotgun/blob/39761fdd190
faa47e9be09901782cda6d9f4f687/shotGun.h - PLANETS Testbed and Methodology
http//www.planets-project.eu/ - DCC Curation Manual http//www.dcc.ac.uk/resource
/curation-manual/chapters / - Transform
- DCC Briefing Papers (particularly
Interoperability) http//www.dcc.ac.uk/resource/b
riefing-papers/
29- CHECKLISTS Conceptualise
- ? Get into the habit of equating data
curation with good research. - ? Know what your funding body expects you to do
with your data and for how long. Assess your
ability to be able to meet these expectations
(i.e., do you need additional funding or staff?) - ? Determine intellectual property rights from
the outset and ensure they are documented. - ? Identify any anticipated publication
requirements (embargoes, restrictions on - publishing over multiple sites)
- ? Identify and document specific roles and
responsibilities as early as possible.
30CHECKLISTS Create and/or Receive
31- CHECKLISTS Ingest and Store
32CHECKLISTS Access and Reuse ? Know what you
want users to be able to do with your data and
for how long. ? Pin down and communicate the
significant properties of your data. ? Ensure
that any restrictions on access and use are
communicated and respected. ? Ensure that you
provide enough context to ensure that your data
can be located and used either by the
originally designated user community or new users
over time. ? Ensure you clearly articulate any
citation requirements and usage statistics that
you require at the point of ingest so that
repository managers know how your data should be
cited if it is reused.
33- CHECKLISTS Appraise and Select (1)
34- CHECKLISTS Appraise and Select (2)
35- CHECKLISTS Preservation Action
- ? Know what you want people to be able to do with
your data this will impact many aspects
(formats selected for long term storage,
compression, etc) - Pin down the significant properties of your
data and communicate them make sure that the
people carrying out preservation actions know
what they are. This might be through metadata or
other means. - Dont be afraid to be critical when reviewing
best practice and recommended approaches. They
might work for the specific scenario for which
they were created but not for you. Do you know
the criteria used to rate things like preferred
formats? - ? Document preservation actions so that people
know what has been done to the data over time. - ? Once youve gone through the exercise of
producing a sound data management plan, youll
be able to reuse many aspects of it so each
project data management plan will not need the
same level of effort to complete.