Digital Curation 101 - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Digital Curation 101

Description:

This work is licensed under the Creative Commons Attribution-NonCommercial ... This can become murky later in the process. Because good research needs good data ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 36
Provided by: dcc8
Category:

less

Transcript and Presenter's Notes

Title: Digital Curation 101


1
  • Digital Curation 101
  • Taster
  • Joy Davidson, Associate Director, DCC
    british.editor_at_erpanet.org
  • Sarah Higgins, Standards Advisor, DCC
    Sarah.Higgins_at_ed.ac.uk

Funded by
This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 2.5 UK
Scotland License. To view a copy of this license,
visit http//creativecommons.org/licenses/by-nc-sa
/2.5/scotland/ or, (b) send a letter to
Creative Commons, 543 Howard Street, 5th Floor,
San Francisco, California, 94105, USA.
2
  • DC 101 aims and objectives
  • Data management and curation are becoming
    increasingly integral for successful research or
    digitisation bids. Using the context of beginning
    a new research bid, this short course aims to
    introduce participants to the DCC Curation
    Lifecycle Model as a means of contextualising the
    range and nature of roles and activities required
    to maintain access to data over time. While the
    DCC Curation Lifecycle Model is sequential, it is
    flexible and allows users to start at any point
    in the model, or start to address issues which
    have had lower priority, depending on their
    current needs.
  • Ultimately, tools and approaches will evolve over
    time, but if participants understand the bigger
    picture they will be in a better position to make
    critical decisions that best reflect their
    individual needs. The course will introduce
    participants to some of the tools and approaches
    and provide them with pointers to further
    information and support.
  • The course is aimed at researchers, content
    creators and those who support them. We hope that
    participants leave the course equipped to explain
    why data curation is important and what roles
    they have to play in the process.

3
  • What is curation?
  • Data have importance as the evidential base for
    scholarly conclusions, and for the validation of
    those conclusions, a basic tenet of which is
    reproducibility.
  • Curation is the active management and appraisal
    of data over the lifecycle of scholarly and
    scientific interest it is the key to
    reproducibility and reuse. This adds value
    through the provision of context and linkage
    placing emphasis on 'publishing' data in ways
    that ease reuse, with implications for metadata
    and interoperability.
  • Data curation is part of good research and
    content management practice.

4
  • Why Curate?
  • Curation brings immediate and longer-term
    benefits
  • Access to reliable, working data both for the
    creator and users
  • Compliance with funding body and research council
    mandates on data sharing, management and access
  • Independent validation of research findings
  • Reliable lab and field electronic notebooks
    through trustworthy capture
  • Large amounts of data can be developed and
    analysed across different locations by
    maintaining consistency in working practices and
    interpretations
  • Relationship management between different
    versions of dynamic or evolving datasets is
    easier
  • Facilitated linkage with related research and
    between primary, secondary and tertiary data
  • Knowledge and data originating from short-term
    research projects does not become obsolete or
    inaccessible when funding expires
  • Innovative data set combining is possible e.g.
    combined historic biodiversity data and GIS data
    can be used to investigate trends in ecosystem
    development.

5
  • Lifecycle approach to curation
  • digital materials are fragile and susceptible to
    change from technological advances from creation
    onwards
  • activities (or lack of) at each lifecycle stage
    influence ability to manage and preserve
    materials in subsequent stages
  • reliable re-use of digital materials is only
    possible if materials are curated in such a way
    that their authenticity and integrity are
    retained
  • requires significant input and buy-in from the
    range of stakeholders creators, curators, IT
    staff, management
  • helps maximise initial investment made in
    creating or gathering data
  • supports verification of provenance
  • facilitates continuity of service
  • From Pennock, Maureen,
  • Digital Curation A Life-Cycle Approach
  • to Managing and Preserving Usable Digital
    Information, (2007)

6
  • The DCC Curation Lifecycle Model
  • Provides a graphical high level overview of the
    stages
  • required for successful curation and preservation
    of data.
  • It can be used to plan activities within an
  • organisation or consortium to ensure all
  • necessary stages are undertaken,
  • each in the correct sequence.
  • Full Lifecycle Actions
  • Sequential Actions
  • Occasional Actions
  • http//www.dcc.ac.uk/lifecycle-model/

7
  • Researchers and content creators tend to focus
    on
  • conceptualise
  • create or receive
  • ingest
  • store
  • access, use and reuse
  • data
  • description
  • community watch and participation

8
  • Researchers and content creators tend to focus
    less on
  • appraise and select
  • dispose
  • preservation action
  • transform
  • representation information
  • preservation planning
  • curate and preserve
  • migrate
  • reappraise

9
  • Conceptualise
  • Conceive and plan the creation of data,
    including capture method and storage options.
  • Researchers
  • define a research question
  • begin to design the experiment
  • seek funding
  • conceive and plan the creation of data
  • consider capture methods and
  • storage options
  • identify research collaborators
  • identify potential subjects
  • Roles researcher, funding bodies, publishers, IT
    department, ethics panel
  • Plan with digital curation in mind!
  • Decisions made at the Conceptualise stage impact
    on every other stage of the lifecycle.

10
  • Specific issues to consider for the Conceptualise
    stage
  • Research design and workflows what do you want
    to do?
  • What storage needs to you anticipate using? Does
    your institution have the capacity for this?
    Will you keep raw or derived data or both?
  • Will you make use of any existing data? Will you
    need to obtain rights to use it?
  • Do you want your data to interoperate with other
    datasets? If so, how will you ensure that this is
    possible?
  • What are the funders requirements regarding
    curation and preservation? Will they pay for
    curation activity?
  • Will the research involve any legal restrictions
    on the use and access to the data?
  • Are there any data protection issues that will
    require data cleaning before the data can be
    accessed and used?
  • Do you require ethical approval from your
    institution or funder? Will this have any impact
    on the datas potential use and reuse?
  • Do you need to calibrate data capture devices?
    Will this need to occur at multiple sites?
  • Will the data be released under Creative Commons
    or Science Commons licenses?
  • Are there likely to be any embargoes on data
    publication?

11
  • Create or Receive
  • Create data including administrative,
    descriptive, structural and technical metadata.
    Preservation metadata may also be added at the
    time of creation. OR Receive data, in
    accordance with documented collecting policies,
    from data creators, other archives, repositories
    or data centres, and if required assign
    appropriate metadata.
  • Roles researchers, information specialists,
    technical support
  • Ensure data are curation ready!
  • Be careful - data may be irreplaceable
  • Capture context for long-term reuse and
    comprehensibility.
  • Clearly identify IPR at an early stage. This can
    become murky later in the process.

12
  • Specific issues to consider for the Create or
    Receive stage
  • What do you want people to be able to do with the
    data you are generating?
  • What do you not want people to be able to do with
    the data?
  • Are there any variations between data capture
    tools located at different sites? How will you
    ensure that these are recorded/addressed?
    Consistency of testing and data acquisition are
    crucial.
  • Will you be adhering to any content, syntax, and
    structure standards? Are these easily available
    for use by everyone on the project team?
  • Who will have rights over any collaboratively
    generated data (eg., databases)
  • Who will you record contextual metadata and how?
  • What level of data quality do you need to
    achieve? How will you ensure this level is
    achieved across all partners?
  • Will you make use of any ontologies to facilitate
    data integration?
  • Will you make use of any data collection
    policies?
  • How will you handles file naming and version
    control?
  • Do you have access to training and support for
    any/all of the above?

13
  • Ingest and Store
  • Transfer data to an archive, repository, data
    centre or other custodian. Adhere to documented
    guidance, policies or legal requirements. Store
    the data in a secure manner adhering to relevant
    standards.
  • Data is transferred to a curation environment
    such as an institutional repository or a
    subject-based repository.
  • Roles information specialists, repository
    managers, researchers
  • Prepare data for long-term storage, access and
    continuity! Storage may be a dedicated data
    repository or a folder on a shared drive, but
    must be considered, secure and adhere to relevant
    standards.

14
  • Specific issues to consider for the Ingest and
    Store stages
  • Does the data have sufficient metadata? If more
    is required, who will be responsible for
    providing it?
  • Will the data require additional cleaning before
    it can be ingested into the repository?
  • Will frequent access to the data be required? If
    so, this could affect the storage choices.
  • What level of responsibility does the repository
    indicate it will take on with regards to
    stewardship?
  • Does the repository accept your data formats? If
    not, will there be any normalisation processes
    that may occur with the deposit of non-preferred
    formats?
  • Does the repository outsource any of its
    activity? Could this have an impact on your data?
  • Does the repository have sufficient resources and
    policies in place?
  • Once ingest is complete, is there a formal
    acknowledgement that the transfer of custody has
    occurred?

15
  • Access, Use and Reuse
  • Ensure that data is accessible to both
    designated users and reusers, on a day-to-day
    basis. This may be in the form of publicly
    available published information. Robust access
    controls and authentication procedures may be
    applicable.
  • Roles repository managers, researchers
  • Ensure access and continuity!

16
  • Specific issues to consider for the Access, Use
    and Reuse stage
  • Are the intended users of the data able to access
    it and make use of it? i.e., are they able to use
    the data in the way that you originally intended
    them to use it? What about non-intended users?
  • Are there any restrictions on access and reuse
    Ensure that these are communicated to the
    repository staff.
  • Researchers should work with repository managers
    to develop suitable access policies and terms for
    use of the data
  • If you are planning on making your data freely
    accessible for reuse, have you supplied enough
    context to enable its reliable reuse?
  • Are they adequate finding aid to help locate and
    retrieve your data within the repository?
  • Is the data practically interoperable with other
    datasets? Does it need to be?

17
  • Appraise and Select
  • Evaluate data and select for long-term curation
    and preservation. Adhere to documented guidance,
    policies or legal requirements.
  • Researchers and content creators, along with
    information specialists use quality checks to
    identify and evaluate data for long-term
    curation
  • must be legal, appropriate, and valuable
  • may include data objects, metadata, and
    contextual information.
  • Roles researchers, information specialists,
    funding bodies
  • Develop robust policies! The keep everything
    approach quickly becomes unviable. As the volume
    of curated data increases, efficient search and
    retrieval becomes more difficult.

18
  • Specific issues to consider for the Appraise and
    Select stage
  • Does the data meet the data quality metrics
    identified by both the researchers and the
    archive? Who will be responsible for the final
    decision? Can errors in the data remain
    undetected at this stage, and cause problems at
    later stages?
  • Has enough contextual information been collected
    to make an informed decision about which data to
    keep?
  • What is the minimum you need to keep for your
    data findings and publications to be supported
    over time?
  • Are there any data that you, by law, are not
    allowed to keep? How will it be destroyed and
    what evidence will you be able to provide to
    support this if necessary?
  • Do you have any schedule for re-appraisal over
    time?
  • Do you have access to expertise in your project
    staff or at your institution to assist with
    selection and appraisal?
  • Your initial bid is a good place to start as
    youll have clearly indicated what outputs you
    planned to produce.
  • Does your selection and appraisal fit in with
    your funding body requirements? What do they
    expect you to keep and where does it need to be
    kept?

19
  • Preservation action
  • Undertake actions to ensure long-term
    preservation and retention of the authoritative
    nature of data. Preservation actions should
    ensure that data remains authentic, reliable and
    usable while maintaining its integrity. Actions
    include data cleaning, validation, assigning
    preservation metadata, assigning representation
    information and ensuring acceptable data
    structures or file formats.
  • Roles information specialists, preservation
    practitioners, repository managers
  • Community Watch activities can be very helpful at
    this stage to identify imminent risks to data.

20
  • Specific issues to consider for the Preservation
    Action stage
  • Does the repository participate in community
    watch and ongoing preservation planning activity?
  • Does the repository manager know what the
    significant properties of your data are? If not,
    some preservation actions can alter the
    significant properties.
  • Are any preservation actions undertaken
    transparent and documented?
  • Does the repository have legal rights to
    undertake preservation actions at all?
  • Does the researcher require notification of any
    preservation actions that may affect the intended
    use of the data? If so, have mechanisms been set
    in place to facilitate this?
  • If certain actions are recommended, are they
    suitable for your data? If not, are repository
    staff aware of any restrictions?

21
  • Transform
  • Create new data from the original, for example
    by migration into a different format or by
    creating a subset, by selection or query, to
    create newly derived results, perhaps for
    publication.
  • New data may be generated from the original
  • by format migration
  • through integration with other data
  • by new analyses and techniques applied within or
    across disciplines
  • Roles researchers
  • New uses for curated data! Derivative data, new
    visualisations or enhancements feed back into the
    Conceptualise and Create stages of the lifecycle
    which then starts anew.

22
  • Specific issues to consider for the Transform
    stage
  • Metadata aggregation to join up with other
    datasets, this integration of data drives new
    curation requirements.
  • Image normalisation and automated analysis
    creates a variety of new contextual and
    provenance information
  • If transformations or derivatives are produced
    (e.g. noise reduction) it must be accompanied by
    appropriate metadata
  • Use community standards for recording provenance
    to safeguard against fast changing techniques.
  • Does the community have sufficient support in
    transformation actions?
  • Is more value gained from producing new data or
    from transforming old data in new ways?

23
  • More information on all these stages is in the
    workshop packs!
  • info_at_dcc.ac.uk

24
  • Tools and resources to help with the DCC Curation
    Lifecycle stages
  • Conceptualise
  • DCC Policy Pages http//www.dcc.ac.uk/resource/cu
    ration-policies
  • Check our handy table as a starting point to make
    sure you are aware of any curation related
    requirements for your particular funding body. If
    your funding body is not in our table, please get
    in touch with us so that we can add their policy
    details.
  • DCC Helpdesk http//www.dcc.ac.uk/helpdesk
  • If you need further assistance at this stage,
    please dont hesitate to drop us a line via our
    helpdesk and well make every effort to support
    your curation activity.

25
  • Tools and resources to help with the DCC Curation
    Lifecycle stages
  • Create or Receive
  • DCC DIFFUSE
  • You might wish to consult the DCC DIFFUSE
    database for standards frameworks related to your
    area of research. We strongly encourage the
    contribution of standards frameworks for specific
    domains from our user community to help ensure
    that this is a community-driven resource.
    http//www.dcc.ac.uk/diffuse/
  • DCC Technology and Standards Watch papers
    http//www.dcc.ac.uk/resource
  • AHDS advice on creating digital resources
    http//www.ahds.ac.uk/creating/index.htm/

26
  • Tools and resources to help with the DCC Curation
    Lifecycle stages
  • Ingest and Store
  • AHDS recommended stable formats for different
    types of data http//www.ahds.ac.uk/depositing/de
    posit-formats.htm
  • Access, Use, Re-use
  • DCC Resource Centre http//www.dcc.ac.uk/resource
  • DCC Helpdesk http//www.dcc.ac.uk/helpdesk
  • DCC Legal Blog http//dccblawg.blogspot.com/
  • DCC Briefing Papers (particularly Data
    Protection) http//www.dcc.ac.uk/resource/briefin
    g-papers/

27
  • Tools and resources to help with the DCC Curation
    Lifecycle stages
  • Appraise and Select
  • Data Audit Framework tool http//www.data-audit.e
    u/
  • DCC Briefing Paper and Curation Manual chapter on
    Appraisal and Selection http//www.dcc.ac.uk/res
    ource/briefing-papers/ http//www.dcc.ac.uk/resou
    rce/curation-manual/chapters/
  • US Geological Survey selection and appraisal
    toolkit

28
  • Tools and resources to help with the DCC Curation
    Lifecycle stages
  • Preservation Action
  • Dr. Manfred Thallers Fileshooter tool
  • Good for assessing file format robustness using
    your own success metrics.
  • http//github.com/mcarden/shotgun/blob/39761fdd190
    faa47e9be09901782cda6d9f4f687/shotGun.h
  • PLANETS Testbed and Methodology
    http//www.planets-project.eu/
  • DCC Curation Manual http//www.dcc.ac.uk/resource
    /curation-manual/chapters /
  • Transform
  • DCC Briefing Papers (particularly
    Interoperability) http//www.dcc.ac.uk/resource/b
    riefing-papers/

29
  • CHECKLISTS Conceptualise
  • ? Get into the habit of equating data
    curation with good research.
  • ? Know what your funding body expects you to do
    with your data and for how long. Assess your
    ability to be able to meet these expectations
    (i.e., do you need additional funding or staff?)
  • ? Determine intellectual property rights from
    the outset and ensure they are documented.
  • ? Identify any anticipated publication
    requirements (embargoes, restrictions on
  • publishing over multiple sites)
  • ? Identify and document specific roles and
    responsibilities as early as possible.

30
CHECKLISTS Create and/or Receive
31
  • CHECKLISTS Ingest and Store

32
CHECKLISTS Access and Reuse ? Know what you
want users to be able to do with your data and
for how long. ? Pin down and communicate the
significant properties of your data. ? Ensure
that any restrictions on access and use are
communicated and respected. ? Ensure that you
provide enough context to ensure that your data
can be located and used either by the
originally designated user community or new users
over time. ? Ensure you clearly articulate any
citation requirements and usage statistics that
you require at the point of ingest so that
repository managers know how your data should be
cited if it is reused.
33
  • CHECKLISTS Appraise and Select (1)

34
  • CHECKLISTS Appraise and Select (2)

35
  • CHECKLISTS Preservation Action
  • ? Know what you want people to be able to do with
    your data this will impact many aspects
    (formats selected for long term storage,
    compression, etc)
  • Pin down the significant properties of your
    data and communicate them make sure that the
    people carrying out preservation actions know
    what they are. This might be through metadata or
    other means.
  • Dont be afraid to be critical when reviewing
    best practice and recommended approaches. They
    might work for the specific scenario for which
    they were created but not for you. Do you know
    the criteria used to rate things like preferred
    formats?
  • ? Document preservation actions so that people
    know what has been done to the data over time.
  • ? Once youve gone through the exercise of
    producing a sound data management plan, youll
    be able to reuse many aspects of it so each
    project data management plan will not need the
    same level of effort to complete.
Write a Comment
User Comments (0)
About PowerShow.com