Title: JISCSURFCNImtgmay05
1From research data to new knowledge a lifecycle
approach. Dr Liz Lyon, Director UKOLN,
University of Bath, UK JISC/SURF/CNI Conference
May 2005, Amsterdam.
UKOLN is supported by
www.bath.ac.uk
www.ukoln.ac.uk
a centre of expertise in digital information
management
2Overview
- Scholarly communications in flux
- e-Research and the diversity of data
- Repositories meta-functionality
- Realising the link to learning eBank UK
- Providing value-added services
- Enabling knowledge extraction post-processing
- Look at (some of) the issues en route
31. Scholarly communications in flux
4A medieval scriptorium..
5Presentation services subject, media-specific,
data, commercial portals
Searching , harvesting, embedding
Resource discovery, linking, embedding
Data creation / capture / gathering laboratory
experiments, Grids, fieldwork, surveys, media
The scholarly knowledge cycle. Liz Lyon, Ariadne,
July 2003.
Aggregator services national, commercial
Data analysis, transformation, mining, modelling
Harvestingmetadata
Research e-Science workflows
Repositories institutional,
e-prints, subject, data, learning objects
Deposit / self-archiving
Validation
Publication
Peer-reviewed publications journals, conference
proceedings
6Presentation services subject, media-specific,
data, commercial portals
Searching , harvesting, embedding
Resource discovery, linking, embedding
Aggregator services national, commercial
Learning object creation, re-use
Harvestingmetadata
Learning Teaching workflows
Repositories institutional,
e-prints, subject, data, learning objects
Institutional presentation services portals,
Learning Management Systems, u/g, p/g courses,
modules
Deposit / self-archiving
Validation
Resource discovery, linking, embedding
Validation
Peer-reviewed publications journals, conference
proceedings
Quality assurance bodies
7Presentation services subject, media-specific,
data, commercial portals
Searching , harvesting, embedding
Resource discovery, linking, embedding
Resource discovery, linking, embedding
Data creation / capture / gathering laboratory
experiments, Grids, fieldwork, surveys, media
Aggregator services national, commercial
Data analysis, transformation, mining, modelling
Learning object creation, re-use
Harvestingmetadata
Learning Teaching workflows
Research e-Science workflows
Repositories institutional,
e-prints, subject, data, learning objects
Institutional presentation services portals,
Learning Management Systems, u/g, p/g courses,
modules
Deposit / self-archiving
Deposit / self-archiving
Validation
Publication
Resource discovery, linking, embedding
Validation
Peer-reviewed publications journals, conference
proceedings
Quality assurance bodies
82. e-Research and the diversity of data
9Assuring permanent open access to the records of
science the humanities?
- Long term access to primary data
- Increasing data volumes from eScience and
Grid-enabled / cyberinfrastructure applications - Changing research paradigm data-driven science,
big science - Observational data, simulations, large-scale
experimentation, computations - Multi-media resources, statistical data,
surveys, geo-spatial data
10Diversity of data collections
- Very large, relatively homogeneous
Large-scale Hadron Collider (LHC)
outputs from CERN - Smaller, heterogeneous and richer collections
World Data Centre for Solar-terrestrial
Physics CCLRC - Small-scale laboratory results
jumping robots project at the
University of Bath - Population survey data UK Biobank
- Highly sensitive, personal data patient care
records
11Taxonomy of data collections
- Research collections jumping robots
- Community collections Flybase at Indiana (with
UC Berkeley ) - Reference collections Protein Data Bank
- Source NSF Long-Lived Digital Data Collections
- Draft report March 2005
12Taxonomy of data collections
Evolution
- Research collections jumping robots
- Community collections Flybase at Indiana (with
UC Berkeley ) - Reference collections Protein Data Bank
- Source NSF Long-Lived Digital Data Collections
- Draft report March 2005
13Repository evolution 1971 Research collection
lt12 files 2005 Reference collection gt2700
structures deposited in 6 months
141. Issues research data as content
- Sharing it!
- Data diversity
- Homo- or heterogeneous
- Raw and derived / processed
- Sensitivity
- Fast or slow growth in volume
- Repository evolution
- Likelihood to scale up (from bytes to petabytes)
- Quality assurance (from the start)
- Community-based standards development
(folksonomies) - Build robust services
153. Repositories meta-functionality
16eBank UK linking research data to learning
- JISC-funded September 2003, Phase 2 February
2005 - UKOLN at the University of Bath (lead),
University of Southampton, University of
Manchester - Exemplar e-Science testbed Combechem
- Grid-enabled combinatorial chemistry
- Crystallography, laser and surface chemistry
examples - Development of an e-Lab using pervasive computing
technology - National Crystallography Service
- Resource Discovery Network / PSIgate physical
sciences portal - http//www.ukoln.ac.uk/projects/ebank-uk/
17Presentation services subject, media-specific,
data, commercial portals
Searching , harvesting, embedding
Resource discovery, linking, embedding
Resource discovery, linking, embedding
Data creation / capture / gathering laboratory
experiments, Grids, fieldwork, surveys, media
Data analysis, transformation, mining, modelling
Learning object creation, re-use
Aggregator services eBank UK
Harvestingmetadata
Learning Teaching workflows
Research e-Science workflows
Repositories institutional,
e-prints, subject, data, learning objects
Institutional presentation services portals,
Learning Management Systems, u/g, p/g courses,
modules
Deposit / self-archiving
Deposit / self-archiving
Validation
Publication
Resource discovery, linking, embedding
Validation
Peer-reviewed publications journals, conference
proceedings
Quality assurance bodies
18Data Flow in eBank UK
Create
OAI-PMH
Index and Search
Institutional repository
eBank aggregator
Data files
Metadata
19Comb-e-Chem Project
Video
Simulation
Properties
Analysis
StructuresDatabase
Diffractometer
X-Raye-Lab
Propertiese-Lab
Grid Middleware
20(No Transcript)
21The digital repository
ecrystals.chem.soton.ac.uk Acknowledgement Simon
Coles
22Access to the underlying data
23Harvesting OAIster
24Aggregating search discover
25Linking to publications
26eBank embedded in a science portal
27eBank Phase 2 linking to learning
- Embedding in e-Learning processes
- Evaluating the pedagogical benefits
- MChem course
- Chemical informatics course
282. Issues generic data models, metadata schema
terminology
- Validation against other schema
- CCLRC Scientific Data Model Vs 2
- Complex digital objects and packaging options
- METS
- MPEG 21 DIDL
- Terminologies
- Domain crystallography
- Inter-disciplinary e.g. biomaterials
- Metadata enhancement subject keyword additions
to datasets based on knowledge of keywords in
related publications - Meaningful resource discovery?
293. Issues linking and identifiers
- Links to individual datasets within an experiment
- Links to all datasets associated with an
experiment or a data collection - Links to derived eprints and published literature
- Context sensitive linking find me
- Datasets by this author / creator
- Datasets related to this subject
- Learning objects by this author / creator
- Learning objects related to this subject
- Identifiers and persistence
- generic
- domain International Chemical Identifier (InChI
code) - Resource discovery Google Scholar?
- Provenance authenticity, authority, integrity?
304. Issues embedding and workflow
- Into the crystallographic publishing community
International Union of Crystallography - Into the chemistry research workflow
- SMART TEA Digital Lab Book e-synthesis Lab
- Other analytical techniques and instrumentation
- Into the curriculum and e-Learning workflows
- MChem course
- Undergraduate Chemical Informatics courses
31Repositories and digital curation
For later use? In use now (and the future)?
Static
Dynamic
Data preservation
Data curation
maintaining and adding value to a trusted body
of digital information for current and future use
32Provide value-added services
- Annotation
- e-Lab books (Smart Tea Project in chemistry)
- Gene and protein sequences
33Enable post-processing and knowledge extraction
- The acquisition of newly-derived information and
knowledge from repository content - Run complex algorithms over primary datasets
- Mining (data, text, structures)
- Modelling (economic, climate, mathematical,
biological) - Analysis (statistical, lexical, pattern
matching, gene) - Presentation (visualisation, rendering)
34(No Transcript)
355. Issues knowledge services
- Layered over repositories
- Annotation
- Mining, modelling, analysis
- Visualisation
- Across multiple repositories
- Grid enabled applications
- Highly distributed, dynamic and collaborative
- Associated with curatorial responsibility
- UK Digital Curation Centre http//www.dcc.ac.uk
36Issues summary
- Research data is diverse, increasing rapidly in
volume and complexity - Repository collections are dynamic and evolve
- Technical challenges associated with
interoperability, persistence, provenance,
resource discovery and infrastructure provision - Embedding in workflow is critical scholarly
communications, research practice, learning - Knowledge extraction tools will generate new
discoveries based on repository content - Repository solutions must scale M2M processing
will become the norm