Enhancing sccess to resesrch dsts: the crystallogrsphy chsllenge - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Enhancing sccess to resesrch dsts: the crystallogrsphy chsllenge

Description:

Changing research methods: high througput technologies, automation, smart labs' ... Quality assurance (from the start) Community-based standards development ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 44
Provided by: monicaduke
Category:

less

Transcript and Presenter's Notes

Title: Enhancing sccess to resesrch dsts: the crystallogrsphy chsllenge


1
Enhancing access to research data the e-Science
project eBank UK
UKOLN is supported by
2005-09-01 www.ukoln.ac.uk
www.bath.ac.uk
A centre of expertise in digital information
management
2
Enhancing access to research data overview
  • E-Science impact of digital technologies on
    research process
  • Scholarly knowledge cycle and publication
    bottleneck
  • eBank project applying digital library
    techniques to support data curation in
    crystallography
  • Services, metadata, issues phase 3

3
Changes in research process
  • Increasing data volumes from eScience /
    Grid-enabled / cyber-infrastructure applications,
    big science, data-driven science
  • Changing research methods high througput
    technologies, automation, smart labs
  • Potential for re-use of data, new
    inter-disciplinary research
  • Different types of data observational data,
    experimental data, computational data different
    stewardship and long-term access requirements

4
Diversity of data collections
  • Very large, relatively homogeneous
    Large-scale Hadron Collider (LHC)
    outputs from CERN
  • Smaller, heterogeneous and richer collections
    World Data Centre for Solar-terrestrial
    Physics CCLRC
  • Small-scale laboratory results
    jumping robots project at the
    University of Bath
  • Population survey data UK Biobank
  • Highly sensitive, personal data patient care
    records

5
Taxonomy of data collections
  • Research collections jumping robots
  • Community collections Flybase at Indiana (with
    UC Berkeley )
  • Reference collections Protein Data Bank
  • Source NSF Long-Lived Digital Data Collections
  • Draft report March 2005

6
Repository evolution 1971 Research collection
lt12 files 2005 Reference collection gt2700
structures deposited in 6 months
7
1. Issues research data as content
  • Sharing or not Open Access to data?
  • Data diversity
  • Homo- or heterogeneous
  • Raw and derived / processed
  • Sensitivity
  • Fast or slow growth in volume
  • Repository evolution
  • Likelihood to scale up (from bytes to petabytes)
  • Quality assurance (from the start)
  • Community-based standards development
  • Relationship between institutional and subject
    rs
  • Build robust services

8
Presentation services subject, media-specific,
data, commercial portals
Searching , harvesting, embedding
Resource discovery, linking, embedding
Resource discovery, linking, embedding
Data creation / capture / gathering laboratory
experiments, Grids, fieldwork, surveys, media
Aggregator services national, commercial
Data analysis, transformation, mining, modelling
Learning object creation, re-use
Harvestingmetadata
Learning Teaching workflows
Research e-Science workflows
Repositories institutional,
e-prints, subject, data, learning objects
Institutional presentation services portals,
Learning Management Systems, u/g, p/g courses,
modules
Deposit / self-archiving
Deposit / self-archiving
Validation
Publication
Resource discovery, linking, embedding
Validation
Peer-reviewed publications journals, conference
proceedings
Quality assurance bodies
9
The data deluge crystallography
10
Data overload the publication bottleneck
2,000,000
25,000,000
300,000
11
Current Publishing Process
  • Journal articles aims, ideas, context,
    conclusions only most significant data
  • Raw underlying data required by peers not
    readily available

12
Context existing data repositories
  • National data archives
  • UK Data Archive, Arts and Humanities Data
    Service, US National Archives and Records
    Administration (NARA), Atlas Datastore
  • Discipline specific archives
  • GenBank, Protein Data Bank
  • Crystallography archives
  • Cambridge Crystallographic Data Centre (Cambridge
    Structural Database) , Indiana University
    Molecular Structure Center (Crystal Data Server,
    Reciprocal Net), FIZ Karlsruhe (Inorganic
    crystals), Toth Information Systems (CHRYSTMET)
  • Journals require deposit of data to support
    articles
  • Typically deposit of summary data. partial
    coverage

13
eBank UK project overview
  • JISC funded in 2003, now in Phase 2 to 2006
  • Joint effort between crystallographers, computer
    scientists, digital library researchers
  • Investigating contribution of existing digital
    library technologies to enable publication at
    source
  • Partners have interest in dissemination of
    chemistry research data, open access, OAI,
    institutional repositories http//www.ukoln.ac.uk/
    projects/ebank-uk/

14
eBank project team
  • University of Bath, UKOLN (lead)
  • Monica Duke, Rachel Heery, Traugott Koch, Liz
    Lyon,
  • University of Southampton, School of Chemistry
  • Simon Coles, Jeremy Frey, Mike Hursthouse
  • University of Southampton, School of Electronics
    and Computer Science
  • Leslie Carr, Chris Gutteridge
  • University of Manchester, PSIgate (physical
    sciences portal in RDN)
  • John Blunden-Ellis

15
eBank phase one achievements
  • Gathered requirements from crystallographers
  • Established pilot institutional repository for
    crystallography data at Southampton with web
    interface
  • Developed a demonstrator aggregator service at
    UKOLN (CCDC exploring aggregation service)
  • Developed appropriate schema
  • Demonstrated a search interface as an embedded
    service at PSIgate portal
  • Demonstrated an added value service linking
    research data to papers (one-off)

16
Institutional repositoriespublication at source
  • Institution establishes repository(s)
  • Institution pro-actively supports deposit process
  • OAI provides basis for interoperability
  • Potential for added value services
  • And/Or .international subject based archives?

17
Crystallography good fit.
  • Crystallography has well defined data creation
    workflow
  • Tradition of sharing using standard file format
  • Crystallography Information File (CIF)
  • What about other chemistry sub-disciplines? other
    scientific disciplines?

18
eBank UK e-Science testbed Combechem
  • Grid-enabled combinatorial chemistry
  • Crystallography, laser and surface chemistry
    examples
  • Development of an e-Lab using pervasive computing
    technology
  • National Crystallography Service at Southampton

19
Comb-e-Chem Project
Video
Simulation
Properties
Analysis
StructuresDatabase
Diffractometer
X-Raye-Lab
Propertiese-Lab
Grid Middleware
20
Crystallography workflow
  • Initialisation mount new sample on
    diffractometer set up data collection
  • Collection collect data
  • Processing process and correct images
  • Solution solve structures
  • Refinement refine structure
  • CIF produce CIF (Crystallographic Information
    File)
  • Validation chemical crystallographic checks

21
Data Collection
22
Data Flow in eBank UK
Create
OAI-PMH
Index and Search
Institutional repository
eBank aggregator
Data files
Metadata
23
Southampton digital repository
http//ecrystals.chem.soton.ac.uk
24
Access to ALL underlying data
25
Harvesting OAIster
26
OAI-PMH harvesting and aggregating
eBank aggregator at UKOLN http//eprints-uk.rdn.ac
.uk/ebank-demo/
Demonstrating potential for linking between data
and journal article
27
Embedded search service at PSIgate
PSIgate subject gateway service provider
28
Schema for records made available for harvesting
  • Data holding (collection of files associated with
    experiment)
  • Qualified Dublin Core data elements plus
    additional chemical properties
  • Chemical formula
  • International Chemical Identifier (InChI)
  • Compound Class
  • Individual data files
  • Separate records for stage status of each file
  • Description set wrapped into one XML record using
    METS
  • Research metadata/data as a complex object

29
Dataset
eBank data model
Dataset
Dataset
dctermsreferences
Harvesting OAI-PMH oai_dc
Crystal structure (data holding)
ePrint UK aggregator service
Linking
dctypeCrystalStructure
Harvesting OAI-PMH ebank_dc
ebank_dc record (XML)
Deposit
eBank UK aggregator service
dcidentifier
Institutional repositories
dctermsisReferencedBy
Crystal structure report (HTML)
Deposit
Harvesting OAI-PMH oai_dc,ebank_dc
Eprint jump-off page (HTML)
dcidentifier
Eprint manifestation (e.g. PDF)
Eprint oai_dc record (XML)
Other aggregators and services
dctypeEprint and/or Text
Linking
Model input Andy Powell, UKOLN.
30
Creating the metadata
  • Potential to embed deposit and disseminate into
    workflow of chemist in automated way

31
eBank phase two work areas
  • Sub-disciplines of chemistry, earth sciences,
    engineering
  • Pursue generic data model
  • Use of identifiers for citing datasets
  • Subject approach to discovering research data
    (keywords, classification, ontology)
  • Access to research data in teaching and learning
    context
  • Liaise with other digital repository initiatives

32
Related UK projects
  • National e-Science Centre NESC
  • NERC Data Grid (Athmospheric and Oceanographic
    Data Centres)
  • JISC Digital Repositories Programme
  • - Spectra (experim. chemistry, high volume
    ingestion)
  • - R4L (lab equipment, metadata
    generation)
  • - CLADDIER (citation, identifiers, linking)
  • - StORe (data and publ. repository links)
  • - GRADE (reuse of geospatial data)

33
2. Issues generic data models, metadata schema
terminology
  • Validation against generic schema
  • CCLRC Scientific Data Model Vs 2
  • Complex digital objects and packaging options
  • METS
  • MPEG 21 DIDL
  • Terminologies
  • Domain crystallography
  • Inter-disciplinary e.g. biomaterials
  • Metadata enhancement subject keyword additions
    to datasets based on related publications
  • Meaningful resource discovery?

34
3. Issues linking
  • Links to individual datasets within an experiment
  • Links to all datasets associated with an
    experiment or a data collection
  • Links to derived eprints and published literature
  • Context sensitive linking find me
  • Datasets by this author / creator
  • Datasets related to this subject
  • Learning objects by this author / creator
  • Learning objects related to this subject
  • Identifiers and persistence
  • generic
  • domain International Chemical Identifier (InChI
    code)
  • Resource discovery Google Scholar?
  • Provenance authenticity, authority, integrity?

35
4. Issues identifiers
  • Identifiers and persistence
  • generic DOI, PURL, Handle, ARK
  • domain International Chemical Identifier (InChI)
  • Resolution lookup
  • Resource discovery Google Scholar?
  • Granularity (metadata, linking)?
  • Provenance authenticity, authority, integrity?

36
5. Issues embedding and workflow
  • Into the crystallographic publishing community
    International Union of Crystallography
  • Into the chemistry research workflow
  • SMART TEA Digital Lab Book e-synthesis Lab
  • Other analytical techniques and instrumentation
  • Into the curriculum and e-Learning workflows
  • MChem course
  • Undergraduate Chemical Informatics courses

37
For the future
  • Who provides added value services?
  • Authority files, automated subject indexing,
    annotation, data mining, visualisation
  • What are the preservation issues?
  • UK Digital Curation Centre http//www.dcc.ac.uk
  • National Science Board Draft report on long-lived
    data collections http//www.nsf.gov/nsb/meetings/2
    005/LLDDC_draftreport.pdf
  • How to manage complex objects descriptions within
    OAI ?
  • Digital curation of research data presents new
    roles for scientists, computer scientists, data
    managers.

38
Repositories and digital curation
For later use? In use now (and the future)?
Static
Dynamic
Data preservation
Data curation
maintaining and adding value to a trusted body
of digital information for current and future use
39
Provide value-added services
  • Annotation
  • e-Lab books (Smart Tea Project in chemistry)
  • Gene and protein sequences

40
Enable post-processing and knowledge extraction
  • The acquisition of newly-derived information and
    knowledge from repository content
  • Run complex algorithms over primary datasets
  • Mining (data, text, structures)
  • Modelling (economic, climate, mathematical,
    biological)
  • Analysis (statistical, lexical, pattern
    matching, gene)
  • Presentation (visualisation, rendering)

41
6. Issues knowledge services
  • Layered over repositories
  • Annotation
  • Mining, modelling, analysis
  • Visualisation
  • Across multiple repositories
  • Grid enabled applications
  • Highly distributed, dynamic and collaborative
  • Associated with curatorial responsibility
  • UK Digital Curation Centre http//www.dcc.ac.uk

42
Issues summary
  • Research data is diverse, increasing rapidly in
    volume and complexity
  • Repository collections are dynamic and evolve
  • Technical challenges associated with
    interoperability, persistence, provenance,
    resource discovery and infrastructure provision
  • Embedding in workflow is critical scholarly
    communications, research practice, learning
  • Knowledge extraction tools will generate new
    discoveries based on repository content
  • Repository solutions must scale M2M processing
    will become the norm

43
Project homepagehttp//www.ukoln.ac.uk/projects/
ebank-uk/Duke, M. et al Enhancing access to
research data the challenge of crystallography.
JCDL 2005.http//www.ukoln.ac.uk/projects/ebank-u
k/dissemination/jcdl2005/preprint.pdfAcknowledge
ment to all project partners for their
contributions to this presentation.
Write a Comment
User Comments (0)
About PowerShow.com