DSpace 2.x Architecture Roadmap - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

DSpace 2.x Architecture Roadmap

Description:

i.e. for module X depends on an implementation of API Y ... Integrate asset store w/DSpace 1.x. Either build synchronisation tool, or ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 56

Provided by: ROB51

Category:

more less

Transcript and Presenter's Notes

Title: DSpace 2.x Architecture Roadmap

1
DSpace 2.x Architecture Roadmap

Robert Tansley
DSpace Technical Lead, HP

2
Overview

Why a DSpace 2.x?
Proposed Target Architecture
Example Deployments
Proposed Migration Path

3
Why a DSpace 2.x?
4
DSpace 1.x

Breadth-first implementation of institutional
repository
Provides all required functionality to start
capturing digital assets
Widened awareness and understanding of digital
preservation problem

5
Key areas for improvement

Modularity
Digital Preservation
Scalability

6
Modularity

Current APIs are low-level, somewhat ad-hoc
Difficult to keep stable
Difficult to implement enhanced/alternative
functionality behind them
Changing a particular aspect of functionality
involves changing UI as well as underlying
business logic module
e.g Workflow review pages very specific to
current Workflow Manager module functionality

7
Modularity

Heavy inter-dependence
e.g. Use same DB tables change in one module
means you have to change others that use the same
tables
No real plug in mechanism
Managing a modification alongside evolving core
DSpace code can be tricky

8
DSpace 1 series architecture
9
Making a change
10
Proposed new modular approach

Modules provide own UI
Modules do not directly share data, e.g. DB
tables
Inter-module communication via defined APIs
Many modules then dont need APIs, e.g. browse UI

11
Proposed new modular approach

UIs glued together by UI framework
Framework provides navigation tools, look and
feel, internationalisation, localisation

12
Proposed new modular approach

Modules can depend on APIs

13
Proposed new modular approach

Modules can implement two APIs
E.g. LDAP integration module could implement
E-person API and authorisation API

14
Digital preservation

Use of relational database optimised for access
Metadata is separate from Bitstreams
database corruption would make archive very
difficult to reconstruct
Hard to extend metadata schema support
Custom schema difficult for other apps to access

15
Scalability

Some limits on scalability in 1.x, e.g.
Browse code
Supports multiple file systems, but not ideal
Largely limited to single server
Mirroring difficult
Metadata in database, bitstream on file system
extraction non-trivial

16
Proposed approach

Refactor storage Asset store
Metadata in standard format and bitstreams stored
in the same place
AIP becomes a more tangible concept
Aids preservation No reliance on particular
software
Aids scalability Easier to manage storage and
distribution
Easier to move around

17
Summary
18
Proposed Target Architecture
19
Target architecture overview
20
Asset store
21
Asset store

Corresponds to OAIS Archival Storage
Contains only Archival Information Packages
(AIPs)
Not e-people records, in-progress submissions
etc.
AIPs consist of
Metadata serialisation
Bitstreams
AIP checksum

22
Object model
23
Example AIP (item)

How it might look in a file system
aip-identifier/
metadata.xml current metadata serialisation
184BE84F293342 bitstream 1 (filename
checksum)
3F9AD0389CB821 bitstream 2
330F925A1D0386 bitstream 3
checksum checksum of AIP

24
Asset store API
25
Asset store API

Standardised Java API for DSpace asset stores
May be different implementations
simple file system
Enterprise reference information store
Grid-based, e.g. SRB
SAN
Allows creation, retrieval, update etc. of AIPs

26
Scaling up

Easy to replicate AIPs and asset stores
Enables serving larger numbers of users
Aids preservation Multiple copies, more robust

27
Scaling up

Two DSpaces can easily keep synchronised

28
Scaling up

Two DSpaces can easily keep synchronised
Something as simple as a periodic rsync can do
the job
Exact mechanism would depend on asset store
File system, enterprise reference information
store, SRB etc.

29
What about clashes?

Were dealing with reference information
DSpace is not an authoring system
Not work-in-progress, often-updated material
Same AIP updated by two different DSpace
instances in same day unlikely
Can flag as a conflict for manual resolution
Exception Items being added to same collection
Simple to resolve merge the additions
Just make sure IDs are unique!

30
What about search indices?

Modules may maintain indices or caches of
information from AIPs in the asset store
E.g. the browse UI, Lucene index
Modules keep indices or caches up-to-date by
periodically polling asset store API
Similar to incremental harvesting in OAI-PMH

31
Why the polling approach?

Polling is simpler to implement than real-time
notification
Implementing custom asset store easier
More scalable can control when indexing occurs
Big sync might mean several indices updating at
once
End-users might not see deposits appear in the
search/browse indices immediately. However
Doesnt happen anyway if any workflow review
needed
Neednt take more than overnight to happen
Reference information not time-critical data

32
DSpace modular architecture

Some modules have APIs some do not
Modules may have dependencies
i.e. for module X depends on an implementation of
API Y
Modules may use RDBMS but do not share tables

33
UI framework
34
UI framework

Glues together UIs of different modules
Provides navigation tools, stylesheets, skin
Internationalisation, localisation
User authentication
Cocoon provides most of the above functionality
Easy to add the rest

35
Exposing services via Tomcate.g. OAI-PMH
36
Core DSpace modules and APIs
37
Content management API

Similar to existing org.dspace.content API
Provides procedural way to manipulate AIPs
Implementation may cache some information in
RDBMS
E.g. Community/collection/item structure

38
Extending metadata

Pull out pieces of search UI, submit UI, item
display related to Dublin Core into a separate
module
Allow other similar modules for dealing with
other schemas and extensions
Start with simple property/value support
SIMILE will provide richer functionality

39
Security

Similar to DSpace 1.x
Modules running within DSpace instance trusted
Not worrying about malicious code for now
Modules, UI framework responsible for
authenticating end-user as an e-person
Modules, asset store implementation must invoke
authorisation API as appropriate

40
Summary

Refactor storage Content in AIPs (metadata
bitstreams)
Easier to share/mirror AIPs with periodic
synchronisation
Modules do OAI-PMH-style incremental harvests to
keep indices/caches up to date
Benefit Increased scalability, preserve-ability
Cost New/changed AIPs arent instantly indexed
Often not the case anyway (workflow reviews)
Reference information (not time critical)

41
Summary

Modular architecture
Modules responsible for own UI and data
Modules inter-communicate via defined APIs
UI framework provides Web UI glue Cocoon
Dependency mechanism to allow plug-in
functionality
Benefit Vastly improved modularity
Essential for our diverse community of users
Cost Implementing modules might take more effort
Unavoidable but manageable price of modularity
Different from current approach migration
non-trivial
Those who havent changed DSpace 1 much will have
easy upgrade path
Does anyone really like servlets/JSPs?

42
Example Deployments
43
Standard deployment
44
Web services module
45
LDAP-based e-people and authorisation
46
Mirrored asset store
47
Shared asset store
48
Separate ingest and access instances
49
DSpace on SRB
50
SIMILE
51
Proposed Migration Path
52
Stage 1 Build asset store