Title: DSpace 2.x Architecture Roadmap
1DSpace 2.x Architecture Roadmap
- Robert Tansley
- DSpace Technical Lead, HP
2Overview
- Why a DSpace 2.x?
- Proposed Target Architecture
- Example Deployments
- Proposed Migration Path
3Why a DSpace 2.x?
4DSpace 1.x
- Breadth-first implementation of institutional
repository - Provides all required functionality to start
capturing digital assets - Widened awareness and understanding of digital
preservation problem
5Key areas for improvement
- Modularity
- Digital Preservation
- Scalability
6Modularity
- Current APIs are low-level, somewhat ad-hoc
- Difficult to keep stable
- Difficult to implement enhanced/alternative
functionality behind them - Changing a particular aspect of functionality
involves changing UI as well as underlying
business logic module - e.g Workflow review pages very specific to
current Workflow Manager module functionality
7Modularity
- Heavy inter-dependence
- e.g. Use same DB tables change in one module
means you have to change others that use the same
tables - No real plug in mechanism
- Managing a modification alongside evolving core
DSpace code can be tricky
8DSpace 1 series architecture
9Making a change
10Proposed new modular approach
- Modules provide own UI
- Modules do not directly share data, e.g. DB
tables - Inter-module communication via defined APIs
- Many modules then dont need APIs, e.g. browse UI
11Proposed new modular approach
- UIs glued together by UI framework
- Framework provides navigation tools, look and
feel, internationalisation, localisation
12Proposed new modular approach
- Modules can depend on APIs
13Proposed new modular approach
- Modules can implement two APIs
- E.g. LDAP integration module could implement
E-person API and authorisation API
14Digital preservation
- Use of relational database optimised for access
- Metadata is separate from Bitstreams
- database corruption would make archive very
difficult to reconstruct - Hard to extend metadata schema support
- Custom schema difficult for other apps to access
15Scalability
- Some limits on scalability in 1.x, e.g.
- Browse code
- Supports multiple file systems, but not ideal
- Largely limited to single server
- Mirroring difficult
- Metadata in database, bitstream on file system
- extraction non-trivial
16Proposed approach
- Refactor storage Asset store
- Metadata in standard format and bitstreams stored
in the same place - AIP becomes a more tangible concept
- Aids preservation No reliance on particular
software - Aids scalability Easier to manage storage and
distribution - Easier to move around
17Summary
18Proposed Target Architecture
19Target architecture overview
20Asset store
21Asset store
- Corresponds to OAIS Archival Storage
- Contains only Archival Information Packages
(AIPs) - Not e-people records, in-progress submissions
etc. - AIPs consist of
- Metadata serialisation
- Bitstreams
- AIP checksum
22Object model
23Example AIP (item)
- How it might look in a file system
- aip-identifier/
- metadata.xml current metadata serialisation
- 184BE84F293342 bitstream 1 (filename
checksum) - 3F9AD0389CB821 bitstream 2
- 330F925A1D0386 bitstream 3
- checksum checksum of AIP
24Asset store API
25Asset store API
- Standardised Java API for DSpace asset stores
- May be different implementations
- simple file system
- Enterprise reference information store
- Grid-based, e.g. SRB
- SAN
- Allows creation, retrieval, update etc. of AIPs
26Scaling up
- Easy to replicate AIPs and asset stores
- Enables serving larger numbers of users
- Aids preservation Multiple copies, more robust
27Scaling up
- Two DSpaces can easily keep synchronised
28Scaling up
- Two DSpaces can easily keep synchronised
- Something as simple as a periodic rsync can do
the job - Exact mechanism would depend on asset store
- File system, enterprise reference information
store, SRB etc.
29What about clashes?
- Were dealing with reference information
- DSpace is not an authoring system
- Not work-in-progress, often-updated material
- Same AIP updated by two different DSpace
instances in same day unlikely - Can flag as a conflict for manual resolution
- Exception Items being added to same collection
- Simple to resolve merge the additions
- Just make sure IDs are unique!
30What about search indices?
- Modules may maintain indices or caches of
information from AIPs in the asset store - E.g. the browse UI, Lucene index
- Modules keep indices or caches up-to-date by
periodically polling asset store API - Similar to incremental harvesting in OAI-PMH
31Why the polling approach?
- Polling is simpler to implement than real-time
notification - Implementing custom asset store easier
- More scalable can control when indexing occurs
- Big sync might mean several indices updating at
once - End-users might not see deposits appear in the
search/browse indices immediately. However - Doesnt happen anyway if any workflow review
needed - Neednt take more than overnight to happen
- Reference information not time-critical data
32DSpace modular architecture
- Some modules have APIs some do not
- Modules may have dependencies
- i.e. for module X depends on an implementation of
API Y - Modules may use RDBMS but do not share tables
33UI framework
34UI framework
- Glues together UIs of different modules
- Provides navigation tools, stylesheets, skin
- Internationalisation, localisation
- User authentication
- Cocoon provides most of the above functionality
- Easy to add the rest
35Exposing services via Tomcate.g. OAI-PMH
36Core DSpace modules and APIs
37Content management API
- Similar to existing org.dspace.content API
- Provides procedural way to manipulate AIPs
- Implementation may cache some information in
RDBMS - E.g. Community/collection/item structure
38Extending metadata
- Pull out pieces of search UI, submit UI, item
display related to Dublin Core into a separate
module - Allow other similar modules for dealing with
other schemas and extensions - Start with simple property/value support
- SIMILE will provide richer functionality
39Security
- Similar to DSpace 1.x
- Modules running within DSpace instance trusted
- Not worrying about malicious code for now
- Modules, UI framework responsible for
authenticating end-user as an e-person - Modules, asset store implementation must invoke
authorisation API as appropriate
40Summary
- Refactor storage Content in AIPs (metadata
bitstreams) - Easier to share/mirror AIPs with periodic
synchronisation - Modules do OAI-PMH-style incremental harvests to
keep indices/caches up to date - Benefit Increased scalability, preserve-ability
- Cost New/changed AIPs arent instantly indexed
- Often not the case anyway (workflow reviews)
- Reference information (not time critical)
41Summary
- Modular architecture
- Modules responsible for own UI and data
- Modules inter-communicate via defined APIs
- UI framework provides Web UI glue Cocoon
- Dependency mechanism to allow plug-in
functionality - Benefit Vastly improved modularity
- Essential for our diverse community of users
- Cost Implementing modules might take more effort
- Unavoidable but manageable price of modularity
- Different from current approach migration
non-trivial - Those who havent changed DSpace 1 much will have
easy upgrade path - Does anyone really like servlets/JSPs?
42Example Deployments
43Standard deployment
44Web services module
45LDAP-based e-people and authorisation
46Mirrored asset store
47Shared asset store
48Separate ingest and access instances
49DSpace on SRB
50SIMILE
51Proposed Migration Path
52Stage 1 Build asset store
- Decide on AIP metadata serialisation
- Build asset store
- Integrate asset store w/DSpace 1.x
- Either build synchronisation tool, or
- Replace CM API (org.dspace.content) -- trickier
53Stage 2 Build 2.0
- Design build modular infrastructure
(dependencies etc) - Define the APIs
- Port/implement 1.x functionality
- Release this as 2.0
- Institutions can port their code to the 2.0
architecture, and swap over
54Stage 3 2.x and beyond
- DSpace 2.1
- Authorisation policy expression in AIPs
- XQuery API
- DSpace 2.2
- Federation
- DSpace 2.3
- Integrate SIMILE components
55(No Transcript)