Much of the learning about the constituent UPS archives occ - PowerPoint PPT Presentation

About This Presentation
Title:

Much of the learning about the constituent UPS archives occ

Description:

Much of the learning about the constituent UPS archives occurred out of band... UPS duplications were removed by hand. tracking publication lifecycle ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 31
Provided by: centralebi
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Much of the learning about the constituent UPS archives occ


1
herbert van de sompel, michael nelson, thomas
krichel
the UPS protoproto project
UPS 1 Meeting Santa Fe - October 21th 1999
2
description
the UPS protoproto
the data exchange framework
3
  • UPS enable cross-archive end-user services
  • protoproto
  • facilitate discussions
  • identify issues involved in creating
    cross-archive services
  • experiment with digital object concepts for
    archive material
  • does not claim to be a solution
  • protoproto is multi-disciplinary
  • a special instance of cross-archive
  • there is a market
  • promotional value

4
  • coordination herbert van de sompel, michael
    nelson, thomas krichel
  • involvement of
  • Old Dominion U NASA Langley
  • U of Surrey
  • U of Ghent
  • Los Alamos National Laboratory - Library
  • Russian Academy of Science - Siberian branch

5
  • Los Alamos National Laboratory - Research Library
  • JISC eLib WoPEc project

6
  • metadata only
  • full text remains at archives
  • static dumps obtained ca. July 99

objects 85,223 742 3,036 29,184 1,590 73,367 193,
142
full-text 85,223 659 3,036 9,084 951 13,582 112,5
35
!organization 17,983 14 100 93 1 2,453
the arXiv CogPrints NACA NCSTRL NDLTD RePEc Tota
l
7
the arXiv CogPrints NACA NCSTRL NDLTD RePEc
format internal internal Refer RFC1807 MARC ReDIF
8
  • Getting metadata out of archives
  • not all archives support metadata extraction
  • some archives have undocumented metadata
    extraction procedures
  • not all archives support rich criteria for
    extraction
  • single dump concept only
  • Intellectual property and use rights not always
    clear

9
  • Metadata has problems with
  • record duplication
  • crucial missing fields
  • internal errors
  • ambiguous references to people and places,
    publications

10
  • all datasets converted to ReDIF
  • essential to have a single fomat for the
    creation of services
  • supply by archives in a single format was not
    realistic
  • no downgrading of data
  • data enhancements
  • creation of unique identifier
  • addition of raw subject-classification
  • normalization of publication types

11
  • creation of archives for ReDIF-ed metadata
  • using intelligent digital objects buckets

RePEc
arXiv
NCSTRL
12
  • Buckets were chosen to study the implications of
    using rich, intelligent objects in UPS
  • Buckets are
  • DL protocol / system independent
  • self-contained and mobile
  • handle their own display, enforcement of terms
    and conditions, and dissemination of their
    contents
  • designed for bundling multiple data
    representations and data instance types
  • The aggregative nature of buckets is well suited
    for adding valued-added services at the object
    level

13
  • NCSTRL digital library service
  • indexing buckets in archives by requesting their
    metadata
  • enhanced user-interface
  • NCSTRL search results point at buckets
  • buckets auto-display
  • buckets provide link to full-text in native
    archive

14
  • UPS contains 193K objects
  • using buckets consumed inodes (60 inodes per
    bucket)
  • filesystem reformatted with more generous amount
    of inodes
  • Solaris and Dienst conflict
  • Dienst wants each object in an publishing
    authority to be in a single directory
  • Solaris has a hard limit of 32K objects in a
    directory
  • resolution use many (100) authorities for UPS

15
  • integrate the archives with the traditional
    communication mechanism
  • context-sensitive linking to deliver extended
    services via SFX technology

16
evaluate metadata
system A
17
(No Transcript)
18
  • buckets for arXiv, NCSTRL and RePEc are
    SFX-aware
  • Cogprints, NACA, NDLTD not SFX-aware
  • SLAC/SPIRES is SFX-aware
  • linking services for preprint metadata for
    published version

19
  • will be available starting beginning of November
  • UPS list will be notified
  • disclaimer not a production system

http//ups.cs.odu.edu8000/
http//ups.cs.odu.edu
20
  • data exchange framework
  • data provision vs. data implementation
  • central searching, distributed archives
  • need for a framework by which archives can
    describe themselves
  • content
  • terms and conditions
  • protocols, criteria supported to extract
    (meta)data
  • metadata scheme, subject classification scheme,
    material-type scheme, ...

21
  • need for an identifier scheme for archives and
    archive objects
  • (cf. ISSN, ISBN, DOI)
  • metadata quality obstructs the creation of
    services
  • desirabile to extend metadata with citation
    information
  • smart objects
  • archived objects that are active, not passsive

22
  • Providing data
  • publishing into an archive
  • providing methods for metadata harvesting
  • provide non-technical context for sharing
    information also
  • Implementing Data
  • harvest metadata from providers
  • implement user interface to data
  • Even if provided by the same DL, these are
    distinct functions

23
Native harvesting interface
Input interface
Native end-user interface
Provider
Input interface
Provider
Native end-user interface
No machine based way to extract metadata
Machine and user interfaces for extracting
metadata.
24
Input and harvesting interfaces optional
Native end-user interface
Implementor
Native harvesting interface
Native harvesting interface
Input interface
Input interface
Provider
Provider
Native end-user interface
Native end-user interface optional (e.g., RePEc)
25
  • Much of the learning about the constituent UPS
    archives occurred out of band
  • Given an unknown archive, we should be able to
    algorithmically determine the archives
    metadata...

Native harvesting interface
Where possible, the harvesting interface should
provide the same criteria as the end-user
interface
Input interface
Provider
Native end-user interface
26
  • Recommended criteria for metadata extraction
  • subject classification
  • accession date
  • publication date
  • Criteria for archive description
  • metadata formats employed
  • contact information for archive
  • publication type scheme
  • identifier scheme
  • subject classification scheme

27
  • Useful in
  • reference linking
  • can be used in citations
  • resolving duplications
  • UPS duplications were removed by hand
  • tracking publication lifecycle
  • Need the ability for an object to have multiple
    unique identifiers
  • organization, discipline, etc.

28
  • Premise Objects are more important than the
    archives that hold them
  • SODA Smart Objects, Dumb Archives
  • Objects should be the canonical authority for
  • metadata
  • contents
  • use
  • Objects should be able to grow and change
  • correct metadata
  • add new formats
  • add new services
  • reflect the lifecycle of the object

29
  • It would be beneficial if the archived objects
    could be heterogenous
  • with their own look-and-feel
  • unique functionality / services
  • e.g., the data archiving needs of an atmospheric
    scientist can be different than that of a
    computer scientist, engineer or medical
    researcher
  • yet maintained a standard API for
  • extracting metadata
  • content retrieval
  • resource discovery on the object
  • terms and conditions

30
  • A strong distinction between the provision of
    data, and the implementation of data
  • also, a socio-legal context for sharing metadata
  • Open, self-describing archives
  • A universal, unique identifier name space
  • Archived objects with more intelligence and
    flexibility
Write a Comment
User Comments (0)
About PowerShow.com