Persistent Archive for the NSDL - PowerPoint PPT Presentation

About This Presentation
Title:

Persistent Archive for the NSDL

Description:

San Diego Supercomputer Center. Persistent Archive for the NSDL. Reagan W. Moore. Charlie Cowart ... San Diego Supercomputer Center. Implementation ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 12
Provided by: reag7
Learn more at: https://www.erpanet.org
Category:

less

Transcript and Presenter's Notes

Title: Persistent Archive for the NSDL


1
Persistent Archive for the NSDL Reagan W.
Moore Charlie Cowart University of California,
San Diego San Diego Supercomputer Center (moore,
charliec)_at_sdsc.edu http//www.npaci.edu/DICE/
2
Persistent Archive Team
  • Reagan Moore
  • Sheau Yen Chen
  • Charles Cowart
  • George Kremenek
  • Erdem Kulrul
  • Richard Marciano
  • Arcot Rajasekar
  • Michael Wan

3
Status
  • Architecture design
  • Choice of web crawler
  • Demonstration
  • Proof of concepts

4
Architecture
  • Built on existing tools
  • Retrieve metadata
  • OAI metadata harvester
  • Retrieve digital entities
  • Web crawler
  • Organize and archive digital entities
  • Data grid
  • Provide access
  • OAI and HTTP interfaces

5
OAI Interfaces
  • OAI service provider interface
  • Used Tom Kalts (U Mass) OAI harvester classes
  • Initiate connection
  • Retrieve metadata as XML
  • Parse XML into objects
  • OAI data provider interface
  • Custom CGI interface to SRB/MCAT written in C
  • Parses OAI2 requests and generates SRB client
    calls
  • Transforms from SRB objects to XML

6
Web Crawler
  • HTML crawler choice
  • WGET (Gnu)
  • WebBase (Stanford)
  • HTML/XML translator (SDSC)
  • Capabilities
  • Parallelized for performance
  • Recursively crawl web site
  • Build link graph structure
  • Translation of links to logical name space

7
Data Grid
  • Organize retrieved digital entities
  • Snapshot based (time)
  • Support for compound documents
  • Conversion of all internal URL links to SRB URL
    links, and associated SRB logical name space for
    digital entities
  • Manage storage of digital entities
  • Store on disk / archive at SDSC, could be
    replicated to any other site

8
Implementation
  • URL list generation from harvesting of NSDL
    repository
  • Crawl and retrieve digital entities into a
    buffer area
  • Archive into snapshot organized collections
  • Flags / time stamps for changed data for OAI
    based retrieval

9
Demonstration
  • Register digital entity by original URL
  • Store DC metadata
  • Crawl based on text file of desired URLs
  • Tested on LoC American Memory collection
  • Currently crawl two levels
  • Manages CGI redirection
  • Organize compound documents
  • Add SRB links for redirection
  • Preserve external web links
  • Display results using INQ interface to SRB

10
SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
OAI
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
11
General Information
  • http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com