Repository Synchronization Using NNTP and SMTP - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Repository Synchronization Using NNTP and SMTP

Description:

Load a small amount of data 'Look upon my archive ye Mighty, and despair! ... Web Server Based Preservation. Use Apache modules to create archival-ready resources ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 22
Provided by: Michael50
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Repository Synchronization Using NNTP and SMTP


1
Repository Synchronization Using NNTP and SMTP
  • Michael L. Nelson, Joan A. Smith, Martin Klein
  • Old Dominion University
  • Norfolk VA
  • www.cs.odu.edu/mln,jsmit,mklein
  • DLF Spring 2006
  • Austin TX
  • April 10-12, 2006

2
Preservation Fortress Model
Five Easy Steps for Preservation
  • Get a lot of
  • Buy a lot of disks, machines, tapes, etc.
  • Hire an army of staff
  • Load a small amount of data
  • Look upon my archive ye Mighty, and despair!

image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
3
Alternate Models of Preservation
  • Lazy Preservation
  • Let Google, IA et al. preserve your website
  • Just-In-Time Preservation
  • Find a good enough replacement web page
  • Web Server Based Preservation
  • Use Apache modules to create archival-ready
    resources
  • Shared Infrastructure Preservation
  • Push your content to sites that might preserve it

image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
4
Shared, Existing Infrastructure
  • Can we (re)use existing installed network
    infrastructure for preservation purposes?

Who has the Bigger Fortress?
5
Experiment Simulation
  • Inject the contents of an OAI-PMH repository
    directly into
  • Email (SMTP)
  • Usenet News (NNTP)
  • Instrument existing email, news servers
  • Use mod_oai (www.modoai.org) to do resource
    harvesting
  • complex object formats (e.g. MPEG-21 DIDL) used
    to encode the resources as lumps of XML
  • results are generalizable to any repository
    system
  • Analyze testbed, simulate very large collections

6
Test Repository
  • Website with 72 files
  • HTML, PDF, PNG, JPEG, GIF
  • 1KB - 1.5 MB
  • Used a script to harvest the MPEG-21 DIDLs, and
    then
  • attach to outbound email mesgs
  • post to a moderated newsgroup (repository.odu.test
    1)

7
General Architecture
8
Email
9
Adding Email Attachments / Headers
outgoing mail
incoming mail
10
Email Headers
11
SMTP Overhead
diminishing returns for skipping mesgs
1 sec penalty per mesg
12
Email Traffic _at_ mail.cs.odu.edu
  • 30 days of traffic
  • 505,987 mesgs
  • 4081 unique hosts
  • daily
  • mean 16,866
  • std dev 5147

P(x) a(x-b) we measured b1.6
13
News
14
News Posting
15
News Overhead
16
News Policies
17
Simulation Parameters
  • Repository
  • 100,000 items
  • 1MB/item
  • 100 daily additions
  • 400 daily updates
  • Time
  • 2000 days (5.5 years)
  • Email
  • granularity1
  • follows ODU power law example
  • News
  • servers hold contents for 30 days

18
NNTP Results
19
Email Results (Without Memory)
20
Email Results (With Memory)
21
Discussion
  • Weve examined the worst case scenario
  • large, active repository
  • sending contents by-value
  • Optimizations / Alternatives
  • smaller, less dynamic repositories
  • sending contents by-reference
  • use for repository discovery, not for content
    interchange
  • instead of sending GetRecord results, send
    Identify results and let interested parties
    return to your site with proper harvesters

22
Summary
  • Shared, existing infrastructure can be used to
    push content to unknown preservation partners
  • exploiting not just hardware infrastructure, but
    human communication patterns for resource
    discovery as well
  • While not possessing ideal DL/Archival
    capabilities, these methods are congruent with
    standard web practices
  • Gmail, Google Groups, etc. will always have more
    disks than you
Write a Comment
User Comments (0)
About PowerShow.com