Title: Repository Synchronization Using NNTP and SMTP
1Repository Synchronization Using NNTP and SMTP
- Michael L. Nelson, Joan A. Smith, Martin Klein
- Old Dominion University
- Norfolk VA
- www.cs.odu.edu/mln,jsmit,mklein
- DLF Spring 2006
- Austin TX
- April 10-12, 2006
2Preservation Fortress Model
Five Easy Steps for Preservation
- Get a lot of
- Buy a lot of disks, machines, tapes, etc.
- Hire an army of staff
- Load a small amount of data
- Look upon my archive ye Mighty, and despair!
image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
3Alternate Models of Preservation
- Lazy Preservation
- Let Google, IA et al. preserve your website
- Just-In-Time Preservation
- Find a good enough replacement web page
- Web Server Based Preservation
- Use Apache modules to create archival-ready
resources - Shared Infrastructure Preservation
- Push your content to sites that might preserve it
image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
4Shared, Existing Infrastructure
- Can we (re)use existing installed network
infrastructure for preservation purposes?
Who has the Bigger Fortress?
5Experiment Simulation
- Inject the contents of an OAI-PMH repository
directly into - Email (SMTP)
- Usenet News (NNTP)
- Instrument existing email, news servers
- Use mod_oai (www.modoai.org) to do resource
harvesting - complex object formats (e.g. MPEG-21 DIDL) used
to encode the resources as lumps of XML - results are generalizable to any repository
system - Analyze testbed, simulate very large collections
6Test Repository
- Website with 72 files
- HTML, PDF, PNG, JPEG, GIF
- 1KB - 1.5 MB
- Used a script to harvest the MPEG-21 DIDLs, and
then - attach to outbound email mesgs
- post to a moderated newsgroup (repository.odu.test
1)
7General Architecture
8Email
9Adding Email Attachments / Headers
outgoing mail
incoming mail
10Email Headers
11SMTP Overhead
diminishing returns for skipping mesgs
1 sec penalty per mesg
12Email Traffic _at_ mail.cs.odu.edu
- 30 days of traffic
- 505,987 mesgs
- 4081 unique hosts
- daily
- mean 16,866
- std dev 5147
P(x) a(x-b) we measured b1.6
13News
14News Posting
15News Overhead
16News Policies
17Simulation Parameters
- Repository
- 100,000 items
- 1MB/item
- 100 daily additions
- 400 daily updates
- Time
- 2000 days (5.5 years)
- Email
- granularity1
- follows ODU power law example
- News
- servers hold contents for 30 days
18NNTP Results
19Email Results (Without Memory)
20Email Results (With Memory)
21Discussion
- Weve examined the worst case scenario
- large, active repository
- sending contents by-value
- Optimizations / Alternatives
- smaller, less dynamic repositories
- sending contents by-reference
- use for repository discovery, not for content
interchange - instead of sending GetRecord results, send
Identify results and let interested parties
return to your site with proper harvesters
22Summary
- Shared, existing infrastructure can be used to
push content to unknown preservation partners - exploiting not just hardware infrastructure, but
human communication patterns for resource
discovery as well - While not possessing ideal DL/Archival
capabilities, these methods are congruent with
standard web practices - Gmail, Google Groups, etc. will always have more
disks than you