Title: Managing the Rhizome:
1Managing the Rhizome METS for Web Archiving
Leslie Myrick, NYU CNI Fall 2004 Task Force
Mtg Portland, OR 6-7 December, 2004
2Outline for Today
- Peculiarities of websites as complex digital
objects - How METS is particularly suited to address the
challenges of encapsulating website objects - How METS could be used to manage and even
navigate website objects in an archive
3The problem
- WWW an increasingly important vehicle for
dissemination of information - Tools and infrastructure perhaps as volatile as
material were collecting - Ensure that fugitive materials will be captured,
preserved, made accessible
4Early Implementers of Large-Scale Web Archiving
Repositories
- Internet Archive Wayback Machine
- National Library of Australia
- PANDORA
- National Library of Sweden
- Kulturarw3
5Major Web Archiving Initiatives
- Nordic Web Archive (NWA)
- UK Web Archiving Consortium (UKWAC)
- International Internet Preservation Consortium
(IIPC) - National Digital Information Infrastructure and
Preservation Program (NDIIPP) - Half of Partnerships involve web content
- The Web at Risk CDL, UNT, NYU
6Political Communications Web Archive Project
(PCWA)
- Under auspices of CRL and Mellon
- Participants Cornell University, Stanford
University, UT Austin, NYU - Focus SE Asia, Sub-Saharan Africa, Latin
America, Western Europe - Radical or NGO political born-digital ephemera
- Content Internet Archive (.arc files of website
snapshots culled from Alexa crawls)
7Websites as Moving Target
- Volatility
- BBC Site Banner Updated Every Minute of Every
Day - Ephemerality
- Peter Lyman Average lifespan of webpage is 44
days. - Entire websites disappear at alarming rate as
well
8Political Websites ephemerality import
urgency
- Candidates in the Nigerian Elections of April
2003 (PCWA) 8/37 - 135 Candidates websites in California Recall
campaign of 2003 (CDL) - Defunct Federal Agencies (CyberCemetery at UNT)
9An enterprise fraught with questions
- How do we collect it before it disappears or
radically changes? - How do we define what we are collecting?
- How do we manage the curiosity cabinet of MIME
Types we ingest? - How do we supply effective metadata for
management, preservation and access?
10Basic METS Recipe
- metsHdr
- fileSec
- structMap
- structLink
- dmdSec
- amdSec
- behaviorSec
11Web Archiving Challenges IDefinition and
Taxonomy
- Definition of the object website and its
boundaries - Complexities of website structure(s)
- Complex symphonic nature of a webpage itself
12Definition and boundaries
- Website as a structured aggregate of files
- related by hyperlinking or embedding
- Capture and treatment of near files
- .css, .js, icons that may live on another server
or domain but are necessary to render the page - Treatment of external links
- and of files at the end of those links
13Which website structure?
- Physical file structure from host server as
represented by captured mirror? - Logical tree structure?
- entry page as parent other pages as children
- Hyperlink structure?
- All of the above? or some combination?
14Webpage as symphonic
- HTML wrapper around embedded data streams
hyperlinks all rendered in parallel - embedded multimedia or Flash
- image SRCs
- javascript, .css
- HREFs
15ltMETSfileSecgt(the easy bit)
16How to inventory captured files?
- All the files harvested in a single snapshot
- Single ltfileGrpgt
- Sorted according to resource type?
- Incremental harvest (trickier)
- Handled by multiple fileGrps?
- Separate fileGrps for initial snapshot, migrated
or refreshed files?
17METS File inventory
ltMETSfileSecgt ltMETSfileGrpgt
ltMETSfile ID"FID18" MIMETYPE" text/html"
ADMID"ADM1"gt ltMETSFLocat
LOCTYPE"URL" xlinkhref"www.apgawomen.org/" /gt
lt/METSfilegt ltMETSfile
ID"FID113" MIMETYPE"text/html ADMID"ADM2"gt
ltMETSFLocat LOCTYPE"URL" xlinkhref"www.apga
women.org/officers.htm" /gt lt/METSfilegt
ltMETSfile ID"FID120" MIMETYPE"text/html
ADMID"ADM3"gt ltMETSFLocat LOCTYPE"URL"
xlinkhref"www.apgawomen.org/calender.htm" /gt
lt/METSfilegt ltMETSfile ID"FID154"
MIMETYPE"text/html" ADMID"ADM4"gt
ltMETSFLocat LOCTYPE"URL" xlinkhref"www.apgawom
en.org/newsarchives.htm" /gt lt/METSfilegt
ltMETSfile ID"FID1059"
MIMETYPE"text/html" ADMID"ADM5"gt
ltMETSFLocat LOCTYPE"URL" xlinkhref"www.apgawom
en.org/home.htm" /gt lt/METSfilegt
,,, lt/ltMETSfileGrpgt lt/METSfileSecgt
18ltMETSstructMapgt
19METS ltstructMapgt View of a Websitein a Nutshell
- Flattened logical tree hierarchy of three levels
- ltdivgt root entry page, index.html
- ltdivgt each HTML page
- ltdivgt each hyperlink on that page to a page
internal to the site
20METS ltstructMapgt view of an HTML page
- HTML wrapper around parallel elements page
itself, embedded files and hyperlinks - ltdivgt for the HTML page
- ltfptrgt
- ltpargt
- ltareagtfor HTML page each embedded
parallel element -- .css, .js, images
etc. - (with ID-IDREF to file ID in fileSec)
- ltdivgt for each (internal) hyperlinked page
21Simple ltstructMapgt Example
- Nigerian Election (April 2003) Testbed
- APGA Womens Website
- Entry HTML page with
- two embedded flash files
- one href around
- an embedded image to non-flash home.
22(No Transcript)
23lthtmlgt ltheadgtlttitlegtindexlt/titlegtltmeta
http-equiv"Content-Type" content"text/html
charsetiso-8859-1"gtlt/headgt ltbody
bgcolor"000000"gt lttable width"100"gt
lttrgtlttdgt ltdiv align"center"gtltobject
classid"clsidD27CDB6E-AE6D-11cf-96B8-44455354000
0" codebase"http//download.macromedia.com/pub/sh
ockwave/cabs/flash/swflash.cabversion4,0,2,0"
width"700" height"150"gt ltembed
src"notjust.swf" qualityhigh pluginspage"http/
/www.macromedia.com/shockwave/download/index.cgi?P
1_Prod_VersionShockwaveFlash" type"application/x
-shockwave-flash" width"700" height"150"gt
lt/embedgt lt/objectgtlt/divgt
lt/tdgtlt/trgt lt/tablegt lttable width"100"gt lttrgt
lttdgt ltdiv align"center"gtltobject
classid"clsidD27CDB6E-AE6D-11cf-96B8-44455354000
0" codebase"http//download.macromedia.com/pub/sh
ockwave/cabs/flash/swflash.cabversion4,0,2,0"
width"600" height"64"gt ltembed
src"apgawnew.swf" qualityhigh
pluginspage"http//www.macromedia.com/shockwave/d
ownload/index.cgi?P1_Prod_VersionShockwaveFlash"
type"application/x-shockwave-flash" width"600"
height"64"gt lt/embedgt
lt/objectgtlt/divgt lt/tdgt lt/trgt lt/tablegt ltpgtnbsp
lt/pgtltdiv align"center"gt lttable width"85"gt
lttrgtlttdgt ltdiv align"right"gtlta
href"home.htm"gtltimg src"enterarrow.gif"
width"80" height"27" border"0"gtlt/agtlt/divgt
lt/tdgtlt/trgt lt/tablegt lt/divgt lt/bodygt lt/htmlgt
24 ltMETSdiv DMDID"DM1" TYPE"web page"
ID"page18" LABEL"http//dlibdev.nyu.edu/webarc
hive/metstest/www.apgawomen.org/index.html "gt
ltMETSfptrgt ltMETSpargt
ltMETSarea FILEID"FID18"/gt
index.html
ltMETSarea FILEID"FID1036"/gt
notjust.swf
ltMETSarea FILEID"FID1043"/gt apgawnew.swf
ltMETSarea
FILEID"FID1075"/gt enterarrow.gif
lt/METSpargt lt/METSfptrgt
ltMETSdiv TYPE"hyperlink" ID"LINK1"
LABEL"home"gt ltMETSfptrgt
ltMETSarea BEGIN"000" BETYPE"BYTE"
END"111"
FILEID"FID18"/gt
lt/METSfptrgt lt/METSdivgt
25ltMETSstructLinkgt
26LC Flattened StructurestructMap and structLink
ltMETSdiv DMDID"DM01" TYPE"wcwebpage"
ID"page18" LABEL"http//dlibdev.nyu.edu/webarchi
ve/metstest/www.apgawomen.org/index.html"gt
ltMETSfptrgt ltMETSpargt ltMETSarea
FILEID"FID18"/gt ltMETSarea FILEID"FID1036"/gt
ltMETSarea FILEID"FID1043"/gt ltMETSarea
FILEID"FID1075"/gt lt/METSpargt
lt/METSfptrgt lt/METSdivgt   ltMETSstructLinkgt ltM
ETSsmLink from"page18" to"page1059"/gt ltMETS
smLink from"page1059" to"page154"/gt
ltMETSsmLink frompage1059 topage237/gt ltME
TSsmLink frompage1059 topage398/gt
27Mapping Hyperlink Structure, Redux
- ltdivgt in structMap (obliquely) cross-referenced
to ltsmLinkgt in structLink - ltMETSstructLinkgt
- ltMETSsmLink from"LINK1" to"page1059"
xlinktitle"home"/gt - ltMETSsmLink from"LINK2" to"page113"
xlinktitleofficers"/gt - ltMETSsmLink from"LINK3" to"page102"
xlinktitlecalendar"/gt - lt/METSstructLinkgt
28Web Archiving Challenges II Extracted vs
Human-Catalogued Metadata
- Lack of influence over content production
- More importantly embedded metadata
- Technical metadata seen as safe because it can
be programmatically extracted from the file
itself - Metadata embedded by producers of web pages, e.g.
lttitlegt ltmetagt tags, questionable at best - Do we want to take descriptive metadata wholesale
from lttitlegt, ltmetagt tags? - Really?
29(No Transcript)
30The Case of the Purloined Metadata
31The Case of the Purloined Metadata, continued
ltsnipgt ltHTMLgt lt!-- saved from url(0041)http//www
.sport.de/spart/sk1/ski006.php3
--gt ltHEADgt ltTITLEgtBienvenue sur le site de Front
Sociallt/TITLEgt ltMETA CONTENT"text/html
charsetwindows-1252" HTTP-EQUIV"Content-Type"gt lt
META CONTENT"Sport sports Baseball Basketball
Beach-Volleyball Bob Boxen Bundesliga
Bundesligavereine Championsleague DEL DFB
DFB-Pokal Eishockey Ergebnisse Europameisterschaft
Europapokal Fernsehen Football Formel1 Formel3
Fußball Golf Hallenmasters Handball Hockey
Inline-Skating Leichtathletik Motorbike Motorrad
Motorsport Nationalmannschaft NBA NFL NHL Reiten
Rodeln Schwimmen Skifahren Skispringen Snowboard
Sportarten Sportnachrichten Surfen Tennis
Tischtennis Turniere Uefa-Cup US Open Vereine
Volleyball Wassersport WBA WBC WBO
Weltmeisterschaft Weltrangliste Wimbledon Fußball
Motorsport Radsport Volleyball Sport Eishockey
Skisport Boxen Handball Leichtathletik
Pferdesport Schwimmen" NAME"keywords"gt ltMETA
CONTENT"Sport Sportnachrichten Sportvereine
Ergebnisse Tabellen Ranglisten Bundesliga DEL
Formel 1 Tennis" NAME"description"gt ltMETA
CONTENT"thu, 30 mar 2000 120000 GMT"
HTTP-EQUIV"date"gt ltSCRIPT language"JavaScript"
SRC"sport_fichiers/sidiscript.js"gt ltSCRIPT
language"JavaScript"gt lt!-- var on
"/ima/pfeil_weiss2.gif" var off
"/ima/pfeil_weiss.gif" lt/snipgt
32ltMETSdmdSecgt
33Case study Metadata from an Alexa .arc
- Typical Alexa / IA SIP .arc and .dat files
along with byte-offset .ndx file - IA .arc 100 MB .gz archive packed with files
from web crawl along with servers HTTP response
headers for each file. -
34Typical Internet Archive .arc snippet
ltsnipgt crawlers file header http//www.apgawom
en.org80/calender.htm 63.241.136.203
20030417223125 text/html 2570 http
headers HTTP/1.1 200 OK Date Thu, 17 Apr 2003
213543 GMT Server Apache/1.3.27 (Unix)
FrontPage/5.0.2.2510 Last-Modified Sun, 26 Jan
2003 040537 GMT ETag "3b01d2-8fb-3e335e91" Acce
pt-Ranges bytes Content-Length 2299 Connection
close Content-Type text/html file
itself lthtmlgt ltheadgt lttitlegtcalenderlt/titlegt ltmet
a http-equiv"Content-Type" content"text/html
charsetiso-8859-1"gt lt/headgt ltbody
bgcolor"FFFFFF"gt lt/snipgt
35What is extractable (dmdSec)?
HTTP/1.1 200 OK Date Thu, 17 Apr 2003 213543
GMT Server Apache/1.3.27 (Unix)
FrontPage/5.0.2.2510 Last-Modified Sun, 26 Jan
2003 040537 GMT ETag "3b01d2-8fb-3e335e91" Acce
pt-Ranges bytes Content-Length 2299 Connection
close Content-Type text/html lthtmlgt ltheadgt lttitl
egtcalenderlt/titlegt ltmeta http-equiv"Content-Type"
content"text/html charsetiso-8859-1"gt lt/headgt
ltbody bgcolor"FFFFFF"gt lt/snipgt
36Website-Level MODS
ltmodsmodsgt ltmodstitleInfogt
ltmodstitlegtWebsite of the
APGA Womenlt/modstitlegt
lt/modstitleInfogt
ltmodsgenregtWeb sitelt/modsgenregt
ltmodsoriginInfogt
ltmodsdateCaptured encoding"iso8601"gt20030417lt/mo
dsdateCapturedgt
lt/modsoriginInfogt
ltmodslanguage authority"iso639-2b"gtenglt/modslan
guagegt ltmodsphysicalDescripti
ongt ltmodsinternetMediaTyp
egttext/htmllt/modsinternetMediaTypegt
ltmodsinternetMediaTypegtimage/jpglt/mods
internetMediaTypegt
ltmodsinternetMediaTypegtimage/giflt/modsinternetMe
diaTypegt
ltmodsinternetMediaTypegtapplication/mswordlt/modsi
nternetMediaTypegt
ltmodsinternetMediaTypegtapplication/x-shockwave-fl
ashlt/modsinternetMediaTypegt
lt/modsphysicalDescriptiongt
ltmodsabstractgtSupports the All Progressive Grand
Alliance political party (APGA). Information on
the APGA presidential candidate,
Chief Chukwuemeka Odumegwu-Ojukwu.
Based in Kennesaw, Georgia.lt/modsabstractgt
ltmodssubjectgt
ltmodstopicgtPolitical Partieslt/modstopicgt
ltmodsgeographicgtAfricalt/mods
geographicgt
ltmodsgeographicgtNigerialt/modsgeographicgt
lt/modssubjectgt
ltmodsrelatedItem type"host"gt
ltmodstitleInfogt
ltmodstitlegtCRL Political Web Archiving
Projectlt/modstitlegt
lt/modstitleInfogt
ltmodsidentifier type"uri"gthttp//www.crl.edu/con
tent/PolitWeb.htmlt/modsidentifiergt
lt/modsrelatedItemgt
ltmodsidentifier displayLabel"Archived site"
type"uri"gthttp//dlibdev.nyu.edu/webarchive/metst
est/apgawomen/20030417/www.agpawomen.org
/lt/modsidentifiergt
lt/modsmodsgt
37MINERVA MODS Display
38ltMETSamdSecgt ltMETStechMDgt
39Technical Metadata Sources ( .arc)
- Alexa, Heritrix crawler frontier application
- writes metadata about the harvest itself, the
.arc file - Host servers HTTP response headers
- metadata about the host server, files recorded
- Captured files themselves
- file headers IPTC headers -- human input
- Post-processing with JHOVE, ImageMagick etc.
40ImageMagick dump for Mao1925.jpg
- Image Mao1925.jpg
- Format JPEG (Joint Photographic Experts Group
JFIF format) - Geometry 142x185
- Class DirectClass
- Type true color
- Depth 8 bits-per-pixel component
- Colors 11423
- Resolution 300x300 pixels
- Filesize 8115b
- Interlace Plane
- Background Color grey100
- Border Color DFDFDF
- Matte Color grey74
- Iterations 0
- Compression JPEG
- signature
- 8c173bd33c3e5667d27e51aee539afcd58ccbc8d4a11ab76b1
27408905f598fd - Tainted False
41ltmixmixgt ltmixBasicImageParametersgt
ltmixFormatgt
ltmixMIMETypegtimage/jpeglt/mixMIMETypegt
ltmixByteOrdergtlittle-endianlt/mixByteOrder
gt ltmixCompressiongt
ltmixCompressionSchemegt5lt/mixCompress
ionSchemegt
ltmixCompressionLevelgt0lt/mixCompressionLevelgt
lt/mixCompressiongt
ltmixPhotometricInterpretationgt
ltmixColorSpace/gt
lt/mixPhotometricInterpretationgt
lt/mixFormatgt ltmixFilegt
ltmixImageIdentifiergtperso.magic.fr/image
s/Mao1925.jpglt/mixImageIdentifiergt
ltmixFileSizegt8115lt/mixFileSizegt
lt/mixFilegt
ltmixPreferredPresentation/gt
lt/mixBasicImageParametersgt
ltmixImageCreation/gt ltmixImagingPerformanceA
ssessmentgt ltmixSpatialMetricsgt
ltmixImageWidthgt142lt/mixImageWidthgt
ltmixImageLengthgt185lt/mixImageLength
gt lt/mixSpatialMetricsgt
ltmixEnergeticsgt
ltmixBitsPerSamplegt8lt/mixBitsPerSamplegt
lt/mixEnergeticsgt lt/mixImagingPerforman
ceAssessmentgt ltmixChangeHistory/gt
lt/mixmixgt
42Web Archiving Challenges IIIStructuring and
Managing Versions
- Version control-related storage and access issues
in a continuous archive - Creator-driven changes successive harvests and
versions - Especially tricky with incremental harvest
- Repository-driven changes refreshing, migration
43Modeling Website Objects with METS in a
Continuous Archive
- One possibility
- Root level METS (web site X as intellectual
object) with ltmptrgts down to - Intermediary METS (web site X as harvested on
April 17, 2003) with ltmptrgts down to - Leaf node METS (single web page in web site X
harvested on April 17, 2003)
44APGA Women Websites
April 17, 2003
December 12, 2003
February 2, 2004
home.html
home.html
about.html
home.html
about.html
about.html
officers.html
officers.html
officers.html
news.html
45APGA Women Websites
April 17, 2003
December 12, 2003
February 2, 2004
46Aggregator / Single Capture Model
- METS for top level aggregation that uses ltmptrgts
to point to either another intermediary
aggregator or to more than one captured
version(s) of a web site. - METS for single standalone captured site, whether
part of successive harvests or a one-off capture.
47METS Website Aggregator
- Contains single MODS record describing the
aggregation as an intellectual object - e.g. Election 2004 JohnKerry.com (Oct 1-Nov 3)
- Contains no fileSec or structLink
- Contains TBD digiProv, rights in amdSec
- Consists of a root ltdivgt for the aggregation
- nesting ltdivgts with ltmptrgts to each subsidiary
aggregation or captured version
48MINERVA Election 2004
Kerry
Nader
Bush
Nov 1
Nov 1
Nov 1
Nov 2
Nov 2
Nov 2
Nov 3
Nov 3
Nov 3
49MINERVA Election 2004
November 1, 2004
November 3, 2004
November 2, 2004
Kerry
Kerry
Nader
Nader
Kerry
Nader
Bush
Bush
Bush
50(No Transcript)
51(No Transcript)
52How websites escape from archives
- External links left live
- Internal links not parsed out of FLASH
- Internal links not parsed out of javascript
- .php (etc) files not converted to static HTML
- .js runners or applets with date() functions not
disabled
53Sealing the archive
- What Crawlers Can Do
- leave external links live? Or create custom 404s?
- rewrite internal links to relative links
- repair producer-generated relative links
- rewrite dynamic extensions e.g. .php to .html
- successfully parse out javascript, FLASH URLs
54Sealing the Archive
- What Viewer Applications can do
- PANDAS
- METS Viewer
55(No Transcript)
56(No Transcript)
57PANDORA Treatment of External Links
lth1gtExternal Links to African Websiteslt/h1gt
ltpgtltbgtAfrican News linkslt/bgt lta
href"/external.html?linkwww-sul.stanford.edu/dep
ts/ssrg/africa/news.html"gtltbrgt Latest African
newslt/agtltbrgt lta href"/external.html?linkkahn.i
nteraccess.com/intelweb/africa.html"gtMore African
news sourceslt/agtlt/pgt ltpgtltbgtGeneral
comprehensive resource links on Africa lt/bgtlta
href"/external.html?linkwww.columbia.edu/cu/libr
aries/indiv/area/Africa/"gtltbrgt Columbia
University - African Studies Internet
Resourceslt/agt lta href"/external.html?linkwww-s
ul.stanford.edu/depts/ssrg/africa/guide.html"gtltbrgt
African South of the Sahara internet
resourceslt/agtltbrgt lta href"/external.html?linkw
ww.sas.upenn.edu/African_Studies/Home_Page/AFR_GID
E.html"gt Electronic Guide for African
Resources on the Internet - University of
Pennsylvanialt/agtltbrgt lta href"/external.html?lin
kwww.africa.com/"gtAfrica.comlt/agtltbrgt lta
href"/external.html?linkwww.sourceafrica.com/"gtS
ource Africalt/agtltbrgt lta href"/external.html?lin
kwww.africapolicy.org/"gtAfrican Policy
Information Centrelt/agtltbrgt lta
href"/external.html?linkwww.cc.utah.edu/pks1019
"gtUniversity of Utah - Africa Homepagelt/agt
ltbrgt lta href"/external.html?linkwww.fordham.ed
u/halsall/africa/africasbook.html"gtAfrican
History Internet Sourcebooklt/agtltbrgt
58METS Viewer
59METS Viewer External Links
60METS Strengths / Websites
- Suitability as SIP, AIP, DIP
- ease of conversion between Packages (XSLT)
- Cross-referencing of structMap and structLink
- easy implementation using bottom line (XSLT)
- Open Source Community Support
- Implementations from XSLT to Java Apps
- Emergence of Profiles for Interoperability
- Harmonization with LO Metadata schemas
61Repositories examining use of METS to manage web
materials
- OCLC Digital Archive
- MIT CWSpace (DSpace, OCW)
- Ultimately decided upon IMS-CP
- NDIIPP The Web at Risk Partnership
- CDL Digital Preservation Repository
- NYU may use DSpace
- UNT may use CDL repository instance
62CDL Digital Preservation Repository
- METS-ready at HTML web page level
- In process of defining full Web Archive Data
Model (WADO) - And the metadata to facilitate ingest, retention
and interchange - Will support METS at website level
63DSpace
- Can ingest web material at HTML page level
- Can bundle all the resources for a page
- METS Exporter structMap-less as yet
- Will support METS import, archiving and export at
website level (?)
64Links
- http//dlibdev.nyu.edu8083/xmldev/servlet/SaxonSe
rvlet?sourcenigerian-root.xmlstylemodspage4.xsl
- http//dlibdev.nyu.edu8083/xmldev/servlet/SaxonSe
rvlet?sourceapgawomen-root.xmlstylemodspage3.xs
l - http//dlibdev.nyu.edu8083/xmldev/servlet/frames/
apgawomen20040202-dspace.xml
65For More Information
- Political Communications Web Archive Project
- http//www.crl.edu/content/PolitWeb.htm
- NDIIPP The Web at Risk Partnership
- http//www.digitalpreservation.gov/about/pr_093004
.html - IIPC
- http//netpreserve.org/about/index.php
- Heritrix Crawler
- http//crawler.archive.org/
66Contact to Chat More
- leslie.myrick_at_nyu.edu
- Leslie Myrick
- NYU Digital Library Team