Case Study: Using METS as a DIP to Navigate Archived Websites - PowerPoint PPT Presentation

About This Presentation
Title:

Case Study: Using METS as a DIP to Navigate Archived Websites

Description:

... sports Baseball Basketball Beach-Volleyball Bob Boxen Bundesliga ... Wimbledon Fu ball Motorsport Radsport Volleyball Sport Eishockey Skisport Boxen ... – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 56
Provided by: Leslie5
Learn more at: https://www.loc.gov
Category:

less

Transcript and Presenter's Notes

Title: Case Study: Using METS as a DIP to Navigate Archived Websites


1
Case Study Using METS as a DIP to Navigate
Archived Websites
Leslie Myrick, NYU METS Opening Day / UK The
British Library 12 July, 2004
2
Political Communications Web Archive Project
(PCWA)
  • Under auspices of CRL and Mellon
  • Participants Cornell University, Stanford
    University, UT Austin, NYU
  • Focus SE Asia, Sub-Saharan Africa, Latin
    America, Western Europe
  • Radical political born-digital ephemera
  • Content Internet Archive (.arc files)

3
Todays Topics
  • Background challenges of web archiving
  • How METS can address some of these challenges
  • How to construct a METS website object
  • How METS instances can be used to control and
    navigate website objects in an archive

4
Basic METS Recipe
  • fileSec
  • structMap
  • structLink
  • dmdSec
  • amdSec

5
Web Archiving Challenges IDefinition and
Taxonomy
  • Definition of the object website and its
    boundaries
  • what to do with external links? near files?
  • Complex nature of website structure
  • which structure?
  • Complex symphonic nature of a web page itself

6
ltMETSfileSecgt
7
File inventory
ltMETSfileSecgt ltMETSfileGrpgt
ltMETSfile ID"FID18" MIMETYPE" text/html"
ADMID"ADM1"gt ltMETSFLocat
LOCTYPE"URL" xlinkhref"www.apgawomen.org/" /gt
lt/METSfilegt ltMETSfile
ID"FID113" MIMETYPE"text/html ADMID"ADM2"gt
ltMETSFLocat LOCTYPE"URL" xlinkhref"www.apga
women.org/officers.htm" /gt lt/METSfilegt
ltMETSfile ID"FID120" MIMETYPE"text/html
ADMID"ADM3"gt ltMETSFLocat LOCTYPE"URL"
xlinkhref"www.apgawomen.org/calender.htm" /gt
lt/METSfilegt ltMETSfile ID"FID154"
MIMETYPE"text/html" ADMID"ADM4"gt
ltMETSFLocat LOCTYPE"URL" xlinkhref"www.apgawom
en.org/newsarchives.htm" /gt lt/METSfilegt
ltMETSfile ID"FID1059"
MIMETYPE"text/html" ADMID"ADM5"gt
ltMETSFLocat LOCTYPE"URL" xlinkhref"www.apgawom
en.org/home.htm" /gt lt/METSfilegt
,,, lt/ltMETSfileGrpgt lt/METSfileSecgt
8
ltMETSstructMapgt
9
(No Transcript)
10
lthtmlgt ltheadgtlttitlegtindexlt/titlegtltmeta
http-equiv"Content-Type" content"text/html
charsetiso-8859-1"gtlt/headgt ltbody
bgcolor"000000"gt lttable width"100"gt
lttrgtlttdgt ltdiv align"center"gtltobject
classid"clsidD27CDB6E-AE6D-11cf-96B8-44455354000
0" codebase"http//download.macromedia.com/pub/sh
ockwave/cabs/flash/swflash.cabversion4,0,2,0"
width"700" height"150"gt ltembed
src"notjust.swf" qualityhigh pluginspage"http/
/www.macromedia.com/shockwave/download/index.cgi?P
1_Prod_VersionShockwaveFlash" type"application/x
-shockwave-flash" width"700" height"150"gt
lt/embedgt lt/objectgtlt/divgt
lt/tdgtlt/trgt lt/tablegt lttable width"100"gt lttrgt
lttdgt ltdiv align"center"gtltobject
classid"clsidD27CDB6E-AE6D-11cf-96B8-44455354000
0" codebase"http//download.macromedia.com/pub/sh
ockwave/cabs/flash/swflash.cabversion4,0,2,0"
width"600" height"64"gt ltembed
src"apgawnew.swf" qualityhigh
pluginspage"http//www.macromedia.com/shockwave/d
ownload/index.cgi?P1_Prod_VersionShockwaveFlash"
type"application/x-shockwave-flash" width"600"
height"64"gt lt/embedgt
lt/objectgtlt/divgt lt/tdgt lt/trgt lt/tablegt ltpgtnbsp
lt/pgtltdiv align"center"gt lttable width"85"gt
lttrgtlttdgt ltdiv align"right"gtlta
href"home.htm"gtltimg src"enterarrow.gif"
width"80" height"27" border"0"gtlt/agtlt/divgt
lt/tdgtlt/trgt lt/tablegt lt/divgt lt/bodygt lt/htmlgt
11
METS structMap view of an HTML page
  • HTML wrapper around embedded files and
    hyperlinks
  • ltdivgt for the HTML page
  • ltfptrgt
  • ltpargt
  • ltareagtfor page each embedded parallel
    element -- .css, .js, images etc.
  • (with ID-IDREF to file ID in fileSec)
  • ltdivgt for each (internal) hyperlinked page

12
ltMETSdiv DMDID"DM1" TYPE"web page"
ID"page18" LABEL"http//dlibdev.nyu.edu/webarc
hive/metstest/www.apgawomen.org/index.html "gt
ltMETSfptrgt ltMETSpargt
ltMETSarea FILEID"FID18"/gt
index.html
ltMETSarea FILEID"FID1036"/gt
notjust.swf
ltMETSarea FILEID"FID1043"/gt apgawnew.swf
ltMETSarea
FILEID"FID1075"/gt enterarrow.gif
lt/METSpargt lt/METSfptrgt
ltMETSdiv TYPE"hyperlink" ID"LINK1"
LABEL"home"gt ltMETSfptrgt
ltMETSarea BEGIN"000" BETYPE"BYTE"
END"111"
FILEID"FID18"/gt
lt/METSfptrgt lt/METSdivgt
13
METS structMap for a Website
  • Flattened logical tree hierarchy
  • ltdivgt entry page, index.html
  • ltdivgt each HTML page
  • ltdivgt each hyperlink to a page internal to
    the site

14
METS DB view of site structure
15
DB View of Page Structure
16
DB View of Embedded Elements
17
ltMETSstructLinkgt
18
Mapping Hyperlink Structure
  • ltdivgts (via div ID) in structMap cross-referenced
    to ltsmLinkgts in structLink
  • ltMETSstructLinkgt
  • ltMETSsmLink from"LINK1" to"page1059"
    xlinktitle"home"/gt
  • ltMETSsmLink from"LINK2" to"page113"
    xlinktitleofficers"/gt
  • ltMETSsmLink from"LINK3" to"page102"
    xlinktitlecalendar"/gt
  • lt/METSstructLinkgt

19
Web Archiving Challenges II Extracted vs
Human-catalogued Metadata
  • Lack of influence over content production
  • Questionable embedded metadata from producers of
    web pages, e.g. lttitlegt ltmetagt tags
  • Technical metadata is safe because it can be
    programmatically extracted from the file itself
  • Do we want to take descriptive metadata wholesale
    from lttitlegt, ltmetagt tags?
  • Really?

20
(No Transcript)
21
The Case of the Purloined Metadata
22
The Case of the Purloined Metadata, continued
ltsnipgt ltHTMLgt lt!-- saved from url(0041)http//www
.sport.de/spart/sk1/ski006.php3
--gt ltHEADgt ltTITLEgtBienvenue sur le site de Front
Sociallt/TITLEgt ltMETA CONTENT"text/html
charsetwindows-1252" HTTP-EQUIV"Content-Type"gt lt
META CONTENT"Sport sports Baseball Basketball
Beach-Volleyball Bob Boxen Bundesliga
Bundesligavereine Championsleague DEL DFB
DFB-Pokal Eishockey Ergebnisse Europameisterschaft
Europapokal Fernsehen Football Formel1 Formel3
Fußball Golf Hallenmasters Handball Hockey
Inline-Skating Leichtathletik Motorbike Motorrad
Motorsport Nationalmannschaft NBA NFL NHL Reiten
Rodeln Schwimmen Skifahren Skispringen Snowboard
Sportarten Sportnachrichten Surfen Tennis
Tischtennis Turniere Uefa-Cup US Open Vereine
Volleyball Wassersport WBA WBC WBO
Weltmeisterschaft Weltrangliste Wimbledon Fußball
Motorsport Radsport Volleyball Sport Eishockey
Skisport Boxen Handball Leichtathletik
Pferdesport Schwimmen" NAME"keywords"gt ltMETA
CONTENT"Sport Sportnachrichten Sportvereine
Ergebnisse Tabellen Ranglisten Bundesliga DEL
Formel 1 Tennis" NAME"description"gt ltMETA
CONTENT"thu, 30 mar 2000 120000 GMT"
HTTP-EQUIV"date"gt ltSCRIPT language"JavaScript"
SRC"sport_fichiers/sidiscript.js"gt ltSCRIPT
language"JavaScript"gt lt!-- var on
"/ima/pfeil_weiss2.gif" var off
"/ima/pfeil_weiss.gif" lt/snipgt
23
Whence Web Archive Metadata?
  • Programmatically extractable metadata provided by
    crawlers
  • Found in logs, .arc .dat files, files
    themselves
  • Balance to be struck between automated metadata
    extraction and human cataloguing (especially for
    descriptive metadata)

24
ltMETSdmdSecgt
25
Case study Metadata from an Alexa .arc
  • Typical Alexa / IA SIP .arc and .dat files
    along with byte offset .ndx file
  • IA .arc 100 MB .gz archive file packed with
    files from web crawl along with servers HTTP
    response headers for each file.

26
Typical IA .arc snippet
ltsnipgt crawlers file header http//www.apgawom
en.org80/calender.htm 63.241.136.203
20030417223125 text/html 2570 http
headers HTTP/1.1 200 OK Date Thu, 17 Apr 2003
213543 GMT Server Apache/1.3.27 (Unix)
FrontPage/5.0.2.2510 Last-Modified Sun, 26 Jan
2003 040537 GMT ETag "3b01d2-8fb-3e335e91" Acce
pt-Ranges bytes Content-Length 2299 Connection
close Content-Type text/html file
itself lthtmlgt ltheadgt lttitlegtcalenderlt/titlegt ltmet
a http-equiv"Content-Type" content"text/html
charsetiso-8859-1"gt lt/headgt ltbody
bgcolor"FFFFFF"gt lt/snipgt
27
What is extractable (dmdSec)?
HTTP/1.1 200 OK Date Thu, 17 Apr 2003 213543
GMT Server Apache/1.3.27 (Unix)
FrontPage/5.0.2.2510 Last-Modified Sun, 26 Jan
2003 040537 GMT ETag "3b01d2-8fb-3e335e91" Acce
pt-Ranges bytes Content-Length 2299 Connection
close Content-Type text/html lthtmlgt ltheadgt lttitl
egtcalenderlt/titlegt ltmeta http-equiv"Content-Type"
content"text/html charsetiso-8859-1"gt lt/headgt
ltbody bgcolor"FFFFFF"gt lt/snipgt
28
LC Metadata Object Description Schema
29
MINERVA MODS Display
30
Top-Level MODS
ltmodsmodsgt ltmodstitleInfogt
ltmodstitlegtWebsite of the
APGA Womenlt/modstitlegt
lt/modstitleInfogt
ltmodsgenregtWeb sitelt/modsgenregt
ltmodsoriginInfogt
ltmodsdateCaptured encoding"iso8601"gt20030417lt/mo
dsdateCapturedgt
lt/modsoriginInfogt
ltmodslanguage authority"iso639-2b"gtenglt/modslan
guagegt ltmodsphysicalDescripti
ongt ltmodsinternetMediaTyp
egttext/htmllt/modsinternetMediaTypegt
ltmodsinternetMediaTypegtimage/jpglt/mods
internetMediaTypegt
ltmodsinternetMediaTypegtimage/giflt/modsinternetMe
diaTypegt
ltmodsinternetMediaTypegtapplication/mswordlt/modsi
nternetMediaTypegt
ltmodsinternetMediaTypegtapplication/x-shockwave-fl
ashlt/modsinternetMediaTypegt
lt/modsphysicalDescriptiongt
ltmodsabstractgtSupports the All Progressive Grand
Alliance political party (APGA). Information on
the APGA presidential candidate,
Chief Chukwuemeka Odumegwu-Ojukwu.
Based in Kennesaw, Georgia.lt/modsabstractgt
ltmodssubjectgt
ltmodstopicgtPolitical Partieslt/modstopicgt
ltmodsgeographicgtAfricalt/mods
geographicgt
ltmodsgeographicgtNigerialt/modsgeographicgt
lt/modssubjectgt
ltmodsrelatedItem type"host"gt
ltmodstitleInfogt
ltmodstitlegtCRL Political Web Archiving
Projectlt/modstitlegt
lt/modstitleInfogt
ltmodsidentifier type"uri"gthttp//www.crl.edu/con
tent/PolitWeb.htmlt/modsidentifiergt
lt/modsrelatedItemgt
ltmodsidentifier displayLabel"Archived site"
type"uri"gthttp//dlibdev.nyu.edu/webarchive/metst
est/apgawomen/20030417/www.agpawomen.org
/lt/modsidentifiergt
lt/modsmodsgt
31
Page-Level MODS
ltMETSdmdSec ID"DM1"gt ltMETSmdWrap
MDTYPE"MODS"gt ltMETSxmlDatagt
ltmodsmodsgt ltmodstitleInfogt ltmodstitlegtoffi
cerslt/modstitlegt lt/modstitleInfogt ltmodsorig
inInfogt ltmodsdateCapturedgt20030417223125lt/mods
dateCapturedgt lt/modsoriginInfogt ltmodsidenti
fier type"uri"gtwww.apgawomen.org/officers.htmlt/mo
dsidentifiergt ltmodsphysicalDescriptiongt ltmo
dsextentgt3252lt/modsextentgt lt/modsphysicalDesc
riptiongt ltmodsgenregtWeb
pagelt/modsgenregt lt/modsmodsgt
lt/METSxmlDatagt lt/METSmdWrapgt lt/METSdmdSecgt
32
ltMETSamdSecgt ltMETStechMDgt
33
Technical Metadata Sources (.arc)
  • Crawler frontier application
  • metadata about the harvest itself, the archive
    file
  • Host servers HTTP response headers
  • metadata about the host server, files
  • Captured files themselves
  • file headers IPTC headers -- human input
  • Post-processing with ImageMagick etc.

34
ImageMagick dump for Mao1925.jpg
  • Image Mao1925.jpg
  • Format JPEG (Joint Photographic Experts Group
    JFIF format)
  • Geometry 142x185
  • Class DirectClass
  • Type true color
  • Depth 8 bits-per-pixel component
  • Colors 11423
  • Resolution 300x300 pixels
  • Filesize 8115b
  • Interlace Plane
  • Background Color grey100
  • Border Color DFDFDF
  • Matte Color grey74
  • Iterations 0
  • Compression JPEG
  • signature
  • 8c173bd33c3e5667d27e51aee539afcd58ccbc8d4a11ab76b1
    27408905f598fd
  • Tainted False

35
NISO Metadata for Images in XML Schema (MIX)
36
ltmixmixgt ltmixBasicImageParametersgt
ltmixFormatgt
ltmixMIMETypegtimage/jpeglt/mixMIMETypegt
ltmixByteOrdergtlittle-endianlt/mixByteOrder
gt ltmixCompressiongt
ltmixCompressionSchemegt5lt/mixCompress
ionSchemegt
ltmixCompressionLevelgt0lt/mixCompressionLevelgt
lt/mixCompressiongt
ltmixPhotometricInterpretationgt
ltmixColorSpace/gt
lt/mixPhotometricInterpretationgt
lt/mixFormatgt ltmixFilegt
ltmixImageIdentifiergtperso.magic.fr/image
s/Mao1925.jpglt/mixImageIdentifiergt
ltmixFileSizegt8115lt/mixFileSizegt
lt/mixFilegt
ltmixPreferredPresentation/gt
lt/mixBasicImageParametersgt
ltmixImageCreation/gt ltmixImagingPerformanceA
ssessmentgt ltmixSpatialMetricsgt
ltmixImageWidthgt142lt/mixImageWidthgt
ltmixImageLengthgt185lt/mixImageLength
gt lt/mixSpatialMetricsgt
ltmixEnergeticsgt
ltmixBitsPerSamplegt8lt/mixBitsPerSamplegt
lt/mixEnergeticsgt lt/mixImagingPerforman
ceAssessmentgt ltmixChangeHistory/gt
lt/mixmixgt
37
Web Archiving Challenges IIIStructuring and
Managing Versions
  • Version control-related storage and access issues
    in a continuous archive
  • Creator-driven changes successive harvests and
    versions
  • Repository-driven changes refreshing, migration,
    other changes

38
Modeling Website Objects with METS in a
Continuous Archive
  • One possibility
  • Root level METS (web site X as intellectual
    object) with ltmptrgts down to
  • Intermediary METS (web site X as harvested on
    April 17, 2003) with ltmptrgts down to
  • Leaf node METS (single web page in web site X
    harvested on April 17, 2003)

39
APGA Women Websites
April 17, 2003
December 12, 2003
February 2, 2004
home
home
about
about
home
about
officers
officers
officers
news
40
APGA Women Websites
April 17, 2003
December 12, 2003
February 2, 2004
41
Aggregator / Single Capture Model
  • METS for top level aggregation that uses ltmptrgts
    to point to either another intermediary
    aggregator or to more than one captured version
    of a web site.
  • METS for single standalone captured site, whether
    part of successive harvests or a one-off capture.

42
METS Website Aggregator
  • Contains single MODS record describing the
    aggregation as an intellectual object
  • e.g. Election 2004 JohnKerry.com (Nov 1-10)
  • Contains no amdSec, fileSec or structLink
  • Contains a root ltdivgt for the aggregation
  • nesting ltdivgts with ltmptrgts to each subsidiary
    aggregation or captured version

43
MINERVA Election 2004
Kerry
Bush
Nader
Nov 1
Nov 1
Nov 1
Nov 2
Nov 2
Nov 2
Nov 3
Nov 3
Nov 3
44
MINERVA Election 2004
November 1, 2004
November 3, 2004
November 2, 2004
Kerry
Kerry
Nader
Nader
Kerry
Nader
Bush
Bush
Bush
45
(No Transcript)
46
(No Transcript)
47
Web Archiving Challenges IVKeeping archived
websites hermetically sealed
48
How websites escape from archives
  • External links
  • Internal links not parsed out of FLASH
  • Internal links not parsed out of javascript
  • .php files not converted to static HTML
  • .js runners or applets with date() functions

49
Sealing the archive
  • What Crawlers Can Do
  • rewrite internal links to relative links
  • repair producer-generated relative links
  • leave external links live? Or create custom 404s?
  • rewrite dynamic extensions e.g. .php to .html
  • successfully parse out javascript, FLASH URLs

50
Sealing the Archive
  • What Applications can do
  • PANDAS
  • METS Viewer

51
(No Transcript)
52
(No Transcript)
53
PANDORA Treatment of External Links
lth1gtExternal Links to African Websiteslt/h1gt
ltpgtltbgtAfrican News linkslt/bgt lta
href"/external.html?linkwww-sul.stanford.edu/dep
ts/ssrg/africa/news.html"gtltbrgt Latest African
newslt/agtltbrgt lta href"/external.html?linkkahn.i
nteraccess.com/intelweb/africa.html"gtMore African
news sourceslt/agtlt/pgt ltpgtltbgtGeneral
comprehensive resource links on Africa lt/bgtlta
href"/external.html?linkwww.columbia.edu/cu/libr
aries/indiv/area/Africa/"gtltbrgt Columbia
University - African Studies Internet
Resourceslt/agt lta href"/external.html?linkwww-s
ul.stanford.edu/depts/ssrg/africa/guide.html"gtltbrgt
African South of the Sahara internet
resourceslt/agtltbrgt lta href"/external.html?linkw
ww.sas.upenn.edu/African_Studies/Home_Page/AFR_GID
E.html"gt Electronic Guide for African
Resources on the Internet - University of
Pennsylvanialt/agtltbrgt lta href"/external.html?lin
kwww.africa.com/"gtAfrica.comlt/agtltbrgt lta
href"/external.html?linkwww.sourceafrica.com/"gtS
ource Africalt/agtltbrgt lta href"/external.html?lin
kwww.africapolicy.org/"gtAfrican Policy
Information Centrelt/agtltbrgt lta
href"/external.html?linkwww.cc.utah.edu/pks1019
"gtUniversity of Utah - Africa Homepagelt/agt
ltbrgt lta href"/external.html?linkwww.fordham.ed
u/halsall/africa/africasbook.html"gtAfrican
History Internet Sourcebooklt/agtltbrgt
54
METS Viewer
55
METS Viewer External Links
Write a Comment
User Comments (0)
About PowerShow.com