Title: Persistently identifying Web site content
1Persistently identifying Web site content
- Future-proofing Institutional Web sitesDCC and
Wellcome Library workshop
2Contents
- context
- functional requirements
- issues raised
- practical suggestions
- note not going to look at any particular
solutions in any detail PURLs, DOIs, Handles,
ARKs,
3Context institutional Web sites
- institutional Web sites are
- heterogeneous i.e. wide variety of content,
managed/unmanaged, formal/informal - primarily accessed via mainstream Web browsers
but that may change over time - dynamic i.e. content is regularly added (and
changed and removed!) - closely tied to the institution and
institutions are liable to change!
4Context man vs. machine
- identifiers serve a human andmachine/software
purpose - person heres one I foundearlier e.g. using
del.icio.usor connotea - machine is this the same asthat?
- worth remembering that machines tend to be fairly
stupid - e.g. if some people use the PURL and some use the
corresponding URL, then del.icio.us wont spot
that their entries are about the same thing - in most cases, being able to resolve the
identifier is helpful to both people and machines - in most cases, the longer an identifier lasts,
the better even after the resolution service
breaks!
5Context what is being identified
- the most important question in any discussion
about identifiers is what is being identified? - in the case of institutional Web sites
- the site
- significant parts of the site
- static documents, individual images, etc.
- dynamic services
-
- some possibility for confusion here
- e.g. what does http//www.bris.ac.uk/ identify?
- but in the case of institutional Web sites,
people usually do the right thing and what is
being identified is obvious from the context
6Context - works vs. manifestations
- one key aspect is whether the identifier is for
an abstract work or a particular
manifestation of that work - there are some scenarios in which it is necessary
to identify the work - in other cases, it is necessary to identify a
particular manifestation of the work - beginning to see this problem in the development
of eprint archives and institutional repositories
Crystal Studio is a recommended resource for the
teaching of crystallography at undergraduate
level.
"To perform this exercise you will need a copy of
Crystal Studio version 5.0 (versions 4.0 Lite and
4.0 Professional do not support the required
options)."
7Functional requirements
- the JISC IE technical standards document says
Every significant item that is made available
through a JISC IE network service should be
assigned a URI that is reasonably persistent.
This means that item URIs should not be expected
to break for a period of 10-15 years after they
have first been used. For this reason, JISC IE
service components should not hardcode file
format, server technology, service organisational
structure or other information that is likely to
change over a 10-15 year period into item URIs.
If items become unavailable during that period,
then the URI should resolve to a Web page that
explains why the item is no longer available and
what actions the end-user can take to obtain a
copy of the item or similar resources.
Furthermore, item URIs should not contain
end-user-specific information, i.e. all item URIs
should work for all end-users (albeit allowing
for appropriate authentication challenges to be
inserted into the process by which the URI is
resolved).
Every significant item that is made available
through a JISC IE network service should be
assigned a URI that is reasonably persistent.
This means that item URIs should not be expected
to break for a period of 10-15 years after they
have first been used. For this reason, JISC IE
service components should not hardcode file
format, server technology, service organisational
structure or other information that is likely to
change over a 10-15 year period into item URIs.
If items become unavailable during that period,
then the URI should resolve to a Web page that
explains why the item is no longer available and
what actions the end-user can take to obtain a
copy of the item or similar resources.
Furthermore, item URIs should not contain
end-user-specific information, i.e. all item URIs
should work for all end-users (albeit allowing
for appropriate authentication challenges to be
inserted into the process by which the URI is
resolved).
http//www.ukoln.ac.uk/distributed-systems/jisc-ie
/arch/standards/
8What should be identified?
- every significant item
- what does that mean?
- every resource that people are likely to want to
cite persistently? - there might be stuff on institutional Web sites
that we dont need to cite persistently - but often difficult to pre-judge what is
significant and what isnt - and judgements about significance and required
level of persistence may come from outside the
institution
9What does reasonably persistent mean?
- notion of persistence is application dependent
- perhaps helpful to think about 15 20 year
timeframe? - longer than the Web has been around to date
- solutions for 20 year period may well last longer
- forever is too long
- what will have changed in 20 years time?
- technology - HTML replaced? HTTP replaced? DNS
replaced? URI system replaced? - organisations mergers, closures, new
institutions, new government departments, etc. - people deaths, retirements, etc.
- countries!
10What does break mean?
- what does it mean for an identifier to break?
- need to differentiate between the breakage of
services on the identifier and breakage of the
identifier itself - most obvious services on identifiers are
resolution services - give me a representation of the identified
thing - known as dereferencing in W3C documentation
- resolution services can break (by design or by
accident) but the identifier may live on and
remain useful - the identifier itself only breaks when all
parties (including software systems) have
forgotten what it identified, or when parties no
longer agree about what it identifies (e.g. if it
gets re-assigned)
11Usability issues
- the only good long-term identifier is a good
short-term identifier - unless identifiers work well now, then they wont
turn into persistent identifiers because they
wont be used at all - what does work well mean (particularly in the
context of institutional Web sites)? - conformant with current Internet standards
- usable in Web browsers (without additional
plug-ins - i.e. usable by everyone) - meaningful to people
- resolvable
- simple to assign and maintain
- low cost (in terms of money and time)
12Interim conclusions
- identifiers for content on institutional Web
sites should be URIs - why? because the URI is the global and
unambiguous standard for identifiers on the
Internet - http URIs are better than any other form of URI
- why? because they work in current Internet tools,
particularly Web browsers - built-in resolution mechanism
- easy to assign and low-cost (typically!)
13http URI problems?
- but http URIs tend to break dont they?
- note usually it is the resolution service that
breaks (i.e. they stop working as locators) -
this doesnt necessarily imply that they stop
functioning as identifiers though the two may be
closely related - reasons for fragility of http URI resolution
examined later - but poor design and lack of commitment often to
blame - not necessarily the case that one can apply
generic Internet-wide findings about http URI
breakage to institutional Web sites - attempts at more persistent forms of identifier
often based on moving away from direct ties to
HTTP and/or introducing a level of indirection
14How indirection works (or not?)
- populate resolution service tables with
identifier -gt locator mappings (and possibly
other metadata) - DOI 10.1000/182 -gt http//www.doi.org/hb.html
- Handle 4263537/4002 -gt http//www.handle.net/docu
mentation.html - ARK http//ark.nlm.nih.gov/ark/12025/pm10611131
-gt http//brain.oxfordjournals.org/cgi/content/ful
l/123/1/171 - PURL http//purl.org/net/ukoln -gt
http//www.ukoln.ac.uk/ - typically used as the basis for HTTP redirects,
e.g. - http//dx.doi.org/10.1000/182 -gt
http//www.doi.org/hb.html - http//hdl.handle.net/4263537/4002 -gt
http//www.handle.net/documentation.html - etc.
- helps to ensure persistence but
- HTTP redirects not handled very well by browsers
- end-user is typically left using the
non-persistent URI ? - need commitment to maintain resolver services and
tables - introduces a second (at least) identifier for
each resource
15What about uniqueness?
- the same identifier should not be assigned to
more than one resource - a resource may have more than one identifier
assigned to it but this should be avoided as far
as possible - e.g. the DOI 10.1000/182 can be encoded as a
URI in several ways - http//dx.doi.org/10.1000/182, doi10.1000/182,
urndoi10.1000/182 and infodoi/10.1000/182 - therefore, DOI-aware applications need to have
knowledge of these encodings hard-coded into them
(partly because the DOI itself is just a string,
but also because nothing in the URI specification
indicates that the URI encodings are equivalent) - though within a domain this may become the norm
(e.g. Google Scholar, Crossref, Connotea, etc.)
16ARK system
- ARKs are worthy of note since they are http
URIs - and therefore meet many of the usability
requirements outlined earlier - ARKs clearly flag an institutional commitment to
persistence - the identifier owner (often the resource owner)
commits to maintaining ARK services and
associated metadata - no reliance on third-party resolver
- but they suffer from the HTTP redirect problem
- and ultimately may lead to multiple URIs being
assigned to a single resource
17Anatomy of http URIs
Server technology change of technology may
enforce change of URI, leading to multiple URIs
for same resource (with no simple mechanism for
determining equivalence)
http URI scheme URI persistence not reliant
on HTTP protocol, but is reliant on continued
registration and management of the scheme (and of
the URI spec. itself!)
http//www.somewhere.ac.uk/physics/index.cfm?name
about http//www.somewhere.ac.uk/chemistry/report.
rtf
DNS domain name persistence reliant on
continued ownership and management of the DNS
domain name (and the DNS!)
File format inappropriate if identifier is for
the work rather than the manifestation -
because changing the format will result in a new
URI
Component hierarchy, often organisationally based
persistence reliant on continued management of
component structure, i.e. not re-using old
components
18Improving persistence of http URIs
- choose long-lived DNS domain names e.g. try to
avoid details of internal organisational
structure - partition URI components by function rather
than by organisational structure - because
structure is likely to change - avoid exposing Web server technology in URIs
(Cold Fusion, PHP, etc.) - to allow changes to
technology without URI proliferation and resolver
breakage - avoid embedding details of document format into
URIs, unless particular manifestation is being
identified - avoid embedding end-user or session information
into URIs so that they can be shared between
people
19Conclusions and recommendations
- persistent identifiers require persistent
commitment from the institution (and
third-parties) - need to determine what persistent means in
practice (on the basis that forever is
unrealistic) - http URIs can be made more persistent if they
are constructed and managed sensibly - use of DOIs/Handles/ARKs/PURLs may be appropriate
(particularly where domain practice is clear) - but need to be clear about cost/benefits and
institutional and third-party commitment to
maintaining resolver tables and associated
services - where these are used, always and only use the
http form of URI (e.g. http//dx.doi.org/10.1000
/182)
20Questions