Persistently identifying Web site content - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Persistently identifying Web site content

Description:

every resource that people are likely to want to cite persistently? there might be stuff on institutional Web sites that we don't need to cite persistently ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 21

Provided by: andyp74

Category:

more less

Transcript and Presenter's Notes

Title: Persistently identifying Web site content

1
Persistently identifying Web site content

Future-proofing Institutional Web sitesDCC and
Wellcome Library workshop

2
Contents

context
functional requirements
issues raised
practical suggestions
note not going to look at any particular
solutions in any detail PURLs, DOIs, Handles,
ARKs,

3
Context institutional Web sites

institutional Web sites are
heterogeneous i.e. wide variety of content,
managed/unmanaged, formal/informal
primarily accessed via mainstream Web browsers
but that may change over time
dynamic i.e. content is regularly added (and
changed and removed!)
closely tied to the institution and
institutions are liable to change!

4
Context man vs. machine

identifiers serve a human andmachine/software
purpose
person heres one I foundearlier e.g. using
del.icio.usor connotea
machine is this the same asthat?
worth remembering that machines tend to be fairly
stupid
e.g. if some people use the PURL and some use the
corresponding URL, then del.icio.us wont spot
that their entries are about the same thing
in most cases, being able to resolve the
identifier is helpful to both people and machines
in most cases, the longer an identifier lasts,
the better even after the resolution service
breaks!

5
Context what is being identified

the most important question in any discussion
about identifiers is what is being identified?
in the case of institutional Web sites
the site
significant parts of the site
static documents, individual images, etc.
dynamic services
some possibility for confusion here
e.g. what does http//www.bris.ac.uk/ identify?
but in the case of institutional Web sites,
people usually do the right thing and what is
being identified is obvious from the context

6
Context - works vs. manifestations

one key aspect is whether the identifier is for
an abstract work or a particular
manifestation of that work
there are some scenarios in which it is necessary
to identify the work
in other cases, it is necessary to identify a
particular manifestation of the work
beginning to see this problem in the development
of eprint archives and institutional repositories

Crystal Studio is a recommended resource for the
teaching of crystallography at undergraduate
level.
"To perform this exercise you will need a copy of
Crystal Studio version 5.0 (versions 4.0 Lite and
4.0 Professional do not support the required
options)."
7
Functional requirements

the JISC IE technical standards document says

Every significant item that is made available
through a JISC IE network service should be
assigned a URI that is reasonably persistent.
This means that item URIs should not be expected
to break for a period of 10-15 years after they
have first been used. For this reason, JISC IE
service components should not hardcode file
format, server technology, service organisational
structure or other information that is likely to
change over a 10-15 year period into item URIs.
If items become unavailable during that period,
then the URI should resolve to a Web page that
explains why the item is no longer available and
what actions the end-user can take to obtain a
copy of the item or similar resources.
Furthermore, item URIs should not contain
end-user-specific information, i.e. all item URIs
should work for all end-users (albeit allowing
for appropriate authentication challenges to be
inserted into the process by which the URI is
resolved).
Every significant item that is made available
through a JISC IE network service should be
assigned a URI that is reasonably persistent.
This means that item URIs should not be expected
to break for a period of 10-15 years after they
have first been used. For this reason, JISC IE
service components should not hardcode file
format, server technology, service organisational
structure or other information that is likely to
change over a 10-15 year period into item URIs.
If items become unavailable during that period,
then the URI should resolve to a Web page that
explains why the item is no longer available and
what actions the end-user can take to obtain a
copy of the item or similar resources.
Furthermore, item URIs should not contain
end-user-specific information, i.e. all item URIs
should work for all end-users (albeit allowing
for appropriate authentication challenges to be
inserted into the process by which the URI is
resolved).
http//www.ukoln.ac.uk/distributed-systems/jisc-ie
/arch/standards/
8
What should be identified?

every significant item
what does that mean?
every resource that people are likely to want to
cite persistently?
there might be stuff on institutional Web sites
that we dont need to cite persistently
but often difficult to pre-judge what is
significant and what isnt
and judgements about significance and required
level of persistence may come from outside the
institution

9
What does reasonably persistent mean?

notion of persistence is application dependent
perhaps helpful to think about 15 20 year
timeframe?
longer than the Web has been around to date
solutions for 20 year period may well last longer
forever is too long
what will have changed in 20 years time?
technology - HTML replaced? HTTP replaced? DNS
replaced? URI system replaced?
organisations mergers, closures, new
institutions, new government departments, etc.
people deaths, retirements, etc.
countries!

10
What does break mean?

what does it mean for an identifier to break?
need to differentiate between the breakage of
services on the identifier and breakage of the
identifier itself
most obvious services on identifiers are
resolution services
give me a representation of the identified
thing
known as dereferencing in W3C documentation
resolution services can break (by design or by
accident) but the identifier may live on and
remain useful
the identifier itself only breaks when all
parties (including software systems) have
forgotten what it identified, or when parties no
longer agree about what it identifies (e.g. if it
gets re-assigned)

11
Usability issues

the only good long-term identifier is a good
short-term identifier
unless identifiers work well now, then they wont
turn into persistent identifiers because they
wont be used at all
what does work well mean (particularly in the
context of institutional Web sites)?
conformant with current Internet standards
usable in Web browsers (without additional
plug-ins - i.e. usable by everyone)
meaningful to people
resolvable
simple to assign and maintain
low cost (in terms of money and time)

12
Interim conclusions

identifiers for content on institutional Web
sites should be URIs
why? because the URI is the global and
unambiguous standard for identifiers on the
Internet
http URIs are better than any other form of URI
why? because they work in current Internet tools,
particularly Web browsers
built-in resolution mechanism
easy to assign and low-cost (typically!)

13
http URI problems?

but http URIs tend to break dont they?
note usually it is the resolution service that
breaks (i.e. they stop working as locators) -
this doesnt necessarily imply that they stop
functioning as identifiers though the two may be
closely related
reasons for fragility of http URI resolution
examined later
but poor design and lack of commitment often to
blame
not necessarily the case that one can apply
generic Internet-wide findings about http URI
breakage to institutional Web sites
attempts at more persistent forms of identifier
often based on moving away from direct ties to
HTTP and/or introducing a level of indirection

14
How indirection works (or not?)

populate resolution service tables with
identifier -gt locator mappings (and possibly
other metadata)
DOI 10.1000/182 -gt http//www.doi.org/hb.html
Handle 4263537/4002 -gt http//www.handle.net/docu
mentation.html
ARK http//ark.nlm.nih.gov/ark/12025/pm10611131
-gt http//brain.oxfordjournals.org/cgi/content/ful
l/123/1/171
PURL http//purl.org/net/ukoln -gt
http//www.ukoln.ac.uk/
typically used as the basis for HTTP redirects,
e.g.
http//dx.doi.org/10.1000/182 -gt
http//www.doi.org/hb.html
http//hdl.handle.net/4263537/4002 -gt
http//www.handle.net/documentation.html
etc.
helps to ensure persistence but
HTTP redirects not handled very well by browsers
- end-user is typically left using the
non-persistent URI ?
need commitment to maintain resolver services and
tables
introduces a second (at least) identifier for
each resource

15
What about uniqueness?

the same identifier should not be assigned to
more than one resource
a resource may have more than one identifier
assigned to it but this should be avoided as far
as possible
e.g. the DOI 10.1000/182 can be encoded as a
URI in several ways
http//dx.doi.org/10.1000/182, doi10.1000/182,
urndoi10.1000/182 and infodoi/10.1000/182
therefore, DOI-aware applications need to have
knowledge of these encodings hard-coded into them
(partly because the DOI itself is just a string,
but also because nothing in the URI specification
indicates that the URI encodings are equivalent)
though within a domain this may become the norm
(e.g. Google Scholar, Crossref, Connotea, etc.)

16
ARK system

ARKs are worthy of note since they are http
URIs
and therefore meet many of the usability
requirements outlined earlier
ARKs clearly flag an institutional commitment to
persistence
the identifier owner (often the resource owner)
commits to maintaining ARK services and
associated metadata
no reliance on third-party resolver
but they suffer from the HTTP redirect problem
and ultimately may lead to multiple URIs being
assigned to a single resource

17
Anatomy of http URIs
Server technology change of technology may
enforce change of URI, leading to multiple URIs
for same resource (with no simple mechanism for
determining equivalence)
http URI scheme URI persistence not reliant
on HTTP protocol, but is reliant on continued
registration and management of the scheme (and of
the URI spec. itself!)
http//www.somewhere.ac.uk/physics/index.cfm?name
about http//www.somewhere.ac.uk/chemistry/report.
rtf
DNS domain name persistence reliant on
continued ownership and management of the DNS
domain name (and the DNS!)
File format inappropriate if identifier is for
the work rather than the manifestation -
because changing the format will result in a new
URI
Component hierarchy, often organisationally based
persistence reliant on continued management of
component structure, i.e. not re-using old
components
18
Improving persistence of http URIs

choose long-lived DNS domain names e.g. try to
avoid details of internal organisational
structure
partition URI components by function rather
than by organisational structure - because
structure is likely to change
avoid exposing Web server technology in URIs
(Cold Fusion, PHP, etc.) - to allow changes to
technology without URI proliferation and resolver
breakage
avoid embedding details of document format into
URIs, unless particular manifestation is being
identified
avoid embedding end-user or session information
into URIs so that they can be shared between
people

19
Conclusions and recommendations

persistent identifiers require persistent
commitment from the institution (and
third-parties)
need to determine what persistent means in
practice (on the basis that forever is
unrealistic)
http URIs can be made more persistent if they
are constructed and managed sensibly
use of DOIs/Handles/ARKs/PURLs may be appropriate
(particularly where domain practice is clear)
but need to be clear about cost/benefits and
institutional and third-party commitment to
maintaining resolver tables and associated
services
where these are used, always and only use the
http form of URI (e.g. http//dx.doi.org/10.1000
/182)