Persistently identifying Web site content - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Persistently identifying Web site content

Description:

every resource that people are likely to want to cite persistently? there might be stuff on institutional Web sites that we don't need to cite persistently ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 21
Provided by: andyp74
Category:

less

Transcript and Presenter's Notes

Title: Persistently identifying Web site content


1
Persistently identifying Web site content
  • Future-proofing Institutional Web sitesDCC and
    Wellcome Library workshop

2
Contents
  • context
  • functional requirements
  • issues raised
  • practical suggestions
  • note not going to look at any particular
    solutions in any detail PURLs, DOIs, Handles,
    ARKs,

3
Context institutional Web sites
  • institutional Web sites are
  • heterogeneous i.e. wide variety of content,
    managed/unmanaged, formal/informal
  • primarily accessed via mainstream Web browsers
    but that may change over time
  • dynamic i.e. content is regularly added (and
    changed and removed!)
  • closely tied to the institution and
    institutions are liable to change!

4
Context man vs. machine
  • identifiers serve a human andmachine/software
    purpose
  • person heres one I foundearlier e.g. using
    del.icio.usor connotea
  • machine is this the same asthat?
  • worth remembering that machines tend to be fairly
    stupid
  • e.g. if some people use the PURL and some use the
    corresponding URL, then del.icio.us wont spot
    that their entries are about the same thing
  • in most cases, being able to resolve the
    identifier is helpful to both people and machines
  • in most cases, the longer an identifier lasts,
    the better even after the resolution service
    breaks!

5
Context what is being identified
  • the most important question in any discussion
    about identifiers is what is being identified?
  • in the case of institutional Web sites
  • the site
  • significant parts of the site
  • static documents, individual images, etc.
  • dynamic services
  • some possibility for confusion here
  • e.g. what does http//www.bris.ac.uk/ identify?
  • but in the case of institutional Web sites,
    people usually do the right thing and what is
    being identified is obvious from the context

6
Context - works vs. manifestations
  • one key aspect is whether the identifier is for
    an abstract work or a particular
    manifestation of that work
  • there are some scenarios in which it is necessary
    to identify the work
  • in other cases, it is necessary to identify a
    particular manifestation of the work
  • beginning to see this problem in the development
    of eprint archives and institutional repositories

Crystal Studio is a recommended resource for the
teaching of crystallography at undergraduate
level.
"To perform this exercise you will need a copy of
Crystal Studio version 5.0 (versions 4.0 Lite and
4.0 Professional do not support the required
options)."
7
Functional requirements
  • the JISC IE technical standards document says

Every significant item that is made available
through a JISC IE network service should be
assigned a URI that is reasonably persistent.
This means that item URIs should not be expected
to break for a period of 10-15 years after they
have first been used. For this reason, JISC IE
service components should not hardcode file
format, server technology, service organisational
structure or other information that is likely to
change over a 10-15 year period into item URIs.
If items become unavailable during that period,
then the URI should resolve to a Web page that
explains why the item is no longer available and
what actions the end-user can take to obtain a
copy of the item or similar resources.
Furthermore, item URIs should not contain
end-user-specific information, i.e. all item URIs
should work for all end-users (albeit allowing
for appropriate authentication challenges to be
inserted into the process by which the URI is
resolved).
Every significant item that is made available
through a JISC IE network service should be
assigned a URI that is reasonably persistent.
This means that item URIs should not be expected
to break for a period of 10-15 years after they
have first been used. For this reason, JISC IE
service components should not hardcode file
format, server technology, service organisational
structure or other information that is likely to
change over a 10-15 year period into item URIs.
If items become unavailable during that period,
then the URI should resolve to a Web page that
explains why the item is no longer available and
what actions the end-user can take to obtain a
copy of the item or similar resources.
Furthermore, item URIs should not contain
end-user-specific information, i.e. all item URIs
should work for all end-users (albeit allowing
for appropriate authentication challenges to be
inserted into the process by which the URI is
resolved).
http//www.ukoln.ac.uk/distributed-systems/jisc-ie
/arch/standards/
8
What should be identified?
  • every significant item
  • what does that mean?
  • every resource that people are likely to want to
    cite persistently?
  • there might be stuff on institutional Web sites
    that we dont need to cite persistently
  • but often difficult to pre-judge what is
    significant and what isnt
  • and judgements about significance and required
    level of persistence may come from outside the
    institution

9
What does reasonably persistent mean?
  • notion of persistence is application dependent
  • perhaps helpful to think about 15 20 year
    timeframe?
  • longer than the Web has been around to date
  • solutions for 20 year period may well last longer
  • forever is too long
  • what will have changed in 20 years time?
  • technology - HTML replaced? HTTP replaced? DNS
    replaced? URI system replaced?
  • organisations mergers, closures, new
    institutions, new government departments, etc.
  • people deaths, retirements, etc.
  • countries!

10
What does break mean?
  • what does it mean for an identifier to break?
  • need to differentiate between the breakage of
    services on the identifier and breakage of the
    identifier itself
  • most obvious services on identifiers are
    resolution services
  • give me a representation of the identified
    thing
  • known as dereferencing in W3C documentation
  • resolution services can break (by design or by
    accident) but the identifier may live on and
    remain useful
  • the identifier itself only breaks when all
    parties (including software systems) have
    forgotten what it identified, or when parties no
    longer agree about what it identifies (e.g. if it
    gets re-assigned)

11
Usability issues
  • the only good long-term identifier is a good
    short-term identifier
  • unless identifiers work well now, then they wont
    turn into persistent identifiers because they
    wont be used at all
  • what does work well mean (particularly in the
    context of institutional Web sites)?
  • conformant with current Internet standards
  • usable in Web browsers (without additional
    plug-ins - i.e. usable by everyone)
  • meaningful to people
  • resolvable
  • simple to assign and maintain
  • low cost (in terms of money and time)

12
Interim conclusions
  • identifiers for content on institutional Web
    sites should be URIs
  • why? because the URI is the global and
    unambiguous standard for identifiers on the
    Internet
  • http URIs are better than any other form of URI
  • why? because they work in current Internet tools,
    particularly Web browsers
  • built-in resolution mechanism
  • easy to assign and low-cost (typically!)

13
http URI problems?
  • but http URIs tend to break dont they?
  • note usually it is the resolution service that
    breaks (i.e. they stop working as locators) -
    this doesnt necessarily imply that they stop
    functioning as identifiers though the two may be
    closely related
  • reasons for fragility of http URI resolution
    examined later
  • but poor design and lack of commitment often to
    blame
  • not necessarily the case that one can apply
    generic Internet-wide findings about http URI
    breakage to institutional Web sites
  • attempts at more persistent forms of identifier
    often based on moving away from direct ties to
    HTTP and/or introducing a level of indirection

14
How indirection works (or not?)
  • populate resolution service tables with
    identifier -gt locator mappings (and possibly
    other metadata)
  • DOI 10.1000/182 -gt http//www.doi.org/hb.html
  • Handle 4263537/4002 -gt http//www.handle.net/docu
    mentation.html
  • ARK http//ark.nlm.nih.gov/ark/12025/pm10611131
    -gt http//brain.oxfordjournals.org/cgi/content/ful
    l/123/1/171
  • PURL http//purl.org/net/ukoln -gt
    http//www.ukoln.ac.uk/
  • typically used as the basis for HTTP redirects,
    e.g.
  • http//dx.doi.org/10.1000/182 -gt
    http//www.doi.org/hb.html
  • http//hdl.handle.net/4263537/4002 -gt
    http//www.handle.net/documentation.html
  • etc.
  • helps to ensure persistence but
  • HTTP redirects not handled very well by browsers
    - end-user is typically left using the
    non-persistent URI ?
  • need commitment to maintain resolver services and
    tables
  • introduces a second (at least) identifier for
    each resource

15
What about uniqueness?
  • the same identifier should not be assigned to
    more than one resource
  • a resource may have more than one identifier
    assigned to it but this should be avoided as far
    as possible
  • e.g. the DOI 10.1000/182 can be encoded as a
    URI in several ways
  • http//dx.doi.org/10.1000/182, doi10.1000/182,
    urndoi10.1000/182 and infodoi/10.1000/182
  • therefore, DOI-aware applications need to have
    knowledge of these encodings hard-coded into them
    (partly because the DOI itself is just a string,
    but also because nothing in the URI specification
    indicates that the URI encodings are equivalent)
  • though within a domain this may become the norm
    (e.g. Google Scholar, Crossref, Connotea, etc.)

16
ARK system
  • ARKs are worthy of note since they are http
    URIs
  • and therefore meet many of the usability
    requirements outlined earlier
  • ARKs clearly flag an institutional commitment to
    persistence
  • the identifier owner (often the resource owner)
    commits to maintaining ARK services and
    associated metadata
  • no reliance on third-party resolver
  • but they suffer from the HTTP redirect problem
  • and ultimately may lead to multiple URIs being
    assigned to a single resource

17
Anatomy of http URIs
Server technology change of technology may
enforce change of URI, leading to multiple URIs
for same resource (with no simple mechanism for
determining equivalence)
http URI scheme URI persistence not reliant
on HTTP protocol, but is reliant on continued
registration and management of the scheme (and of
the URI spec. itself!)
http//www.somewhere.ac.uk/physics/index.cfm?name
about http//www.somewhere.ac.uk/chemistry/report.
rtf
DNS domain name persistence reliant on
continued ownership and management of the DNS
domain name (and the DNS!)
File format inappropriate if identifier is for
the work rather than the manifestation -
because changing the format will result in a new
URI
Component hierarchy, often organisationally based
persistence reliant on continued management of
component structure, i.e. not re-using old
components
18
Improving persistence of http URIs
  • choose long-lived DNS domain names e.g. try to
    avoid details of internal organisational
    structure
  • partition URI components by function rather
    than by organisational structure - because
    structure is likely to change
  • avoid exposing Web server technology in URIs
    (Cold Fusion, PHP, etc.) - to allow changes to
    technology without URI proliferation and resolver
    breakage
  • avoid embedding details of document format into
    URIs, unless particular manifestation is being
    identified
  • avoid embedding end-user or session information
    into URIs so that they can be shared between
    people

19
Conclusions and recommendations
  • persistent identifiers require persistent
    commitment from the institution (and
    third-parties)
  • need to determine what persistent means in
    practice (on the basis that forever is
    unrealistic)
  • http URIs can be made more persistent if they
    are constructed and managed sensibly
  • use of DOIs/Handles/ARKs/PURLs may be appropriate
    (particularly where domain practice is clear)
  • but need to be clear about cost/benefits and
    institutional and third-party commitment to
    maintaining resolver tables and associated
    services
  • where these are used, always and only use the
    http form of URI (e.g. http//dx.doi.org/10.1000
    /182)

20
Questions
Write a Comment
User Comments (0)
About PowerShow.com