Title: Proxy data objects
1Proxy data objects
provide optional linking to external objects and
reusable data interfaces, from which Darwin- and
LinneanCore-like protocol interfaces can be
constructed
(Talk held Wednesday, 2004-10-13, at the TDWG
2004 meeting in Christchurch, New Zealand)
2(Submitted Abstract)
- Ideally, biodiversity data are expressed using
well-defined object types (with generally
accepted and intuitive property and object
composition concepts) and all objects required in
relations are available in digital form,
identifiable through resolvable globally unique
identifiers. Reality is different. a) the complex
models like ABCD, SDD, TCS etc. are under debate
and not necessarily fully stable. A consequence
is that simplified "Cores" like DarwinCore or
LinneanCore are proposed. b) Most required data
are not digitized at all. Wherever data should
naturally refer to other biodiversity domains
alternative models are required to either linking
something external or provide a sufficient
internal object definition. Such a definition
also involves a simplified set of core elements. - I therefore propose to combine the two problems
and define relatively simple data interfaces,
that can serve both for protocol/query purposes
and for the definition of local proxy objects (
either link to external entities, or provide a
local definition). Data interfaces shield the
complexity of a fuller model (i.e. the full model
can be treated as a black box). They should be
rough enough to fit to several models, but also
detailed enough to allow the definition of proxy
data (i. e. make a substantial semantic
definition). Data interface are often implicitly
used in current practice. 'Specify' uses a
simplified literature and nomenclature interface,
DarwinCore contains name, identification and
geographical location interfaces, Taxon Concept
Schema contains interfaces for literature and
specimen, and Linnean Core has a literature
interface. Agreeing on a common set of such
interface concepts would allow to build Darwin
and LinneanCore as well as much of SDD, TCS, etc.
from a the same building blocks, drastically
reducing the investment needed for a full global
biodiversity information system.
3Outline
- Linking and proxy objects
- Data interface definitions
- DarwinCore and LinneanCore from Interfaces
4Linking
- Organism-interaction data are primarily a
4-tuple of - links to external objects
- Organism name 1 (e. g., a fungus)
- Organism name 2 (e. g., a plant)
- Geographical location
- Publication / data source reference
- plus non-linking data
- Interaction type (controlled vocabulary)
- Reference detail (page, document-fragment, ...)
5Linking
- Similarly, SDD descriptions need links to
express - Taxon / class name
- Specimen unit
- Geographical scope
- Publication
- Contributing IPR agents
- Plus terminology definitions multiple
description projects sharing a common terminology
6Two kind of links?
- Convenience links (see also-links)
- GenBank entries for a specimen/taxon name
- Link to illustrating image/video
-
- If link breaks, the information content of the
main document is usually left intact. - Defining links
- Link to taxon / specimen in a description
- Link to cited publication
-
- If link breaks, the information content of the
main document is lost or severely damaged.
? Recovery mechanism is highly desirable ?
Simplest preserve some human-readable semantics
7Requirements
- A link should be
- Stable
- URLs are not stable
- PURLs unmanageable? (usually just xURLs
extended life URLs) - Resolvable
- Many GUID / URN schemes are not
- LSIDs and are both stable and resolvable
- Links used in primary scientific data must
further - Offer recovery mechanism if a (principally
stable) link vanishes nevertheless!
8Not required, but desirable
- Object Identity
- DOI defines identity, but is expensive
- At least, it should be discoverable, whether a
link defines object identity or not ( do
multiple URL / LSID exist for the same object?)
9The Problem
- For quite some time, the default will be that
there is no external object - Either not yet
- Digitization of specimens
- Older literature
- Or unreliable
- Recent literature
- Moreover, permanently objects will be
temporarily not yet available - Science creates new
- Specimens still in private collection
- Taxon names / concepts to be published
- etc.
10So, now we have
Linkclient
Linkedobject
Local recove- ry semantics
Or
(unavailable)
Linkingclient
Linkedobject
Local replacement
11Often also caching desirable
Locally cached data
Linkclient
Linkedobject
Local recove- ry semantics
Caching linked data temporarily unavailable
Recovery linked data permanently unavailable
12Simplify?
Linkclient
Linkedobject
Locally cached interface data (incl. recovery)
Proxy object
13A data proxy
- May link to external data providers, especially
for knowledge domains outside of the scope of the
current dataset - Supports several object linking mechanisms
involving globally unique identifiers and
resolving mechanisms (e. g., DOI, LSID, URL) - Can replace links in cases where objects are
(perhaps not yet) available from an external data
source - For existing links A minimalized data interface
is cached on the assumption that access is
asynchronous, slow, or may be temporarily
unavailable - For local proxies The same data interface allows
relatively simple, local object definition to
decouple processes - Provides cached data and semantics to human
readers, allowing recovery even if a link has
become permanently broken
14Outline
- Linking and proxy objects
- Data interface definitions
- DarwinCore and LinneanCore from Interfaces
15Interface good choice of term?
- Not user-interface!
- Interface is used similar to its use in
object-oriented programming - However, no methods in data interfaces
- Persistence interface properties/fields/struc
tures like collections - Data nterfaces provide additional abstraction
layer on top of public object model - Data interfaces allow programming against the
interface instead of against the full object
model - Is there a better term?
16Complexity
- Complex schemata arecertainly necessary, but no
solution for projects where a knowledge domain
is of secondary relevance!
17Goals
- Abstract encapsulate complexity
- Reduced size
- cover 80 of interaction needs
- react more flexible when changes in the main
object model occur - Formalize (and standardize) reality
- Most software needs peripheral objects from other
knowledge domains and treats them lightly - TCS, SDD, ABCD/DarwinCore, LinneanCore
- Example Specify contains a name and publication
reduced data set with editors to them
essentially proxy objects with a set of interface
fields
18Abstraction layers
Complex object model/schema
19Data interface requirements
- Should be stable, so applications programming
against the interface do not break - Define mapping from complex standardto interface
- Where interface is used as proxy (entering data),
the reverse should also be defined
20Outline
- Linking and proxy objects
- Data interface definitions
- DarwinCore and LinneanCore from Interfaces
21Modular interface/protocol schemata
Darwin-Core
Taxon name interface(full/atomizedtaxon name)
Taxon name interface(full/atomizedtaxon name)
Linn-ean-Core
Specimen curatorial interface(Collection/subcoll
.,access. no., )
Specimen curatorial interface(Collection/subcoll
.,access. no., )
Publicationinterface(author, year, title, ...)
Geographical location interface(place
description, geogr. coord., gazetteer link, )
22Full schema partly using interfaces
Fulltaxonconceptschema
Taxon name interface(full/atomizedtaxon name)
Taxon name complete
Linn-ean-Core
Specimen curatorial interface(Collection/subcoll
.,access. no., )
Specimen curatorial interface(Collection/subcoll
.,access. no., )
Publicationinterface(author, year, title, ...)
Publicationinterface(author, year, title, ...)
23OO schema or flattened?
Darwin-Core
Geographical location interface(place
description, geogr. coord., gazetteer link, )
24Proxies proposed in UBIF/SDD