Title: A Semantic Modelling Approach to Biological Parameter Interoperability
1A Semantic Modelling Approach to Biological
Parameter Interoperability
Ocean Biodiversity Informatics
Roy Lowry Laura Bird British Oceanographic Data
Centre Pieter Haaring RIKZ, Rijkswaterstaat, The
Netherlands
2Presentation Overview
- The nature of the problem
- Dictionaries and data models
- The starting position
- Manual mapping
- Automation through semantic matching
- From dictionary to semantic model
- Mapping semantic models
- Semantic model applications
- Conclusions and lessons learned
3The Nature of the Problem
- BODC and Rijkswaterstaat both have marine
databases holding a wide range of physical,
chemical and biological parameters - Both were to be included pan-European
metadatabases (EDIOS and SEA-SEARCH CDI) using a
common discovery vocabulary - BODC set up the vocabulary and obviously included
a mapping to the BODC Parameter Dictionary - Problem arose of how to provide a similar mapping
for the Rijkswaterstaat - If the Rijkswaterstaat data markup vocabulary
could be mapped to the BODC Parameter Dictionary
then the BODC discovery vocabulary mapping could
be used
4Dictionaries and Data Models
- BODC systems have roots in the GF3 model, which
means - Data values are linked to a parameter code
- Parameter code is defined in a Parameter
Dictionary - The parameter code specifies more than one
metadata item for the data value - For chemical and biological data more than one
becomes a lot
5Dictionaries and Data Models
- Rijkswaterstaat uses data models (DONAR becoming
WADI) - Measurements are accompanied by attributes
containing specific atomic metadata items - Each attribute is populated from a controlled
vocabulary - DONAR constrains attribute term combinations
using a parameter dictionary concept - WADI reduces maintenance overheads by allowing
any combination
6The Starting Position
- BODC
- Parameter Codes defined by two plain-text fields
- Related semantic information not necessarily in
the same field - Fields would not concatenate sensibly
- OK for humans, but not for machines
- Rijkswaterstaat
- Consistently located semantics
- Metadata fields that concatenate sensibly in both
Dutch and English
7Manual Mapping
- Manual mapping protocol
- For each entry in the Rijkswaterstaat
dictionary spreadsheet - Look up code with identical meaning using BODC
Dictionary search tools (Access Filter by Form) - If found
- Copy BODC code from Access and paste into
spreadsheet - Else
- Prepare dictionary update record and submit for
QA and load - Error prone and 500 entries is pushing the limit
of human endurance!
8Semantic Matching
- When code lists run into thousands, automation is
required - Rijkswaterstaat developed a semantic matching
tool to pull matching terms (preferably one) from
the BODC dictionary - Defeated by the lack of standardisation in the
BODC plain-text fields e.g. - Calanus abundance
- Abundance of Calanus
- Calanus count
- Number of Calanus
9Dictionary to Semantic Model
- Became apparent that the BODC Dictionary required
significant improvement if it was to support
mapping automation - Development strategy was to model the parameter
code in the same way DONAR models a measurement - Semantic model developed to cover all codes in
BODC Dictionary
10Dictionary to Semantic Model
- Semantic model developed from DONAR with an
increased semantic element count to overcome
shoe-horning - Principle that semantic elements may be combined
automatically to produce text descriptions
maintained - Currently implemented as three sub-models
- Element superset will ultimately be created as a
single model
11Dictionary to Semantic Model
- Biological sub-model semantic elements
- Parameter (Abundance, Biomass)
- Taxon_code (ITIS code)
- Taxon_name
- Taxon_subgroup (gender, size, stage)
- Parameter_compartment_relationship (per unit
volume of the, per unit area of the) - Compartment (water column, bed, sediment)
- Sample_preparation
- Analysis
- Data_processing
- Needs further refinement e.g. subdivide
Taxon_subgroup
12Mapping Semantic Models
- Two stage process
- First map the semantic elements
- DONAR Parameter BODC Parameter
Parameter_compartment_relationship - DONAR Compartment BODC Compartment
- Then map vocabularies for mapped elements
- Surface water water column
- Relational database designers will recognise this
as normalisation
13Mapping Semantic Models
- Number of look-ups required is reduced by an
order of magnitude - Vocabulary elements have simple semantics so
automation is possible - Approximately 90 of the Rijkswaterstaat to BODC
mapping accomplished by a single SQL statement - Straightforward extension of vocabulary maps
(different names for same thing) sorted out most
of the rest - Thesauri could help reduce the need for this
14Mapping Semantic Models
- Hard Core problems required manual resolution
- Unclear or ambiguous semantics in Rijkswaterstaat
element vocabularies (residual beta) - Problems with Dutch to English translation
- Some mapping errors were detected
- Caused by homonyms (Branchiura)
- Emphasises the need for more than just a name for
a taxon (reference or ITIS code)
15Semantic Model Applications
- Semantic modelling is a lowest common denominator
approach to metadata - This is what makes it good for mapping
- The approach also offers the basis for
user-controlled data discovery and
interoperability - User chooses the semantic element subset
- User data selection interaction based on the
subset vocabulary - Automated interoperability requires more
sophistication (thesauri, ontologies)
16Conclusions
- Dont even think about manual mapping of large
parameter dictionaries - 99 of a map is completed in the first 10 of the
time - More standardisation means fewer errors and
problems - Semantic model vocabularies need ontologies and
thesauri to achieve their full interoperability
potential
17Conclusions
- Semantic modelling works for mappings between
dictionaries and data models - It also has great potential for parameter
discovery and interoperability