Title: Distributed Data, Distributed Governance, Distributed Vocabularies: The NERC DataGrid
1Distributed Data, Distributed Governance,
Distributed Vocabularies The NERC DataGrid
Bryan Lawrence (on behalf of a big team, and note
also a substantial piece of work with specific
authorship included herein)
BADC, BODC, CCLRC, PML and SOC
2Outline
- Motivation
- Standards
- Feature Types
- Taxonomy
- Overall Architecture
- NDG Products
- Discovery Portal
- Data Extractor
- MOLES (NumSim relationship with NMM)
- CSML
- CSML
- Description
- Prototyping in MarineXML
- Round-Tripping
- Vocabulary Issues IN NDG (Hughes, Kondapalli,
Lowry) - NDG Timeline
3Complexity Volume Remote Access Grid
Challenge
British Atmospheric Data Centre
http//ndg.nerc.ac.uk
British Oceanographic Data Centre
4Integration semantics
- Want interdisciplinary semantic access to
information, not abstract data - getData(potential temperature from ERA-40 dataset
in North Atlantic from 1990 to 2000) - not getData(era40.nc, PTMP, 2050, 300340,
190200) - or even worse
- for j19902000
- getData(era40_j.nc, PTMP, 2050, 300340)
- Lossy is OK!
- Care less about completeness of representation
than semantic unification
5Standards
- ISO 19101 Geographic information Reference
model
6Standards
- Geographic features
- abstraction of real world phenomena ISO 19101
- Type or instance
- Encapsulate important semantics in universe of
discourse - Something you can name
- Application schema
- Defines semantic content and logical structure
- ISO standards provide toolkit
- spatial/temporal referencing
- geometry (1-, 2-, 3-D)
- topology
- dictionaries (phenomena, units, etc.)
- GML canonical encoding
from ISO 19109 Geographic information Rules
for Application Schema
7Architecture NDG Metadata Taxonomy
not one schema, not one solution!
8Architecture Deployment
9Architecture Deployment
10Architecture Deployment
11Architecture Deployment
12Current Status
13Discovery Service
NDG Products Discovery Portal
http//ndg.nerc.ac.uk/discovery
NB Web Service Interface (you can do the search
from your own site and format and present the
results there!
14(No Transcript)
15(No Transcript)
16NDG Products MOLES
Ugly as sin! A hint of things to come
17MOLES implementation
- Core linking concept is the deployment
of a Data Production Tool
at an Observation Station
on behalf of an Activity
that produces a Data Entity
Activity
DataProductionTool
ObservationStation
Links the metadata records into a structure that
can be turned into a navigable structure
Deployment
Each of the main metadata objects has security
data attached to it. This means that this can be
applied to queries on the metadata
Data Entity
18Simulators as data production tools NumSim
NDG Products NumSim
19NumSim Example
NumSim Example
20(No Transcript)
21NDG Products DataExtractor
22(No Transcript)
23NDG Products GEOSPLAT
24- ERA40
- All driven from one CDML file, 9 TB online
spherical harmonics, looking like 40 TB virtual
gridded!
25NDG-A Climate Science Modelling Language
- Aims
- provide semantic integration mechanism for NDG
data - explore new standards-based interoperability
framework - emphasise content, not container
- Design principles
- offload semantics onto parameter type
(phenomenon, observable, measurand) - e.g. wind-profiler, balloon temperature sounding
- offload semantics onto CRS
- e.g. scanning radar, sounding radar
- sensible plotting as discriminant
- in-principle unsupervised portrayal
- explicitly aim for small number of weakly-typed
features (in accordance with governance principle
and NDG remit)
26Climate Science Modelling Language
- CSML feature types
- defined on basis of geometric and topologic
structure
CSML feature type Description Examples
TrajectoryFeature Discrete path in time and space of a platform or instrument. ships cruise track, aircrafts flight path
PointFeature Single point measurement. raingauge measurement
ProfileFeature Single profile of some parameter along a directed line in space. wind sounding, XBT, CTD, radiosonde
GridFeature Single time-snapshot of a gridded field. gridded analysis field
PointSeriesFeature Series of single datum measurements. tidegauge, rainfall timeseries
ProfileSeriesFeature Series of profile-type measurements. vertical or scanning radar, shipborne ADCP, thermistor chain timeseries
GridSeriesFeature Timeseries of gridded parameter fields. numerical weather prediction model, ocean general circulation model
27Climate Science Modelling Language
- CSML feature types
- examples...
28Climate Science Modelling Language
- Numerical array descriptors
- provides wrapper architecture for legacy data
files - Connected to data model numerical content
through xlinkhref - Three subtypes
- InlineArray
- ArrayGenerator
- FileExtract (NASAAmes, NetCDF, GRIB)
- Composite design pattern for aggregation
29Climate Science Modelling Language
- Inline array
- Array generator
ltNDGInlineArraygt ltarraySizegt5 2lt/arraySizegt ltuom
gtudunits.xmldegreeClt/uomgt ltnumericTypegtfloatlt/nu
mericTypegt ltregExpTransformgts/10/9/gelt/regExpTran
sformgt ltnumericTransformgt5lt/numericTransformgt lt
valuesgt1 2 3 4 5 6 7 8 9 10lt/valuesgt lt/NDGInlineAr
raygt
ltNDGArrayGeneratorgt ltarraySizegt10001lt/arraySizegt
ltuomgtudunits.xmlminutelt/uomgt ltnumericTypegtfloat
lt/numericTypegt ltexpressiongt0550000lt/expressiongt
lt/NDGArrayGeneratorgt
30Climate Science Modelling Language
ltNDGNASAAmesExtractgt ltarraySizegt526lt/arraySizegt
ltnumericTypegtdoublelt/numericTypegt ltfileNamegt/data
/BADC/macehead/mh960606.cf1lt/fileNamegt ltvariableN
amegtCFC-12lt/variableNamegt lt/NDGNASAAmesExtractgt
ltNDGNetCDFExtract gmlid"feat04azimuth"gt ltarra
ySizegt10000lt/arraySizegt ltfileNamegtradar_data.nclt
/fileNamegt ltvariableNamegtazlt/variableNamegt lt/ND
GNetCDFExtractgt
ltNDGGRIBExtractgt ltarraySizegt320
160lt/arraySizegt ltnumericTypegtdoublelt/numericTypegt
ltfileNamegt/e40/ggas1992010100rsn.grblt/fileNamegt
ltparameterCodegt203lt/parameterCodegt ltrecordNumber
gt5lt/ recordNumbergt ltfileOffsetgt289412lt/fileOffset
gt lt/NDGGRIBExtractgt
31MarineXML Testbed
For each XSD (for the source data) there is an
XSLT to translate the data to the Feature Types
(FT) defined by CSML. The FTs and XSLT are
maintained in a MarineXML registry
Phenomena in the XSD must have an associated
portrayal
Data from different parts of the marine community
conforming to a variety of schema (XSD)
The FTs can then be translated to equivalent FTs
for display in the ECDIS system
XSD
XML
Biological Species
S52 Portrayal Library
XSD
XML
Chl-a from Satellite
XML Parser
MarineGML(NDG) Feature Types
XSLT
XML
XSLT
XSLT
SeeMyDENC
SENC
XSD
MeasuredHydrodynamics
XML
XSLT
XML
XSLT
XSLT
ECDIS acts as an example client for the data.
XSD
Data Dictionary
XML
ModelledHydrodynamics
The result of the translation is an encoding
that contains the marine data in weakly typed
(i.e. generic) Features
Features in the source XSD must be present in the
data dictionary.
XSD
Feature described using S-57v3.1Application
Schema can be imported and are equivalent to the
same features in CSML
XML
S-57v3 GML
Slide adapted from Kieran Millard (AUKEGGS, 2005)
32MarineXML Testbed
Biological sampling station with attributes for
the species sampled at each
Grid of Chl-a from the MERIS instrument on ENVISAT
Predicted and measured wave climate timeseries
(height, direction and period)
Vectors of currents from instruments
Slide adapted from Kieran Millard (AUKEGGS, 2005)
33The Concept of re-using Features
Here structured XML is converted to plain ascii
text in the form required for a numerical model
HTML warning service pages are generated on the
fly
Here the same XML is converted to the SENC format
used in a proprietary tool for viewing electronic
navigation charts.
XML can also be converted to SVG to display data
graphically
Slide adapted from Kieran Millard (AUKEGGS, 2005)
34CSML Round Tripping - 1
Managing semantics
35CSML Round Tripping - 2
Managing data - 1
36Managing Data 2
scanner
XSLT
PUBLISH
ISO19115
37Architecture Deployment
38Vocabulary Management for NERC DataGrid
- Michael Hughes, V.Siva Kondapalli and Roy Lowry
39Vocabulary Presentation Outline
- Problem and Solution
- NERC DataGrid Vocabulary Model
- Vocabulary Technical Governance
- Vocabulary Content Governance
- Mappings and Thesaurus Server
- Potential Role of Local Mappings
40The Problem
- NERC DataGrid cannot function operationally
without metadata and data semantic
interoperability - This will never be achieved without
- Readily accessible standard terms whose meaning
is clearly understood - Readily accessible semantic maps both within and
between lists of standard terms - Semantic maps between local terms and standard
terms
41The Solution?
- Implementation of a Vocabulary Server
- Building OWL ontologies mapping between
domain-relevant de-facto standard vocabularies - Deploying the ontologies through a Web Service
thesaurus server - Making tools available for users to build and
deploy local ontologies
42NDG Vocabulary Model
43NDG Vocabulary Model
- The vocabulary resource is built from Entries
- The representation of a single object in the real
world comprising - Key - A bit pattern that represents an entity. It
must be unique, permanent and free from
semantics. - Term Text used to label the entity to
facilitate human recognition. - Abbreviation An shortened version of the term
for use where space is tight. Target size is
20-30 bytes. - Definition Text that unambiguously specifies
the entity. - Entries are aggregated into Lists (entity class
or subclass e.g. UK post towns) - Lists are aggregated into Constraints (entity
class e.g. post towns of the world)
44Vocabulary Technical Governance
- The story so far
- Lists are available as flat ASCII files or XML
documents as URLs e.g. - http//www.cgd.ucar.edu/cms/eaton/cf-metadata/stan
dard_name.xml - ftp//ftp.pol.ac.uk/pub/bodc/jgofs/datadict/new/pa
rameter_group.csv - http//www.sea-search.net/cdi_documentation/cdi_sa
mpling_codes.csv - http//gcmd.nasa.gov/Resources/valids//gcmd_parame
ters.html - Some (BODC, SEA-SEARCH) include keys
- Some (CF, BODC) include definitions
- None are properly versioned
45Vocabulary Technical Governance
- Versioning should
- Provide a unique label for each instantiation of
the list - Enable any previous instantiation of the list to
be recreated - Provide timestamp information for creation and
modification of every object in the vocabulary
system - Delivery should
- Be from the master, not a copy
- Be accessible to software agents to allow
automated synchronisation of local copies - Have a hotline to content governance
46Vocabulary Technical Governance
- NERC DataGrid Vocabulary Server
- Back End
- Fully automated record archive, timestamps and
version numbering. Live April 2006. - 47 (of 115) lists publicly accessible.
- Front End
- Web Service API. Live June 2006.
- XML list downloads from website (July 2006?).
- Web-form tools (August 2006?).
47Vocabulary Content Governance
- Standard lists need to respond to ever expanding
user requirements - Change needs to be rapid or users lose interest
- Standard lists need to maintain information
quality and internal consistency - Content governance has to resolve these
conflicting requirements
48 Vocabulary Content Governance
- Content governance in oceanographic and
atmospheric domains is based on - Moderated e-mail discussion lists
- Benign Dictator and well-meaning volunteers
- Variable success depending on right people having
spare time at the right moments - More formalism underpinned by more resources
required - But need to be careful about going too far or
levels of service become unacceptable
49Mappings and Thesaurus Server
- There will never be a single list for a given
topic - Term mapping therefore an essential part of
semantic interoperability - Marine Metadata Interoperability
(http//marinemetdata.org) have developed tooling
and trialled mappings in the measurement
phenomena arena
50Mappings and Thesaurus Server
- MMI approach
- Harmonise lists to be mapped in OWL (Voc2OWL
tool) - Map on basis of same as, broader than and
narrower than relationships (VINE tool) - Place a Web Service API over the map to implement
a term or thesaurus server
51Mappings and Thesaurus Server
- NERC DataGrid Plans
- Use MMI technology plus domain expertise
available in BODC, BADC and their user
communities to build a complete map between - BODC Parameter Discovery Vocabulary (300 terms)
- CF Standard Names (5-600 terms)
- GCMD Parameter Valids (2-300 relevant terms)
- Incorporate this map into the NDG Discovery
Service to facilitate smart searching (e.g.
pigments finds dataset labelled chlorophyll)
through MMI Web Service - Integrate ontology maintenance into source list
maintenance
52Role of Local Mappings
- There will always be local terms and
understanding - Pigment data sets could mean
- Chlorophyll OR carotenoids OR phaeopigments
- Chlorophyll AND carotenoids AND phaeopigments
- Depends on point of view
53Role of Local Mappings
- Possible solution to this
- User builds an ontology reflecting local
perception of the mapping between local terms and
standard terms - Discovery or data integration tools use ontology
as a plug-in allowing user to operate with
local terminology - Tools (e.g. VINE) could be made available to
facilitate this
54NDG Timeline
- NDG2 runs until September 2007
- NDG-Alpha (June 2006)
- Not all components in place (particularly
delivery broker) - Not many (maybe only DX) products will be
deployable by non-NDG participants - (too much hard work installing things that
havent been optimised for installation) - Discovery portal will be (is now) usable, linking
to NCAR data etc, but isnt very user friendly
(options not obvious etc). - NDG-Beta (Feb 2007)
- Most components should work, but deployment of
software may still be difficult by
non-participants - NDG-Prod (Jun 2007)
- Should be deployable and far more user friendly
(spending from Feb-June working on deployment and
friendliness, no new functionality) - Last few months working on sustainability etc
http//proj.badc.rl.ac.uk/trac/roadmap