Title: ESG Publication Tools
1ESG Publication Tools
PCMDI Software Team
ESG All Hands Meeting Boulder, Colorado April
29, 2008
LLNL-PRES-403079
2Overview
- Publication is the process of generating metadata
about ESG datasets, and making that information
available to ESG services - Search, browse, download, server-side processing
rely on published metadata - Eventually will tie into a notification service
- Unit of work is a dataset
- Question need to publish individual files
directly? - Publication deals with files and aggregations as
first-class objects - Persons responsible for publication are the data
publishers
3Goals
- Publisher can read metadata in a collection of
files, and - Add new metadata
- Modify existing m-d
- Add, update, delete dataset
- Flexibility to add new projects
- Static configuration where possible (minimize
coding) - Logic can be encapsulated in project-specific
handlers - Metadata fields of interest are defined by the
configuration - Different projects may have different metadata
items. - CF-1 support
- Standard names
- Spatio-temporal coordinates
- Standard configuration
- .ini style
4Goals
- GUI, but publishing is also scriptable
- Quality control checks for
- Duplication of data
- Validity of coordinate metadata (ex.
monotonicity of time dimension) - Validity of standard name
- Generation of THREDDS catalogs to support LAS,
harvesting - Generation of data aggregations
- Ability to publish both online and offline
(tertiary storage) datasets. - For offline data, requires a list of paths /
filesizes - Support for Dublin Core
- Some CF fields map to DC
5The Process
- Specify
- Project (IPCC_AR4, C-LAMP, NARCCAP,)
- Dataset
- Metadata may be read from self-describing
dataset, or input by user - Options for specifying a dataset
- Read paths from a file
- Regular expression template for paths
- Directory name and file filter
- Generate dataset metadata by
- Scanning self-describing dataset to extract
metadata - Aggregate variables
- Create/replace/update/delete
- Publish
- Generate THREDDS catalog. The form of the catalog
may depend on whether - Dataset is aggregated,
- Non-aggregated,
- Offline
- Release data for harvesting
6Dataset publishing on an ESG node Metadata
specification
- Dataset pane
- shows metadata in a file, allows modification
- is project-specific
- metadata is extracted from the first file in the
list
- Output pane
- displays logged results
- log level is configurable
Expansion buttons in left pane correspond to
publication steps.
- Status bar
- shows scan progress
7Data scan
1. Dataset is created or updated based on input
metadata.Required fields are highlighted.
Selecting an extraction option starts the
dataset scan. Options are- create a new
dataset, or replacethe dataset if it exists -
append or update - the files are added to an
existing dataset.
2. Files are scanned and internal database
tables populated.
3. If an aggregation dimension is found or
specified, variables are aggregated.
8Data aggregation and publication
- Publication step
- Generate THREDDS catalog for harvesting,
server-side configuration - Release data for harvesting
9Configuration
- INI style
- Named section for each application, project
- Each section contains options for that section
- Expands (option)s interpolations
- Per-project specification of models, experiments,
standard names - Enumerate valid values for fields
- Per-project handler encapsulates logic for
reading / generating metadata
10Status
- Publication GUI is pre-alpha
- Implemented in Python, Tcl/Tk
- Metadata DB is MySQL, but flexibility to use
PostGRES - ESG-specific data (in addition to THREDDS) needs
to be defined. - Method of data release depends on harvesting
infrastructure - Still to do
- Dataset management
- Display existing datasets
- Delete dataset(s)
- Individual file deletion
- Handling multiple datasets
- Handle non-CF compliant netCDF
- Improved handling of preferences
- Interface to backup systems?
- Interface to authn/authz
- Checksums?