Title: LEAD Data Subsystem: Overview, Current Approach to Integration, and Challenges
1LEAD Data Subsystem Overview, Current Approach
to Integration, and Challenges
Beth Plale Indiana University DIALOGUE
workshop 01 August 2005
2 LEAD vision
- Cyberinfrastructure that allows mesoscale
meteorology forecasters to - Dynamically and adaptively respond to weather
patterns to produce better faster-than-real-time
forecasts, and - Run larger multi-model simulations than can do
today.
3Definitions
- Mesoscale meteorology - regional scale, severe
storm forecasting - Tornadoes, flash floods, severe storms
- Multi-model simulations - ensemble runs
- 20-100-500 versions of a forecast model with
physics tweaked slightly differently. - Results analyzed akin to distributed concensus
scheme (i.e., voting) - looking for regions of
uncertainty - Faster-than-real-time -
- forecast that precedes the storm.
4Problem Conventional Numerical Weather
Prediction
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
5Conventional Numerical Weather Prediction
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
Analysis/Assimilation Quality Control Retrieval
of Unobserved Quantities Creation of Gridded
Fields
6Problem Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
7Problem Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
8Problem Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
- End Users
- Natl Weather Service
- Private Companies
- Students
9Problem Conventional Numerical Weather Prediction
- Analysis/Assimilation
- Quality Control
- Retrieval of Unobserved
- Quantities
- Creation of Gridded Fields
Prediction PCs to Teraflop Systems
- Product Generation,
- Display,
- Dissemination
- OBSERVATIONS
- Radar Data
- Mobile Mesonets
- Surface Observations
- Upper-Air Balloons
- Commercial Aircraft
- Geostationary and Polar Orbiting Satellite
- Wind Profilers
- GPS Satellites
The process is entirely serial and pre-scheduled
no responseto weather!
- End Users
- NWS
- Private Companies
- Students
10Goal Adaptive Forecast
Convert to format suitable for assim
Assimilate Into 3D grid
NexRad radar ingest
Fetch data products
Forecast model execution (20 versions)
Analyze final results of each run
Plan 20 run ensemble
Request to NetRad radar control system
0600
2100
0800
1000
1400
1200
1300
afternoon storms
collect data
3 hr forecast
6 hr forecast
12 hr forecast
11Selected data gathering possible with NEXRAD
Moore OK tornado, 3 May 1999 Steerable, 90o
sweep
12Data Subsystem
- Philosophy of service oriented architecture
- Motivate understanding of remainder of talk.
- Data subsystem
- Architecture overview
- Select component detail
- Significant subsystem accomplishments
- Ongoing deployment and research work
13Categories of data products
Personal resources
Resources
Public, personal
-- users experiment products, personal
collections, scripts, input config params.
Geo- Data products
Observational data
Model generated data
Collections
Public products
Derived data
Data analysis results
-- data gathered and made accessible by external
data providers.
Workflow scripts
Public, personal
Compute resources, storage resource
Public, personal
External products
Public
Services
Public, personal
Model input resources
-- data not known to resource catalog.
14Data Subsystem Architecture
personal Workspace browser
Viz Client (IDV)
Geospatial Query GUI
Ask ontology
Access interfaces
Resource Catalog LEAD public products and
services
Query Service query mediation
myLEAD Users own Information catalog
Noesis Ontology concepts and vocabulary
Access services
- Name
- Service
- single global
- naming system
Automated metadata generation - a capability
- Stream
- Service
- from LDM
- to users app
- THREDDS
- Catalogs
- web browser
- metadata
Resource services
Grid Storage respository
Unidata Data dissem client (LDM)
Steerable instruments - CASA
OPeNDAP data server
Resources
15Brief Tour of Capabilities and Component
Functionality
control
- LEAD metadata schema
- myLEAD personal information service
- Geospatial query service (GEO)
- Noesis Ontology service
- Common Data Model (CDM)
data
THREDDS, Opendap, LDM
THREDDS, Opendap, LDM
THREDDS, Opendap, CDM
Resource catalog
Noesis ontology
Metadata in LEAD schema
GEO Query GUI
Data in binary (often)
myLEAD
Grid storage repository
Query service
16Geospatial Query GUI - uses Minnesota Map
Server - bounding box selection - depicts state
boundaries, rivers, counties. - bounding
box generates lat/lon coordinates for
Query Service
17LEAD metadata schema
-- Needed standard XML description of data
products, -- Needed compliance with standard --
None suited exactly
-- Federal Geographic Data Committee (FGDC)
compliant, -- LEAD profile extends FGDC schema
18myLEAD personal catalog
- Users information workspace.
- Stores and serves metadata about products used
in and generated during experimental
investigation - Data products themselves reside in grid storage
repository - User sees tree view of holdings
- User above has 2 experiments. On right shows
metadata for highlighted collection.
19myLEAD research goals transparent structure
(through agent), privacy and sharing
Bobs workspace (Dec 04)
Bobs workspace (Feb 05)
Bobs workspace (Mar 05)
Hurricane Ivan
Hurricane Ivan
Hurricane Ivan
SE OK quadrant
SE OK quadrant
SE OK quadrant
Vortice study 98-00
Vortice study 98-00
Vortice study 98-00
Experim-Dec04
Input data sets
Experim-Dec04
Workflow templates
Experim-Feb05
Experim-Feb05
WRF output files
Input data sets
Published results
001.nc
. . .
WRF output
150.nc
Physical data storage
Metadata Catalog
ftp//storageserver.org/file1998o768
20Noesis Ontological Smart Search
Stores relationships between domain specific
concepts and terms
21Noesis Ontological Smart Search as service
surface temp radiosonde upper air temp
METAR
Noesis ontology
Resource catalog
ltrequestgt ltnamegt upper air temp ltvaluegt
METAR /gt
temperature?
SELECT WHERE attrib.name surface
temp attrib.value radiosonde
GEO Query GUI
Query service
myLEAD
temperature?
22Subsystem-wide Drivers
- Data and query access transparency
- Extensibility
23Query and Data Access Transparency
- Hide differences in data representation and way
in which resources accessed by users. - Query on high-level application domain concepts,
- retrieve results across heterogeneous data
products and servers. - Human and component integration required
- GEO GUI and Query service - IU
- Common vocabulary - Millersville, UAH, OU
- Noesis ontology - UAH
- myLEAD, Resource Catalog - IU
- Automated metadata generation - Unidata
- LEAD metadata schema - UAH, IU, Unidata, NCSA
- Common data model - Unidata
24Common Data Model (CDM)
Client
Data transparency for client
NetcdfDataset
NetcdfFile
ADDE
OpenDAP
HDF5
NetCDF-Java v 2.2 architecture
I/O service provider
NetCDF-3
NetCDF-4
GRIB
NIDS
GINI
Nexrad
DMSP
25Conceptual support of expandable architecture
sandbox and crosswalk
- Metadata crosswalk mapping between native
interface schema supported by external collection
and LEAD metadata schema.
26Adding new catalogs current vs. future schemes
THREDDS Catalogs
Newly Minted Catalog
LEAD Resource Catalog
LEAD Resource Catalog
A
X registers self with catalog
Data
A
X
Data
A
B
Thredds plugin
THREDDS plugin
B
Thredds plugin
THREDDS plugin
Plugin X
schema xyz
B
C
crosswalk
Services
X service descr
LEAD schema
Services
Assumes either -- All catalogs are THREDDS
catalogs, or -- we modify code base of Resource
Catalog for every new catalog
X does
Who serves precipitation data?
client
27Service-enabled approach to handling streaming
data
- Defn Streaming data data from CASA radars (and
other streaming sources) arrives in Unidata LDM
feed and is written as individual files to
directory managed by local file system. - Current service-enabled tools that expect
streaming data - - WDSS II integrate into 3D grid, visualize
and data mine CASA data - - ADAS assimilate CASA data onto 3D grid
- Challenges
- Capabilities not currently in LEAD to process
streaming data - End-to-end (soft) real-time constraints must be
identified - File system used LDM could delay writes
- Unidata LDM keeps files open for long time before
being closed (should not be problem with CASA
radar data numerous but small files)
28What are your plans for tackling the challenges
of developing service-enabled tools that can
handle streaming data?
- Plans
- CASA data comes in through LDM as netCDF
- Assume MDA pulls directly out of LDM
- Assume CASA data is stored for minimum of 4 days
in LDM - Stream service that interoperates between binary
stream and notification stream - Stream service hosts long lived dynamic MDA
algorithm generates trigger metadata about
mesoscale characteristics
Ensemble Kalman Filtering model
Workflow listeners
invoke
Mining service Stream svc
Event Notification
binary stream
Event Broker
LDM
CASA data
29Subsystem Accomplishments
- LEAD metadata schema -
- 12 month highly cooperative effort (3 group-level
F2F meetings, agreed upon standards compliance,
agreed upon content) - V1 released Summer 05
- Subsystem level requirements document
- Spring 05 effort
- V1 released June 05
- Interoperability and integration
- myLEAD, resource catalog integrated with workflow
orchestration - Portal, query service, ontology, resource catalog
and myLEAD - high (concept) level query access
transparency - Metadata generation - leveraging Unidata Common
Data Model - myLEAD v0.3alpha publicly released open source
May 2005
30Year 3 Deployment Goals
Integration of 4 components
Resource Catalog LEAD public products and
services
Query Service query mediation
myLEAD Users own experiment products
Noesis Ontology concepts and vocabulary
Name Service - unique ID all products
Automated metadata generation - a capability
- Stream Svc
- response to
- weather, stream
- to app
- THREDDS
- Catalogs
- web browser
- metadata
Stream svc deployment
Grid Storage respository
Unidata Data dissem client (LDM)
Steerable instruments - CASA
OPeNDAP data server
deployment
31Ongoing Research Goals
- myLEAD
- Sharing with peers,
- Versioning experiments through time,
- Publishing experiment products as LEAD public
resource - Automated metadata generation
- How much can be accomplished (attribute names
only or values as well?) and at what cost? - Leverage Common Data Model (CDM) for tool
support? - Provenance - capturing provenance on the fly.
- Noesis Yellow Pages to data catalogs
- Semantic mediation
- Ontology browsing
- Performance scalability of myLEAD and LEAD
resource catalog
32Questions?
33CASA/LEAD Interaction
NEXRAD RADARS
LEAD
Processed Data
Experimental Numerical Weather Prediction Models
Model Output
Commands
Algorithm Detections
NETRAD Data
CASA
NETRAD Data
34Why is quality needed
- For Data User,
- Decide which data most optimally fits the users
application needs a choice may be available or
it may prevent surprises later if inappropriate
data used - For Data Creator,
- Feedback to improve quality of data creation
process - For Data Provider,
- Feedback to improve quality of data (error
correctioncuration) - Archive poor quality data
35Factors affecting quality for user
- Availability of data, quality of service
providing data - Access restrictions (e.g. ACARS or private data
of other users) - Based on error margins of parameters in dataset
(precision, accuracy) - Based on who created it, who are using it
(reputation, trust) - How recently was it created (timeliness)
- Whether all data in collection available
(consistency, completeness) e.g. if LDM had an
outage and some timesteps are missing - Subjective vs. Objective (e.g. reputation vs.
precision) - Intrinsic vs. Extrinsic (e.g. accuracy vs.
availability)
36How can we record factors affecting quality
- Using historical resource information
- NWS (bandwidth)
- Resource Broker tracking history of storage
repository, hosts (availability, performance,
failures) - Using creation process
- Checking provenance log (Creator, creating
process/workflow, input data and variables used) - Cracking open files
- Intrinsic value looking at parameters, error
margins - Measuring statistical performance over time (e.g.
FSLs MADIS Quality Control flags for dataset) - Based on Access Policies (myLead, resource
catalog) - Estimates of other users (if other users are
experts e.g. students guided by meteorologists
rating of dataset) - User feedback, peer rating (Collaborative
filtering, community driven) - Frequency of use
37How can we record factors affecting quality
- Using historical resource information
- NWS (bandwidth)
- Resource Broker tracking history of storage
repository, hosts (availability, performance,
failures) - Using creation process
- Checking provenance log (Creator, creating
process/workflow, input data and variables used) - Cracking open files
- Intrinsic value looking at parameters, error
margins - Measuring statistical performance over time (e.g.
FSLs MADIS Quality Control flags for dataset) - Based on Access Policies (myLead, resource
catalog) - Estimates of other users (if other users are
experts e.g. students guided by meteorologists
rating of dataset) - User feedback, peer rating (Collaborative
filtering, community driven) - Frequency of use
38Ontology service
Query service
Geo GUI
My Workspace
User Workspace and Resources
Personal Metadata Catalogue
Search tools
myExperiment space
Authorization
Portal Gateway
Exper builder
Exper GUI
LEAD Resource Catalogue
My Tools
IDV client
Middleware Services and Resources
Registered with catalogue
Ensemble initial cond. generation
WRF Forecast Model/Data Assimilation
ADAM Data Mining
Datasets
Mesoscale event detection
Used in experiment
D
D
D
Publish results to catalogue
Workflow orchestration engine
A
Q
Computational Layer
response
M
Monitoring control service
39Tenets of SOA by Rich Turner
- Service Boundaries are Explicit
- Services expressed at their boundary
- only ONE way to access information and/or
functionality of service - through capabilities
exposed at Service Boundary. - Services are Autonomous
- Services are isolated, independent and
interchangeable. Otherwise end up with closely
coupled system that is fragile and overly
complex. - Services expose Schema and Contract (not Class
and Type) - Services expose their interfaces and shared
information interchange structures through well
understood schema, rather than a particular
language or platform's representation of class or
type. - Services negotiate using Policy
- Services must comply with one another's policy
requirements in order to interoperate. If you
offer secure, reliable, transacted?service,
caller must also support the necessary protocols
etc. This protocol negotiation should be
performed dynamically.