Title: Information Technology
1Information Technology Systems Center
- Sara J. Graves
- Director, Information Technology and Systems
Center - University Professor, Computer Science Department
- University of Alabama in Huntsville
- Director, Information Technology Research Center
- National Space Science and Technology Center
- 256-824-6064
- sgraves_at_itsc.uah.edu
http//www.itsc.uah.edu
2Invasive Species Data Service IT vision
- Networked, distributed system to integrate NASA
ESE and NBII data sources - Customized, easily accessible data products
- Aggregated, thematic, interdisciplinary
- Web-based service interoperability
- User interface tailored to user community
3IT Components
- Data catalogs
- To provide transparent access to ESE, NBII, and
user-supplied data resources - Service catalogs
- Service metadata, semantics
- Service Integration Function
- Loosely coupled, dynamically bound services (read
routines, georegistration, mining modules,
subsetting, reprojection, aggregation) - Standard service chaining protocols (SOAP, OGC,
etc.) - To support
- Basic pre-processing steps performed dynamically
with minimal interventions by the user - Creation of aggregated, thematic,
interdisciplinary data products
4ISDS Services for Customized Data Products
Data and Service Catalogs (NBII, ECHO, GCMD, ESML
registry, local and distributed ontologies)
Scientists and Policy Makers
ISDS User Interface
Display
Service Integrator
Aggregation
Custom data processing and delivery service chain
Subsetter
Re-grid
Miner
ESML data reader
Data
Services
5ITSC Relevant Skills and Technologies
- Accessing and using heterogeneous, distributed
data - ESML Interchange Technologies
- Subsetting spatial data
- HEW Subsetting Engine
- ESML-based Subsetting
- Analysis tools for data mining and image
processing - ADaM
- Distributed Science Data and Information
Management - EOSDIS node development and operation (GHRC, LIS
SCF) - ESIP Federation (GHRC, PM-ESIP, Interoperability
and Technology Committee, GIS Cluster, Federation
web site) - Distributed processing and delivery (AMSR-E SIPS)
6Data Usability Success Builds on the Integration
of Domain Science and Information Technology
- Collaborations
- Accelerate research process
- Maximize knowledge discovery
- Minimize data handling
- Contribute to both fields
7Improving Data Usability
- Advanced Applications Development
- Data organization and management for archival and
analysis - Data Mining in real-time and for post run
analysis - Interchange Technologies for improved data
exploitation - Semantics to transform data exploitation via
intelligent automated processing - Infrastructure Development
- Grid technologies for seamless access to multiple
computational and data resources into a virtual
computing environment - Cluster technologies for high speed parallel
computation, for multiple agent computations, and
other applications - High-performance networking for advanced
applications development and high-speed
connectivity - Next generation technologies in videoconferencing
and electronic collaboration
8Heterogeneity Leads to Data Usability Problems
- Earth Science Data Characteristics
- Many different formats, types and structures (18
and counting for atmospheric science alone!) - Different states of processing (raw, calibrated,
derived, modeled or interpreted) - Enormous volumes
9Interoperability Accessing Heterogeneous Data
The Problem
The Solution
DATA FORMAT 3
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 3
ESML FILE
ESML FILE
ESML FILE
FORMAT CONVERTER
ESML LIBRARY
READER 1
READER 2
APPLICATION
APPLICATION
- One approach Enforce a standard data format,
but - Difficult to implement and enforce
- Cant anticipate all needs
- Some data cant be modeled or is lost in
translation - Converting legacy data is costly
- A better approach Interchange Technologies
- Earth Science Markup Language
10What is ESML?
- It is a specialized markup language for Earth
Science metadata based on XML - NOT another data
format. - It is a machine-readable and -interpretable
representation of the structure, semantics and
content of any data file, regardless of data
format - ESML description files contain external metadata
that can be generated by either data producer or
data consumer (at collection, data set, and/or
granule level) - ESML provides the benefits of a standard,
self-describing data format (like HDF, HDF-EOS,
netCDF, geoTIFF, ) without the cost of data
conversion - ESML is the basis for core Interchange Technology
that allows data/application interoperability - ESML complements and extends data catalogs such
as FGDC and GCMD by providing the use/access
information those directories lack. - http//esml.itsc.uah.edu
11ESML Components
12ESML v3.0 Library Layered Design
- The core ESML library provides the basic
functionality of reading structural metadata from
the ESML file and returning data to the user - Intuitive user API based on the analogy of file
access in a directory structure - Plug-in modules for each individual format allow
flexible packaging of libraries - Simple Plug-in API for easy addition of new
formats - Additional software can be easily added to
provide other functions - Versions Available
- C for Windows and Linux
- Python (pyESML)
Level 1 API
Semantic Parser
User Level 0 API
DOM Tree
Plug-in API
Binary
HDF - EOS
13ESML IN ACTION Ingest surface skin temperature
data in Numerical Models
- Purpose
- Use ESML to incorporate observational data into
the numerical models for simulation - Skin temperatures come in a variety of data
formats - GOES McIDAS
- Reanalysis Data - GRIB
- MM5 Model - Binary
- AVHRR HDF
- MODIS - EOS-HDF
Reanalysis GRIB files
MM5
GOES
ESML file
ESML file
ESML file
- Scientists can
- Select remote files across the network
- Select different observational data to increase
the model prediction accuracy
http//vortex.nsstc.uah.edu/sud/web/default.htm
14ESML IN ACTIONCollocation Algorithm
MISR/ Others
ESML file
ESML file
ESML file
MODIS
CERES
ESML Library
- Purpose
- To study the relationship between shortwave flux
and cloud or aerosol properties - Important for climate change studies
Collocation Algorithm
Analysis
http//vortex.nsstc.uah.edu/seth/multiplot.htm
15ESML-Enabled Generic Subsetter
Other Formats
Binary/ ASCII
ESML file
ESML file
ESML file
HDF-EOS
Network
ESML Library
Subsetting Algorithm
For HDF-EOS data not formatted for subsetting
with the HDF-EOS library ESML file can be used
to correct the semantic tag required to subset
HDF-EOS data without the need to recreate the
data file
Subsetted Data
16Smart Applications/Services using ESML and
Ontologies
- ESML Schemas focus is on providing structural
data interoperability between data/application - However, ESML allows embedding semantic terms for
data fields in the Description File to provide a
complete structural and semantic description of
the data - Various science communities can create their own
ontologies (for example, SWEET) and link them
with ESML Description Files for their data - Application developers can add semantic parsers
on top of the core ESML Library to build smart
applications or services
Structural Information
ESMLSchema
ESMLSCHEMA
ESMLSCHEMA
Ontologies
Ontologies
Semantic Information
17Prototype Smart Subsetter
- To demonstrate a smart application using ESML and
ontologies, a subsetting prototype is being
developed - Subsetting is a frequent preprocessing step used
by scientists to reduce the size and complexity
of the data - The subsetting prototype parses the semantic tags
embedded in the ESML Description File - The subsetting prototype then uses the linked
ontologies to decipher meaning of these tags to
make useful decisions - Components of this Prototype
- Simple ontologies describing Subsetting and
Dataset - ESML Description Files
- Reasoning System used as an inference engine
Dataset Ontology
Subsetting Ontology
18Current Status
- ESML data formats
- Currently supported
- ASCII, Binary, HDF-EOS, netCDF, Grib, HDF5, BUFR,
NEXRAD Level II - ESML Library
- Currently available
- C for Windows and Linux, Python plugin, IDL
plugin - ESML DODS Server
- ESML Editor application
- ESML Data Browser
- http//esml.itsc.uah.edu
19Subsetting
- Goal to provide a science data user with only
the data they request as quickly as possible. - Benefits science data users and data centers-
reduces analysis time by reducing amount of
data- reduces time for data delivery- reduces
resources (network, personnel, media, etc.) - Steps- locate spatial / temporal / spectral
area of interest- extract- re-assemble for
distribution
20ITSC Subsetting Tools
- HDF-EOS Subsetting Engine (HSE)
- Dataset-independent subsetting service for
HDF-EOS data - Callable function for integration into other
applications - Available as stand-alone executable
- Integrated into data ordering and delivery
components of EOSDIS Core System - Planned web service available through ECHO and
other service brokers - Specialized subsetting and data aggregation tools
for MODIS Land team - modland subsetter for MODIS gridded data
- stitcher pieces together 2 or 4 contiguous
MODIS tiles - ESML-based subsetting and related data services
21HSE HDF-EOS Subsetting Engine
- Callable function can be integrated into other
applications - Uses HDF-EOS (and HDF) library
- Handles Swath and/or Grid objects
- Unix (SGI Sun) available (Linux planned)
- Optionally updates metadata in output files to
contain - StructMetadata (HDF-EOS)
- ArchiveMetadata
- ProductMetadata (added by HEW subset request)
- CoreMetadata (w/ modified bounding box time
info) - optionally placed in .met file
- if present in parent file
22HSE Subsettable datasets
- EOS DATASETS
- Terra
- MODIS
- MOPITT
- ASTER
- Aqua
- AMSR-E
- AIRS
- MODIS
- Aura
- HIRDLS
- OTHERS
- TRMM
- TMI
- NOAA-15, 16, 17
- AMSU-A
- Any other HDF-EOS datasets written with HDF-EOS
calls in mind
23HDF-E0S Web-based (HEW) Subsetter
Users Browser (HTML)
User Interface (CGI)
Input file
Subsetting API (ODL)
HEW
Output file
HSE
1. The User Interface CGI checks the HDF-EOS
file and presents the attributes to the user. 2.
The User interacts with the browser to specify
the subsetting criteria. 3. The User Interface
CGI creates the ODL file with the subsetting
criteria. 4. The Subsetter uses the ODL file and
the HDF-EOS file to create the subset HDF-EOS
file.
24SPOT
- Subsettability checker
- Displays content/structure of HDF-EOS files
- Examines files for subsettability by HSE
- Simple command-line interface
- Stand-alone operation
- Available for SGI and Sun at subset.org
25HEW integration with ECS
ECS
EDG System
2
EDG
ECS
1
Order
submission
(HTML)
7
4
3
End
Output data
Data order
(Reingested)
user
and reply
Subset ODL
and reply
5
6
HDF-EOS Subsetting Appliance
Input
Output
data
data
HSE
Subsetting System
26ECS integration status
- EDG v3.5.1 has basic subsetting options
- Operational at NSIDC
- Testing at LPDAAC (EDC)
- Testing to begin at GDAAC soon
- Further enhancements for DAACs
27Subsetting as a Web Service
Subsetting Center
Subset request
Subsetted data
HSE
ECHO
URL to data on Archive
Archive
28Subsetting web-sitesubset.org
- The subsetting portal is being created for
everyone involved in subsetting - Advertising
- Forums
- Data
- Software
- Glossary
- Tutorials
- Links to specialized subsetters
29Distributed Data Integration
Merged data product for on-demand visualization
Countries
Cyclone Events
AMSU-A Channel 01
MCS Events
Coastlines
Knowledge Base
AMSU-A
ITSC
GLOBE
AMSU-A data overlaid with MCS and Cyclone events,
merged with world boundaries from GLOBE.
30Chained Image Processing Services
Service Chaining is used to integrate modules
or services developed on distributed platforms
and different languages for a single processing
solution.
31Data Mining
- Automated discovery of patterns, anomalies from
vast observational data sets - Derived knowledge for decision making,
predictions and disaster response - ADaM Algorithm Development and Mining System
- http//datamining.itsc.uah.edu
32Data Mining Types of Mining
- Association Rule Mining
- Initially developed for market basket analysis
- Goal is to discover relationships between
attributes - Uses include decision support, classification and
clustering - Classification and Prediction (Supervised
Learning) - Classifiers are created using labeled training
samples - Training samples created by ground truth /
experts - Classifier later used to classify unknown samples
- Clustering (Unsupervised Learning)
- Grouping objects into classes so that similar
objects are in the same class and dissimilar
objects are in different classes - Discover overall distribution patterns and
relationships between attributes - Other Types of Mining
- Outlier Analysis
- Concept / Class Description
- Time Series Analysis
33ADaM System Overview
- Developed by the Information Technology and
Systems Center at the University of Alabama in
Huntsville - Consists of over 75 interoperable mining and
image processing components - Each component is provided with a C application
programming interface (API), an executable in
support of scripting tools (e.g. Perl, Python,
Tcl, Shell) - ADaM components are lightweight and autonomous,
and have been used successfully in a grid
environment - ADaM has several translation components that
provide data level interoperability with other
mining systems (such as WEKA and Orange), and
point tools (such as libSVM and svmLight) - Components include Python wrappers and web
service interfaces are planned
34ADaM 4.0 Components
35Data Mining in Action
- Grid Mining
- NASA Information Power Grid
- NSF TeraGrid
BioInformatics Genome Patterns
- Earth Science
- Mining Model Data (Ames, Goddard, SWA)
- Satellite Observations
- Radar Observations
Space Science Polar Cap Boundary in Auroras
36Classification of Tabular Data
- Wisconsin breast cancer data in ARFF format, from
the University of California Irvine (UCI) Machine
Learning Database - http//www.ics.uci.edu/mlearn/MLRepository.html
- The Naïve Bayes classifier will be trained to
distinguish malignant vs. benign tumors based on
nine characteristics
37Cumulus Cloud Classification
- Science Rationale Man-made changes to land use
cause changes in weather patterns, especially
cumulus clouds - ADaM allows comparison of many different
classification techniques based on accuracy of
detection and amount of time required to classify - Best algorithm can be used to create cloud mask
product
Original
GLRL
Association Rules
GLCM
38Mining on Data Ingest Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data
- Mining Plan
- Water cover mask to eliminate land
- Laplacian filter to compute temperature gradients
- Science Algorithm to estimate wind speed
- Contiguous regions with wind speeds above a
desired threshold identified - Additional test to eliminate false positives
- Maximum wind speed and location produced
Further Analysis
Calibration/ Limb Correction/ Converted to Tb
Knowledge Base
Data Archive
Hurricane Floyd
Mining Environment
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
http//pm-esip.msfc.nasa.gov/cyclone
39Mining Model Data
- To advance its capacity in information extraction
from models, the Global Modeling and Assimilation
Office at GSFC, ITSC and Simpson Weather
Associates propose to apply data mining
frameworks for the analysis and information
extraction of numerical model output data
generated or archived at the GMAO. This will be
done by conducting experiments focusing on the
automated detection and mining of atmospheric
phenomena relationships within the model data.
- Tropical Cyclone Identification
- The heuristic procedure considered all tropical
ocean pixels and accepted those that - Had surface pressure below a certain threshold
(990) - Had vorticity above a certain threshold (15)
- As an alternative to the heuristic procedure, a
clustering algorithm was used to derive the
signature of the cyclones - Using pressure, vorticity
- Using pressure, vorticity, temperature, cloud
total - Using pressure, vorticity, cloud low
Sea Level Pressure Global Map
Wind Vector Overlay - Detail
40Global Hydrology Resource Center Data Systems
and Services for Earth Science
- Practical application of information technology
research through the Global Hydrology Resource
Center - Provide data and advanced information science
applications to the science community, thereby
enabling research and discovery
- LIS SCF (1997 )
- Passive Microwave ESIP (1998 2004 )
- AMSR-E SIPS (1999 )
- MSFC DAAC (1992 1997)
- ESIP Federation Web Site (1999 )
- CAMEX 3, 4 (Fall 1998, Fall 2001)
- ACES (Aug/Sept 2002)
- SERVIR (2003 2008)
- DISCOVER (2003 2008)
41Event-driven Data Delivery Example
Satellite Data
Event detection triggers dynamic packaging of
related suite of data products, delivered
immediately to subscribers and made available to
other users on the web
Science User
Delivery
Subset and Packaging Services
Others
AMSR-E
SSM/I
TMI
Quik SCAT
Calibration/ Limb Correction/ Converted to Tb
Notification
Linked from Event Page
Near-Real-Time Mining for Events
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
http//pm-esip.nsstc.nasa.gov
42Chained ED3 Distributed Services
Data Delivery To Science User
Event Listener
Package (Linux)
Subset (Linux)
Trigger User Subscription
Data Streams
Reformat (Linux)
Data Archive
Reader (Windows)
Data Files
ESML Lib
ESML
Knowledge Base
Data Files
43Related Projects
- SERVIR (NASA REASoN CAN)
- Providing NASA ESE data products and technology
to produce and distribute accurate and timely
decision support and environmental monitoring
data products for Central America and the
Mesoamerican Biological Corridor - ITSC role Data and information system
development, generation of decision support
products, environmental monitoring, 3-D
visualization of data for national leaders and
training. - http//servir.nsstc.nasa.gov/
- LEAD (NSF Large ITR)
- An integrated, scalable framework for
identifying, accessing, preparing, assimilating,
predicting, managing, analyzing, mining and
visualizing a broad array of meteorological data
and model output, independent of format and
physical location - ITSC role Mining tools and analyses, ESML
interchange technologies, data and service
semantics - http//lead.ou.edu/
44Website Home Page
45SIAM SERVIR
Regional Coordination Standards
Sponsors
Training Capacity Building
Scientific Support
Env. Monitor Decision Sup. Applications
Climate Change Modeling
Data Archive Distribution Visualization
SERVIR
1
1
1
NASA/WB/CCAD Project
1
1
2
2
2
NASA/USAID Cambio Climatico
2
2
3
3
1
NASA Funded Collaborators
Externally Funded Collaborators
46- Partners Oklahoma - K Droegemeier, UAH S
Graves, Colorado State V Chandrasekar,
Illinois/NCSA R Wilhelmson, Indiana D Gannon,
UCAR/Unidata M Ramamurthy, Howard E Joseph,
Millersville R Clark - ITSC Contributions
MyLEAD Portal
MyLEAD Virtual Environment
Interchange
Workflow
Semantics for data
Personal Data Space
Technologies
Orchestration
and services
Application Services
Visualization
Data Mining
Models
Others
tools
Middleware
Data Management
Workflow Management
Monitoring
Grid and Web infrastructure
Resource
Scheduling
Security
Others
Allocation
47On-Board Real-Time Processing Sensor
Control/Targeting
EVE Environment for On-board Processing
- Anomaly detection
- Data Mining
- Autonomous Decision Making
- Immediate response
- Direct satellite to Earth delivery of results
www.itsc.uah.edu/eve
48A Reconfigurable Web of Interacting Sensors
Communications
Weather
Satellite Constellations
Military
Ground Network
Ground Network
Ground Network
49Example Application of EVE TechnologyLightning
Detection During Tornadic Activity
2) The Ground Station uploads the plan to
multiple on-board platforms
1) The user creates a mining plan using the EVE
editor
3) On-board Platform 1 uses its sensor to watch
for lightning events
4) Platform 1 notifies Platform 2 of the event
5) Platform 2 requests subsetting web services
from an NSSTC server
6) The results are sent back to Platform 1 for
display and further processing
50Background Slides
51ISDS in a Nutshell
Success will ultimately require a far more
comprehensive and sophisticated integration of
data from the earth and life sciences than is
currently possible, and will also require that
the multidisciplinary teams who deal with
invasive species issues have far better access to
heterogeneous data resources than is currently
available. That is the key problem that we hope
to address in building the Invasive Species Data
Service.
52How it fits together
The NASA Office of Earth Science and the US
Geological Survey are developing a National
Invasive Species Forecasting System for the
management and control of invasive species on all
Department of Interior and adjacent lands
(Schnase et al., 2002a,b). The forecasting system
will be the first major client of the Invasive
Species Data Service, and the heterogeneous data
ingest need of the system is the principal design
driver for the ISDS. To clarify the relationship
between these components and the relevance of the
proposed work, we first describe the Invasive
Species Forecasting System then show how its data
ingest requirements define a new class of data
services that we intend to introduce by creating
the Invasive Species Data Service.
53Relationship to ISFS
Context diagram for the Invasive
Species Forecasting System. The proposed Invasive
Species Data Service would complete the data
resource connection shown in the lower left oval.
ISDS