Title: Discovery Systems: Accelerating Scientific Discovery at NASA
1Discovery Systems Accelerating Scientific
Discovery at NASA
- Barney Pell, Ph.D.
- NASA Ames Research Center
- Barney.D.Pell _at__at_ nasa.gov
- Presentation at IAAI-04 panel on The Broader Role
of Artificial Intelligence in Large-Scale
Scientific Research
2Outline of Talk
- Trends and Challenges affecting Scientific
Discovery at NASA - Distributed Data Search, Access, and Analysis
- Machine-Assisted Model Discovery and Refinement
- Exploratory Environments and Collaboration
- Vision for the future and summary of AI
technologies - Closing remarks
3Science Discovery Acceleration
- NASA conducts missions to take measurements that
produce large amounts of data to support
ambitious science goals - In-situ observation of deep space for origin and
evolution of life - Earth-orbiting satellites for global cause and
effect relationships - Biological experiments to support life in space
- Too much work and expertise required to perform
each of many steps in a discovery cycle to
understand this data - Detailed knowledge of the heritage of data and
models - Hard to invert through a complex processing
pipeline - Constant reprocessing and reanalyzing as new info
available - The specialized expertise slows the process and
also restricts the set of users and scientists
using NASA products
4Discovery Steps and Architectures
- Examples of discovery steps
- - finding and organizing distributed data
- - assessing, filtering, cleaning and
post-processing the data - - reconciling the differences across diverse data
- - exploring the data sets to discover
regularities - - using the regularities to formulate and
evaluate hypotheses - - testing the hypotheses and comparing alternate
hypotheses against each other - - integrating the data into models
- - linking separate models together
- - running simulations to generate predictive data
to compare against observations - Current technology programs addressing
difficulties of individual steps, typically in
isolation - Eg. machine-learning algorithms detect
regularities in underlying phenomena but also
artifacts of the data collection/processing
system. - ML algorithms developed without consideration of
the deeper processes by which the data is
generated, distributed, and used - Data system put together without characterizing
the data stream to enable new users to analyze
the data in unanticipated ways.
5Trends affecting NASA
- Improvements in sensors, communications, and
computing - orders of magnitude more data, in more varieties,
and at higher rates than ever before. - NASAs science questions are becoming
increasingly large-scale and interdisciplinary. - forming and evaluating theories across a wide
variety of data - integrating a complex set of models produced by
diverse communities of scientists - virtual projects comprising distributed teams
- Socioeconomic demands are requiring increased
quality - Eg. many customers for weather and climate model
predictions - Need characterization of confidence in data,
models, results - Faster feedback loops in observing/simulation
systems - make it possible to gather more precise data,
often in real-time, if only we could understand
the existing data quickly enough. - NASA required to enable public access and benefit
from the data to the same extent as the mission
science team
6Distributed Search, Access and Analysis
- Objective
- Develop and demonstrate technologies to enable
investigating interdisciplinary science questions
by finding, integrating, and composing models and
data from distributed archives, pipelines
running simulations, and running instruments. - Support interactive and complex query-formulation
with constraints and goals in the queries and
resource-efficient intelligent execution of these
tasks in a resource-constrained environment. - Milestone Enable novel what-if and predictive
question answering - Across NASAs complex and heterogeneous data and
simulations - By non data-specialists
- Use world-knowledge and meta-data
- Support query formulation and resource discovery
- Example query Within 20, what will be the
water runoff in the creeks of the Comanche
National Grassland if we seed the clouds over
southern Colorado in July and August next year?
7Terrestrial Biogeoscience Involves Many Complex
Processes and Data
Chemistry CO2, CH4, N2O ozone, aerosols
Climate Temperature, Precipitation, Radiation,
Humidity, Wind
Heat Moisture Momentum
CO2 CH4 N2O VOCs Dust
Minutes-To-Hours
Biogeophysics
Biogeochemistry
Carbon Assimilation
Aero- dynamics
Decomposition
Water
Energy
Mineralization
Microclimate Canopy Physiology
Phenology
Hydrology
Inter- cepted Water
Bud Break
Soil Water
Days-To-Weeks
Snow
Leaf Senescence
Evaporation Transpiration Snow Melt Infiltration R
unoff
Gross Primary Production Plant
Respiration Microbial Respiration Nutrient
Availability
Species Composition Ecosystem Structure Nutrient
Availability Water
Years-To-Centuries
Ecosystems Species Composition Ecosystem Structure
WatershedsSurface Water Subsurface
Water Geomorphology
Disturbance Fires Hurricanes Ice Storms Windthrows
Vegetation Dynamics
Hydrologic Cycle
(Courtesy Tim Killeen and Gordon Bonan, NCAR)
8Solution Construction via Composing Models
modeledphenomenon
service interface required inputs,provided
outputs, data descriptions,events
climate model
binary data streams
snow melt metadata
Each model typically has acommunity of experts
thatdeal with the complexity of themodel and
its environment
surface watercommunity
parameterizedphenomenon
rainfall
Nat. WeatherService
modeledphenomenon
modeledphenomenon
USGS
9Virtual Data Grid Example
Application Three data types of interest ? is
derived from ?, ? is derived from ?, which is
primary data(interaction and and operations
proceed left to right)
Need ?
Have ?
Request ?
Need ?
Proceed?
How to generate ?(? is at ?LFN)
Estimate for generating ?
? is known. Contact Materialized Data Catalogue.
Need ?
Abstract Planner(for materializing data)
Concrete Planner(generates workflow)
MetadataCatalogue
Need ?
Exact steps to generate ?
Resolve?LFN
Materialize ?with ?PERS
Grid workflow engine
?PFN
? ismaterializedat ?LFN
Need tomaterialize ?
Virtual Data Catalogue(how to generate ? and ?)
Grid compute resources
Materialized Data Catalogue
Data Grid replica services
Inform that ?is materialized
LFN logical file name PFN physical file
name PERS prescription for generating
unmaterialized data
Store an archival copy, if so requested. Record
existence of cached copies.
Grid storage resources
As illustrated, easy to deadlock w/o QoS and SLAs.
10Machine assisted model discovery and refinement
- Develop and demonstrate methods to
- assist discovery of and fit physically
descriptive models with quantifiable uncertainty
for estimation and prediction - improve the use of observational or experimental
data for simulation and assimilation applied to
distributed instrument systems (e.g. sensor web) - integrate instrument models with physical domain
modeling and with other instruments (fusion) to
quantify error, correct for noise, improve
estimates and instrument performance. - Eg. Metrics
- 50 reduction in scientist time forming models
- 10 reduction in uncertainty in parameter
estimates or a 10 reduction in effort to achieve
current accuracies - 10 reduction in computational costs associated
with a forward model - ability to process data on the order of 1000s of
dimensions - ability to estimate parameters from tera-scale
data.
11Prediction of the 97/98 El Nino
JFM 1998 Predicted Precipitation
1997
1999
A reasonable 15 month prediction of the 97/98 El
Nino is achieved when ocean height, temperature
and surface wind data are combined to initialize
the model.
12Observing System of the Future
- Partners
- NASA
- DoD
- Other Govt
- Commercial
- International
- Information Synthesis
- Access to Knowledge
User Community
Information
13Exploratory Environments and Collaboration
- Objective
- Develop exploratory environments in which
interdisciplinary and/or distributed teams
visualize and interact with intelligently
combined and presented data from such sources as
distributed archives, pipelines, simulations, and
instruments in networked environments. - Demonstrate that these environments measurably
improve scientists capability to answer
questions, evaluate models, and formulate
follow-on questions and predictions.
14Multi-parameter Explorations
15(No Transcript)
16Vision for future science
Technical Area Today Tomorrow
Distributed Data Search Access and Analysis Answering queries requires specialized knowledge of content, location, and configuration of all relevant data and model resources. Solution construction is manual. Search queries based on high-level requirements. Solution construction is mostly automated and accessible to users who arent specialists in all elements.
Machine integration of data / QA Publish a new resource takes 1-3 years. Assembling a consistent heterogeneous dataset takes 1-3 years. Automated data quality assessment by limits and rules. Publish a new resource takes 1 week. Assembling a consistent heterogeneous dataset in real-time. Automated data quality assessment by world models and cross-validation.
Machine Assisted Model Discovery and Refinement Physical models have hidden assumptions and legacy restrictions. Machine learning algorithms are separate from simulations, instrument models, and data manipulation codes. Prediction and estimation systems integrate models of the data collection instruments, simulation models, observational data formatting and conditioning capabilities. Predictions and estimates with known certainties.
Exploratory environments and collaboration Co-located interdisciplinary teams jointly visualize multi-dimensional preprocessed data or ensembles of running simulations on wall-sized matrixed displays. Distributed teams visualize and interact with intelligently combined and presented data from such sources as distributed archives, pipelines, simulations, and instruments in networked environments.
17Discovery Systems AI Technology Elements
- Distributed data search, access and analysis
- Grid based computing and services
- Information retrieval
- Databases
- Planning, execution, agent architecture,
multi-agent systems - Knowledge representation and ontologies
- Machine-assisted model discovery and refinement
- Information and data fusion
- Data mining and Machine learning
- Modeling and simulation languages
- Exploratory environments and Collaboration
- Visualization
- Human-computer interaction
- Computer-supported collaborative work
- Cognitive models of science
18Closing remarks
- NASA science is challenging
- Need to improve in existing capabilities and
address emerging trends - AI technologies have a crucial role for future
science - Distributed Data Search, Access, and Analysis
- Machine-Assisted Model Discovery and Refinement
- Exploratory Environments and Collaboration
- Many of these themes are shared with science (or
research) at large