Title: DataDriven Discovery through eScience Technologies
1Data-Driven Discovery through e-Science
Technologies
George Mason University and QSS Group Inc.,
NASA-Goddard Space Flight Center kborne_at_gmu.edu
or kirk.borne_at_gsfc.nasa.gov http//rings.gsfc.na
sa.gov/nvo_datamining.html
7/19/2006
2Looking to NASAs Future -- Constellations of
Spacecraft Sensor Webs
- Working Definition of a Constellation or a Sensor
Web (Sensor Network) Spatially distributed
network of individual vehicles, or assets, acting
collaboratively as a single collective unit,
exhibiting a common system-wide capability. - Constellation and Sensor Web operations include
homogeneous sensor clusters, heterogeneous
distributed sensors, multiple missions (space
and ground). - Constellation missions and Sensor Webs/Networks
will provide opportunities for - Application of on-board data processing
functions. - Application of parallel data processing
techniques. - Application of autonomous intelligent systems for
mission operations. - Application of client-server architecture for
inter-sensor communications, for
constellation/network operations, and for active
mission participation in a Virtual Observatory. - Application of interoperable systems techniques
within the constellation, and with other sensors
(space-based, ground-based, and/or virtual). - Application of data reduction (filtering) and
code-shipping techniques. - Application of data mining for target of
opportunity, event, anomaly detection.
3(No Transcript)
4Data Challenges for Space Mission Sciencecraft
of the Future
- Data and Information Explosion massive volumes
from single missions and from spacecraft suites,
constellations, and armadas sciencecraft - Data Discovery within distributed data systems
for science decision support - Transparent Access to Data across heterogeneous
mission environments - Interoperability of systems, metadata, data,
information, and knowledge - New Information Technology Infusion across
multiple distributed systems - Data Fusion and Information Integration from
multiple sources (spacecraft / missions /
sciencecraft) across various scales and
modalities - Information and Knowledge Sharing among
cooperating science nodes - Intelligence within the sensor and measurement
and data systems - Knowledge Representation and Ontology
reconciliation across discipline-specific
instruments - Semantic Knowledge Extraction and Retrieval
from multiple mission science data streams - Knowledge Management, Sharing, and Reuse
lessons learned remembered!
5The New Face of Science 1
- Big Data (usually distributed across systems)
- High-Energy Particle Physics
- Astronomy, Planetary, and Space Physics
- Earth Observing System (Remote Sensing)
- Human Genome and Bioinformatics
- Numerical Simulations of any kind
- Digital Libraries (electronic publication
repositories) - e-Science
- Built on Web Services (e-Gov, e-Biz) paradigm
- Distributed heterogeneous data are the norm
- Data integration across projects institutions
- One-stop shopping The right data, right now.
6The New Face of Science 2
- Data and data systems are central to science.
- Databases enable scientific discovery
- Data Handling and Archiving (management of
massive data resources) - Data Discovery (finding data wherever they exist)
- Data Access (HTTP-Database interfaces)
- Data/Metadata Browsing (serendipity)
- Data Sharing and Reuse (within project teams and
by other missions and programs scientific
validation) - Data Fusion (across multiple modalities
domains) - Data Integration (from multiple sources)
- Knowledge Sharing Reuse (through ontologies and
semantic representation in knowledgebases) - Data Mining (KDD Knowledge Discovery in
Databases)
7e-Science
- Key Technologies
- Data Mining
- KDD Knowledge Discovery in Databases
- Machine Learning
- Distributed Data Discovery Access
- Vx0s Virtual any science Observatories
- Web Services
- Grid Computing
- The Semantic Web (ontologies)
- Key Benefits
- Provides seamless uniform discovery, access,
mining, and analysis of distributed heterogeneous
data sources ... - ... here, there, and beyond.
- Find the right data, right now
- Enables ... Data Information integration and
fusion ... - ... across multiple distributed heterogeneous
data collections ... - ... to enable scientific knowledge discovery ...
- ... and decision support ...
- ... (with minimal human assistance, or
autonomously). - Provides intelligence within the data system.
8Modeling of a business process, using an
event-driven process chain.
Science data ordered
Intelligent Database event-driven
valid request
invalid request
check data availability
data not available
data available
Steps in this process chain are business-related,
but you can easily transform these steps into
scientific decision support within a space
mission data system.
Design and implement data processing plan
Collect new data from sensors
Science data shipped
generate data products
Reference EPML Event-driven Process chain
Markup Language
product finished
Any more Requests?
http//xml.coverpages.org/ni2003-11-21-a.html
Order completed
Happy User!
9Data Mining is the killer app for e-Science
- Data Mining is Knowledge Discovery in Databases
(KDD). - Data Mining is defined as an information
extraction activity whose goal is to discover
hidden facts contained in (large) databases. - Data Mining is the application of Machine
Learning to large databases. - Machine Learning is defined as the application
of computer algorithms that improve automatically
through experience.
10Data Mining Technique Example Decision Tree
Rule Induction
- Decision Tree (DT) Construction is a form of data
mining called Rule Induction. - The DT algorithm learns the rules from the
database of historical records. - Picks predictors and their splitting values on
the basis of an information gain metric - The difference between the amount of information
that is needed to make the correct prediction
both before and after the split has been made. - If the amount of information required is much
lower after the split is made, then the split is
said to have decreased the disorder (entropy) of
the original data. This is good (i.e., the
more ordered the data, then the more certain is
our final classification.) - Similar to the game 20 Questions good
questions provide good DT splits. For example
adult asks Is it alive?, but child asks Is it
my daddy?. - After the rules are learned (induction), they are
then applied to new events (new records, or new
data entries, ) to make predictions on unseen
data. This can be applied in real-time in deep
space missions. - Some databases have automated rule-learning
algorithms built-in. These are called Inductive
Databases, and they are data-driven. - Reference http//www.mli.gmu.edu/projects/idb.ht
ml
11Data Mining Technique Example Bayes Inference
Engine
- The application of Bayes Theorem is a form of
Inference (the act or process of deriving a
conclusion based solely on what one already
knows). - The Bayes Inference Engine (BIE) learns the
likelihood of possible hypotheses (models, or
classifications) being correct, for a given set
of observational measurements, based upon the
database of historical records. - The historical knowledgebase for a space mission
is used to populate the data history (including
known outcomes from prior experience) with
particular measurement values of the observed
features that correspond to particular events. - As new data arrive, the Bayes Inference Engine
(BIE) estimates the most likely model (outcome,
class, hypothesis) to describe what the
spacecraft sees. - Following the interpretation of the new events,
which are then properly labeled, these events
(class labels) and their corresponding measured
data values (evidence, features) add to the
knowledgebase for the next application of the
BIE. - Bayes Inference data-driven, learns as it goes,
incremental learning, self-correcting, can test
multiple hypotheses, unbiased by prior prejudice,
model-independent inference, yields
probabilistically ranked predictions, popular
standard machine learning tool for optimal
knowledge-based decision-making. - References
- http//www.astro.cornell.edu/staff/loredo/bayes/
- http//aisrp.nasa.gov/projects/5665da55.html
H C hypothesis or class E F evidence or
feature P(H E) probability of
hypothesis being correct, given the evidence
E. P(C Fk) probability of classification
C, given the set of measured features Fk.
12Existing Space Science Data Infrastructure
- The Recent Past many independent distributed
heterogeneous data archives - Today VxOs Virtual Observatories
- Web Services-enabled e-Science paradigm
(middleware, standards, protocols) - Provides seamless uniform access to distributed
heterogenous data sources - Find the right data, right now
- One-stop shopping for all of your data needs
- Emerging environment consists of many VxOs for
example - NVO National Virtual Observatory (precursor to
VAO Virtual Astro Obs) - VSO Virtual Solar Observatory
- VSPO Virtual Space Physics Observatory
- NVAO National Virtual Aeronomy Observatory
- VITMO Virtual Ionospheric, Thermospheric,
Magnetospheric Observatory - VHO Virtual Heliospheric Observatory
- VMO Virtual Magnetospheric Observatory
- Standards for data formats, data/metadata
exchange, data models, registries, Web Services,
VO queries, query results, semantics - And of course The Grid, Web Services,
Semantic Web, etc. ...
13Sun-Earth Space Environment Rich Source of
Heliophysical Phenomena
14Multi-point Observations and Models of Space
Plasmas Deliver a Deluge of Physical Measurements
15Data-Driven Knowledge Discovery
16Space Weather Example Early Warning System for
Astronauts in Space
CME Coronal Mass Ejection SEP Solar Energetic
Particle
17Data Mining in Action
- Data Mining facilitates Intelligent Data
Understanding (IDU). - Data Mining enables Decision Support and Active
Control Systems. - IDU refers to the application of techniques for
transforming data into scientific understanding. - Web reference http//is.arc.nasa.gov/IDU/index.h
tml - IDU specifically refers to automating the
following techniques for machine-assisted science
data analysis - Data Mining (e.g., http//is.arc.nasa.gov/IDU/tas
ks/NVODDM.html) - Knowledge Discovery
- Machine Learning
18Case Study - Mars Rovers
19e-Science Data Mining Applications on Mars Rover
(1)
- Rove around the surface of Mars and take samples
of rocks (mass spectroscopy a data histogram) - Supervised Learning (search for rocks with known
compositions) - Unsupervised Learning (discover what types of
rocks are present, without preconceived biases) - Association Mining (find unusual associations)
- Clustering (find the set of unique classes of
rocks) - Classification (assign rocks to known classes)
- Deviation/Outlier Detection (one-of-kind
interesting?)
20e-Science Data Mining Applications on Mars Rover
(2)
- On-board Intelligent Data Understanding
Decision Support Systems (Fuzzy Logic Decision
Trees Cased-Based Reasoning ) Science Goal
Monitoring - stay here and do more or else move on to
another rock - send results to Earth immediately or send
results later - Learn as it goes (Machine Learning Neural Nets)
- Relate the results to other factors, such as dust
storms (XML Information Retrieval Information
Fusion with other data from orbiting satellite
mother ship) - Predict where to go in order to find interesting
rocks (Logistic Regression Case-Based Reasoning)
21Mars Rover as an e-Science Data System
- Decisions are based on data mined, prior
experience, new knowledge, and the set of learned
rules. - Rover acts autonomously, without human
intervention, in Deep Space environment. - Actions are driven by mining actionable data from
all sensors.
http//www.samsi.info/200506/astro/presentations/t
ut1loredo-7.pdf
22Autonomous Mineral Detectors for Mars Rovers and
Landers
- NASA / AISRP PI Martha Gilmore, Wesleyan
University
Objective Design and develop software to
enable rovers to autonomously analyze spectral
data and identify data indicating geologically
important signatures. Motivation Both rover
and orbital missions can collect more data than
can be returned due to downlink restrictions.
Results Software is designed to allow onboard
processing of Vis/NIR spectra to identify and
select spectra that contain minerals of geologic
interest autonomously.
Non-carbonates
Carbonates
Credit M. Gilmore
23A Neural Map View of Planetary Spectral Images
for Precision Data Mining and Rapid Resource
Identification
- NASA / AISRP PI Erzsébet Merényi, Rice
University
Uses advanced variants of the self-organized
machine learning paradigm Self-Organizing
Map, applied to spectral imagery. They detected
orthopyroxene and clinopyroxene dominated mineral
subclasses within a rare undifferentiated mineral
type nicknamed "black rock" by geologists. SOM
by eye! (SOM self-organizing map)
Credit E.Merenyi
24Application of Machine Learning Technology to
Martian Geology
NASA / AISRP PI Ruye Wang, Harvey Mudd College
- Machine Learning algorithms have been applied to
the analysis of Themis (Thermal Emission Imaging
System) image data of Mars, for the purpose of
studying mountain ranges on Mars (the Thaumasia
Highlands and Corprates rise). - Specifically, various clustering and
classification algorithms (e.g., K-means,
competitive neural network, support vector
machine, Independent Components Analysis) have
been applied to the Themis image data covering
certain areas in the Thaumasia highlands. - Objectives
- Develop an intelligent system for robust
detection and accurate classification in
multispectral remote sensing image data - Demonstrate system in context of Martian geology
application
Credit R. Wang
25Some e-Science Projects
- International Virtual Observatory Alliance
- The Thinking Telescope Project
- Open Science Grid (OSG)
- Biomedical Informatics Research Network (BIRN)
- Network for Earthquake Engineering Simulation
(NEES) - PlanetLab computer science testbed
- Earth System Grid (ESG)
- Department of Energy's Fusion Collaboratory
- UK myGrid project
- Enabling Grids for eScience in Europe (EGEE)
- e-Framework for Education and Research
- Global Earth Observation System of Systems (GEOSS)
- http//www.ivoa.net/
- http//www.thinkingtelescopes.lanl.gov/
- http//www.opensciencegrid.org/
- http//www.nbirn.net/
- http//www.nees.org/
- http//www.planet-lab.org/
- https//www.earthsystemgrid.org/
- http//www.fusiongrid.org/
- http//www.mygrid.org.uk/
- http//public.eu-egee.org/
- http//www.e-framework.org/
- http//www.epa.gov/geoss/
26Astronomy Example The Thinking Telescope
Machine Learning Applications Automated Feature
Extraction Real-time identification of
artifacts and transients in direct and difference
images. Classifiers Automated classification of
celestial objects based on temporal and spectral
properties. Anomaly Detection Real-time
recognition of important deviations from normal
behavior for persistent sources.
Credit http//www.thinkingtelescopes.lanl.gov/
27Sample e-Science Data Mining Use Cases
- Discover data stored in distributed heterogeneous
systems. - Search huge databases for trends and correlations
in high-dimensional parameter spaces identify
new properties or new classes of scientific
objects. - Discover new linkages associations among data
parameters. - Search for rare, one-of-a-kind, and exotic
objects in huge databases. - Identify repeating patterns of temporal
variations from millions or billions of
observations. - Identify moving objects in huge image databases.
- Identify parameter glitches / anomalies /
deviations either in static databases (e.g.,
archives) or in dynamic data (e.g., science /
instrumental / engineering data streams). - Find clusters, nearest neighbors, outliers,
and/or zones of avoidance in the distribution of
objects or other observables in arbitrary
parameter spaces. - Serendipitously explore huge scientific databases
through access to distributed, autonomous,
federated, heterogeneous, multi-experiment,
multi-mission science data systems.
28Applications of e-Science Data Mining Techniques
to the Space Mission I.T. Data System Environment
- Archival research applications
- cross-links between archives active mission
data will offer improved data analysis,
calibration, anomaly detection, and scientific
discovery with active missions. - Decision support for selecting interesting
targets for observation. - Identification of interesting events for rapid
followup observation planning. - Real-time on-board decision-support functions,
such as - the rapid analysis of large volumes of
time-series data (engineering, telemetry, and
science streams) in order to make decisions
(about operations, maneuvers, and science
observations) in deep space without human
intervention.
7/19/2006