Title: Data Integration for Homeland Security
1Creating a Data Mining Environment for
Geosciences Interface 2002 Montreal April 18,
2002 Sara J. Graves Director, Information
Technology and Systems Center Professor, Computer
Science Department University of Alabama in
Huntsville Director, Information Technology and
Research Center National Space Science and
Technology Center 256-824-6064 sgraves_at_itsc.uah.ed
u www.itsc.uah.edu
2(No Transcript)
3Characteristics of Science Data
- Varied kinds of data
- Raster images
- With structure and geometry
- Multispectral
- Time series and sequence data
- Numerical model outputs
- Multiple resolutions/multiple scales
- Variability of data formats
- Granularity of data
- Includes spatial and temporal dimensions
- Physical basis/domain knowledge needed before
applying algorithms - Typically requires domain-specific algorithms
4Scientific Analysis
- Harnesses human analysis capabilities
- Highly creative
- Based on theory and hypothesis formulation
- Physical basis is normally used for algorithms
- Drawing insights about the underlying phenomena
- Rapidly widening gap between data collection
capabilities and the ability to analyze data - Potential of vast amounts of data to be unused
5Data Mining
- Provides automation of the analysis process
- Can be used for dimensionality reduction when
manual examination of data is impossible - Can have limitations
- May not utilize domain knowledge
- May be difficult to prove validity of the results
- There may not be a physical basis
- Should be viewed as complementary tool and not a
replacement for scientific analysis
6Reasons for Mining Science Data
- Powerful tool for research and analysis given
the volume of science data - Necessity when manual examination of data is
impossible - Can allow scientists to refine/add more layers
to the knowledge bases - Can minimize scientists data handling to allow
them to maximize research time - Can reduce reinventing the wheel
- Can fully exploit reusable knowledge bases for
different problems - Can be integrated into a Next Generation
Information System to provide additional
services such as - Custom Order Processing
- Subsetting/Formatting/Gridding .
- Event/Relationship Searching
7Similarity between Data Mining and Scientific
Analysis Process
8Data Challenge
Search and Access Data
Data Mining Science Analysis
Data Integration
Data Transformation
Data Reduction
Results
Data Preparation for Mining/Analysis
Data Sets
9Typical Data Preparation Operations
- Data Cleaning
- Clean data by filling in missing values,
smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies. - Fairly well handled
- Data Integration
- Integration of multiple data files
- Data Transformations
- Normalization and aggregation
- Data Reduction
- Obtain a reduced representation of the data set,
which produces the same analytical results
10User Perspective and Data Perspective of the Data
Mining Process
Analysis
Decision
Volume
Value
Transformation
Knowledge
Preprocessing
Information
Dataset Specific Algorithms
Domain Specific Algorithms
Data
Calibration Navigation
Data Stores
Dataset
User Perspective
Data Perspective
11Scientific Data Mining Environment Stakeholders
End Users
Scientists
12Scientists Perspective
- Define the experiment
- Reduce data volume
- Create reusable Knowledge Base
- Iterate over experiment to refine the knowledge
base - Minimize data handling/Maximize research
- Add more layers to the knowledge base
- Allow different levels of knowledge discovery
- Shallow knowledge
- Hidden
- Deep
13End Users Perspective
- End users can be
- Students
- Public
- Decision makers
- Other Scientists
- Access to data
- Access to knowledge base
- End products
14NASA Workshop on Issues of Application of Data
Mining to Scientific Data
- Held on October 19-21, 1999 at University of
Alabama in Huntsville - Domain Focus
- Global Change
- Natural Hazard
- Terrestrial Ecology
- Key Recommendations
- Need to create a data mining environment for
facilitation, scalability and automation of
scientific analysis for large scale data streams - Need to formulate critical partnerships between
physical scientists, computer scientists and
statisticians for an effective integration of
analysis processes, scientific algorithms,
statistical approaches and enabling computer
architectures
15Reasons for Building a Data Mining Environment
- Provide the capabilities and flexibility of
creative scientific analysis - Provide an infrastructure of mining algorithms
and knowledge bases for creative analysis to
reduce reinventing the wheel - Provide capabilities to add science algorithms
to the framework - Support a spectrum of heterogeneous participants,
data sources and technological approaches - Provide a framework with components and suitable
management of the interfaces between them - Allow scientists to refine/add domain information
to the mining environment - Minimize scientists data handling to allow them
to maximize research time - Incorporate relevance feedback mechanism to learn
methodologies for multiple domains - Integrate data mining functionality into other
distributed systems
16Mining Environment When,Where, Who and Why?
- WHERE
- User Workstation
- Data Mining Center
- GRID
- WHEN
- Real Time
- On-Ingest
- On-Demand
- Repeatedly
- WHO
- End Users
- Domain Experts
- Mining Experts
- WHY
- Event
- Relationship
- Association
- Corroboration
- Collaboration
Data Mining
17ADaM History
- Algorithm Development and Mining (ADaM) System
- ADaM system developed under NASA HQ research
grant - The system provides knowledge discovery, feature
detection and content-based searching for data
values, as well as for metadata. - It contains over 120 different operations to be
performed on the input data stream. - Operations vary from specialized atmospheric
science data-set specific algorithms to different
digital image processing techniques, processing
modules for automatic pattern recognition,
machine perception, neural networks and genetic
algorithms. - Developed an Event/Relationship Search System
18ADaM Engine Architecture
Preprocessed Data
Patterns/ Models
Results
Data
Translated Data
Processing
Preprocessing
Analysis
Selection and Sampling Subsetting
Subsampling Select by Value Coincidence
Search Grid Manipulation Grid Creation
Bin Aggregate Bin Select Grid Aggregate
Grid Select Find Holes Image Processing
Cropping Inversion Thresholding Others...
Clustering K Means Isodata
Maximum Pattern Recognition Bayes Classifier
Min. Dist. Classifier Image Analysis
Boundary Detection Cooccurrence Matrix
Dilation and Erosion Histogram Operations
Polygon Circumscript Spatial Filtering
Texture Operations Genetic Algorithms Neural
Networks Others...
19Extensibility of ADaM
ADaM Mining Engine
Analysis Modules
Input Modules
Output Modules
20 Data Mining Environment
Data Mining Server
Mining Results
Event/ Relationship Search System
21Event/Relationship Search System
- Allows users to conduct coincidence searches and
relationship tests between mined phenomena and a
variety of parameters - Parameters include geographic regions,
political boundaries, or other named phenomena
for a specific time period
22An Environment for On-board processing EVE
- Real-time on-board data mining can provide unique
capabilities - Anomaly detection
- Autonomous control and decision making
- Immediate response
- Direct satellite to Earth delivery of results
- The Sensor Web is expanding and more processing
is available on-orbit
23Major EVE Components
EVE Software Architecture
24Interchange Technology Earth Science Markup
Language (ESML)
- Facilitate effective utilization of distributed,
heterogeneous data products - Enable interchangeable tools and services
Data
Applications
Integration
25Interoperability Accessing Heterogeneous Data
- Earth science data comes in
- Different formats, types and structures
- Different states of processing (raw, calibrated,
derived, modeled or interpreted) - Enormous volumes
- One approach Standard data formats
- Difficult to implement and enforce
- Cant anticipate all needs
- Some data cant be modeled or is lost in
translation - The cost of converting legacy data
- A better approach Interchange technologies
- Earth Science Markup Language
HDF-EOS
HDF
netCDF
ASCII
GRIB
Binary
26Data Usability
The Problem
The Solution
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 3
FORMAT CONVERTER
READER 1
READER 2
APPLICATION
- Specialized code for every format
- Difficult to assimilate new data types
- Expensive to convert legacy data
- Data interoperability
- Define Once, Use Anywhere
27Earth Science Markup Language
- Specialized markup language for Earth Science
information based on XML - Beyond traditional metadata to include both
structural and semantic information needed to
effect a practical runtime interpretation of a
data set - Benefits of ESML
- Enables independently developed applications and
services to effectively utilize distributed,
heterogeneous data products - Allows the end-user to integrate data sets of
differing structures to aid in data fusion and
analysis without having to write a special reader
for each data set - Is simple enough that end-users can create their
own ESML for on-hand datasets (new or legacy)
28Example ESML file for an Image
- Hurricane Mitch (981027)
- Binary format
- (32 bits/pixel)
- Mitch.esml
- ltESMLgt
- ltSyntacticMetaDatagt
- ltBinary GeoInfoNoGeoInfogt
- ltArray nameTrack occurs300
DimNameYgt - ltField nameChannel 1 typeint size32
occurs512 DataTypeData DimNameX/gt - ltData/gt
- lt/Arraygt
- lt/Binarygt
- lt/SyntacticMetaDatagt
- lt/ESMLgt
512
300
29Atmospheric Science Mining Applications
- Lightning Detection
- Rainfall Identification Estimation Study
- Rainfall Accumulation Study
- Tropical Cyclone Detection and Wind Speed
Estimation - GOES Cumulus Cloud Classification
- Mesoscale Convective System Detection
- Detection of Jet Streams in Numerical Model Data
30Tropical Cyclone DetectionEstimating Maximum
Wind Speed
Advanced Microwave Sounding Unit (AMSU-A)Data
- Water cover mask to eliminate land
- Laplacian filter to compute temperature
gradients - Science Algorithm to estimate wind speed
- Contiguous regions with wind speeds above a
desired - threshold identified
- Additional test to eliminate false positives
- Maximum wind speed and location produced
Calibration/ Limb Correction/ Converted to Tb
Hurricane Floyd
Data Archive
Mining Environment
Result
Results are placed on the web and made available
to National Hurricane Center Joint Typhoon
Warning Center
31Data Fusion and Mining From Global Information
to Local Knowledge
Emergency Response
Precision Agriculture
Urban Environments
Weather Prediction
32Mining as a Web Service
33Mining on Information Power Grid (IPG) using ADaM
IPG Processor
Mining LDAP Server
34Earth Science Example of Developing a Knowledge
Network Collaborative Research in Mesoscale
Convective Systems
Knowledge Base
Information about MCSs detected
Visualization Eureka Interface
Eureka Spatial
- Database
- location
- size
- intensity etc.
Data Sets SSM/I (F13, F14)
ADaM System
Generate end products while mining
Add algorithm to detect MCSs
Pose question and get answers from the
Knowledge Repository (such as coincidence search,
relationship testing)
Anyone can access the knowledge base via the web
Scientists/Researchers can ask questions such as
End Users
- What is the latitudinal distribution of MCSs?
- Which continent has more MCSs?
- What is the seasonal distribution of MCSs?
- What is the relationship between the
- number of MCSs and their intensity?
- Generate information useful to the general
- public ( students, researchers, policy
- makers etc)
- Images
- Forecast aids
- General Science information
- Answer the practical side of the problem
35Challenges
- Develop and document common/standard interfaces
for interoperability of data and services - Design new data models for handling
- real-time/streaming input
- data fusion/integration
- Design and develop distributed standardized
catalog capabilities - Develop advanced resource allocation and load
balancing techniques - Exploit distributed infrastructure for enhanced
data mining functionality (grids, web services,
etc) - Develop more intelligent and intuitive user
interfaces - Develop ontologies of scientific data, processes
and data mining techniques for multiple domains - Support language and system independent
components - Incorporate data mining into scientific curricula