Data Integration for Homeland Security - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Data Integration for Homeland Security

Description:

Physical basis/domain knowledge needed before applying algorithms ... May not utilize domain knowledge. May be difficult to prove validity of the results ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 36

Provided by: kenk159

Category:

more less

Transcript and Presenter's Notes

Title: Data Integration for Homeland Security

1
Creating a Data Mining Environment for
Geosciences Interface 2002 Montreal April 18,
2002 Sara J. Graves Director, Information
Technology and Systems Center Professor, Computer
Science Department University of Alabama in
Huntsville Director, Information Technology and
Research Center National Space Science and
Technology Center 256-824-6064 sgraves_at_itsc.uah.ed
u www.itsc.uah.edu
2
(No Transcript)
3
Characteristics of Science Data

Varied kinds of data
Raster images
With structure and geometry
Multispectral
Time series and sequence data
Numerical model outputs
Multiple resolutions/multiple scales
Variability of data formats
Granularity of data
Includes spatial and temporal dimensions
Physical basis/domain knowledge needed before
applying algorithms
Typically requires domain-specific algorithms

4
Scientific Analysis

Harnesses human analysis capabilities
Highly creative
Based on theory and hypothesis formulation
Physical basis is normally used for algorithms
Drawing insights about the underlying phenomena
Rapidly widening gap between data collection
capabilities and the ability to analyze data
Potential of vast amounts of data to be unused

5
Data Mining

Provides automation of the analysis process
Can be used for dimensionality reduction when
manual examination of data is impossible
Can have limitations
May not utilize domain knowledge
May be difficult to prove validity of the results
There may not be a physical basis
Should be viewed as complementary tool and not a
replacement for scientific analysis

6
Reasons for Mining Science Data

Powerful tool for research and analysis given
the volume of science data
Necessity when manual examination of data is
impossible
Can allow scientists to refine/add more layers
to the knowledge bases
Can minimize scientists data handling to allow
them to maximize research time
Can reduce reinventing the wheel
Can fully exploit reusable knowledge bases for
different problems
Can be integrated into a Next Generation
Information System to provide additional
services such as
Custom Order Processing
Subsetting/Formatting/Gridding .
Event/Relationship Searching

7
Similarity between Data Mining and Scientific
Analysis Process
8
Data Challenge
Search and Access Data
Data Mining Science Analysis
Data Integration
Data Transformation
Data Reduction
Results
Data Preparation for Mining/Analysis
Data Sets
9
Typical Data Preparation Operations

Data Cleaning
Clean data by filling in missing values,
smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies.
Fairly well handled
Data Integration
Integration of multiple data files
Data Transformations
Normalization and aggregation
Data Reduction
Obtain a reduced representation of the data set,
which produces the same analytical results

10
User Perspective and Data Perspective of the Data
Mining Process
Analysis
Decision
Volume
Value
Transformation
Knowledge
Preprocessing
Information
Dataset Specific Algorithms
Domain Specific Algorithms
Data
Calibration Navigation
Data Stores
Dataset
User Perspective
Data Perspective
11
Scientific Data Mining Environment Stakeholders
End Users
Scientists
12
Scientists Perspective

Define the experiment
Reduce data volume
Create reusable Knowledge Base
Iterate over experiment to refine the knowledge
base
Minimize data handling/Maximize research
Add more layers to the knowledge base
Allow different levels of knowledge discovery
Shallow knowledge
Hidden
Deep

13
End Users Perspective

End users can be
Students
Public
Decision makers
Other Scientists
Access to data
Access to knowledge base
End products

14
NASA Workshop on Issues of Application of Data
Mining to Scientific Data

Held on October 19-21, 1999 at University of
Alabama in Huntsville
Domain Focus
Global Change
Natural Hazard
Terrestrial Ecology
Key Recommendations
Need to create a data mining environment for
facilitation, scalability and automation of
scientific analysis for large scale data streams
Need to formulate critical partnerships between
physical scientists, computer scientists and
statisticians for an effective integration of
analysis processes, scientific algorithms,
statistical approaches and enabling computer
architectures

15
Reasons for Building a Data Mining Environment

Provide the capabilities and flexibility of
creative scientific analysis
Provide an infrastructure of mining algorithms
and knowledge bases for creative analysis to
reduce reinventing the wheel
Provide capabilities to add science algorithms
to the framework
Support a spectrum of heterogeneous participants,
data sources and technological approaches
Provide a framework with components and suitable
management of the interfaces between them
Allow scientists to refine/add domain information
to the mining environment
Minimize scientists data handling to allow them
to maximize research time
Incorporate relevance feedback mechanism to learn
methodologies for multiple domains
Integrate data mining functionality into other
distributed systems

16
Mining Environment When,Where, Who and Why?

WHERE
User Workstation
Data Mining Center
GRID

WHEN
Real Time
On-Ingest
On-Demand
Repeatedly

WHO
End Users
Domain Experts
Mining Experts

WHY
Event
Relationship
Association
Corroboration
Collaboration

Data Mining
17
ADaM History

Algorithm Development and Mining (ADaM) System
ADaM system developed under NASA HQ research
grant
The system provides knowledge discovery, feature
detection and content-based searching for data
values, as well as for metadata.
It contains over 120 different operations to be
performed on the input data stream.
Operations vary from specialized atmospheric
science data-set specific algorithms to different
digital image processing techniques, processing
modules for automatic pattern recognition,
machine perception, neural networks and genetic
algorithms.
Developed an Event/Relationship Search System

18
ADaM Engine Architecture
Preprocessed Data
Patterns/ Models
Results
Data
Translated Data
Processing

Preprocessing
Analysis
Selection and Sampling Subsetting
Subsampling Select by Value Coincidence
Search Grid Manipulation Grid Creation
Bin Aggregate Bin Select Grid Aggregate
Grid Select Find Holes Image Processing
Cropping Inversion Thresholding Others...
Clustering K Means Isodata
Maximum Pattern Recognition Bayes Classifier
Min. Dist. Classifier Image Analysis
Boundary Detection Cooccurrence Matrix
Dilation and Erosion Histogram Operations
Polygon Circumscript Spatial Filtering
Texture Operations Genetic Algorithms Neural
Networks Others...
19
Extensibility of ADaM
ADaM Mining Engine
Analysis Modules
Input Modules
Output Modules
20
Data Mining Environment
Data Mining Server
Mining Results
Event/ Relationship Search System
21
Event/Relationship Search System

Allows users to conduct coincidence searches and
relationship tests between mined phenomena and a
variety of parameters
Parameters include geographic regions,
political boundaries, or other named phenomena
for a specific time period

22
An Environment for On-board processing EVE

Real-time on-board data mining can provide unique
capabilities
Anomaly detection
Autonomous control and decision making
Immediate response
Direct satellite to Earth delivery of results
The Sensor Web is expanding and more processing
is available on-orbit

23
Major EVE Components
EVE Software Architecture
24
Interchange Technology Earth Science Markup
Language (ESML)

Facilitate effective utilization of distributed,
heterogeneous data products
Enable interchangeable tools and services

Data
Applications
Integration
25
Interoperability Accessing Heterogeneous Data

Earth science data comes in
Different formats, types and structures
Different states of processing (raw, calibrated,
derived, modeled or interpreted)
Enormous volumes
One approach Standard data formats
Difficult to implement and enforce
Cant anticipate all needs
Some data cant be modeled or is lost in
translation
The cost of converting legacy data
A better approach Interchange technologies
Earth Science Markup Language

HDF-EOS
HDF
netCDF
ASCII
GRIB
Binary
26
Data Usability
The Problem
The Solution
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 3
FORMAT CONVERTER
READER 1
READER 2
APPLICATION

Specialized code for every format
Difficult to assimilate new data types
Expensive to convert legacy data

Data interoperability
Define Once, Use Anywhere

27
Earth Science Markup Language

Specialized markup language for Earth Science
information based on XML
Beyond traditional metadata to include both
structural and semantic information needed to
effect a practical runtime interpretation of a
data set
Benefits of ESML
Enables independently developed applications and
services to effectively utilize distributed,
heterogeneous data products
Allows the end-user to integrate data sets of
differing structures to aid in data fusion and
analysis without having to write a special reader
for each data set
Is simple enough that end-users can create their
own ESML for on-hand datasets (new or legacy)

28
Example ESML file for an Image

Hurricane Mitch (981027)
Binary format
(32 bits/pixel)

Mitch.esml
ltESMLgt
ltSyntacticMetaDatagt
ltBinary GeoInfoNoGeoInfogt
ltArray nameTrack occurs300
DimNameYgt
ltField nameChannel 1 typeint size32
occurs512 DataTypeData DimNameX/gt
ltData/gt
lt/Arraygt
lt/Binarygt
lt/SyntacticMetaDatagt
lt/ESMLgt

512
300
29
Atmospheric Science Mining Applications

Lightning Detection
Rainfall Identification Estimation Study
Rainfall Accumulation Study
Tropical Cyclone Detection and Wind Speed
Estimation
GOES Cumulus Cloud Classification
Mesoscale Convective System Detection
Detection of Jet Streams in Numerical Model Data

30
Tropical Cyclone DetectionEstimating Maximum
Wind Speed
Advanced Microwave Sounding Unit (AMSU-A)Data

Water cover mask to eliminate land
Laplacian filter to compute temperature
gradients
Science Algorithm to estimate wind speed
Contiguous regions with wind speeds above a
desired
threshold identified
Additional test to eliminate false positives
Maximum wind speed and location produced

Calibration/ Limb Correction/ Converted to Tb
Hurricane Floyd
Data Archive
Mining Environment
Result
Results are placed on the web and made available
to National Hurricane Center Joint Typhoon
Warning Center
31
Data Fusion and Mining From Global Information
to Local Knowledge
Emergency Response
Precision Agriculture
Urban Environments
Weather Prediction
32
Mining as a Web Service
33
Mining on Information Power Grid (IPG) using ADaM
IPG Processor
Mining LDAP Server
34
Earth Science Example of Developing a Knowledge
Network Collaborative Research in Mesoscale
Convective Systems
Knowledge Base
Information about MCSs detected
Visualization Eureka Interface
Eureka Spatial

Database
location
size
intensity etc.

Data Sets SSM/I (F13, F14)
ADaM System
Generate end products while mining
Add algorithm to detect MCSs
Pose question and get answers from the
Knowledge Repository (such as coincidence search,
relationship testing)
Anyone can access the knowledge base via the web
Scientists/Researchers can ask questions such as
End Users

What is the latitudinal distribution of MCSs?
Which continent has more MCSs?
What is the seasonal distribution of MCSs?
What is the relationship between the
number of MCSs and their intensity?

Generate information useful to the general
public ( students, researchers, policy
makers etc)
Images
Forecast aids
General Science information
Answer the practical side of the problem

35
Challenges

Develop and document common/standard interfaces
for interoperability of data and services
Design new data models for handling
real-time/streaming input
data fusion/integration
Design and develop distributed standardized
catalog capabilities
Develop advanced resource allocation and load
balancing techniques
Exploit distributed infrastructure for enhanced
data mining functionality (grids, web services,
etc)
Develop more intelligent and intuitive user
interfaces
Develop ontologies of scientific data, processes
and data mining techniques for multiple domains
Support language and system independent
components
Incorporate data mining into scientific curricula