Title: Data Mining Research and Applications
1Data MiningResearch and Applications
- Workshop on Cyberinfrastructure
- For Environmental Research and Education
- October 31, 2002
- Steve Tanner
- Information Technology and Systems Center
- University of Alabama in Huntsville
- stanner_at_itsc.uah.edu
- 256.824.5143
- www.itsc.uah.edu
2Key Questions
- What is the most effective approach to developing
an integrated framework and plan for an
interdisciplinary environmental
cyberinfrastructure? - What organizational structure is needed to
provide long-term support for data storage,
access, model development, and services for a
global clientele of researchers, educators,
policy makers, and citizens? - How will effective interagency and public-private
partnerships be formed to provide financial
support for such an extensive and costly system? - How can communication and coordination among
computer scientists and environmental researchers
and educators be enhanced to develop this
innovative, powerful, and accessible
infrastructure?
3Data Mining
- Data Mining is an interdisciplinary field drawing
from areas such as statistics, machine learning,
pattern recognition and others - Automated discovery of patterns, anomalies, etc.
from vast observational and model data sets - Derived knowledge for decision making,
predictions and disaster response - ADaM Algorithm Development and Mining System
- datamining.itsc.uah.edu
4Techniques used for Data Mining
- Clustering Techniques
- K Means
- Isodata
- Maximum
- Pattern Recognition
- Bayes Classifier
- Minimum Distribution Classifier
- Image Analysis
- Boundary Detection
- Cooccurrence Matrix
- Dilation and Erosion
- Histogram Operations
- Polygon Circumscript
- Spatial Filtering
- Texture Operations
- Genetic Algorithms
- Neural Networks
- Etc.
Data Mining systems usually involve a toolbox of
many different techniques and a means for
combining them
5Typical Everyday Encounters with Data Mining
- Google
- Complex algorithm sequence to decide order
- Amazon.Com
- Additional purchase suggestions
- Credit Card Fraud
- Event notification of odd usage
Most current Data Mining applications are text
based. Text provides an easily readable source
of heterogeneous data. Mining of scientific data
sets is more complex.
6User Perspective and Data Perspective of the Data
Mining Process
Analysis
Decision
Volume
Value
Transformation
Knowledge
Preprocessing
Information
Dataset Specific Algorithms
Domain Specific Algorithms
Data
Calibration Navigation
Data Stores
Dataset
User Perspective
Data Perspective
7Scientific Analysis
Data Mining
- Harnesses human analysis capabilities
- Highly creative
- Based on theory and hypothesis formulation
- Physical basis is normally used for algorithms
- Drawing insights about the underlying phenomena
- Rapidly widening gap between data collection
capabilities and the ability to analyze data - Potential of vast amounts of data to be unused
- Provides automation of the analysis process
- Can be used for dimensionality reduction when
manual examination of data is impossible - Can have limitations
- May not utilize domain knowledge
- May be difficult to prove validity of the results
- There may not be a physical basis
- Should be viewed as complimentary tool and not a
replacement for scientific analysis
8Similarity between Data Mining and Scientific
Analysis Process
9Mining Environments
- Mining Framework (ADaM)
- Complete System (Client and Engine)
- Mining Engine (User provides its own client)
- Application Specific Mining Systems
- Operations Tool Kit
- Stand Alone Mining Algorithms
- Data Fusion
- Distributed/Federated Mining
- Distributed services
- Distributed data
- Chaining using Interchange Technologies
- On-board Mining (EVE)
- Real time and distributed mining
- Processing environment constraints
10Using the Mining Framework Focusing on the
information in data
11The ADaM Processing Model
Preprocessed Data
Patterns/ Models
Results
Translated Data
Raw Data
12Iterative Nature of the Data Mining Process
EVALUATION And PRESENTATION
KNOWLEDGE
DISCOVERY
MINING
SELECTION And TRANSFORMATION
CLEANING And INTEGRATION
PREPROCESSING
DATA
13Distributed/Federated Mining Meshing data and
algorithms to generate knowledge
14 ADaM Mining Environment for Scientific Data
- The system provides knowledge discovery, feature
detection and content-based searching for data
values, as well as for metadata. - contains over 120 different operations
- Operations vary from specialized science data-set
specific algorithms to various digital image
processing techniques, processing modules for
automatic pattern recognition, machine
perception, neural networks, genetic algorithms
and others
15Classification Based on Texture Features and Edge
Density
- Science Rationale Man-made changes to land use
cause changes in weather patterns, especially
cumulus clouds - Comparison based on
- Accuracy of detection
- Amount of time required to classify
- Cumulus cloud fields have a very characteristic
texture signature in the GOES visible imagery
16Parallel Version of Cloud Extraction
- GOES images can be used to recognize cumulus
cloud fields - Cumulus clouds are small and do not show up well
in 4km resolution IR channels - Detection of cumulus cloud fields in GOES can be
accomplished by using texture features or edge
detectors
Master
Slave 1
Slave 2
Slave 3
GOES Image
Laplacian Filter
Sobel Horizontal Filter
Sobel Vertical Filter
Energy Computation
Energy Computation
Energy Computation
Energy Computation
Classifier
Cloud Image
GOES Image
Cumulus Cloud Mask
- Three edge detection filters are used together to
detect cumulus clouds which lends itself to
implementation on a parallel cluster
17Automated Data Analysis for Boundary Detection
and Quantification
- Analysis of polar cap auroras in large volumes
of spacecraft UV images - Science Rationale Indicators to predict
geomagnetic storm - Damage satellites
- Disrupt radio connection
- Developing different mining algorithms to detect
and quantify polar cap boundary
Polar Cap Boundary
18Detecting Signatures
- Science Rationale Mesocyclone signatures in
Radar data are indicators of Tornadic activity - Developing an algorithm based on wind velocity
shear signatures - Improve accuracy and reduce false alarm rates
19Genetic Subtyping Using Hierarchical Clustering
- Biologists are interested in comparing DNA
sequences to see how closely related they are to
one another - Phylogenetic trees are constructed by performing
hierarchical clustering on DNA sequences using
genetic distance as a distance measure - Such trees show which organisms are most likely
share common ancestors, and may provide
information about how various subtypes of
organisms evolved - This information is useful when studying disease
causing organisms such as viruses and bacteria,
because genetically similar types should behave
in similar ways
20Mining on Data Ingest Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data
- Mining Plan
- Water cover mask to eliminate land
- Laplacian filter to compute temperature gradients
- Science Algorithm to estimate wind speed
- Contiguous regions with wind speeds above a
desired threshold identified - Additional test to eliminate false positives
- Maximum wind speed and location produced
Further Analysis
Calibration/ Limb Correction/ Converted to Tb
Knowledge Base
Data Archive
Hurricane Floyd
Mining Environment
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
pm-esip.msfc.nasa.gov/
21Multiple Mining EnvironmentsPassive Microwave
ESIP Information System
AMSU Product Generation
ADaM-based Processing
Order Staging
PM-ESIP Catalog
Custom Processing
AMSU-A Ingest
TMI
AMSU-A
SSM/I
SSM/T2
TMI Ingest and Product Generation
Distributed Data Stores
Data Ingest Processing
22Interoperability Accessing Heterogeneous Data
The Problem
DATA FORMAT 3
DATA FORMAT 1
DATA FORMAT 2
- Science data comes in
- Different formats, types and structures
- Different states of processing (raw, calibrated,
derived, modeled or interpreted) - Enormous volumes
- Heterogeneity leads to data usability problems
- One approach Standard data formats
- Difficult to implement and enforce
- Cant anticipate all needs
- Some data cant be modeled or is lost in
translation - The cost of converting legacy data
- A better approach Interchange Technologies
- Earth Science Markup Language
FORMAT CONVERTER
READER 1
READER 2
APPLICATION
The Solution
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 3
ESML FILE
ESML FILE
ESML FILE
ESML LIBRARY
APPLICATION
23Chained Image Processing Services
WMS (Java/Windows)
Service Chaining is used to integrate modules
or services developed on distributed platforms
and different languages for a single processing
solution.
Format (Perl/Linux)
Resample (Perl/C Linux)
GeoCrop (Perl/Linux)
Chained Services
Draw Image (PERL/C Linux)
Data Streams
Data
Reader (Java/C Windows)
Data Files
ESML
ESML Lib
Knowledge Base
Data Files
24Data Integration using Web Mapping Services
Countries
Cyclone Events
AMSU-A Channel 01
MCS Events
Coastlines
Knowledge Base
AMSU-A
ITSC
Globe
AMSU-A data overlaid with MCS and Cyclone events
for September 2000, merged with world boundaries
from Globe.
25Fused Displays from Multiple Servers
Analysis Correlate MCSs and cyclones with
atmospheric temperatures for September 2000.
26MULTI-LEVEL MINING
CONCEPT MINING
DECISION SUPPORT
EVENT A
EVENT B
CONCEPTUAL LEVEL
FEATURE SET I
FEATURE I
FEATURE II
FEATURE III
FEATURE X
FEATURE Y
Model and Observation Data
DATA FILE LEVEL
Concept Hierarchy for Data Mining and Fusion
27On-Board Real-Time Processing Sensor
Control/Targeting
EVE Environment for On-board Processing
- Anomaly detection
- Data Mining
- Autonomous Decision Making
- Immediate response
- Direct satellite to Earth delivery of results
www.itsc.uah.edu/eve
28A Reconfigurable Web of Interacting Sensors
Communications
Weather
Satellite Constellations
Military
Ground Network
Ground Network
Ground Network
29Example Plan Threshold events in AMSU-A
Streaming Data
EVE
30Data Integration and Mining From Global
Information to Local Knowledge
Emergency Response
Precision Agriculture
Urban Environments
Weather Prediction
31Key Questions
- What is the most effective approach to developing
an integrated framework and plan for an
interdisciplinary environmental
cyberinfrastructure? - What organizational structure is needed to
provide long-term support for data storage,
access, model development, and services for a
global clientele of researchers, educators,
policy makers, and citizens? - How will effective interagency and public-private
partnerships be formed to provide financial
support for such an extensive and costly system? - How can communication and coordination among
computer scientists and environmental researchers
and educators be enhanced to develop this
innovative, powerful, and accessible
infrastructure?