Data Mining Research and Applications - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Data Mining Research and Applications

Description:

Derived knowledge for decision making, predictions and disaster response ... May not utilize domain knowledge. May be difficult to prove validity of the results ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 32
Provided by: kenk160
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Research and Applications


1
Data MiningResearch and Applications
  • Workshop on Cyberinfrastructure
  • For Environmental Research and Education
  • October 31, 2002
  • Steve Tanner
  • Information Technology and Systems Center
  • University of Alabama in Huntsville
  • stanner_at_itsc.uah.edu
  • 256.824.5143
  • www.itsc.uah.edu

2
Key Questions
  • What is the most effective approach to developing
    an integrated framework and plan for an
    interdisciplinary environmental
    cyberinfrastructure?
  • What organizational structure is needed to
    provide long-term support for data storage,
    access, model development, and services for a
    global clientele of researchers, educators,
    policy makers, and citizens?
  • How will effective interagency and public-private
    partnerships be formed to provide financial
    support for such an extensive and costly system?
  • How can communication and coordination among
    computer scientists and environmental researchers
    and educators be enhanced to develop this
    innovative, powerful, and accessible
    infrastructure?

3
Data Mining
  • Data Mining is an interdisciplinary field drawing
    from areas such as statistics, machine learning,
    pattern recognition and others
  • Automated discovery of patterns, anomalies, etc.
    from vast observational and model data sets
  • Derived knowledge for decision making,
    predictions and disaster response
  • ADaM Algorithm Development and Mining System
  • datamining.itsc.uah.edu

4
Techniques used for Data Mining
  • Clustering Techniques
  • K Means
  • Isodata
  • Maximum
  • Pattern Recognition
  • Bayes Classifier
  • Minimum Distribution Classifier
  • Image Analysis
  • Boundary Detection
  • Cooccurrence Matrix
  • Dilation and Erosion
  • Histogram Operations
  • Polygon Circumscript
  • Spatial Filtering
  • Texture Operations
  • Genetic Algorithms
  • Neural Networks
  • Etc.

Data Mining systems usually involve a toolbox of
many different techniques and a means for
combining them
5
Typical Everyday Encounters with Data Mining
  • Google
  • Complex algorithm sequence to decide order
  • Amazon.Com
  • Additional purchase suggestions
  • Credit Card Fraud
  • Event notification of odd usage

Most current Data Mining applications are text
based. Text provides an easily readable source
of heterogeneous data. Mining of scientific data
sets is more complex.
6
User Perspective and Data Perspective of the Data
Mining Process
Analysis
Decision
Volume
Value
Transformation
Knowledge
Preprocessing
Information
Dataset Specific Algorithms
Domain Specific Algorithms
Data
Calibration Navigation
Data Stores
Dataset
User Perspective
Data Perspective
7
Scientific Analysis
Data Mining
  • Harnesses human analysis capabilities
  • Highly creative
  • Based on theory and hypothesis formulation
  • Physical basis is normally used for algorithms
  • Drawing insights about the underlying phenomena
  • Rapidly widening gap between data collection
    capabilities and the ability to analyze data
  • Potential of vast amounts of data to be unused
  • Provides automation of the analysis process
  • Can be used for dimensionality reduction when
    manual examination of data is impossible
  • Can have limitations
  • May not utilize domain knowledge
  • May be difficult to prove validity of the results
  • There may not be a physical basis
  • Should be viewed as complimentary tool and not a
    replacement for scientific analysis

8
Similarity between Data Mining and Scientific
Analysis Process
9
Mining Environments
  • Mining Framework (ADaM)
  • Complete System (Client and Engine)
  • Mining Engine (User provides its own client)
  • Application Specific Mining Systems
  • Operations Tool Kit
  • Stand Alone Mining Algorithms
  • Data Fusion
  • Distributed/Federated Mining
  • Distributed services
  • Distributed data
  • Chaining using Interchange Technologies
  • On-board Mining (EVE)
  • Real time and distributed mining
  • Processing environment constraints

10
Using the Mining Framework Focusing on the
information in data
11
The ADaM Processing Model
Preprocessed Data
Patterns/ Models
Results
Translated Data
Raw Data
12
Iterative Nature of the Data Mining Process
EVALUATION And PRESENTATION
KNOWLEDGE
DISCOVERY
MINING
SELECTION And TRANSFORMATION
CLEANING And INTEGRATION
PREPROCESSING
DATA
13
Distributed/Federated Mining Meshing data and
algorithms to generate knowledge
14
ADaM Mining Environment for Scientific Data
  • The system provides knowledge discovery, feature
    detection and content-based searching for data
    values, as well as for metadata.
  • contains over 120 different operations
  • Operations vary from specialized science data-set
    specific algorithms to various digital image
    processing techniques, processing modules for
    automatic pattern recognition, machine
    perception, neural networks, genetic algorithms
    and others

15
Classification Based on Texture Features and Edge
Density
  • Science Rationale Man-made changes to land use
    cause changes in weather patterns, especially
    cumulus clouds
  • Comparison based on
  • Accuracy of detection
  • Amount of time required to classify
  • Cumulus cloud fields have a very characteristic
    texture signature in the GOES visible imagery

16
Parallel Version of Cloud Extraction
  • GOES images can be used to recognize cumulus
    cloud fields
  • Cumulus clouds are small and do not show up well
    in 4km resolution IR channels
  • Detection of cumulus cloud fields in GOES can be
    accomplished by using texture features or edge
    detectors

Master
Slave 1
Slave 2
Slave 3
GOES Image
Laplacian Filter
Sobel Horizontal Filter
Sobel Vertical Filter
Energy Computation
Energy Computation
Energy Computation
Energy Computation
Classifier
Cloud Image
GOES Image
Cumulus Cloud Mask
  • Three edge detection filters are used together to
    detect cumulus clouds which lends itself to
    implementation on a parallel cluster

17
Automated Data Analysis for Boundary Detection
and Quantification
  • Analysis of polar cap auroras in large volumes
    of spacecraft UV images
  • Science Rationale Indicators to predict
    geomagnetic storm
  • Damage satellites
  • Disrupt radio connection
  • Developing different mining algorithms to detect
    and quantify polar cap boundary

Polar Cap Boundary
18
Detecting Signatures
  • Science Rationale Mesocyclone signatures in
    Radar data are indicators of Tornadic activity
  • Developing an algorithm based on wind velocity
    shear signatures
  • Improve accuracy and reduce false alarm rates

19
Genetic Subtyping Using Hierarchical Clustering
  • Biologists are interested in comparing DNA
    sequences to see how closely related they are to
    one another
  • Phylogenetic trees are constructed by performing
    hierarchical clustering on DNA sequences using
    genetic distance as a distance measure
  • Such trees show which organisms are most likely
    share common ancestors, and may provide
    information about how various subtypes of
    organisms evolved
  • This information is useful when studying disease
    causing organisms such as viruses and bacteria,
    because genetically similar types should behave
    in similar ways

20
Mining on Data Ingest Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data
  • Mining Plan
  • Water cover mask to eliminate land
  • Laplacian filter to compute temperature gradients
  • Science Algorithm to estimate wind speed
  • Contiguous regions with wind speeds above a
    desired threshold identified
  • Additional test to eliminate false positives
  • Maximum wind speed and location produced

Further Analysis
Calibration/ Limb Correction/ Converted to Tb
Knowledge Base
Data Archive
Hurricane Floyd
Mining Environment
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
pm-esip.msfc.nasa.gov/
21
Multiple Mining EnvironmentsPassive Microwave
ESIP Information System
AMSU Product Generation
ADaM-based Processing
Order Staging
PM-ESIP Catalog
Custom Processing
AMSU-A Ingest
TMI
AMSU-A
SSM/I
SSM/T2
TMI Ingest and Product Generation
Distributed Data Stores
Data Ingest Processing
22
Interoperability Accessing Heterogeneous Data
The Problem
DATA FORMAT 3
DATA FORMAT 1
DATA FORMAT 2
  • Science data comes in
  • Different formats, types and structures
  • Different states of processing (raw, calibrated,
    derived, modeled or interpreted)
  • Enormous volumes
  • Heterogeneity leads to data usability problems
  • One approach Standard data formats
  • Difficult to implement and enforce
  • Cant anticipate all needs
  • Some data cant be modeled or is lost in
    translation
  • The cost of converting legacy data
  • A better approach Interchange Technologies
  • Earth Science Markup Language

FORMAT CONVERTER
READER 1
READER 2
APPLICATION
The Solution
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 3
ESML FILE
ESML FILE
ESML FILE
ESML LIBRARY
APPLICATION
23
Chained Image Processing Services
WMS (Java/Windows)
Service Chaining is used to integrate modules
or services developed on distributed platforms
and different languages for a single processing
solution.
Format (Perl/Linux)
Resample (Perl/C Linux)
GeoCrop (Perl/Linux)
Chained Services
Draw Image (PERL/C Linux)
Data Streams
Data
Reader (Java/C Windows)
Data Files
ESML
ESML Lib
Knowledge Base
Data Files
24
Data Integration using Web Mapping Services
Countries
Cyclone Events
AMSU-A Channel 01
MCS Events
Coastlines
Knowledge Base
AMSU-A
ITSC
Globe
AMSU-A data overlaid with MCS and Cyclone events
for September 2000, merged with world boundaries
from Globe.
25
Fused Displays from Multiple Servers
Analysis Correlate MCSs and cyclones with
atmospheric temperatures for September 2000.
26
MULTI-LEVEL MINING
CONCEPT MINING
DECISION SUPPORT
EVENT A
EVENT B
CONCEPTUAL LEVEL
FEATURE SET I
FEATURE I
FEATURE II
FEATURE III
FEATURE X
FEATURE Y
Model and Observation Data
DATA FILE LEVEL
Concept Hierarchy for Data Mining and Fusion
27
On-Board Real-Time Processing Sensor
Control/Targeting
EVE Environment for On-board Processing
  • Anomaly detection
  • Data Mining
  • Autonomous Decision Making
  • Immediate response
  • Direct satellite to Earth delivery of results

www.itsc.uah.edu/eve
28
A Reconfigurable Web of Interacting Sensors
Communications
Weather
Satellite Constellations
Military
Ground Network
Ground Network
Ground Network
29
Example Plan Threshold events in AMSU-A
Streaming Data
EVE
30
Data Integration and Mining From Global
Information to Local Knowledge
Emergency Response
Precision Agriculture
Urban Environments
Weather Prediction
31
Key Questions
  • What is the most effective approach to developing
    an integrated framework and plan for an
    interdisciplinary environmental
    cyberinfrastructure?
  • What organizational structure is needed to
    provide long-term support for data storage,
    access, model development, and services for a
    global clientele of researchers, educators,
    policy makers, and citizens?
  • How will effective interagency and public-private
    partnerships be formed to provide financial
    support for such an extensive and costly system?
  • How can communication and coordination among
    computer scientists and environmental researchers
    and educators be enhanced to develop this
    innovative, powerful, and accessible
    infrastructure?
Write a Comment
User Comments (0)
About PowerShow.com