ADaM Tutorial Dr' John Rushing - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

ADaM Tutorial Dr' John Rushing

Description:

ADaM Tutorial Dr' John Rushing – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 57

Provided by: kenk157

Category:

more less

Transcript and Presenter's Notes

Title: ADaM Tutorial Dr' John Rushing

1
ADaM TutorialDr. John Rushing

Information Technology and Systems Center
University of Alabama in Huntsville
National Space Science and Technology Center
256-824-6064
dhardin_at_itsc.uah.edu
http//www.itsc.uah.edu/
http//datamining.itsc.uah.edu/

2
Data Mining What it is

Extracting knowledge from large amounts of data
Motivation
Our ability to collect data has expanded rapidly
It is impossible to analyze all of the data
manually
Data contains valuable information that can aid
in decision making
Uses techniques from
Pattern Recognition
Machine Learning
Statistics
High Performance Database Systems
On-Line Analytical Processing (OLAP)
Plus other techniques unique to data mining
(Association rules)
Data mining methods must be efficient and scalable

3
Data Mining What it isnt

Small Scale
Data mining methods are designed for large data
sets
Scale is one of the characteristics that
distinguishes data mining applications from
traditional machine learning applications
Foolproof
Data mining techniques will discover patterns in
any data
The patterns discovered may be meaningless
It is up to the user to determine how to
interpret the results
Magic
Data mining techniques cannot generate
information that is not present in the data
They can only find the patterns that are already
there

4
Data Mining Types of Mining

Association Rule Mining
Initially developed for market basket analysis
Goal is to discover relationships between
attributes
Uses include decision support, classification and
clustering
Classification and Prediction (Supervised
Learning)
Classifiers are created using labeled training
samples
Training samples created by ground truth /
experts
Classifier later used to classify unknown samples
Clustering (Unsupervised Learning)
Grouping objects into classes so that similar
objects are in the same class and dissimilar
objects are in different classes
Discover overall distribution patterns and
relationships between attributes
Other Types of Mining
Outlier Analysis
Concept / Class Description
Time Series Analysis

5
ADaM System Overview

Developed by the Information Technology and
Systems Center at the University of Alabama in
Huntsville
Consists of over 75 interoperable mining and
image processing components
Each component is provided with a C application
programming interface (API), an executable in
support of scripting tools (e.g. Perl, Python,
Tcl, Shell)
ADaM components are lightweight and autonomous,
and have been used successfully in a grid
environment
ADaM has several translation components that
provide data level interoperability with other
mining systems (such as WEKA and Orange), and
point tools (such as libSVM and svmLight)
Future versions will include Python wrappers and
possible web service interfaces

6
ADaM 4.0 Components
7
ADaM Classification

Definition Classification is a way to find and
develop a procedure for identifying some
phenomenon of interest from a known set of
training data in a way that can be used in the
future to find similar patterns in similar data
sets.
Method
Define or categorize classes of data objects.
Create a model represented by training samples
which taken together form a training data set.
Represent the model mathematically in the form of
classification rules, decision trees or formulae.
Use the model to categorize similar data sets.

8
ADaM Classification - Process

Identify potential features which may
characterize the phenomenon of interest
Generate a set of training instances where each
instance consists of a set of feature values and
the corresponding class label
Describe the instances using ARFF file format
Preprocess the data as necessary (normalize,
sample etc.)
Split the data into training / test set(s) as
appropriate
Train the classifier using the training set
Evaluate classifier performance using test set
K-Fold cross validation, leave one out or other
more sophisticated methods may also be used for
evaluating classifier performance

9
Naïve Bayes Classification - 1

Classification problem with m classes C1, C2,
Cm
Given an unknown sample X, the goal is to choose
a class that is most likely based on statistics
from training data

P(Ci X) can be computed using Bayes Theorem

1 Equations from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
10
Naïve Bayes Classification - 2

P(X) is constant for all classes, so finding the
most likely class amounts to maximizing P(X Ci)
P(Ci)
P(Ci ) is the prior probability of class i. If
the probabilities are not known, equal
probabilities can be assumed.
Assuming attributes are conditionally independent

P(xk Ci) is the probability density function
for attribute k

1 Equation from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
11
Naïve Bayes Classification - 3

P(xk Ci) is estimated from the training samples
Categorical Attributes (non-numeric attributes)
Estimate P(xk Ci) as percentage of samples of
class i with value xk
Training involves counting percentage of
occurrence of each possible value for each class
Numeric attributes
Also use statistics of the sample data to
estimate P(xk Ci)
Actual form of density function is generally not
known, so Gaussian density is often assumed
Training involves computation of mean and
variance for each attribute for each class

12
Naïve Bayes Classification - 4

Gaussian distribution for numeric attributes

Where is the mean of attribute k observed
in samples of class Ci
And is the standard deviation of attribute
k observed in samples of class Ci

1 Equation from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
13
ADaM Classification - Example

Starting with an ARFF file, the ADaM system will
be used to create a Naïve Bayes classifier and
evaluate it
The source data will be an ARFF version of the
Wisconsin breast cancer data from the University
of California Irvine (UCI) Machine Learning
Database
http//www.ics.uci.edu/mlearn/MLRepository.html
The Naïve Bayes classifier will be trained to
distinguish malignant vs. benign tumors based on
nine characteristics

14
Sample Data Set ARFF Format
15
Preparing the Training Samples

ADaM has utilities for splitting data sets into
disjoint groups for training and testing
classifiers
The simplest is ITSC_Sample, which splits the
source data set into two disjoint subsets

16
Splitting the Samples

We will split the breast cancer data set into two
groups, one with 2/3 of the patterns and another
with 1/3 of the patterns
ITSC_Sample -c class -i bcw.arff -o
trn.arff -t tst.arff p 0.66
The i argument specifies the input file name
The o and t arguments specify the names of the
two output files (-o output one, -t output
two)
The p argument specifies the portion of data
that goes into output one (trn.arff), the
remainder goes to output two (tst.arff)
The c argument tells the sample program which
attribute is the class attribute

17
Training the Classifier - 1

Training involves computation of the mean vector
and covariance matrix for each class
The mean of an attribute for a class is the sum
of the attribute values divided by number of
instances
The mean vector is the vector of attribute means
The variance of an attribute is the sum of the
squared deviation from the mean for that
attribute over patterns in the class
The covariance of a pair of attributes is the sum
of the deviation from the mean of attribute one
the deviation from the mean of attribute two over
class patterns

18
Training the Classifier - 2

ADaM has several different types of classifiers
Each classifier has a training method and an
application method
Here we show one the Naïve Bayes classifier
with the following syntax

19
Training the Classifier - 3

Training the Naïve Bayes classifier
ITSC_NaiveBayesTrain -c class -i trn.arff
b bayes.txt
The i argument specifies the input file name
The c argument specifies the name of the class
attribute
The b argument specifies the name of the
classifier file
The output of the ITSC_NaiveBayesTrain module is
shown here.

20
Applying the Classifier - 1

Use mean and covariance to compute P(X Ci), the
probability that pattern X occurs in class Ci for
all classes
The prior probabilities P(Ci) are already known
Assign the pattern X to the class for which P(X
Ci) P(Ci) is maximal
Note that P(X Ci) is high when X is close to
the mean of the class.
The variance is used to adjust for how much
patterns within the class typically vary.
A pattern may be assigned to a class for which
the mean is further away than the mean of some
other class!

21
Applying the Classifier - 2

Once trained, the Naïve Bayes classifier can be
used to classify unknown instances
The syntax for ADaMs Naïve Bayes classifier is
as follows

22
Applying the Classifier - 3

The classifier is run as follows
ITSC_NaiveBayesApply -c class -i trn.arff b
bayes.txt -o res_trn.arff
The i argument specifies the input file name
The c argument specifies the name of the class
attribute
The b argument specifies the name of the
classifier file
The o argument specifies the name of the result
file

23
Evaluating Classifier Performance

By applying the classifier to a test set where
the correct class is known in advance, it is
possible to compare the expected output to the
actual output.
The ITSC_Accuracy utility performs this function

24
Evaluating Classifier Performance

ITSC_Accuracy is run as follows
ITSC_Accuracy -c class -t res_tst.arff v
tst.arff o acc_tst.txt
-c provides the name of the class attribute.
-t indicates the input classified data set file,
-v specifies the name of a valid input file,
-o is the output filename that holds the results
of the comparison

25
Evaluating Classifier Performance

Output from example

26
Python Script for Workflow
27
ADaM Image Classification

Classification of image data is a bit more
involved, as there is an additional set of steps
that must be performed to extract useful features
from the images before classification can be
performed.
In addition, it is also useful to transform the
data back into image format for visualization
purposes.
As an example problem, we will consider detection
of cumulus cloud fields in GOES satellite images
GOES satellites produce a 5 channel image every
15 minutes
The classifier must label each pixel as either
belonging to a cumulus cloud field or not based
on the GOES data
Algorithms based on spectral properties often
miss cumulus clouds because of the low resolution
of the IR channels and the small size of clouds
Texture features computed from the GOES visible
image provide a means to detect cumulus cloud
fields.

28
GOES Images - Preprocessing

Segmentation is based only on the high resolution
(1km) visible channel.
In order to remove the effects of the light
reflected from the Earths surface, a visible
reference background image is constructed for
each time of the day.
The reference image is subtracted from the
visible image before it is segmented.
GOES image patches containing cumulus cloud
regions, other cloud regions, and background were
selected
Independent experts labeled each pixel of the
selected image patches as cumulus cloud or not
The expert labels were combined to form a single
truth image for each of the original image
patches. In cases where the experts disagreed,
the truth image was given a dont know value

29
GOES Images - Example
Expert Labels

GOES Visible Image

30
Image Quantization

Some texture features perform better when the
image is quantized to some small number of levels
before the features are computed.
ITSC_RelLevel performs local image quantization

31
Image Quantization

For this demo, we will reduce the number of
levels from 256 to just three using local image
statistics
ITSC_RelLevel d -s 30 i src.bin o
q4.bin k
The i argument specifies the input file name
The o argument specifies the output file name
The d argument tells the program to use standard
deviation to set the cutoffs instead of a fixed
value
The k option tells the program to keep values in
the range 0, 1, 2 rather than normalizing to
0..1.
The s argument indicates the size of the local
area used to compute statistics

32
Computing Texture Features

ADaM is currently able to compute five different
types of texture features gray level
cooccurrence, gray level run length, association
rules, Gabor filters, and MRF models
The syntax for gray level run length computation
is

33
Computing Texture Features

For this demo, we will compute gray level run
length features using a tile size of 25
ITSC_Glrl i q4.bin o glrl.arff l 3 B
t 25
The i argument specifies the input file name
The o argument specifies the output file name
The l argument tells the program the number of
levels in the input image
The B option tells the program to write a binary
version of the ARFF file (default is ASCII)
The t argument indicates the size of the tiles
used to compute the gray level run length features

34
Converting the Label Images

Since the labels are in the form of images, it is
necessary to convert them to vector form
ITSC_CvtImageToArff will do this

35
Converting the Label Images

The labels can be converted to vector form using
ITSC_CvtImageToArff i lbl.bin o
lbl.arff -B
The i argument specifies the input file name
The o argument specifies the output file name
The B argument tells the program to write the
output file in binary form (default is ASCII)

36
Labeling the Patterns

Once the labels are in vector form, they can be
appended to the patterns produced by ITSC_Glrl
ITSC_LabelPatterns will do this

37
Labeling the Patterns

The labels are assigned to patterns as follows
ITSC_LabelPatterns i glrl.arff c class l
lbl.bin L lbl.arff o all.arff B
The i argument specifies the input file name
(patterns)
The o argument specifies the output file name
The c argument
The c argument specifies the name of the class
attribute in the pattern set
The l argument specifies the name of the label
attribute in the label set
The L argument specifies the name of the input
label file
The B argument tells the program to write the
output file in binary form (default is ASCII)

38
Eliminating Dont Know Patterns

Some of the original pixels were classified
differently by different experts and marked as
dont know
The corresponding patterns can be removed from
the training set using ITSC_Subset

39
Eliminating Dont Know Patterns

ITSC_Subset is used to remove patterns with
unclear class assignment. The subset is generated
based on the value of the class attribute
ITSC_Subset i all.arff o subset.arff a
class r 0 1 -B
The i argument specifies the input file name
The o argument specifies the output file name
The a argument tells which attribute to test
The r argument tells the legal range of the
attribute
The B argument tells the program to write the
output file in binary form (default is ASCII)

40
Selecting Random Samples

Random samples are selected from the original
training data using the same ITSC_Sample program
shown in the previous demo
The program is used in a slightly different way
ITSC_Sample i subset.arff c class o
s1.arff n 2000
The i argument specifies the input file name
The o argument specifies the output file name
The c argument specifies the name of the class
attribute
The n option tells the program to select an
equal number of random samples (in this case
2000) from each class.

41
Python Script for Sample Creation
42
Merging Samples / Multiple Images

The procedure up to this point has created a
random subset of points from a particular image.
Subsets from multiple images can be combined
using ITSC_MergePatterns

43
Merging Samples / Multiple Images

Multiple pattern sets are merged using the
following command
ITSC_MergePatterns c class o
merged.arff i s1.arff s2.arff
The i argument specifies the input file names
The o argument specifies the output file name
The c argument specifies the name of the class
attribute

44
Python Script for Training
45
Results of Classifier Evaluation

The results of running this procedure using five
sample images of size 500x500 is as follows

46
Applying the Classifier to Images

Once the classifier is trained, it can be applied
to segment images. One further program is
required on the end to convert the classified
patterns back into an image

47
Python Function for Segmentation
48
Sample Image Results
Segmentation Result

Expert Labels

49
Remarks

The procedure illustrated here is one specific
example of ADaMs capabilities
There are many other classifiers, texture
features and other tools that could be used for
this problem
Since all of the algorithms of a particular type
work in more or less the same way, the same
general procedure could be used with other tools

50
Data Mining Examples From ITSC
51
Mining on Data Ingest Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data

Mining Plan
Water cover mask to eliminate land
Laplacian filter to compute temperature gradients
Science Algorithm to estimate wind speed
Contiguous regions with wind speeds above a
desired threshold identified
Additional test to eliminate false positives
Maximum wind speed and location produced

Further Analysis
Calibration/ Limb Correction/ Converted to Tb
Knowledge Base
Data Archive
Hurricane Floyd
ADaM Mining Environment
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
http//pm-esip.msfc.nasa.gov/cyclone
52
Mesocyclone Signature Detection Using ADaM

Problem Detecting mesocyclone signatures in
Radar data
Science Rationale Improved accuracy and reduced
false alarm rate for indicators of tornadic
activity
Technique Developing an algorithm based on wind
velocity shear signatures

53
Data Mining / Earth Science Collaboration
Classification Based on Texture Features
Cumulus cloud fields have a very characteristic
texture signature in the GOES visible imagery

Science Rationale Man-made changes to land use
cause changes in weather patterns, especially
cumulus clouds
Comparison based on
Accuracy of detection
Amount of time required to classify

54
Data Mining / Space Science Collaboration
Boundary Detection and Quantification

Analysis of polar cap auroras in large volumes
of spacecraft UV images
Scientific Rationale
Indicators to predict geomagnetic storm
Damage satellites
Disrupt radio connection
Developing different mining algorithms to detect
and quantify polar cap boundary

Polar Cap Boundary
55
Data Mining / BioInformatics Collaboration
Genome Patterns
Text Pattern Recognition Used to search for
text patterns in bioscience data as well as other
text documents.
Scientists
Mining Results MCSs
Genome DB
56
Additional Information