Title: ADaM Tutorial Dr' John Rushing
1ADaM TutorialDr. John Rushing
- Information Technology and Systems Center
- University of Alabama in Huntsville
- National Space Science and Technology Center
- 256-824-6064
- dhardin_at_itsc.uah.edu
- http//www.itsc.uah.edu/
- http//datamining.itsc.uah.edu/
2Data Mining What it is
- Extracting knowledge from large amounts of data
- Motivation
- Our ability to collect data has expanded rapidly
- It is impossible to analyze all of the data
manually - Data contains valuable information that can aid
in decision making - Uses techniques from
- Pattern Recognition
- Machine Learning
- Statistics
- High Performance Database Systems
- On-Line Analytical Processing (OLAP)
- Plus other techniques unique to data mining
(Association rules) - Data mining methods must be efficient and scalable
3Data Mining What it isnt
- Small Scale
- Data mining methods are designed for large data
sets - Scale is one of the characteristics that
distinguishes data mining applications from
traditional machine learning applications - Foolproof
- Data mining techniques will discover patterns in
any data - The patterns discovered may be meaningless
- It is up to the user to determine how to
interpret the results - Magic
- Data mining techniques cannot generate
information that is not present in the data - They can only find the patterns that are already
there
4Data Mining Types of Mining
- Association Rule Mining
- Initially developed for market basket analysis
- Goal is to discover relationships between
attributes - Uses include decision support, classification and
clustering - Classification and Prediction (Supervised
Learning) - Classifiers are created using labeled training
samples - Training samples created by ground truth /
experts - Classifier later used to classify unknown samples
- Clustering (Unsupervised Learning)
- Grouping objects into classes so that similar
objects are in the same class and dissimilar
objects are in different classes - Discover overall distribution patterns and
relationships between attributes - Other Types of Mining
- Outlier Analysis
- Concept / Class Description
- Time Series Analysis
5ADaM System Overview
- Developed by the Information Technology and
Systems Center at the University of Alabama in
Huntsville - Consists of over 75 interoperable mining and
image processing components - Each component is provided with a C application
programming interface (API), an executable in
support of scripting tools (e.g. Perl, Python,
Tcl, Shell) - ADaM components are lightweight and autonomous,
and have been used successfully in a grid
environment - ADaM has several translation components that
provide data level interoperability with other
mining systems (such as WEKA and Orange), and
point tools (such as libSVM and svmLight) - Future versions will include Python wrappers and
possible web service interfaces
6ADaM 4.0 Components
7ADaM Classification
- Definition Classification is a way to find and
develop a procedure for identifying some
phenomenon of interest from a known set of
training data in a way that can be used in the
future to find similar patterns in similar data
sets. - Method
- Define or categorize classes of data objects.
- Create a model represented by training samples
which taken together form a training data set. - Represent the model mathematically in the form of
classification rules, decision trees or formulae.
- Use the model to categorize similar data sets.
8ADaM Classification - Process
- Identify potential features which may
characterize the phenomenon of interest - Generate a set of training instances where each
instance consists of a set of feature values and
the corresponding class label - Describe the instances using ARFF file format
- Preprocess the data as necessary (normalize,
sample etc.) - Split the data into training / test set(s) as
appropriate - Train the classifier using the training set
- Evaluate classifier performance using test set
- K-Fold cross validation, leave one out or other
more sophisticated methods may also be used for
evaluating classifier performance
9Naïve Bayes Classification - 1
- Classification problem with m classes C1, C2,
Cm - Given an unknown sample X, the goal is to choose
a class that is most likely based on statistics
from training data
- P(Ci X) can be computed using Bayes Theorem
1 Equations from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
10Naïve Bayes Classification - 2
- P(X) is constant for all classes, so finding the
most likely class amounts to maximizing P(X Ci)
P(Ci) - P(Ci ) is the prior probability of class i. If
the probabilities are not known, equal
probabilities can be assumed. - Assuming attributes are conditionally independent
- P(xk Ci) is the probability density function
for attribute k
1 Equation from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
11Naïve Bayes Classification - 3
- P(xk Ci) is estimated from the training samples
- Categorical Attributes (non-numeric attributes)
- Estimate P(xk Ci) as percentage of samples of
class i with value xk - Training involves counting percentage of
occurrence of each possible value for each class - Numeric attributes
- Also use statistics of the sample data to
estimate P(xk Ci) - Actual form of density function is generally not
known, so Gaussian density is often assumed - Training involves computation of mean and
variance for each attribute for each class
12Naïve Bayes Classification - 4
- Gaussian distribution for numeric attributes
- Where is the mean of attribute k observed
in samples of class Ci - And is the standard deviation of attribute
k observed in samples of class Ci
1 Equation from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
13ADaM Classification - Example
- Starting with an ARFF file, the ADaM system will
be used to create a Naïve Bayes classifier and
evaluate it - The source data will be an ARFF version of the
Wisconsin breast cancer data from the University
of California Irvine (UCI) Machine Learning
Database - http//www.ics.uci.edu/mlearn/MLRepository.html
- The Naïve Bayes classifier will be trained to
distinguish malignant vs. benign tumors based on
nine characteristics
14Sample Data Set ARFF Format
15Preparing the Training Samples
- ADaM has utilities for splitting data sets into
disjoint groups for training and testing
classifiers - The simplest is ITSC_Sample, which splits the
source data set into two disjoint subsets
16Splitting the Samples
- We will split the breast cancer data set into two
groups, one with 2/3 of the patterns and another
with 1/3 of the patterns - ITSC_Sample -c class -i bcw.arff -o
trn.arff -t tst.arff p 0.66 - The i argument specifies the input file name
- The o and t arguments specify the names of the
two output files (-o output one, -t output
two) - The p argument specifies the portion of data
that goes into output one (trn.arff), the
remainder goes to output two (tst.arff) - The c argument tells the sample program which
attribute is the class attribute
17Training the Classifier - 1
- Training involves computation of the mean vector
and covariance matrix for each class - The mean of an attribute for a class is the sum
of the attribute values divided by number of
instances - The mean vector is the vector of attribute means
- The variance of an attribute is the sum of the
squared deviation from the mean for that
attribute over patterns in the class - The covariance of a pair of attributes is the sum
of the deviation from the mean of attribute one
the deviation from the mean of attribute two over
class patterns
18Training the Classifier - 2
- ADaM has several different types of classifiers
- Each classifier has a training method and an
application method - Here we show one the Naïve Bayes classifier
with the following syntax
19Training the Classifier - 3
- Training the Naïve Bayes classifier
-
- ITSC_NaiveBayesTrain -c class -i trn.arff
b bayes.txt - The i argument specifies the input file name
- The c argument specifies the name of the class
attribute - The b argument specifies the name of the
classifier file - The output of the ITSC_NaiveBayesTrain module is
shown here.
20Applying the Classifier - 1
- Use mean and covariance to compute P(X Ci), the
probability that pattern X occurs in class Ci for
all classes - The prior probabilities P(Ci) are already known
- Assign the pattern X to the class for which P(X
Ci) P(Ci) is maximal - Note that P(X Ci) is high when X is close to
the mean of the class. - The variance is used to adjust for how much
patterns within the class typically vary. - A pattern may be assigned to a class for which
the mean is further away than the mean of some
other class!
21Applying the Classifier - 2
- Once trained, the Naïve Bayes classifier can be
used to classify unknown instances - The syntax for ADaMs Naïve Bayes classifier is
as follows
22Applying the Classifier - 3
- The classifier is run as follows
- ITSC_NaiveBayesApply -c class -i trn.arff b
bayes.txt -o res_trn.arff - The i argument specifies the input file name
- The c argument specifies the name of the class
attribute - The b argument specifies the name of the
classifier file - The o argument specifies the name of the result
file
23Evaluating Classifier Performance
- By applying the classifier to a test set where
the correct class is known in advance, it is
possible to compare the expected output to the
actual output. - The ITSC_Accuracy utility performs this function
24Evaluating Classifier Performance
- ITSC_Accuracy is run as follows
- ITSC_Accuracy -c class -t res_tst.arff v
tst.arff o acc_tst.txt - -c provides the name of the class attribute.
- -t indicates the input classified data set file,
- -v specifies the name of a valid input file,
- -o is the output filename that holds the results
of the comparison
25Evaluating Classifier Performance
26Python Script for Workflow
27ADaM Image Classification
- Classification of image data is a bit more
involved, as there is an additional set of steps
that must be performed to extract useful features
from the images before classification can be
performed. - In addition, it is also useful to transform the
data back into image format for visualization
purposes. - As an example problem, we will consider detection
of cumulus cloud fields in GOES satellite images - GOES satellites produce a 5 channel image every
15 minutes - The classifier must label each pixel as either
belonging to a cumulus cloud field or not based
on the GOES data - Algorithms based on spectral properties often
miss cumulus clouds because of the low resolution
of the IR channels and the small size of clouds - Texture features computed from the GOES visible
image provide a means to detect cumulus cloud
fields.
28GOES Images - Preprocessing
- Segmentation is based only on the high resolution
(1km) visible channel. - In order to remove the effects of the light
reflected from the Earths surface, a visible
reference background image is constructed for
each time of the day. - The reference image is subtracted from the
visible image before it is segmented. - GOES image patches containing cumulus cloud
regions, other cloud regions, and background were
selected - Independent experts labeled each pixel of the
selected image patches as cumulus cloud or not - The expert labels were combined to form a single
truth image for each of the original image
patches. In cases where the experts disagreed,
the truth image was given a dont know value
29GOES Images - Example
Expert Labels
30Image Quantization
- Some texture features perform better when the
image is quantized to some small number of levels
before the features are computed. - ITSC_RelLevel performs local image quantization
31Image Quantization
- For this demo, we will reduce the number of
levels from 256 to just three using local image
statistics - ITSC_RelLevel d -s 30 i src.bin o
q4.bin k - The i argument specifies the input file name
- The o argument specifies the output file name
- The d argument tells the program to use standard
deviation to set the cutoffs instead of a fixed
value - The k option tells the program to keep values in
the range 0, 1, 2 rather than normalizing to
0..1. - The s argument indicates the size of the local
area used to compute statistics
32Computing Texture Features
- ADaM is currently able to compute five different
types of texture features gray level
cooccurrence, gray level run length, association
rules, Gabor filters, and MRF models - The syntax for gray level run length computation
is
33Computing Texture Features
- For this demo, we will compute gray level run
length features using a tile size of 25 - ITSC_Glrl i q4.bin o glrl.arff l 3 B
t 25 - The i argument specifies the input file name
- The o argument specifies the output file name
- The l argument tells the program the number of
levels in the input image - The B option tells the program to write a binary
version of the ARFF file (default is ASCII) - The t argument indicates the size of the tiles
used to compute the gray level run length features
34Converting the Label Images
- Since the labels are in the form of images, it is
necessary to convert them to vector form - ITSC_CvtImageToArff will do this
35Converting the Label Images
- The labels can be converted to vector form using
- ITSC_CvtImageToArff i lbl.bin o
lbl.arff -B - The i argument specifies the input file name
- The o argument specifies the output file name
- The B argument tells the program to write the
output file in binary form (default is ASCII)
36Labeling the Patterns
- Once the labels are in vector form, they can be
appended to the patterns produced by ITSC_Glrl - ITSC_LabelPatterns will do this
37Labeling the Patterns
- The labels are assigned to patterns as follows
- ITSC_LabelPatterns i glrl.arff c class l
lbl.bin L lbl.arff o all.arff B - The i argument specifies the input file name
(patterns) - The o argument specifies the output file name
The c argument - The c argument specifies the name of the class
attribute in the pattern set - The l argument specifies the name of the label
attribute in the label set - The L argument specifies the name of the input
label file - The B argument tells the program to write the
output file in binary form (default is ASCII)
38Eliminating Dont Know Patterns
- Some of the original pixels were classified
differently by different experts and marked as
dont know - The corresponding patterns can be removed from
the training set using ITSC_Subset
39Eliminating Dont Know Patterns
- ITSC_Subset is used to remove patterns with
unclear class assignment. The subset is generated
based on the value of the class attribute - ITSC_Subset i all.arff o subset.arff a
class r 0 1 -B - The i argument specifies the input file name
- The o argument specifies the output file name
- The a argument tells which attribute to test
- The r argument tells the legal range of the
attribute - The B argument tells the program to write the
output file in binary form (default is ASCII)
40Selecting Random Samples
- Random samples are selected from the original
training data using the same ITSC_Sample program
shown in the previous demo - The program is used in a slightly different way
- ITSC_Sample i subset.arff c class o
s1.arff n 2000 - The i argument specifies the input file name
- The o argument specifies the output file name
- The c argument specifies the name of the class
attribute - The n option tells the program to select an
equal number of random samples (in this case
2000) from each class.
41Python Script for Sample Creation
42Merging Samples / Multiple Images
- The procedure up to this point has created a
random subset of points from a particular image.
Subsets from multiple images can be combined
using ITSC_MergePatterns
43Merging Samples / Multiple Images
- Multiple pattern sets are merged using the
following command - ITSC_MergePatterns c class o
merged.arff i s1.arff s2.arff - The i argument specifies the input file names
- The o argument specifies the output file name
- The c argument specifies the name of the class
attribute
44Python Script for Training
45Results of Classifier Evaluation
- The results of running this procedure using five
sample images of size 500x500 is as follows
46Applying the Classifier to Images
- Once the classifier is trained, it can be applied
to segment images. One further program is
required on the end to convert the classified
patterns back into an image
47Python Function for Segmentation
48Sample Image Results
Segmentation Result
49Remarks
- The procedure illustrated here is one specific
example of ADaMs capabilities - There are many other classifiers, texture
features and other tools that could be used for
this problem - Since all of the algorithms of a particular type
work in more or less the same way, the same
general procedure could be used with other tools
50Data Mining Examples From ITSC
51Mining on Data Ingest Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data
- Mining Plan
- Water cover mask to eliminate land
- Laplacian filter to compute temperature gradients
- Science Algorithm to estimate wind speed
- Contiguous regions with wind speeds above a
desired threshold identified - Additional test to eliminate false positives
- Maximum wind speed and location produced
Further Analysis
Calibration/ Limb Correction/ Converted to Tb
Knowledge Base
Data Archive
Hurricane Floyd
ADaM Mining Environment
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
http//pm-esip.msfc.nasa.gov/cyclone
52Mesocyclone Signature Detection Using ADaM
- Problem Detecting mesocyclone signatures in
Radar data - Science Rationale Improved accuracy and reduced
false alarm rate for indicators of tornadic
activity - Technique Developing an algorithm based on wind
velocity shear signatures
53Data Mining / Earth Science Collaboration
Classification Based on Texture Features
Cumulus cloud fields have a very characteristic
texture signature in the GOES visible imagery
- Science Rationale Man-made changes to land use
cause changes in weather patterns, especially
cumulus clouds - Comparison based on
- Accuracy of detection
- Amount of time required to classify
54Data Mining / Space Science Collaboration
Boundary Detection and Quantification
- Analysis of polar cap auroras in large volumes
of spacecraft UV images - Scientific Rationale
- Indicators to predict geomagnetic storm
- Damage satellites
- Disrupt radio connection
- Developing different mining algorithms to detect
and quantify polar cap boundary
Polar Cap Boundary
55Data Mining / BioInformatics Collaboration
Genome Patterns
Text Pattern Recognition Used to search for
text patterns in bioscience data as well as other
text documents.
Scientists
Mining Results MCSs
Genome DB
56Additional Information
- Website
- www.itsc.uah.edu
- Dr. Sara Graves
- sgraves_at_itsc.uah.edu
- Danny Hardin
- dhardin_at_itsc.uah.edu