ADaM Tutorial Dr' John Rushing - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

ADaM Tutorial Dr' John Rushing

Description:

ADaM Tutorial Dr' John Rushing – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 57
Provided by: kenk157
Category:
Tags: adam | john | rushing | tutorial

less

Transcript and Presenter's Notes

Title: ADaM Tutorial Dr' John Rushing


1
ADaM TutorialDr. John Rushing
  • Information Technology and Systems Center
  • University of Alabama in Huntsville
  • National Space Science and Technology Center
  • 256-824-6064
  • dhardin_at_itsc.uah.edu
  • http//www.itsc.uah.edu/
  • http//datamining.itsc.uah.edu/

2
Data Mining What it is
  • Extracting knowledge from large amounts of data
  • Motivation
  • Our ability to collect data has expanded rapidly
  • It is impossible to analyze all of the data
    manually
  • Data contains valuable information that can aid
    in decision making
  • Uses techniques from
  • Pattern Recognition
  • Machine Learning
  • Statistics
  • High Performance Database Systems
  • On-Line Analytical Processing (OLAP)
  • Plus other techniques unique to data mining
    (Association rules)
  • Data mining methods must be efficient and scalable

3
Data Mining What it isnt
  • Small Scale
  • Data mining methods are designed for large data
    sets
  • Scale is one of the characteristics that
    distinguishes data mining applications from
    traditional machine learning applications
  • Foolproof
  • Data mining techniques will discover patterns in
    any data
  • The patterns discovered may be meaningless
  • It is up to the user to determine how to
    interpret the results
  • Magic
  • Data mining techniques cannot generate
    information that is not present in the data
  • They can only find the patterns that are already
    there

4
Data Mining Types of Mining
  • Association Rule Mining
  • Initially developed for market basket analysis
  • Goal is to discover relationships between
    attributes
  • Uses include decision support, classification and
    clustering
  • Classification and Prediction (Supervised
    Learning)
  • Classifiers are created using labeled training
    samples
  • Training samples created by ground truth /
    experts
  • Classifier later used to classify unknown samples
  • Clustering (Unsupervised Learning)
  • Grouping objects into classes so that similar
    objects are in the same class and dissimilar
    objects are in different classes
  • Discover overall distribution patterns and
    relationships between attributes
  • Other Types of Mining
  • Outlier Analysis
  • Concept / Class Description
  • Time Series Analysis

5
ADaM System Overview
  • Developed by the Information Technology and
    Systems Center at the University of Alabama in
    Huntsville
  • Consists of over 75 interoperable mining and
    image processing components
  • Each component is provided with a C application
    programming interface (API), an executable in
    support of scripting tools (e.g. Perl, Python,
    Tcl, Shell)
  • ADaM components are lightweight and autonomous,
    and have been used successfully in a grid
    environment
  • ADaM has several translation components that
    provide data level interoperability with other
    mining systems (such as WEKA and Orange), and
    point tools (such as libSVM and svmLight)
  • Future versions will include Python wrappers and
    possible web service interfaces

6
ADaM 4.0 Components
7
ADaM Classification
  • Definition Classification is a way to find and
    develop a procedure for identifying some
    phenomenon of interest from a known set of
    training data in a way that can be used in the
    future to find similar patterns in similar data
    sets.
  • Method
  • Define or categorize classes of data objects.
  • Create a model represented by training samples
    which taken together form a training data set.
  • Represent the model mathematically in the form of
    classification rules, decision trees or formulae.
  • Use the model to categorize similar data sets.

8
ADaM Classification - Process
  • Identify potential features which may
    characterize the phenomenon of interest
  • Generate a set of training instances where each
    instance consists of a set of feature values and
    the corresponding class label
  • Describe the instances using ARFF file format
  • Preprocess the data as necessary (normalize,
    sample etc.)
  • Split the data into training / test set(s) as
    appropriate
  • Train the classifier using the training set
  • Evaluate classifier performance using test set
  • K-Fold cross validation, leave one out or other
    more sophisticated methods may also be used for
    evaluating classifier performance

9
Naïve Bayes Classification - 1
  • Classification problem with m classes C1, C2,
    Cm
  • Given an unknown sample X, the goal is to choose
    a class that is most likely based on statistics
    from training data
  • P(Ci X) can be computed using Bayes Theorem

1 Equations from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
10
Naïve Bayes Classification - 2
  • P(X) is constant for all classes, so finding the
    most likely class amounts to maximizing P(X Ci)
    P(Ci)
  • P(Ci ) is the prior probability of class i. If
    the probabilities are not known, equal
    probabilities can be assumed.
  • Assuming attributes are conditionally independent
  • P(xk Ci) is the probability density function
    for attribute k

1 Equation from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
11
Naïve Bayes Classification - 3
  • P(xk Ci) is estimated from the training samples
  • Categorical Attributes (non-numeric attributes)
  • Estimate P(xk Ci) as percentage of samples of
    class i with value xk
  • Training involves counting percentage of
    occurrence of each possible value for each class
  • Numeric attributes
  • Also use statistics of the sample data to
    estimate P(xk Ci)
  • Actual form of density function is generally not
    known, so Gaussian density is often assumed
  • Training involves computation of mean and
    variance for each attribute for each class

12
Naïve Bayes Classification - 4
  • Gaussian distribution for numeric attributes
  • Where is the mean of attribute k observed
    in samples of class Ci
  • And is the standard deviation of attribute
    k observed in samples of class Ci

1 Equation from J. Han and M. Kamber, Data
Mining Concepts and Techniques, Morgan
Kaufmann, 2001.
13
ADaM Classification - Example
  • Starting with an ARFF file, the ADaM system will
    be used to create a Naïve Bayes classifier and
    evaluate it
  • The source data will be an ARFF version of the
    Wisconsin breast cancer data from the University
    of California Irvine (UCI) Machine Learning
    Database
  • http//www.ics.uci.edu/mlearn/MLRepository.html
  • The Naïve Bayes classifier will be trained to
    distinguish malignant vs. benign tumors based on
    nine characteristics

14
Sample Data Set ARFF Format
15
Preparing the Training Samples
  • ADaM has utilities for splitting data sets into
    disjoint groups for training and testing
    classifiers
  • The simplest is ITSC_Sample, which splits the
    source data set into two disjoint subsets

16
Splitting the Samples
  • We will split the breast cancer data set into two
    groups, one with 2/3 of the patterns and another
    with 1/3 of the patterns
  • ITSC_Sample -c class -i bcw.arff -o
    trn.arff -t tst.arff p 0.66
  • The i argument specifies the input file name
  • The o and t arguments specify the names of the
    two output files (-o output one, -t output
    two)
  • The p argument specifies the portion of data
    that goes into output one (trn.arff), the
    remainder goes to output two (tst.arff)
  • The c argument tells the sample program which
    attribute is the class attribute

17
Training the Classifier - 1
  • Training involves computation of the mean vector
    and covariance matrix for each class
  • The mean of an attribute for a class is the sum
    of the attribute values divided by number of
    instances
  • The mean vector is the vector of attribute means
  • The variance of an attribute is the sum of the
    squared deviation from the mean for that
    attribute over patterns in the class
  • The covariance of a pair of attributes is the sum
    of the deviation from the mean of attribute one
    the deviation from the mean of attribute two over
    class patterns

18
Training the Classifier - 2
  • ADaM has several different types of classifiers
  • Each classifier has a training method and an
    application method
  • Here we show one the Naïve Bayes classifier
    with the following syntax

19
Training the Classifier - 3
  • Training the Naïve Bayes classifier
  • ITSC_NaiveBayesTrain -c class -i trn.arff
    b bayes.txt
  • The i argument specifies the input file name
  • The c argument specifies the name of the class
    attribute
  • The b argument specifies the name of the
    classifier file
  • The output of the ITSC_NaiveBayesTrain module is
    shown here.

20
Applying the Classifier - 1
  • Use mean and covariance to compute P(X Ci), the
    probability that pattern X occurs in class Ci for
    all classes
  • The prior probabilities P(Ci) are already known
  • Assign the pattern X to the class for which P(X
    Ci) P(Ci) is maximal
  • Note that P(X Ci) is high when X is close to
    the mean of the class.
  • The variance is used to adjust for how much
    patterns within the class typically vary.
  • A pattern may be assigned to a class for which
    the mean is further away than the mean of some
    other class!

21
Applying the Classifier - 2
  • Once trained, the Naïve Bayes classifier can be
    used to classify unknown instances
  • The syntax for ADaMs Naïve Bayes classifier is
    as follows

22
Applying the Classifier - 3
  • The classifier is run as follows
  • ITSC_NaiveBayesApply -c class -i trn.arff b
    bayes.txt -o res_trn.arff
  • The i argument specifies the input file name
  • The c argument specifies the name of the class
    attribute
  • The b argument specifies the name of the
    classifier file
  • The o argument specifies the name of the result
    file

23
Evaluating Classifier Performance
  • By applying the classifier to a test set where
    the correct class is known in advance, it is
    possible to compare the expected output to the
    actual output.
  • The ITSC_Accuracy utility performs this function

24
Evaluating Classifier Performance
  • ITSC_Accuracy is run as follows
  • ITSC_Accuracy -c class -t res_tst.arff v
    tst.arff o acc_tst.txt
  • -c provides the name of the class attribute.
  • -t indicates the input classified data set file,
  • -v specifies the name of a valid input file,
  • -o is the output filename that holds the results
    of the comparison

25
Evaluating Classifier Performance
  • Output from example

26
Python Script for Workflow
27
ADaM Image Classification
  • Classification of image data is a bit more
    involved, as there is an additional set of steps
    that must be performed to extract useful features
    from the images before classification can be
    performed.
  • In addition, it is also useful to transform the
    data back into image format for visualization
    purposes.
  • As an example problem, we will consider detection
    of cumulus cloud fields in GOES satellite images
  • GOES satellites produce a 5 channel image every
    15 minutes
  • The classifier must label each pixel as either
    belonging to a cumulus cloud field or not based
    on the GOES data
  • Algorithms based on spectral properties often
    miss cumulus clouds because of the low resolution
    of the IR channels and the small size of clouds
  • Texture features computed from the GOES visible
    image provide a means to detect cumulus cloud
    fields.

28
GOES Images - Preprocessing
  • Segmentation is based only on the high resolution
    (1km) visible channel.
  • In order to remove the effects of the light
    reflected from the Earths surface, a visible
    reference background image is constructed for
    each time of the day.
  • The reference image is subtracted from the
    visible image before it is segmented.
  • GOES image patches containing cumulus cloud
    regions, other cloud regions, and background were
    selected
  • Independent experts labeled each pixel of the
    selected image patches as cumulus cloud or not
  • The expert labels were combined to form a single
    truth image for each of the original image
    patches. In cases where the experts disagreed,
    the truth image was given a dont know value

29
GOES Images - Example
Expert Labels
  • GOES Visible Image

30
Image Quantization
  • Some texture features perform better when the
    image is quantized to some small number of levels
    before the features are computed.
  • ITSC_RelLevel performs local image quantization

31
Image Quantization
  • For this demo, we will reduce the number of
    levels from 256 to just three using local image
    statistics
  • ITSC_RelLevel d -s 30 i src.bin o
    q4.bin k
  • The i argument specifies the input file name
  • The o argument specifies the output file name
  • The d argument tells the program to use standard
    deviation to set the cutoffs instead of a fixed
    value
  • The k option tells the program to keep values in
    the range 0, 1, 2 rather than normalizing to
    0..1.
  • The s argument indicates the size of the local
    area used to compute statistics

32
Computing Texture Features
  • ADaM is currently able to compute five different
    types of texture features gray level
    cooccurrence, gray level run length, association
    rules, Gabor filters, and MRF models
  • The syntax for gray level run length computation
    is

33
Computing Texture Features
  • For this demo, we will compute gray level run
    length features using a tile size of 25
  • ITSC_Glrl i q4.bin o glrl.arff l 3 B
    t 25
  • The i argument specifies the input file name
  • The o argument specifies the output file name
  • The l argument tells the program the number of
    levels in the input image
  • The B option tells the program to write a binary
    version of the ARFF file (default is ASCII)
  • The t argument indicates the size of the tiles
    used to compute the gray level run length features

34
Converting the Label Images
  • Since the labels are in the form of images, it is
    necessary to convert them to vector form
  • ITSC_CvtImageToArff will do this

35
Converting the Label Images
  • The labels can be converted to vector form using
  • ITSC_CvtImageToArff i lbl.bin o
    lbl.arff -B
  • The i argument specifies the input file name
  • The o argument specifies the output file name
  • The B argument tells the program to write the
    output file in binary form (default is ASCII)

36
Labeling the Patterns
  • Once the labels are in vector form, they can be
    appended to the patterns produced by ITSC_Glrl
  • ITSC_LabelPatterns will do this

37
Labeling the Patterns
  • The labels are assigned to patterns as follows
  • ITSC_LabelPatterns i glrl.arff c class l
    lbl.bin L lbl.arff o all.arff B
  • The i argument specifies the input file name
    (patterns)
  • The o argument specifies the output file name
    The c argument
  • The c argument specifies the name of the class
    attribute in the pattern set
  • The l argument specifies the name of the label
    attribute in the label set
  • The L argument specifies the name of the input
    label file
  • The B argument tells the program to write the
    output file in binary form (default is ASCII)

38
Eliminating Dont Know Patterns
  • Some of the original pixels were classified
    differently by different experts and marked as
    dont know
  • The corresponding patterns can be removed from
    the training set using ITSC_Subset

39
Eliminating Dont Know Patterns
  • ITSC_Subset is used to remove patterns with
    unclear class assignment. The subset is generated
    based on the value of the class attribute
  • ITSC_Subset i all.arff o subset.arff a
    class r 0 1 -B
  • The i argument specifies the input file name
  • The o argument specifies the output file name
  • The a argument tells which attribute to test
  • The r argument tells the legal range of the
    attribute
  • The B argument tells the program to write the
    output file in binary form (default is ASCII)

40
Selecting Random Samples
  • Random samples are selected from the original
    training data using the same ITSC_Sample program
    shown in the previous demo
  • The program is used in a slightly different way
  • ITSC_Sample i subset.arff c class o
    s1.arff n 2000
  • The i argument specifies the input file name
  • The o argument specifies the output file name
  • The c argument specifies the name of the class
    attribute
  • The n option tells the program to select an
    equal number of random samples (in this case
    2000) from each class.

41
Python Script for Sample Creation
42
Merging Samples / Multiple Images
  • The procedure up to this point has created a
    random subset of points from a particular image.
    Subsets from multiple images can be combined
    using ITSC_MergePatterns

43
Merging Samples / Multiple Images
  • Multiple pattern sets are merged using the
    following command
  • ITSC_MergePatterns c class o
    merged.arff i s1.arff s2.arff
  • The i argument specifies the input file names
  • The o argument specifies the output file name
  • The c argument specifies the name of the class
    attribute

44
Python Script for Training
45
Results of Classifier Evaluation
  • The results of running this procedure using five
    sample images of size 500x500 is as follows

46
Applying the Classifier to Images
  • Once the classifier is trained, it can be applied
    to segment images. One further program is
    required on the end to convert the classified
    patterns back into an image

47
Python Function for Segmentation
48
Sample Image Results
Segmentation Result
  • Expert Labels

49
Remarks
  • The procedure illustrated here is one specific
    example of ADaMs capabilities
  • There are many other classifiers, texture
    features and other tools that could be used for
    this problem
  • Since all of the algorithms of a particular type
    work in more or less the same way, the same
    general procedure could be used with other tools

50
Data Mining Examples From ITSC
51
Mining on Data Ingest Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data
  • Mining Plan
  • Water cover mask to eliminate land
  • Laplacian filter to compute temperature gradients
  • Science Algorithm to estimate wind speed
  • Contiguous regions with wind speeds above a
    desired threshold identified
  • Additional test to eliminate false positives
  • Maximum wind speed and location produced

Further Analysis
Calibration/ Limb Correction/ Converted to Tb
Knowledge Base
Data Archive
Hurricane Floyd
ADaM Mining Environment
Result
Results are placed on the web, made available to
National Hurricane Center Joint Typhoon
Warning Center, and stored for further analysis
http//pm-esip.msfc.nasa.gov/cyclone
52
Mesocyclone Signature Detection Using ADaM
  • Problem Detecting mesocyclone signatures in
    Radar data
  • Science Rationale Improved accuracy and reduced
    false alarm rate for indicators of tornadic
    activity
  • Technique Developing an algorithm based on wind
    velocity shear signatures

53
Data Mining / Earth Science Collaboration
Classification Based on Texture Features
Cumulus cloud fields have a very characteristic
texture signature in the GOES visible imagery
  • Science Rationale Man-made changes to land use
    cause changes in weather patterns, especially
    cumulus clouds
  • Comparison based on
  • Accuracy of detection
  • Amount of time required to classify

54
Data Mining / Space Science Collaboration
Boundary Detection and Quantification
  • Analysis of polar cap auroras in large volumes
    of spacecraft UV images
  • Scientific Rationale
  • Indicators to predict geomagnetic storm
  • Damage satellites
  • Disrupt radio connection
  • Developing different mining algorithms to detect
    and quantify polar cap boundary

Polar Cap Boundary
55
Data Mining / BioInformatics Collaboration
Genome Patterns
Text Pattern Recognition Used to search for
text patterns in bioscience data as well as other
text documents.
Scientists
Mining Results MCSs
Genome DB
56
Additional Information
  • Website
  • www.itsc.uah.edu
  • Dr. Sara Graves
  • sgraves_at_itsc.uah.edu
  • Danny Hardin
  • dhardin_at_itsc.uah.edu
Write a Comment
User Comments (0)
About PowerShow.com