Chapter 7 Preparing Scientific and Engineering Data for Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chapter 7 Preparing Scientific and Engineering Data for Mining

1
Chapter 7 - Preparing Scientific and Engineering
Data for Mining

Chandrika Kamath
Center for Applied Scientific Computing
Lawrence Livermore National Laboratory
http//www.llnl.gov/casc/people/kamath

UCRL-PRES-145087 The work of Chandrika Kamath
in Chapters 5, 6, and 7 was performed under the
auspices of the U.S. Department of Energy by the
University of California Lawrence Livermore
National Laboratory under contract No.
W-7405-Eng-48.
2
The input data cannot be directly input to the
pattern recognition algorithms
pattern recognition algorithms
Input Data
features
Images/Meshes Time-dependent Multi-sensor Compres
sed Spatio-temporal Massive 2,3,4 dimensions
Classification Clustering Regression .
?
Data items
? The input data must be processed to make it
suitable for the pattern recognition algorithms.
3
Science and engineering data are available in
different formats

Different storage formats
FITS, AIPS in astronomy
netCDF, GRIB (grid in binary) in climate
Different ways of generating output
sea surface temps for each month in a file
sea surface temps for each year in a file
Depending on the problem, data can be
one-dimensional, usually time series, from
sensors or processing of other data
two-dimensional (spatial) time
three-dimensional (spatial) time

4
Two-dimensional scientific data is available as
images or as meshes
MACHO

Can have spatial and temporal aspects
Images
pixel values gray-scale or real
a scene obtained using different sensors, at
different times, at different resolutions
can be noisy, with noise varying between images
and within an image
Mesh
values at a mesh point are real
cell centered, node centered or edge
centered

Asteroid
5
Three dimensional scientific data comes from
modeling objects in 3-D

Values at a mesh point are real
Can be cell centered, node centered, edge
centered, or face centered
Often have a series of meshes in time spatial
and temporal aspect

6
The complexity of meshes makes it difficult to
extract features
Cartesian Structured
Structured Unstructured
Unstructured
7
The distribution of mesh points can change with
time - need feature tracking
Composed Unstructured mesh
Hierarchy of regular meshes
Composite meshes - locally structured, globally
unstructured
8
Science data is not often in a form ready for
pattern recognition

Data available as pixels or values at mesh points
But, the patterns of interest are at a higher
level

The raw data must be transformed into features
before we can apply pattern recognition. Extracti
ng features that are robust, relevant to the
problem, and invariant to scaling, rotation, and
translation is non-trivial and time consuming -
but, essential to the success of the pattern
recognition algorithm.
9
Most of the work in data mining focuses on
pattern recognition, BUT.

it is the data pre-processing which is
more influential and time consuming
domain specific and therefore less general
perhaps as little as 10 effort was spent on
classification aspects of the problem. (Burl
98)
Langley/Simon 95 much of the power comes not
from the specific induction method, but from
proper formulation of the problems and from
crafting the representation to make learning
tractable.
Brodley/Smyth 95 in practical applications,
it is often the data and human issues which
ultimately dictate success or failure of a
project rather than algorithmic and model issues.

10
The Sapphire view of the end-to-end data mining
process (Kamath01)
Raw Data
Target Data
Preprocessed Data
Transformed Data
Patterns
Knowledge
Data Preprocessing
Pattern Recognition
Interpreting Results
De-noising Object- identification Feature-
extraction Normalization
Dimension- reduction
Data Fusion Sampling Multi-resolution analysis
Classification Clustering Regression .
Visualization Validation
11
Lets make a few simple assumptions in our
discussion of data preparation .

We understand the problem and the data
We have formulated a solution approach
We have relatively easy access to the data
We have the software to read, write, and display
the data
We have the software to bring the data into a
consistent format

? To satisfy these simple assumptions may
require far more time than you expect!
12
Data fusion may be necessary when data from many
sources is available

Combining information from more than one source
to make a more accurate and better informed
decision
Exploit complementary information from different
sensors, at different wavelengths, from different
viewpoints,.

X-ray
Infrared
Optical
Radio
Images of the Crab Nebula from chandra.harvard.edu
13
Data registration is an important part of data
fusion

Registrationalign images to relate information
in one image to information in another image
Used in data fusion and change detection

Translation Rigid body
Rotation Horizontal shear
? Obtain a global or local transformation to
match the input data to the reference data.
14
There are four major components of data
registration (Brown 92)

Feature space what features do we use for
matching?
pixels, edges, contours, corners,.
Search space what transformation to use to
establish correspondence between input and
reference data?
translations, rotations, scaling,.
Search strategy which transformations are
computed and evaluated?
exhaustive search, multiresolution methods
Similarity metric how to evaluate the match
between input and reference data?
mean square error, sum of abs differences

15
Some recent work in image registration

An excellent survey Brown 92.
Use of wavelet-based multi-resolution techniques
(Le Moigne 94)
Using evolutionary algorithms as a search
strategy (Mandava 89)
Using the Levenberg-Marquardt optimization
strategy (Thevenaz 98).

16
The data may need to be de-noised to better
identify the objects

Noise in the data can be due to the data
acquisition process or natural phenomena such as
atmospheric turbulence
De-noising is difficult as cannot always tell
what is the signal and what is the noise
A simple approach thresholding
drop all values below a threshold
how do we calculate the threshold?

17
Simple filters can be used to smooth the data
and minimize the noise effects

Convolve the image with a filter

Image
Non-zero locations of a 3 by 3 filter
Convolution of filter f with image I
18
Examples of some simple filters

Filters can vary in width - a wide filter gives
better noise reduction, but smooths the edges
Mean filters
Gaussian filters

19
Multi-resolution analysis using wavelets

Using appropriate filters, decompose a signal
high frequency part (detail coefficients)
low frequency part (smooth coefficients)
In 2-dimensions, apply first along one dimension
and then the other
Choice of wavelets, transforms, boundary
conditions, number of levels

64 48 16 32 56 56 48 24 56 24 56 36
8 -8 0 12 40 46 16 10 8 -8 0
12 43 -3 16 10 8 -8 0 12
Smooth
Detail
20
Wavelet multi-resolution analysis Haar wavelet,
periodic boundary conditions
Horizontal
Vertical Diagonal
2 level decomposition
21
Wavelets can be used for removing noise from data

Useful when data is available compressed using
wavelet transforms
Basic idea - drop detail coefficients below a
threshold
Extensive study (Fodor/Kamath 01)
several wavelets, boundary treatments
several shrinkage rules, shrinkage functions
compare with linear and non-linear spatial
filters

Inverse wavelet transform
Forward wavelet transform
Calculate threshold
Apply threshold
Noisy Image
Denoised Image
22
Comparison of denoising wavelets (symmlet12) vs.
spatial filters
Noisy, 20 MSE398
Original
MMSE Gaussian MSE65
SURE MSE69
23
Results of study on wavelet-based statistical
techniques for denoising data

Results independent of choice of wavelets
Soft thresholding better than hard or semi-soft
SURE and Bayes rules are consistently better
Wavelets preserve edges better introduce
artifacts
Wavelets are not good at structured noise
Combination of spatial filters may give smaller
MSE
Spatial filters often blur edges
Other approaches - diffusion-based methods, level
set methods, ENO and TVD schemes, non-decimated
transforms, curvelets, .

24
De-noising techniques applied to the FIRST images
Observed
SureShrink
Universal
HypTest
Unsharp mask
Simple threshold
25
Wavelets can be a useful tool in several aspects
of data mining (Fodor/Kamath 00)

Very effective in compression
astronomy, simulations, FBI fingerprints
JPEG 2000, MPEG-4 standards
progressive transmission of data
Mining compressed data visualization approaches
(Machiraju 01)
Feature extraction (at different scales)
texture analysis (Ma/Manjunath 95)
Image registration (Le Moigne 94)
Caveat Recent work (Candes/Donoho 00) indicates
that wavelets might not be good for gt 1D data

26
Once the data has been de-noised, we need to
identify the objects in it

Identifying the objects is non-trivial
tremendous variability of object shapes man-made
vs. natural objects
denoising may have smoothed the edges
variations in image quality (noise, boundary
gaps)

27
Identifying objects in data is difficult, both in
2 and 3-D images and meshes

Challenges in traditional image algorithms
need many parameters for optimal performance
interactions between parameters are complex and
non-linear
no universally accepted measure of quality of the
segmented image
no single method can handle variations between
images
Identifying objects in mesh data
mesh may move/change over time
in two/three spatial dimensions time
irregular meshes
objects may split or merge

28
Several techniques are being used in the image
processing community

Histogram the image, and threshold it based on
the histogram separate the foreground from the
background
Segmentation techniques
split and merge (top-down)
region growing (bottom-up)
Edge detection use a filter to identify an edge

Laplacian
Sobel
29
Examples of some simple edge detection
Original Sobel
Canny
30
More sophisticated techniques for object
identification

Combine traditional techniques with evolutionary
algorithms to make them more adaptive (Bhanu/Lee
95, Cagnoni 97)
Deformable models for segmentation
parametric approach snakes or active contours
(Kass et. al 87)
geometric approach level set methods (Malladi
and Sethian 96)
Non-linear diffusion filters based on PDEs
smooth images while enhancing edges (Weickert et.
al 98)

? PDE-based techniques are gaining popularity -
they are robust, but expensive.
31
Once the objects have been identified, the
features must be extracted

Features dependent on the problem
identifying relevant features
extracting robust features
extracting features invariant to scale, rotation,
and translation
Features may include
distances, angles, areas
histograms
fourier or wavelet coefficients
various moments
.

32
May need to reduce the dimension or the number
of features
Object recognition and Feature Extraction
Dimension Reduction
Pattern Recognition
Raw Data
Information
Features Features
Data items
33
There are several reasons why dimension reduction
may be helpful

Fewer features may make pattern recognition
algorithms computationally tractable
Less time is spent in extracting features
Can minimize correlations between features, which
may be a requirement of some algorithms (e.g.
GLMs)

34
In the FIRST data, we need to reduce the 103
features for 3-entry sources

Input from domain experts
EDA techniques parallel plots and box plots
Wrapper approach

35
There are also more complex techniques for
dimension reduction

Principal component analysis
transform the features to be mutually
uncorrelated
focus on directions that maximize the variance

36
Principal component analysis algorithm

N data items in d dimensions
find the d-dimensional mean vector
obtain the d x d covariance matrix
obtain the d eigenvalues and eigenvectors of the
covariance matrix
keep k largest eigenvectors (k ltlt d)
project the (original data - mean) into the space
spanned by these vectors

? The eigenvectors or principal components (PCs)
are mutually orthogonal and the original
data is a linear combination of these PCs
37
We applied PCA to the problem of bent-double
classification

The first 20 PCs explained about 90 of the
variance
Eliminate unimportant variables
eliminate variable with largest coefficient in
e-vector corresponding to smallest e-value
repeat with the e-vector for the next smallest
e-value
continue till left with 20 variables

? Using the 31 features found through EDA and
PCA lowers the error from 11.1 to 9.5
38
Need more appropriate techniques for dimension
reduction

PCA may not always be appropriate
linear
orthogonal
Other options
independent component analysis
blind source separation
non-linear PCA
genetic algorithms
Need incremental techniques which are applied as
the data is being collected (Kargupta 00)

39
It is difficult to find labeled data in science
and engineering applications

Training set usually generated manually, not
historically
Not all scientists may agree on a label
Labeled data vs interestingdata
Often ground truth is unavailable, or difficult
to find
Approach to labeling may be ad-hoc
the yellow-sticky-pad approach to identifying
bent doubles

Non-bent double
Bent-double
40
Sapphire experiences with a flexible system
design for data mining

We address the needs of a diverse set of
applications
Not all problems require the entire process
Not all algorithms are suitable for a problem
Algorithms typically depend on several parameters
Intermediate data must be handled properly
Domain dependent and independent parts must be
clearly identified
Should be able to accommodate a growing data set

41
The Sapphire approach a flexible, portable,
scalable system architecture
User Input Feedback
Components linked by Python
42
Other pointers that discuss system architecture
issues

Data mining specific projects
ADAM , JARTool, Diamond Eye
Workshops of more general interest
mining scientific datasets (httpwww.ahpcrc.umn.ed
u/conferences)
interfaces to scientific data archives
(http//www.cacr.caltech.edu/SDA)
large scientific databases (http//www.cacr.caltec
h.edu/euus/)
issues in the application of data mining to
scientific data (http//www.cs.uah.edu/NASA_Mining
)
data fusion and data mining (http//ic-www.arc.nas
a.edu/ic/data99-workshop)

43
Challenges in mining science and engineering data
sets

Feature extraction is non-trivial
Labeled data is difficult to obtain
Data can be high dimensional
Need techniques to handle spatial and temporal
aspects
System infrastructure issues are important
Data fusion and registration are required in some
cases
Data may be compressed
May need to mine data as it is being generated
.

44
Acknowledgements

The Sapphire project team Erick Cantú-Paz, Imola
K. Fodor, and Nu Ai Tang
Sisira Weeratunga (LLNL) for insights on
simulations and PDEs
FIRST scientists Bob Becker, Michael Gregg,
Sally Laurent-Muehleisen, and Rick White

http//www.llnl.gov/casc/sapphire
UCRL-PRES-145087 The work of Chandrika Kamath
in Chapters 5, 6, and 7 was performed under the
auspices of the U.S. Department of Energy by the
University of California Lawrence Livermore
National Laboratory under contract No.
W-7405-Eng-48.
45
Chapter 7 - References

Credits for images used in Chapter 7 (if not
provided with the image)
MACHO web page http//wwwmacho.anu.edu.au/
3D meshes http//www.llnl.gov/casc/overture
Structured and unstructured mesh around the front
of an aircraft, http//www.nas.nasa.gov/Pubs/Docs/
FAST/chp_16.serferu.html
3D unstructured mesh with heterogeneous elements
http//cox.iwr.uni-heidelbeg.de/ug/Images/benchmar
k_grid.gif
Composite grid SAMRAI project -
http//www.llnl.gov/casc/samrai
Wavelet images generated by Sapphire software -
http//www.llnl.gov/casc/sapphire

46
Chapter 7 - References

Burl, M., L. Asker, P. Smyth, U. Fayyad, P.
Perona, L. Crumpler, and J. Aubele, Learning to
recognize volcanoes on Venus, Machine Learning,
Volume 30, pages 165-195, 1998.
Langley, P. and H. A. Simon, Applications of
machine learning and rule induction,
Communications of the ACM, Volume 38, Number 11,
pages 55-64.
Brodley, C. and P. Smyth, The process of
applying machine learning algorithms, Workshop
on applying machine learning in practice, IMLC
1995 (http//citeseer.nj.nec.com/722.html)
Kamath, C., E. Cantú-Paz, I. K. Fodor, and N.
Tang, Searching for bent-double galaxies in the
first survey, in Data Mining for Scientific and
Engineering Applications, R. Grossman, C. Kamath,
W. Kegelmeyer, V. Kumar, and R. Namburu (eds.),
Kluwer 2001.

47
Chapter 7 - References

Brown, L. A Survey of Image Registration
Techniques. ACM Computing Surveys, Vol. 24,
Number 4, December 1992.
Le Moigne, J., Parallel Registration of
Multi-sensor remotely senses imagery using
wavelet coefficients, Proc. SPIE Wavelet
Applications Conference, Orlando, 1994, pages
423-443.
Mandava, V., Fitzpatrick, J., and Pickens, D.
(1989). Adaptive search space scaling in digital
image registration. IEEE Transactions on Medical
Imaging, 8, 251-262.
Thevenaz, P., Ruttimann, U., Unser, M., A
Pyramid Approach to Sub-pixel Registration based
on intensity, IEEE Transactions on Image
Processing, Vol 7, Number 1, January 1998.
Fodor, I.K. and C. Kamath, On denoising images
using wavelet-based statistical techniques,
submitted for publication. See the Sapphire web
page for details.

48
Chapter 7 - References

Fodor, I.K. and C. Kamath, The role of
multi-resolution in mining massive image
datasets, Proceedings of the YES2000 Symposium
on Advanced Multiscale and Multi-resolution
Methods, Lecture Notes in Computational Science
and Engineering, Springer-Verlag, 2001.
Machiraju, R. and J. Fowler, D. Thompson, W.
Schroeder, and B. Soni, EVITA - A Prototype
System for Efficient Visualization and
Interrogation of Terascale Datasets, to appear
in Data Mining for Scientific and Engineering
Applications, R. Grossman, C. Kamath, W.
Kegelmeyer, V. Kumar, and R. Namburu (eds.),
Kluwer 2001.
Ma, W. Y. and B. S. Manjunath, A comparison of
wavelet transform features for texture image
annotation, Proc. Second International
Conference on Image Processing, ICIP 95, pages
256-259.

49
Chapter 7 - References

Candes, E. and Donoho, D. , Curvelets,
multiresolution representation, and scaling
laws, Proc. Wavelet Applications in Signal and
Image Processing VIII, SPIE 2000, vol. 4119.
Bhanu, B. and S. Lee,Adaptive image segmentation
using a genetic algorithm, IEEE Transactions on
Systems, Man, and Cybernetics, 25, pages
1543-1567, 1995.
Cagnoni, S., Dobrzeniecki, A, R. Poli, J. Yanch,
Segmentation of 3D medical images through
genetically optimized contour tracking
algorithms, U. Birmingham School of Computer
Science Technical Report, CSRP-97-28, 1997.
Kass, M., A. Witkin, and D. Terzopolous, Snakes
active contour models,Intl J. Computer Vision,
Volume 1. No. 4, pages 321-331, 1987.

50
Chapter 7 - References

Malladi, R. and J. Sethian, A unified approach
to noise removal, image enhancement, and shape
recovery, IEEE Transactions on Image Processing,
Volume 5, 1996, pages 1154-1168.
Weickert, J, B. ter Haar Romeny, and M.
Viergever, Efficient and Reliable Schemes for
Nonlinear Diffusion Filtering, IEEE Transactions
on Image Processing, Volume 7, Number 3, March
1998.
Joliffe, I., Principal Component Analysis,
Springer Verlag, 1986.
Kargupta, H, W. Huang, S. Krishnamoorthy, and E.
Johnson, Distributed clustering using collective
principal component analysis, ACM SigKDD
Workshop on Distributed and Parallel Knowledge
Discovery, 2000.

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 7 Preparing Scientific and Engineering Data for Mining PowerPoint PPT Presentation