Title: Chapter 7 Preparing Scientific and Engineering Data for Mining
1Chapter 7 - Preparing Scientific and Engineering
Data for Mining
- Chandrika Kamath
- Center for Applied Scientific Computing
- Lawrence Livermore National Laboratory
- http//www.llnl.gov/casc/people/kamath
UCRL-PRES-145087 The work of Chandrika Kamath
in Chapters 5, 6, and 7 was performed under the
auspices of the U.S. Department of Energy by the
University of California Lawrence Livermore
National Laboratory under contract No.
W-7405-Eng-48.
2The input data cannot be directly input to the
pattern recognition algorithms
pattern recognition algorithms
Input Data
features
Images/Meshes Time-dependent Multi-sensor Compres
sed Spatio-temporal Massive 2,3,4 dimensions
Classification Clustering Regression .
?
Data items
? The input data must be processed to make it
suitable for the pattern recognition algorithms.
3Science and engineering data are available in
different formats
- Different storage formats
- FITS, AIPS in astronomy
- netCDF, GRIB (grid in binary) in climate
- Different ways of generating output
- sea surface temps for each month in a file
- sea surface temps for each year in a file
- Depending on the problem, data can be
- one-dimensional, usually time series, from
sensors or processing of other data - two-dimensional (spatial) time
- three-dimensional (spatial) time
4Two-dimensional scientific data is available as
images or as meshes
MACHO
- Can have spatial and temporal aspects
- Images
- pixel values gray-scale or real
- a scene obtained using different sensors, at
different times, at different resolutions - can be noisy, with noise varying between images
and within an image - Mesh
- values at a mesh point are real
- cell centered, node centered or edge
centered
Asteroid
5Three dimensional scientific data comes from
modeling objects in 3-D
- Values at a mesh point are real
- Can be cell centered, node centered, edge
centered, or face centered - Often have a series of meshes in time spatial
and temporal aspect
6The complexity of meshes makes it difficult to
extract features
Cartesian Structured
Structured Unstructured
Unstructured
7The distribution of mesh points can change with
time - need feature tracking
Composed Unstructured mesh
Hierarchy of regular meshes
Composite meshes - locally structured, globally
unstructured
8Science data is not often in a form ready for
pattern recognition
- Data available as pixels or values at mesh points
- But, the patterns of interest are at a higher
level
The raw data must be transformed into features
before we can apply pattern recognition. Extracti
ng features that are robust, relevant to the
problem, and invariant to scaling, rotation, and
translation is non-trivial and time consuming -
but, essential to the success of the pattern
recognition algorithm.
9Most of the work in data mining focuses on
pattern recognition, BUT.
- it is the data pre-processing which is
- more influential and time consuming
- domain specific and therefore less general
- perhaps as little as 10 effort was spent on
classification aspects of the problem. (Burl
98) - Langley/Simon 95 much of the power comes not
from the specific induction method, but from
proper formulation of the problems and from
crafting the representation to make learning
tractable. - Brodley/Smyth 95 in practical applications,
it is often the data and human issues which
ultimately dictate success or failure of a
project rather than algorithmic and model issues.
10The Sapphire view of the end-to-end data mining
process (Kamath01)
Raw Data
Target Data
Preprocessed Data
Transformed Data
Patterns
Knowledge
Data Preprocessing
Pattern Recognition
Interpreting Results
De-noising Object- identification Feature-
extraction Normalization
Dimension- reduction
Data Fusion Sampling Multi-resolution analysis
Classification Clustering Regression .
Visualization Validation
11Lets make a few simple assumptions in our
discussion of data preparation .
- We understand the problem and the data
- We have formulated a solution approach
- We have relatively easy access to the data
- We have the software to read, write, and display
the data - We have the software to bring the data into a
consistent format
? To satisfy these simple assumptions may
require far more time than you expect!
12Data fusion may be necessary when data from many
sources is available
- Combining information from more than one source
to make a more accurate and better informed
decision - Exploit complementary information from different
sensors, at different wavelengths, from different
viewpoints,.
X-ray
Infrared
Optical
Radio
Images of the Crab Nebula from chandra.harvard.edu
13Data registration is an important part of data
fusion
- Registrationalign images to relate information
in one image to information in another image - Used in data fusion and change detection
Translation Rigid body
Rotation Horizontal shear
? Obtain a global or local transformation to
match the input data to the reference data.
14There are four major components of data
registration (Brown 92)
- Feature space what features do we use for
matching? - pixels, edges, contours, corners,.
- Search space what transformation to use to
establish correspondence between input and
reference data? - translations, rotations, scaling,.
- Search strategy which transformations are
computed and evaluated? - exhaustive search, multiresolution methods
- Similarity metric how to evaluate the match
between input and reference data? - mean square error, sum of abs differences
15Some recent work in image registration
- An excellent survey Brown 92.
- Use of wavelet-based multi-resolution techniques
(Le Moigne 94) - Using evolutionary algorithms as a search
strategy (Mandava 89) - Using the Levenberg-Marquardt optimization
strategy (Thevenaz 98).
16The data may need to be de-noised to better
identify the objects
- Noise in the data can be due to the data
acquisition process or natural phenomena such as
atmospheric turbulence - De-noising is difficult as cannot always tell
what is the signal and what is the noise - A simple approach thresholding
- drop all values below a threshold
- how do we calculate the threshold?
17Simple filters can be used to smooth the data
and minimize the noise effects
- Convolve the image with a filter
Image
Non-zero locations of a 3 by 3 filter
Convolution of filter f with image I
18Examples of some simple filters
- Filters can vary in width - a wide filter gives
better noise reduction, but smooths the edges - Mean filters
- Gaussian filters
19Multi-resolution analysis using wavelets
- Using appropriate filters, decompose a signal
- high frequency part (detail coefficients)
- low frequency part (smooth coefficients)
- In 2-dimensions, apply first along one dimension
and then the other - Choice of wavelets, transforms, boundary
conditions, number of levels
64 48 16 32 56 56 48 24 56 24 56 36
8 -8 0 12 40 46 16 10 8 -8 0
12 43 -3 16 10 8 -8 0 12
Smooth
Detail
20Wavelet multi-resolution analysis Haar wavelet,
periodic boundary conditions
Horizontal
Vertical Diagonal
2 level decomposition
21Wavelets can be used for removing noise from data
- Useful when data is available compressed using
wavelet transforms - Basic idea - drop detail coefficients below a
threshold - Extensive study (Fodor/Kamath 01)
- several wavelets, boundary treatments
- several shrinkage rules, shrinkage functions
- compare with linear and non-linear spatial
filters
Inverse wavelet transform
Forward wavelet transform
Calculate threshold
Apply threshold
Noisy Image
Denoised Image
22Comparison of denoising wavelets (symmlet12) vs.
spatial filters
Noisy, 20 MSE398
Original
MMSE Gaussian MSE65
SURE MSE69
23Results of study on wavelet-based statistical
techniques for denoising data
- Results independent of choice of wavelets
- Soft thresholding better than hard or semi-soft
- SURE and Bayes rules are consistently better
- Wavelets preserve edges better introduce
artifacts - Wavelets are not good at structured noise
- Combination of spatial filters may give smaller
MSE - Spatial filters often blur edges
- Other approaches - diffusion-based methods, level
set methods, ENO and TVD schemes, non-decimated
transforms, curvelets, .
24De-noising techniques applied to the FIRST images
Observed
SureShrink
Universal
HypTest
Unsharp mask
Simple threshold
25Wavelets can be a useful tool in several aspects
of data mining (Fodor/Kamath 00)
- Very effective in compression
- astronomy, simulations, FBI fingerprints
- JPEG 2000, MPEG-4 standards
- progressive transmission of data
- Mining compressed data visualization approaches
(Machiraju 01) - Feature extraction (at different scales)
- texture analysis (Ma/Manjunath 95)
- Image registration (Le Moigne 94)
- Caveat Recent work (Candes/Donoho 00) indicates
that wavelets might not be good for gt 1D data
26Once the data has been de-noised, we need to
identify the objects in it
- Identifying the objects is non-trivial
- tremendous variability of object shapes man-made
vs. natural objects - denoising may have smoothed the edges
- variations in image quality (noise, boundary
gaps)
27Identifying objects in data is difficult, both in
2 and 3-D images and meshes
- Challenges in traditional image algorithms
- need many parameters for optimal performance
- interactions between parameters are complex and
non-linear - no universally accepted measure of quality of the
segmented image - no single method can handle variations between
images - Identifying objects in mesh data
- mesh may move/change over time
- in two/three spatial dimensions time
- irregular meshes
- objects may split or merge
28Several techniques are being used in the image
processing community
- Histogram the image, and threshold it based on
the histogram separate the foreground from the
background - Segmentation techniques
- split and merge (top-down)
- region growing (bottom-up)
- Edge detection use a filter to identify an edge
Laplacian
Sobel
29Examples of some simple edge detection
Original Sobel
Canny
30More sophisticated techniques for object
identification
- Combine traditional techniques with evolutionary
algorithms to make them more adaptive (Bhanu/Lee
95, Cagnoni 97) - Deformable models for segmentation
- parametric approach snakes or active contours
(Kass et. al 87) - geometric approach level set methods (Malladi
and Sethian 96) - Non-linear diffusion filters based on PDEs
- smooth images while enhancing edges (Weickert et.
al 98)
? PDE-based techniques are gaining popularity -
they are robust, but expensive.
31Once the objects have been identified, the
features must be extracted
- Features dependent on the problem
- identifying relevant features
- extracting robust features
- extracting features invariant to scale, rotation,
and translation - Features may include
- distances, angles, areas
- histograms
- fourier or wavelet coefficients
- various moments
- .
32 May need to reduce the dimension or the number
of features
Object recognition and Feature Extraction
Dimension Reduction
Pattern Recognition
Raw Data
Information
Features Features
Data items
33There are several reasons why dimension reduction
may be helpful
- Fewer features may make pattern recognition
algorithms computationally tractable - Less time is spent in extracting features
- Can minimize correlations between features, which
may be a requirement of some algorithms (e.g.
GLMs)
34In the FIRST data, we need to reduce the 103
features for 3-entry sources
- Input from domain experts
- EDA techniques parallel plots and box plots
- Wrapper approach
35There are also more complex techniques for
dimension reduction
- Principal component analysis
- transform the features to be mutually
uncorrelated - focus on directions that maximize the variance
36Principal component analysis algorithm
- N data items in d dimensions
- find the d-dimensional mean vector
- obtain the d x d covariance matrix
- obtain the d eigenvalues and eigenvectors of the
covariance matrix - keep k largest eigenvectors (k ltlt d)
- project the (original data - mean) into the space
spanned by these vectors
? The eigenvectors or principal components (PCs)
are mutually orthogonal and the original
data is a linear combination of these PCs
37We applied PCA to the problem of bent-double
classification
- The first 20 PCs explained about 90 of the
variance - Eliminate unimportant variables
- eliminate variable with largest coefficient in
e-vector corresponding to smallest e-value - repeat with the e-vector for the next smallest
e-value - continue till left with 20 variables
? Using the 31 features found through EDA and
PCA lowers the error from 11.1 to 9.5
38Need more appropriate techniques for dimension
reduction
- PCA may not always be appropriate
- linear
- orthogonal
- Other options
- independent component analysis
- blind source separation
- non-linear PCA
- genetic algorithms
- Need incremental techniques which are applied as
the data is being collected (Kargupta 00)
39It is difficult to find labeled data in science
and engineering applications
- Training set usually generated manually, not
historically - Not all scientists may agree on a label
- Labeled data vs interestingdata
- Often ground truth is unavailable, or difficult
to find - Approach to labeling may be ad-hoc
- the yellow-sticky-pad approach to identifying
bent doubles
Non-bent double
Bent-double
40Sapphire experiences with a flexible system
design for data mining
- We address the needs of a diverse set of
applications - Not all problems require the entire process
- Not all algorithms are suitable for a problem
- Algorithms typically depend on several parameters
- Intermediate data must be handled properly
- Domain dependent and independent parts must be
clearly identified - Should be able to accommodate a growing data set
41The Sapphire approach a flexible, portable,
scalable system architecture
User Input Feedback
Components linked by Python
42Other pointers that discuss system architecture
issues
- Data mining specific projects
- ADAM , JARTool, Diamond Eye
- Workshops of more general interest
- mining scientific datasets (httpwww.ahpcrc.umn.ed
u/conferences)
- interfaces to scientific data archives
(http//www.cacr.caltech.edu/SDA) - large scientific databases (http//www.cacr.caltec
h.edu/euus/) - issues in the application of data mining to
scientific data (http//www.cs.uah.edu/NASA_Mining
) - data fusion and data mining (http//ic-www.arc.nas
a.edu/ic/data99-workshop)
43Challenges in mining science and engineering data
sets
- Feature extraction is non-trivial
- Labeled data is difficult to obtain
- Data can be high dimensional
- Need techniques to handle spatial and temporal
aspects - System infrastructure issues are important
- Data fusion and registration are required in some
cases - Data may be compressed
- May need to mine data as it is being generated
- .
44Acknowledgements
- The Sapphire project team Erick CantĂș-Paz, Imola
K. Fodor, and Nu Ai Tang - Sisira Weeratunga (LLNL) for insights on
simulations and PDEs - FIRST scientists Bob Becker, Michael Gregg,
Sally Laurent-Muehleisen, and Rick White
http//www.llnl.gov/casc/sapphire
UCRL-PRES-145087 The work of Chandrika Kamath
in Chapters 5, 6, and 7 was performed under the
auspices of the U.S. Department of Energy by the
University of California Lawrence Livermore
National Laboratory under contract No.
W-7405-Eng-48.
45Chapter 7 - References
- Credits for images used in Chapter 7 (if not
provided with the image) - MACHO web page http//wwwmacho.anu.edu.au/
- 3D meshes http//www.llnl.gov/casc/overture
- Structured and unstructured mesh around the front
of an aircraft, http//www.nas.nasa.gov/Pubs/Docs/
FAST/chp_16.serferu.html - 3D unstructured mesh with heterogeneous elements
http//cox.iwr.uni-heidelbeg.de/ug/Images/benchmar
k_grid.gif - Composite grid SAMRAI project -
http//www.llnl.gov/casc/samrai - Wavelet images generated by Sapphire software -
http//www.llnl.gov/casc/sapphire
46Chapter 7 - References
- Burl, M., L. Asker, P. Smyth, U. Fayyad, P.
Perona, L. Crumpler, and J. Aubele, Learning to
recognize volcanoes on Venus, Machine Learning,
Volume 30, pages 165-195, 1998. - Langley, P. and H. A. Simon, Applications of
machine learning and rule induction,
Communications of the ACM, Volume 38, Number 11,
pages 55-64. - Brodley, C. and P. Smyth, The process of
applying machine learning algorithms, Workshop
on applying machine learning in practice, IMLC
1995 (http//citeseer.nj.nec.com/722.html) - Kamath, C., E. CantĂș-Paz, I. K. Fodor, and N.
Tang, Searching for bent-double galaxies in the
first survey, in Data Mining for Scientific and
Engineering Applications, R. Grossman, C. Kamath,
W. Kegelmeyer, V. Kumar, and R. Namburu (eds.),
Kluwer 2001.
47Chapter 7 - References
- Brown, L. A Survey of Image Registration
Techniques. ACM Computing Surveys, Vol. 24,
Number 4, December 1992. - Le Moigne, J., Parallel Registration of
Multi-sensor remotely senses imagery using
wavelet coefficients, Proc. SPIE Wavelet
Applications Conference, Orlando, 1994, pages
423-443. - Mandava, V., Fitzpatrick, J., and Pickens, D.
(1989). Adaptive search space scaling in digital
image registration. IEEE Transactions on Medical
Imaging, 8, 251-262. - Thevenaz, P., Ruttimann, U., Unser, M., A
Pyramid Approach to Sub-pixel Registration based
on intensity, IEEE Transactions on Image
Processing, Vol 7, Number 1, January 1998. - Fodor, I.K. and C. Kamath, On denoising images
using wavelet-based statistical techniques,
submitted for publication. See the Sapphire web
page for details.
48Chapter 7 - References
- Fodor, I.K. and C. Kamath, The role of
multi-resolution in mining massive image
datasets, Proceedings of the YES2000 Symposium
on Advanced Multiscale and Multi-resolution
Methods, Lecture Notes in Computational Science
and Engineering, Springer-Verlag, 2001. - Machiraju, R. and J. Fowler, D. Thompson, W.
Schroeder, and B. Soni, EVITA - A Prototype
System for Efficient Visualization and
Interrogation of Terascale Datasets, to appear
in Data Mining for Scientific and Engineering
Applications, R. Grossman, C. Kamath, W.
Kegelmeyer, V. Kumar, and R. Namburu (eds.),
Kluwer 2001. - Ma, W. Y. and B. S. Manjunath, A comparison of
wavelet transform features for texture image
annotation, Proc. Second International
Conference on Image Processing, ICIP 95, pages
256-259.
49Chapter 7 - References
- Candes, E. and Donoho, D. , Curvelets,
multiresolution representation, and scaling
laws, Proc. Wavelet Applications in Signal and
Image Processing VIII, SPIE 2000, vol. 4119. - Bhanu, B. and S. Lee,Adaptive image segmentation
using a genetic algorithm, IEEE Transactions on
Systems, Man, and Cybernetics, 25, pages
1543-1567, 1995. - Cagnoni, S., Dobrzeniecki, A, R. Poli, J. Yanch,
Segmentation of 3D medical images through
genetically optimized contour tracking
algorithms, U. Birmingham School of Computer
Science Technical Report, CSRP-97-28, 1997. - Kass, M., A. Witkin, and D. Terzopolous, Snakes
active contour models,Intl J. Computer Vision,
Volume 1. No. 4, pages 321-331, 1987.
50Chapter 7 - References
- Malladi, R. and J. Sethian, A unified approach
to noise removal, image enhancement, and shape
recovery, IEEE Transactions on Image Processing,
Volume 5, 1996, pages 1154-1168. - Weickert, J, B. ter Haar Romeny, and M.
Viergever, Efficient and Reliable Schemes for
Nonlinear Diffusion Filtering, IEEE Transactions
on Image Processing, Volume 7, Number 3, March
1998. - Joliffe, I., Principal Component Analysis,
Springer Verlag, 1986. - Kargupta, H, W. Huang, S. Krishnamoorthy, and E.
Johnson, Distributed clustering using collective
principal component analysis, ACM SigKDD
Workshop on Distributed and Parallel Knowledge
Discovery, 2000.