Mining Earth Science Information - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Mining Earth Science Information

Description:

Clustered data symbolization. Novelty detection. 23 /22. System Architecture. Data Interpolation ... Extraction. Symbolization. A Set of Symbolized Sequences. A ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 24

Provided by: seokkyu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Earth Science Information

1
Mining Earth Science Information

Group Leaders
Seokkyung Chung
Jongeun Jun (JJ)
Group Members
Weicheng Chu
Jeff Wei-shinn Ku
Domnic Poravanthattil
Haojun Wang

2
Introduction

Present data mining toolkits to discover
interesting patterns from earth science datasets
Enable scientists to adapt to new earth science
phenomena
Complement the existing analysis methods in earth
science

3
Why Data Preprocessing?

What do we have?
2429 weather stations
Each station has data recorded on a monthly basis
ranged from Jan. 1987 to Dec. 2002
Each station should have 192 values. (16 x 12
192)
But

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
4
Data Preprocessing (cont.)

We cannot simply omit those stations which have
missing values
Only 148 (6) stations have complete data
The result might not be representative
Our method
Omit those stations which do not have enough
values
For the stations with incomplete data, use
interpolated value to substitute the missing
value

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
5
Data Preprocessing (cont.)

Step 1
Omit those stations with less than 144 values

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
6
Data Preprocessing

Step 2 (Handling missing values)
For single missing value
Use the preceding and the succeeding values to
generate the interpolated value
For continuous missing values
Use the values in the previous year and the next
year to generate the interpolated value

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
7
Segmentation

Purpose
Raw time series data cannot be efficiently
handled for data mining
Time series is segmented into non-overlapping,
internally homogeneous data sequences (segments)
Approach
Change point detection
Approximation of data region between adjacent
change points by linear piecewise segments

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
8
Change Point Detection Results

Data after Change Point Detection

Original Data
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
9
Windowing

Motivation To eliminate noise inherent to the
data or introduced by interpolation

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
10
Observations

Time series data sets vary in their shape, data
values, variations in data, etc
The factors that are affected
The definition of a change point for the data set
Thresholds (should be sensitive to the particular
data set)
Approximation of data regions

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
11
Observations (cont.)

Windowing
Best done to time series data sets that have
higher susceptibility to noise e.g., sensor data
streams
Again it should be tailored to suit the data set
that is being considered window size, threshold
for change within window

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
12
Feature Extraction

Indexing is needed for raw data to support
efficient retrieval and matching of time series.
Dimensionality Reduction
Completeness and effectiveness
Noise Reduction
Discrete Fourier Transform (DFT) and Discrete
Wavelet Transform (DWT) used

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
13
Discrete Fourier Transform

One of the most commonly used for analyzing the
component of a stationary signal
Powerful for processing signals composed of some
combination of sine and cosine signals
Less useful for analyzing the signal with no
repetition

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
14
Discrete Wavelet Transform

Wavelet transform overcomes the drawback of DFT
Provide multi-resolution representation of
signals
The number of data elements must be a power of
two (e.g., 256 elements28), thus data
pre-processing might needed
Many wavelets exist, we use Haar wavelet

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
15
What we have done

Applied DFT and DWT on pre-processed raw data,
for each station, there are 192 data elements, we
added data element 0 to make the total number
of data elements equals to 256 for DWT processing
Applied DFT on each segmented raw data

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
16
Clustering

Purpose
We can do comparisons between raw data, DFT data,
DWT data, and segmentation data by clustering
results
Characteristics of each data set can be
discovered by clustering results
Approach
K-means clustering algorithm
Heuristic strategies of finding center points

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
17
Raw Data Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
18
DFT Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
19
DWT Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
20
Clustering Visualization
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
21
Segmentation Data Clustering

Changed Point Detection
Numbering each segments
Clustering segment pieces
Results
We accomplished some preliminary results
The difference between each clusters can be
compared with their slope and length of segments

22
Conclusions and Future Work

Current achievement
Raw data preprocessing
Raw data segmentation/windowing
Feature selection using DFT and DWT
transformations
Raw data, DFT data, DWT data, segmented data
(DFT/DWT) clustering (CPD CPD with windowing)
Future work
Clustered data symbolization
Novelty detection