Title: Mining Earth Science Information
1Mining Earth Science Information
-
- Group Leaders
- Seokkyung Chung
- Jongeun Jun (JJ)
- Group Members
- Weicheng Chu
- Jeff Wei-shinn Ku
- Domnic Poravanthattil
- Haojun Wang
-
2Introduction
- Present data mining toolkits to discover
interesting patterns from earth science datasets - Enable scientists to adapt to new earth science
phenomena - Complement the existing analysis methods in earth
science
3Why Data Preprocessing?
- What do we have?
- 2429 weather stations
- Each station has data recorded on a monthly basis
ranged from Jan. 1987 to Dec. 2002 - Each station should have 192 values. (16 x 12
192) - But
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
4Data Preprocessing (cont.)
- We cannot simply omit those stations which have
missing values - Only 148 (6) stations have complete data
- The result might not be representative
- Our method
- Omit those stations which do not have enough
values - For the stations with incomplete data, use
interpolated value to substitute the missing
value
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
5Data Preprocessing (cont.)
- Step 1
- Omit those stations with less than 144 values
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
6Data Preprocessing
- Step 2 (Handling missing values)
- For single missing value
- Use the preceding and the succeeding values to
generate the interpolated value - For continuous missing values
- Use the values in the previous year and the next
year to generate the interpolated value
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
7Segmentation
- Purpose
- Raw time series data cannot be efficiently
handled for data mining - Time series is segmented into non-overlapping,
internally homogeneous data sequences (segments) - Approach
- Change point detection
- Approximation of data region between adjacent
change points by linear piecewise segments
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
8Change Point Detection Results
- Data after Change Point Detection
Original Data
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
9Windowing
- Motivation To eliminate noise inherent to the
data or introduced by interpolation
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
10Observations
- Time series data sets vary in their shape, data
values, variations in data, etc - The factors that are affected
- The definition of a change point for the data set
- Thresholds (should be sensitive to the particular
data set) - Approximation of data regions
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
11Observations (cont.)
- Windowing
- Best done to time series data sets that have
higher susceptibility to noise e.g., sensor data
streams - Again it should be tailored to suit the data set
that is being considered window size, threshold
for change within window
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
12Feature Extraction
- Indexing is needed for raw data to support
efficient retrieval and matching of time series. - Dimensionality Reduction
- Completeness and effectiveness
- Noise Reduction
- Discrete Fourier Transform (DFT) and Discrete
Wavelet Transform (DWT) used
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
13Discrete Fourier Transform
- One of the most commonly used for analyzing the
component of a stationary signal - Powerful for processing signals composed of some
combination of sine and cosine signals - Less useful for analyzing the signal with no
repetition
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
14Discrete Wavelet Transform
- Wavelet transform overcomes the drawback of DFT
- Provide multi-resolution representation of
signals - The number of data elements must be a power of
two (e.g., 256 elements28), thus data
pre-processing might needed - Many wavelets exist, we use Haar wavelet
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
15What we have done
- Applied DFT and DWT on pre-processed raw data,
for each station, there are 192 data elements, we
added data element 0 to make the total number
of data elements equals to 256 for DWT processing - Applied DFT on each segmented raw data
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
16Clustering
- Purpose
- We can do comparisons between raw data, DFT data,
DWT data, and segmentation data by clustering
results - Characteristics of each data set can be
discovered by clustering results - Approach
- K-means clustering algorithm
- Heuristic strategies of finding center points
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
17Raw Data Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
18DFT Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
19DWT Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
20Clustering Visualization
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
21Segmentation Data Clustering
- Changed Point Detection
- Numbering each segments
- Clustering segment pieces
- Results
- We accomplished some preliminary results
- The difference between each clusters can be
compared with their slope and length of segments
22Conclusions and Future Work
- Current achievement
- Raw data preprocessing
- Raw data segmentation/windowing
- Feature selection using DFT and DWT
transformations - Raw data, DFT data, DWT data, segmented data
(DFT/DWT) clustering (CPD CPD with windowing) - Future work
- Clustered data symbolization
- Novelty detection
23System Architecture
Data Interpolation
Feature Extraction
Segmentation
Clustering
A Set of Symbolized Sequences
Symbolization