Mining Earth Science Information - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Mining Earth Science Information

Description:

Clustered data symbolization. Novelty detection. 23 /22. System Architecture. Data Interpolation ... Extraction. Symbolization. A Set of Symbolized Sequences. A ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 24
Provided by: seokkyu
Category:

less

Transcript and Presenter's Notes

Title: Mining Earth Science Information


1
Mining Earth Science Information
  • Group Leaders
  • Seokkyung Chung
  • Jongeun Jun (JJ)
  • Group Members
  • Weicheng Chu
  • Jeff Wei-shinn Ku
  • Domnic Poravanthattil
  • Haojun Wang

2
Introduction
  • Present data mining toolkits to discover
    interesting patterns from earth science datasets
  • Enable scientists to adapt to new earth science
    phenomena
  • Complement the existing analysis methods in earth
    science

3
Why Data Preprocessing?
  • What do we have?
  • 2429 weather stations
  • Each station has data recorded on a monthly basis
    ranged from Jan. 1987 to Dec. 2002
  • Each station should have 192 values. (16 x 12
    192)
  • But

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
4
Data Preprocessing (cont.)
  • We cannot simply omit those stations which have
    missing values
  • Only 148 (6) stations have complete data
  • The result might not be representative
  • Our method
  • Omit those stations which do not have enough
    values
  • For the stations with incomplete data, use
    interpolated value to substitute the missing
    value


1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
5
Data Preprocessing (cont.)
  • Step 1
  • Omit those stations with less than 144 values

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
6
Data Preprocessing
  • Step 2 (Handling missing values)
  • For single missing value
  • Use the preceding and the succeeding values to
    generate the interpolated value
  • For continuous missing values
  • Use the values in the previous year and the next
    year to generate the interpolated value

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
7
Segmentation
  • Purpose
  • Raw time series data cannot be efficiently
    handled for data mining
  • Time series is segmented into non-overlapping,
    internally homogeneous data sequences (segments)
  • Approach
  • Change point detection
  • Approximation of data region between adjacent
    change points by linear piecewise segments

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
8
Change Point Detection Results
  • Data after Change Point Detection

Original Data
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
9
Windowing
  • Motivation To eliminate noise inherent to the
    data or introduced by interpolation

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
10
Observations
  • Time series data sets vary in their shape, data
    values, variations in data, etc
  • The factors that are affected
  • The definition of a change point for the data set
  • Thresholds (should be sensitive to the particular
    data set)
  • Approximation of data regions

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
11
Observations (cont.)
  • Windowing
  • Best done to time series data sets that have
    higher susceptibility to noise e.g., sensor data
    streams
  • Again it should be tailored to suit the data set
    that is being considered window size, threshold
    for change within window

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
12
Feature Extraction
  • Indexing is needed for raw data to support
    efficient retrieval and matching of time series.
  • Dimensionality Reduction
  • Completeness and effectiveness
  • Noise Reduction
  • Discrete Fourier Transform (DFT) and Discrete
    Wavelet Transform (DWT) used

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
13
Discrete Fourier Transform
  • One of the most commonly used for analyzing the
    component of a stationary signal
  • Powerful for processing signals composed of some
    combination of sine and cosine signals
  • Less useful for analyzing the signal with no
    repetition

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
14
Discrete Wavelet Transform
  • Wavelet transform overcomes the drawback of DFT
  • Provide multi-resolution representation of
    signals
  • The number of data elements must be a power of
    two (e.g., 256 elements28), thus data
    pre-processing might needed
  • Many wavelets exist, we use Haar wavelet

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
15
What we have done
  • Applied DFT and DWT on pre-processed raw data,
    for each station, there are 192 data elements, we
    added data element 0 to make the total number
    of data elements equals to 256 for DWT processing
  • Applied DFT on each segmented raw data

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
16
Clustering
  • Purpose
  • We can do comparisons between raw data, DFT data,
    DWT data, and segmentation data by clustering
    results
  • Characteristics of each data set can be
    discovered by clustering results
  • Approach
  • K-means clustering algorithm
  • Heuristic strategies of finding center points

1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
17
Raw Data Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
18
DFT Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
19
DWT Clustering Graph
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
20
Clustering Visualization
1. Data Preprocessing 2. Segmentation
3. Feature Extraction 4. Clustering
21
Segmentation Data Clustering
  • Changed Point Detection
  • Numbering each segments
  • Clustering segment pieces
  • Results
  • We accomplished some preliminary results
  • The difference between each clusters can be
    compared with their slope and length of segments

22
Conclusions and Future Work
  • Current achievement
  • Raw data preprocessing
  • Raw data segmentation/windowing
  • Feature selection using DFT and DWT
    transformations
  • Raw data, DFT data, DWT data, segmented data
    (DFT/DWT) clustering (CPD CPD with windowing)
  • Future work
  • Clustered data symbolization
  • Novelty detection

23
System Architecture
Data Interpolation
Feature Extraction
Segmentation
Clustering
A Set of Symbolized Sequences
Symbolization
Write a Comment
User Comments (0)
About PowerShow.com