Discovery of Patterns in the Global Climate System using Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Discovery of Patterns in the Global Climate System using Data Mining

Description:

Look up phone number in phone directory. Query a Web search engine for ... Number of analysts ... Global snapshots of values for a number of variables on land surfaces or ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 31
Provided by: Compu257
Category:

less

Transcript and Presenter's Notes

Title: Discovery of Patterns in the Global Climate System using Data Mining


1
Discovery of Patterns in the Global Climate
System using Data Mining
  • Vipin Kumar
  • Army High Performance Computing Research Center
  • Department of Computer Science
  • University of Minnesota http//www.cs.umn.edu/
    kumar
  • Research sponsored by AHPCRC/ARL, DOE, NASA, and
    NSF

2
What is Data Mining?
  • Many Definitions
  • Non-trivial extraction of implicit, previously
    unknown and potentially useful information from
    data
  • Exploration analysis, by automatic or
    semi-automatic means, of large quantities of
    data in order to discover meaningful patterns

3
What is (not) Data Mining?
  • What is not Data Mining?
  • Look up phone number in phone directory
  • Query a Web search engine for information about
    Amazon
  • What is Data Mining?
  • Certain names are more prevalent in certain US
    locations (OBrien, ORourke, in Boston area)
  • Group together similar documents returned by
    search engine according to their context (Amazon
    rainforest, Amazon.com, etc.)

4
Why Mine Data? Commercial Viewpoint
  • Lots of data is being collected and warehoused
  • Web data
  • Yahoo! collects ?10GB/hour
  • purchases at department/grocery stores
  • Walmart records ? 20 million transactions
    per day
  • Bank/Credit Card transactions
  • Computers have become cheaper and more powerful
  • Competitive Pressure is Strong
  • Provide better, customized services for an edge
    (e.g. in Customer Relationship Management)

5
Why Mine Data? Scientific Viewpoint
  • Data collected and stored at enormous speeds
    (GB/hour)
  • remote sensors on a satellite
  • NASA EOSDIS archives over 1-petabytes of Earth
    Science data per year
  • telescopes scanning the skies
  • Sky survey data
  • gene expression data
  • scientific simulations
  • terabytes of data generated in a few hours
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
  • in automated analysis of massive data sets
  • in hypothesis formation

6
Mining Large Data Sets - Motivation
  • There is often information hidden in the data
    that is not readily evident
  • Human analysts may take too long to discover
    useful information
  • Much of the data is never analyzed at all

7
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Traditional techniquesmay be unsuitable due to
  • Enormity of data
  • High dimensionality of data
  • Heterogeneous, distributed nature of data

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
8
Role of Parallel Distributed Computing
  • High Performance Computing (HPC) is often
    critical for scalability to large data sets
  • Many algorithms use more than O(n)computation
    time
  • Sequential computers have limited memory, thus
    requiring multiple, expensiveI/O passes over
    data
  • Distributed computing is neededbecause data is
    distributed
  • due to privacy reasons
  • physically dispersed over many different
    geographic locations

9
Data Mining Tasks...
Data
Clustering
Predictive Modeling
Anomaly Detection
Association Rules
Milk
10
Predictive Modeling
  • Find a model for class attribute as a function
    of the values of other attributes

Model for predicting tax evasion
categorical
categorical
continuous
Married
class
No
Yes
NO
Income?100K
Yes
Yes
Income ? 80K
NO
Yes
No
YES
NO
Learn Classifier
11
Predictive Modeling Applications
  • Targeted Marketing
  • Customer Attrition/Churn
  • Classifying Galaxies
  • Class
  • Stages of Formation

Early
  • Attributes
  • Image features,
  • Characteristics of light waves received, etc.

Intermediate
Late
  • Sky Survey Data Size
  • 72 million stars, 20 million galaxies
  • Object Catalog 9 GB
  • Image Database 150 GB

Courtsey http//aps.umn.edu
12
Clustering
  • Given a set of data points, find groupings such
    that
  • Data points in one cluster are more similar to
    one another
  • Data points in separate clusters are less similar
    to one another

13
Clustering Applications
  • Market Segmentation
  • Gene expression clustering
  • Document Clustering

14
Association Rule Discovery
  • Given a set of records, find dependency rules
    which will predict occurrence of an item based on
    occurrences of other items in the record
  • Applications
  • Marketing and Sales Promotion
  • Supermarket shelf management
  • Inventory Management

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
15
Deviation/Anomaly Detection
  • Detect significant deviations from normal
    behavior
  • Applications
  • Credit Card Fraud Detection
  • Network Intrusion Detection

Typical network traffic at University
level may reach over 100 million connections per
day
16
Discovery of Patterns in the Earth Science Data
  • NASA ESE questions
  • How is the global Earth system changing?
  • What are the primary forcings?
  • How does Earth system respond to natural
    human-induced changes?
  • What are the consequences of changes in the Earth
    system?
  • How well can we predict future changes?
  • Global snapshots of values for a number of
    variables on land surfaces or water
  • Data sources
  • weather observation stations
  • earth orbiting satellites (since 1981)
  • modeled-based data

17
Climate Indices Connecting the Ocean/Atmosphere
and the Land
  • A climate index is a time series of sea surface
    temperature or sea level pressure
  • Climate indices capture teleconnections
  • The simultaneous variation in climate and related
    processes over widely separated points on the
    Earth

El Nino Events
Nino 12 Index
18
Discovery of Climate Indices Using Clustering
A novel clustering technique was developed to
identify regions of uniform behavior in
spatio-temporal data. The use of clustering for
discovering climate indices is driven by the
intuition that a climate phenomenon is expected
to involve a significant region of the ocean or
atmosphere where the behavior is relatively
uniform over the entire area. A cluster-based
approach for discovering climate indices provides
better physical interpretation than those based
on the SVD/EOF paradigm, and provide candidate
indices with better predictive power than known
indices for some land areas. Some SST clusters
reproduce well-known climate indices. In
particular, we were able to replicate the four El
Nino SST-based indices cluster 94 corresponds to
NINO 12, 67 to NINO 3, 78 to NINO 3.4, and 75 to
NINO 4. The correlations of these clusters to
their corresponding indices are higher than
0.9. Some SST clusters, e.g., cluster 29, are
significantly different than known indices, but
provide better correlation with land climate
variables than known indices for many parts of
the globe. The bottom figure shows the
difference in correlation to land temperature
between cluster 29 and the El Nino indices. Areas
in yellow indicate where cluster 29 has higher
correlation.
19
Mining the Climate Data Clustering
grid points 67K Land, 40K Ocean Current
data size range 20 400 MB Monthly data over a
range of 17 to 50 years
El Nino Regions Defined by Earth Scientists
Clusters of SST that have high impact on land
temperature
20
SST Cluster Moderately Correlated to Known Indices
Ref Steinbach et al 2002/2003 (KDD 2003)

21
Correlation of Known Indices with SST Cluster
Centroids and SVD Components
22
SLP Clusters
AO
NAO
SOI
SOI
DMI
23
Pair of SLP Clusters that Correspond to SOI
Centroids of SLP clusters 13 and 20
Cluster centroid 20 13 versus SOI
Correlation 0.75
24
Finding New Patterns Indian Monsoon Dipole Mode
Index
  • Recently a new index, the Indian Ocean Dipole
    Mode index (DMI), has been discovered.
  • DMI is defined as the difference in SST anomaly
    between the region 5S-5N, 55E-75E and the region
    0-10S, 85E-95E.
  • DMI and is an indicator of a weak monsoon over
    the Indian subcontinent and heavy rainfall over
    East Africa.
  • We can reproduce this index as a difference of
    pressure indices of clusters 16 and 22.

Plot of cluster 16 cluster 22 versus the Indian
Ocean Dipole Mode index. (Indices smoothed using
12 month moving average.)
25
Mining the Climate Data Associations
Ref Tan et al 2001
FPAR-Hi gt NPP-Hi (sup5.9, conf55.7)
Grassland/Shrubland areas
Association rule is interesting because it
appears mainly in regions with grassland/shrubland
vegetation type
26
Detection of Ecosystem Disturbances
Detection of sudden changes in greenness over
extensive areas from these large global satellite
data sets required development of automated
techniques that take into account the timing,
location, and magnitude of such changes. An
algorithm was designed to identify any
significant and sustained declines in FPAR during
an 18 year time period. This algorithm transforms
a non-stationary time series to a sequence of
disturbance events. Techniques were also
developed to discover associations between
ecosystem disturbance regimes and historical
climate anomalies.
These algorithms and techniques have allowed
Earth Science researchers to gain a deeper
insight into the interplay among natural
disasters, human activities and the rise of
carbon dioxide in Earth's atmosphere during two
recent decades.
Release 03-51AR          NASA DATA MINING
REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA
is using satellite data to paint a detailed
global picture of the interplay among natural
disasters, human activities and the rise of
carbon dioxide in the Earth's atmosphere during
the past 20 years.
http//amesnews.arc.nasa.gov/releases/2003/03_51AR
.html
27
Understanding Global Teleconnections of Climate
to Regional Model Estimates of Amazon Ecosystem
Carbon Fluxes
Discovered, using correlation analysis, a strong
connection between the rainfall patterns
generated by the South American monsoon system
and terrestrial greenness over a large section of
the southern Amazon region. This is the first
direct evidence of large-scale effects of the
Atlantic Ocean rainfall systems on yearly
greenness changes in the Amazon region, and the
finding has important implications for the
impacts of "slash and burn" deforestation on this
crucial ecosystem of the world.
28
High Resolution EOS Data
  • EOS satellites provide high resolution
    measurements
  • Finer spatial grids
  • 8 km ? 8 km grid produces 10,848,672 data points
  • 1 km ? 1 km grid produces 694,315,008 data points
  • More frequent measurements
  • Multiple instruments
  • Generates terabytes of day per day
  • High resolution data allows us to answer more
    detailed questions
  • Detecting patterns such as trajectories, fronts,
    and movements of regions with uniform properties
  • Finding relationships between leaf area index
    (LAI) and topography of a river drainage basin
  • Finding relationships between fire frequency and
    elevation as well as topographic position

Earth Observing System (e.g., Terra and Aqua
satellites)
http//www.crh.noaa.gov/lmk/soo/docu/basicwx.htm
29
Discovery of Changes from the Global Carbon Cycle
and Climate System Using Data Mining Journal
Publications
  • Potter, C., Tan, P., Steinbach, M., Klooster,
    S., Kumar, V., Myneni, R., Genovese, V., 2003.
    Major disturbance events in terrestrial
    ecosystems detected using global satellite data
    sets. Global Change Biology, July, 2003.
  • Potter, C., Klooster, S. A., Myneni, R.,
    Genovese, V., Tan, P., Kumar,V. 2003. Continental
    scale comparisons of terrestrial carbon sinks
    estimated from satellite data and ecosystem
    modeling 1982-98. Global and Planetary Change (in
    press)
  • Potter, C., Klooster, S. A., Steinbach, M., Tan,
    P., Kumar, V., Shekhar, S., Nemani, R., Myneni,
    R., 2003. Global teleconnections of climate to
    terrestrial carbon flux. Geophys J. Res.-
    Atmospheres (in press).
  • Potter, C., Klooster, S., Steinbach, M., Tan, P.,
    Kumar, V., Myneni, R., Genovese, V., 2003.
    Variability in Terrestrial Carbon Sinks Over Two
    Decades Part 1 North America. Geophysical
    Research Letters (in press)
  • Potter, C. Klooster, S., Steinbach, M., Tan, P.,
    Kumar, V., Shekhar, S. and C. Carvalho, 2002.
    Understanding Global Teleconnections of Climate
    to Regional Model Estimates of Amazon Ecosystem
    Carbon Fluxes. Global Change Biology (in press)
  • Potter, C., Zhang, P., Shekhar, S., Kumar, V.,
    Klooster, S., and Genovese, V., 2002.
    Understanding the Controls of Historical River
    Discharge Data on Largest River Basins. (in
    preparation)

30
Discovery of Changes from the Global Carbon Cycle
and Climate System Using Data Mining
Conference/Workshop Publications
  • Steinbach, M., Tan, P. Kumar, V., Potter, C. and
    Klooster, S., 2003. Discovery of Climate Indices
    Using Clustering, KDD 2003, Washington, D.C.,
    August 24-27, 2003.
  • Zhang, P., Huang, Y., Shekhar, S., and Kumar, V.,
    2003. Exploiting Spatial Autocorrelation to
    Efficiently Process Correlation-Based Similarity
    Queries , Proc. of the 8th Intl. Symp. on Spatial
    and Temporal Databases (SSTD '03)
  • Zhang, P., Huang, Y., Shekhar, S., and Kumar, V.,
    2003. Correlation Analysis of Spatial Time Series
    Datasets A Filter-And-Refine Approach, Proc. of
    the Seventh Pacific-Asia Conference on Knowledge
    Discovery and Data Mining (PAKDD '03)
  • Ertoz, L., Steinbach, M., and Kumar, V., 2003.
    Finding Clusters of Different Sizes, Shapes, and
    Densities in Noisy, High Dimensional Data, Proc.
    of Third SIAM International Conference on Data
    Mining.
  • Tan, P., Steinbach, M., Kumar, V., Potter, C.,
    Klooster, S., and Torregrosa, A., 2001. Finding
    Spatio-Temporal Patterns in Earth Science Data,
    KDD 2001 Workshop on Temporal Data Mining, San
    Francisco
  • Kumar, V., Steinbach, M., Tan, P., Klooster, S.,
    Potter, C., and Torregrosa, A., 2001. Mining
    Scientific Data Discovery of Patterns in the
    Global Climate System, Proc. of the 2001 Joint
    Statistical Meeting, Atlanta
Write a Comment
User Comments (0)
About PowerShow.com