Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla

Description:

Comparing Predictive Power in Climate Data: Clustering Matters Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla & Auroop Ganguly – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 32

Provided by: KarstenSte5

Learn more at: https://web.archive.org

Category:

more less

Transcript and Presenter's Notes

Title: Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla

1
Comparing Predictive Power in Climate Data
Clustering Matters

Karsten SteinhaeuserUniversity of
Minnesotajoint work withNitesh Chawla Auroop
Ganguly
12th International Symposium onSpatial and
Temporal DatabasesMinneapolis, MNAugust 24,
2011

2
Outline

Motivation
Networks Primer
From Data to Networks
Motivating Networks in Climate Science
Descriptive Analysis and Predictive Modeling
Empirical Evaluation Comparison
Conclusions

3
Mining Complex Data

Complex spatio-temporal data pose unique
challenges
Toblers First Law of GeographyEverything is
related, but nearthings more than distant.
But are all near things equally related?
Are there phenomena explained byinteractions
among distant things?(teleconnections)

4
Networks Primer

What is a Network?
Oxford English Dictionarynetwork, n. Any
netlike or complexsystem or collection of
interrelatedthings, as topographical
features,lines of transportation,
ortelecommunications routes(esp. telephone
lines).
My working definitionAny set of items that are
connected or related to each other.(items and
connections can be concrete or abstract)

5
Networks Primer

Community Detection in Networks
Identify groups of nodesthat are relatively more
tightlyconnected to each other thanto other
nodes in the network
Computationally challengingproblem for
real-world networks

6
From Data to Networks

Networks are pervasive insocial science,
technology,and nature
Many datasets explicitlydefine network
structure
But networks can also represent other types of
data, framework for identifying relationships,
patterns, etc.

7
Motivating Networks in Climate

Uncertainty derives from many known andoften
many more unknown sources

8
Motivating Networks in Climate

Projections of climaterely on many factors
Understanding of thephysical processes
Ability to implementthis understanding
incomputational models
Assumptions aboutthe future

Source IPCC SRES and AR-4
9
Motivating Networks in Climate

Some processes well-understood and modeled,
others much less credible
Comparison to observations shows varying skills

10
Motivating Networks in Climate

Models cannot capture some features/processes
Comparison to observations illustrates severe
geographic variability, topographic bias

11
Motivating Networks in Climate

Research Question
Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding?
AnswerStay Tuned

12
Knowledge Discovery for Climate
13
Historical Climate Data

NCEP/NCAR Reanalysis (proxy for observation)
Monthly for 60 years (1948-2007) on 5ºx5º grid
Seven variablesSea surface temperature
(SST)Sea level pressure (SLP)Geopotential
height (GH)Precipitable water (PW)Relative
Humidity (RH)Horizontal wind speed
(HWS)Vertical wind speed (VWS)

Raw Data
De-Seasonalize
AnomalySeries
14
Network Construction

View global climate system as a collection of
interacting oscillatorsTsonis Roebber, 2004
Vertices represent locations in space
Edges denote correlation in variability
Link strength estimated by correlation,
low-weight edges are pruned from the network
Construct networks only for locations over the
oceans
Relatively better captured by models

15
Geographic Properties

Examine network structure in spatial context
Link lengths computed as great-circle distance
Compare autocorrelation / de-correlation lengths
for different variables, interpret within the
domain

Sea Level Pressure Precipitable Water
Vertical Wind Speed
16
Clustering Climate Networks

Apply community detection to partition networks
Use Walktrap algorithmPons Latapy, 2006
Efficient and works well for dense networks
Visualize spatial pattern using GIS tools
Cluster structure suggests relationships within
the climate system

Sea Level Pressure
Precipitable Water
17
Update!

Research Question
Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding?
Revised AnswerYes but thats not all.

18
Descriptive ? Predictive

Network representation is able to capture
interactions, reveal patterns in climate
Validate existing assumptions / knowledge
Suggest potentially new insights or
hypothesesfor climate science
Want to extract the relationships between
atmospheric dynamics over ocean and land
i.e., Learn physical phenomena from the data

19
Predictive Modeling

Use network clusters as candidate predictors
Create response variables for target regions
around the globe (illustrated below)
Build regressionmodel relatingocean clustersto
land climate

20
Illustrative Example

Predictive model for air temperature in Peru
Long-term variability highly predictable due
towell-documented relation to El Nino
Small number of clusters have majority of skill
Feature selection (blue line) improves predictions

Raw DataAll ClustersFeature Selection
21
Update!

Research Question
Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding?
Revised AnswerYes and Yes but wait, theres
more.

22
Results on Train/Test
23
Predictive Skill
24
Update!

Research Question
Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding?
Revised AnswerYes, Yes, and Yes.

25
Variations / Extensions

Compare network approach to traditional
clustering methods
k-means, k-medoids, spectral, EM, etc.
Compare different types of predictive models
(linear) regression, regression trees, neural
nets, support vector regression

26
Compare Clustering Methods
27
Compare Predictive Models
28
Refining Model Projections
29
Conclusions

Networks capture behavior of the climate system
Clusters (or communities) derived from these
networks have useful predictive skill
Statistically significantly better than
predictors based on clusters derived using
traditional methods
Potential for advancing climate science
Understanding of physical processes
Complement climate model simulations

30
Upcoming Events

First International Workshop on Climate
Informatics, New York, NY, August 26,
2011http//www.nyas.org/climateinformatics
NASA Conference on Intelligent Data Understanding
(CIDU), Mountain View, CA, Oct 19-21,
2011https//c3.ndc.nasa.gov/dashlink/projects/43/
IEEE ICDM Workshop on Knowledge Discovery from
Climate Data, Vancouver, Canada, December 10,
2011http//www.nd.edu/dial/climkd11/

31
Thanks Questions

Contact
ksteinha_at_umn.edu
Personal Homepage
http//www.nd.edu/ksteinha
NSF Expeditions onUnderstanding Climate Change
http//climatechange.cs.umn.edu
This work was supported in part by the National
Science Foundation under Grants OCI-1029584 and
BCS-0826958. This research was also funded in
part by the project entitled Uncertainty
Assessment and Reduction for Climate Extremes and
Climate Change Impacts under the initiative
Understanding Climate Change Impact Energy,
Carbon, and Water Initiative within the
Laboratory Directed Research and Development
(LDRD) Program of the Oak Ridge National
Laboratory, managed by UT-Battelle, LLC for the
U.S. Department of Energy under Contract
DE-AC05-00OR22725.

32
K-Means

Lloyds Algorithm
Randomly select k points as initial means
Associate each data point with the closest
mean(closest by some distance, e.g.,
Euclidean)
Calculate the new means to be the centroid of
thedata points associated with each cluster
Repeat steps 2 to 4 until there is no change in
the cluster centers (or less than some e)

33
K-Medoids

PAM Algorithm
Randomly select k of the n data points as the
medoids
Associate each data point to the closest
medoid(closest by some distance, e.g.,
Euclidean)
For each medoid m For each non-medoid data
point o
Swap m and o and compute the total cost of
the configuration
Select the configuration with the lowest cost
Repeat steps 2 to 5 until there is no change in
the medoids

34
Spectral

Given the similarity matrix S, spectral
clustering relies on the spectrum (set of
eigenvectors) of Sto cluster the data in
lower-dimensional space.
1. Compute the k leading eigenvalues of thegraph
Laplacian L I - D-1/2 S D-1/2
where D is the vertex degree matrix
2. Project each data point into the new space
3. Apply simple clustering (e.g., k-means)

35
EM

Given a set of points X, latent variables Z,and
a statistical model (e.g., Gaussian) with
corresponding set of unknown parameters T
E-Step Compute expected value of the
log-likelihood of Z with respect to X (i.e.,
probability of each datum belonging to a given
component of the mixture)
M-Step Find the parameters T that maximizethis
quantity (i.e., tweak the parameters to maximize
the probabilities)

36
Walktrap

Find the similarity between each node andits
neighbors using random walks
Merge the two closest nodes (or communities)
Update the similarities (only requires a
computationfor the merged communities)
Repeat steps 2 to 4 until all nodes are in a
single,giant community (hierarchical clustering)
Decide where to cut the hierarchy
(dendrogram)using modularity or other criteria
(cluster size,a priori knowledge, etc.)