Title: Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla
1Comparing Predictive Power in Climate Data
Clustering Matters
- Karsten SteinhaeuserUniversity of
Minnesotajoint work withNitesh Chawla Auroop
Ganguly - 12th International Symposium onSpatial and
Temporal DatabasesMinneapolis, MNAugust 24,
2011
2Outline
- Motivation
- Networks Primer
- From Data to Networks
- Motivating Networks in Climate Science
- Descriptive Analysis and Predictive Modeling
- Empirical Evaluation Comparison
- Conclusions
3Mining Complex Data
- Complex spatio-temporal data pose unique
challenges - Toblers First Law of GeographyEverything is
related, but nearthings more than distant. - But are all near things equally related?
- Are there phenomena explained byinteractions
among distant things?(teleconnections)
4Networks Primer
- What is a Network?
- Oxford English Dictionarynetwork, n. Any
netlike or complexsystem or collection of
interrelatedthings, as topographical
features,lines of transportation,
ortelecommunications routes(esp. telephone
lines). - My working definitionAny set of items that are
connected or related to each other.(items and
connections can be concrete or abstract)
5Networks Primer
- Community Detection in Networks
- Identify groups of nodesthat are relatively more
tightlyconnected to each other thanto other
nodes in the network - Computationally challengingproblem for
real-world networks
6From Data to Networks
- Networks are pervasive insocial science,
technology,and nature - Many datasets explicitlydefine network
structure - But networks can also represent other types of
data, framework for identifying relationships,
patterns, etc.
7Motivating Networks in Climate
- Uncertainty derives from many known andoften
many more unknown sources
8Motivating Networks in Climate
- Projections of climaterely on many factors
- Understanding of thephysical processes
- Ability to implementthis understanding
incomputational models - Assumptions aboutthe future
Source IPCC SRES and AR-4
9Motivating Networks in Climate
- Some processes well-understood and modeled,
others much less credible - Comparison to observations shows varying skills
10Motivating Networks in Climate
- Models cannot capture some features/processes
- Comparison to observations illustrates severe
geographic variability, topographic bias
11Motivating Networks in Climate
- Research Question
- Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding? - AnswerStay Tuned
12Knowledge Discovery for Climate
13Historical Climate Data
- NCEP/NCAR Reanalysis (proxy for observation)
- Monthly for 60 years (1948-2007) on 5ºx5º grid
- Seven variablesSea surface temperature
(SST)Sea level pressure (SLP)Geopotential
height (GH)Precipitable water (PW)Relative
Humidity (RH)Horizontal wind speed
(HWS)Vertical wind speed (VWS)
Raw Data
De-Seasonalize
AnomalySeries
14Network Construction
- View global climate system as a collection of
interacting oscillatorsTsonis Roebber, 2004 - Vertices represent locations in space
- Edges denote correlation in variability
- Link strength estimated by correlation,
low-weight edges are pruned from the network - Construct networks only for locations over the
oceans - Relatively better captured by models
15Geographic Properties
- Examine network structure in spatial context
- Link lengths computed as great-circle distance
- Compare autocorrelation / de-correlation lengths
for different variables, interpret within the
domain
Sea Level Pressure Precipitable Water
Vertical Wind Speed
16Clustering Climate Networks
- Apply community detection to partition networks
- Use Walktrap algorithmPons Latapy, 2006
- Efficient and works well for dense networks
- Visualize spatial pattern using GIS tools
- Cluster structure suggests relationships within
the climate system
Sea Level Pressure
Precipitable Water
17Update!
- Research Question
- Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding? - Revised AnswerYes but thats not all.
18Descriptive ? Predictive
- Network representation is able to capture
interactions, reveal patterns in climate - Validate existing assumptions / knowledge
- Suggest potentially new insights or
hypothesesfor climate science - Want to extract the relationships between
atmospheric dynamics over ocean and land - i.e., Learn physical phenomena from the data
19Predictive Modeling
- Use network clusters as candidate predictors
- Create response variables for target regions
around the globe (illustrated below) - Build regressionmodel relatingocean clustersto
land climate
20Illustrative Example
- Predictive model for air temperature in Peru
- Long-term variability highly predictable due
towell-documented relation to El Nino - Small number of clusters have majority of skill
- Feature selection (blue line) improves predictions
Raw DataAll ClustersFeature Selection
21Update!
- Research Question
- Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding? - Revised AnswerYes and Yes but wait, theres
more.
22Results on Train/Test
23Predictive Skill
24Update!
- Research Question
- Can we characterize the credible variables,
identify relationships to the relatively less
credible variables, and leverage them toimprove
or refine our understanding? - Revised AnswerYes, Yes, and Yes.
25Variations / Extensions
- Compare network approach to traditional
clustering methods - k-means, k-medoids, spectral, EM, etc.
- Compare different types of predictive models
- (linear) regression, regression trees, neural
nets, support vector regression
26Compare Clustering Methods
27Compare Predictive Models
28Refining Model Projections
29Conclusions
- Networks capture behavior of the climate system
- Clusters (or communities) derived from these
networks have useful predictive skill - Statistically significantly better than
predictors based on clusters derived using
traditional methods - Potential for advancing climate science
- Understanding of physical processes
- Complement climate model simulations
30Upcoming Events
- First International Workshop on Climate
Informatics, New York, NY, August 26,
2011http//www.nyas.org/climateinformatics - NASA Conference on Intelligent Data Understanding
(CIDU), Mountain View, CA, Oct 19-21,
2011https//c3.ndc.nasa.gov/dashlink/projects/43/
- IEEE ICDM Workshop on Knowledge Discovery from
Climate Data, Vancouver, Canada, December 10,
2011http//www.nd.edu/dial/climkd11/
31Thanks Questions
- Contact
- ksteinha_at_umn.edu
- Personal Homepage
- http//www.nd.edu/ksteinha
- NSF Expeditions onUnderstanding Climate Change
- http//climatechange.cs.umn.edu
- This work was supported in part by the National
Science Foundation under Grants OCI-1029584 and
BCS-0826958. This research was also funded in
part by the project entitled Uncertainty
Assessment and Reduction for Climate Extremes and
Climate Change Impacts under the initiative
Understanding Climate Change Impact Energy,
Carbon, and Water Initiative within the
Laboratory Directed Research and Development
(LDRD) Program of the Oak Ridge National
Laboratory, managed by UT-Battelle, LLC for the
U.S. Department of Energy under Contract
DE-AC05-00OR22725.
32K-Means
- Lloyds Algorithm
- Randomly select k points as initial means
- Associate each data point with the closest
mean(closest by some distance, e.g.,
Euclidean) - Calculate the new means to be the centroid of
thedata points associated with each cluster - Repeat steps 2 to 4 until there is no change in
the cluster centers (or less than some e)
33K-Medoids
- PAM Algorithm
- Randomly select k of the n data points as the
medoids - Associate each data point to the closest
medoid(closest by some distance, e.g.,
Euclidean) - For each medoid m For each non-medoid data
point o - Swap m and o and compute the total cost of
the configuration - Select the configuration with the lowest cost
- Repeat steps 2 to 5 until there is no change in
the medoids
34Spectral
- Given the similarity matrix S, spectral
clustering relies on the spectrum (set of
eigenvectors) of Sto cluster the data in
lower-dimensional space. - 1. Compute the k leading eigenvalues of thegraph
Laplacian L I - D-1/2 S D-1/2 - where D is the vertex degree matrix
- 2. Project each data point into the new space
- 3. Apply simple clustering (e.g., k-means)
35EM
- Given a set of points X, latent variables Z,and
a statistical model (e.g., Gaussian) with
corresponding set of unknown parameters T - E-Step Compute expected value of the
log-likelihood of Z with respect to X (i.e.,
probability of each datum belonging to a given
component of the mixture) - M-Step Find the parameters T that maximizethis
quantity (i.e., tweak the parameters to maximize
the probabilities)
36Walktrap
- Find the similarity between each node andits
neighbors using random walks - Merge the two closest nodes (or communities)
- Update the similarities (only requires a
computationfor the merged communities) - Repeat steps 2 to 4 until all nodes are in a
single,giant community (hierarchical clustering) - Decide where to cut the hierarchy
(dendrogram)using modularity or other criteria
(cluster size,a priori knowledge, etc.)