Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla

Description:

Comparing Predictive Power in Climate Data: Clustering Matters Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla & Auroop Ganguly – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 32
Provided by: KarstenSte5
Learn more at: https://web.archive.org
Category:

less

Transcript and Presenter's Notes

Title: Karsten Steinhaeuser University of Minnesota joint work with Nitesh Chawla


1
Comparing Predictive Power in Climate Data
Clustering Matters
  • Karsten SteinhaeuserUniversity of
    Minnesotajoint work withNitesh Chawla Auroop
    Ganguly
  • 12th International Symposium onSpatial and
    Temporal DatabasesMinneapolis, MNAugust 24,
    2011

2
Outline
  • Motivation
  • Networks Primer
  • From Data to Networks
  • Motivating Networks in Climate Science
  • Descriptive Analysis and Predictive Modeling
  • Empirical Evaluation Comparison
  • Conclusions

3
Mining Complex Data
  • Complex spatio-temporal data pose unique
    challenges
  • Toblers First Law of GeographyEverything is
    related, but nearthings more than distant.
  • But are all near things equally related?
  • Are there phenomena explained byinteractions
    among distant things?(teleconnections)

4
Networks Primer
  • What is a Network?
  • Oxford English Dictionarynetwork, n. Any
    netlike or complexsystem or collection of
    interrelatedthings, as topographical
    features,lines of transportation,
    ortelecommunications routes(esp. telephone
    lines).
  • My working definitionAny set of items that are
    connected or related to each other.(items and
    connections can be concrete or abstract)

5
Networks Primer
  • Community Detection in Networks
  • Identify groups of nodesthat are relatively more
    tightlyconnected to each other thanto other
    nodes in the network
  • Computationally challengingproblem for
    real-world networks

6
From Data to Networks
  • Networks are pervasive insocial science,
    technology,and nature
  • Many datasets explicitlydefine network
    structure
  • But networks can also represent other types of
    data, framework for identifying relationships,
    patterns, etc.

7
Motivating Networks in Climate
  • Uncertainty derives from many known andoften
    many more unknown sources

8
Motivating Networks in Climate
  • Projections of climaterely on many factors
  • Understanding of thephysical processes
  • Ability to implementthis understanding
    incomputational models
  • Assumptions aboutthe future

Source IPCC SRES and AR-4
9
Motivating Networks in Climate
  • Some processes well-understood and modeled,
    others much less credible
  • Comparison to observations shows varying skills

10
Motivating Networks in Climate
  • Models cannot capture some features/processes
  • Comparison to observations illustrates severe
    geographic variability, topographic bias

11
Motivating Networks in Climate
  • Research Question
  • Can we characterize the credible variables,
    identify relationships to the relatively less
    credible variables, and leverage them toimprove
    or refine our understanding?
  • AnswerStay Tuned

12
Knowledge Discovery for Climate
13
Historical Climate Data
  • NCEP/NCAR Reanalysis (proxy for observation)
  • Monthly for 60 years (1948-2007) on 5ºx5º grid
  • Seven variablesSea surface temperature
    (SST)Sea level pressure (SLP)Geopotential
    height (GH)Precipitable water (PW)Relative
    Humidity (RH)Horizontal wind speed
    (HWS)Vertical wind speed (VWS)

Raw Data
De-Seasonalize
AnomalySeries
14
Network Construction
  • View global climate system as a collection of
    interacting oscillatorsTsonis Roebber, 2004
  • Vertices represent locations in space
  • Edges denote correlation in variability
  • Link strength estimated by correlation,
    low-weight edges are pruned from the network
  • Construct networks only for locations over the
    oceans
  • Relatively better captured by models

15
Geographic Properties
  • Examine network structure in spatial context
  • Link lengths computed as great-circle distance
  • Compare autocorrelation / de-correlation lengths
    for different variables, interpret within the
    domain

Sea Level Pressure Precipitable Water
Vertical Wind Speed
16
Clustering Climate Networks
  • Apply community detection to partition networks
  • Use Walktrap algorithmPons Latapy, 2006
  • Efficient and works well for dense networks
  • Visualize spatial pattern using GIS tools
  • Cluster structure suggests relationships within
    the climate system

Sea Level Pressure
Precipitable Water
17
Update!
  • Research Question
  • Can we characterize the credible variables,
    identify relationships to the relatively less
    credible variables, and leverage them toimprove
    or refine our understanding?
  • Revised AnswerYes but thats not all.

18
Descriptive ? Predictive
  • Network representation is able to capture
    interactions, reveal patterns in climate
  • Validate existing assumptions / knowledge
  • Suggest potentially new insights or
    hypothesesfor climate science
  • Want to extract the relationships between
    atmospheric dynamics over ocean and land
  • i.e., Learn physical phenomena from the data

19
Predictive Modeling
  • Use network clusters as candidate predictors
  • Create response variables for target regions
    around the globe (illustrated below)
  • Build regressionmodel relatingocean clustersto
    land climate

20
Illustrative Example
  • Predictive model for air temperature in Peru
  • Long-term variability highly predictable due
    towell-documented relation to El Nino
  • Small number of clusters have majority of skill
  • Feature selection (blue line) improves predictions

Raw DataAll ClustersFeature Selection
21
Update!
  • Research Question
  • Can we characterize the credible variables,
    identify relationships to the relatively less
    credible variables, and leverage them toimprove
    or refine our understanding?
  • Revised AnswerYes and Yes but wait, theres
    more.

22
Results on Train/Test
23
Predictive Skill
24
Update!
  • Research Question
  • Can we characterize the credible variables,
    identify relationships to the relatively less
    credible variables, and leverage them toimprove
    or refine our understanding?
  • Revised AnswerYes, Yes, and Yes.

25
Variations / Extensions
  • Compare network approach to traditional
    clustering methods
  • k-means, k-medoids, spectral, EM, etc.
  • Compare different types of predictive models
  • (linear) regression, regression trees, neural
    nets, support vector regression

26
Compare Clustering Methods
27
Compare Predictive Models
28
Refining Model Projections
29
Conclusions
  • Networks capture behavior of the climate system
  • Clusters (or communities) derived from these
    networks have useful predictive skill
  • Statistically significantly better than
    predictors based on clusters derived using
    traditional methods
  • Potential for advancing climate science
  • Understanding of physical processes
  • Complement climate model simulations

30
Upcoming Events
  1. First International Workshop on Climate
    Informatics, New York, NY, August 26,
    2011http//www.nyas.org/climateinformatics
  2. NASA Conference on Intelligent Data Understanding
    (CIDU), Mountain View, CA, Oct 19-21,
    2011https//c3.ndc.nasa.gov/dashlink/projects/43/
  3. IEEE ICDM Workshop on Knowledge Discovery from
    Climate Data, Vancouver, Canada, December 10,
    2011http//www.nd.edu/dial/climkd11/

31
Thanks Questions
  • Contact
  • ksteinha_at_umn.edu
  • Personal Homepage
  • http//www.nd.edu/ksteinha
  • NSF Expeditions onUnderstanding Climate Change
  • http//climatechange.cs.umn.edu
  • This work was supported in part by the National
    Science Foundation under Grants OCI-1029584 and
    BCS-0826958. This research was also funded in
    part by the project entitled Uncertainty
    Assessment and Reduction for Climate Extremes and
    Climate Change Impacts under the initiative
    Understanding Climate Change Impact Energy,
    Carbon, and Water Initiative within the
    Laboratory Directed Research and Development
    (LDRD) Program of the Oak Ridge National
    Laboratory, managed by UT-Battelle, LLC for the
    U.S. Department of Energy under Contract
    DE-AC05-00OR22725.

32
K-Means
  • Lloyds Algorithm
  • Randomly select k points as initial means
  • Associate each data point with the closest
    mean(closest by some distance, e.g.,
    Euclidean)
  • Calculate the new means to be the centroid of
    thedata points associated with each cluster
  • Repeat steps 2 to 4 until there is no change in
    the cluster centers (or less than some e)

33
K-Medoids
  • PAM Algorithm
  • Randomly select k of the n data points as the
    medoids
  • Associate each data point to the closest
    medoid(closest by some distance, e.g.,
    Euclidean)
  • For each medoid m For each non-medoid data
    point o
  • Swap m and o and compute the total cost of
    the configuration
  • Select the configuration with the lowest cost
  • Repeat steps 2 to 5 until there is no change in
    the medoids

34
Spectral
  • Given the similarity matrix S, spectral
    clustering relies on the spectrum (set of
    eigenvectors) of Sto cluster the data in
    lower-dimensional space.
  • 1. Compute the k leading eigenvalues of thegraph
    Laplacian L I - D-1/2 S D-1/2
  • where D is the vertex degree matrix
  • 2. Project each data point into the new space
  • 3. Apply simple clustering (e.g., k-means)

35
EM
  • Given a set of points X, latent variables Z,and
    a statistical model (e.g., Gaussian) with
    corresponding set of unknown parameters T
  • E-Step Compute expected value of the
    log-likelihood of Z with respect to X (i.e.,
    probability of each datum belonging to a given
    component of the mixture)
  • M-Step Find the parameters T that maximizethis
    quantity (i.e., tweak the parameters to maximize
    the probabilities)

36
Walktrap
  1. Find the similarity between each node andits
    neighbors using random walks
  2. Merge the two closest nodes (or communities)
  3. Update the similarities (only requires a
    computationfor the merged communities)
  4. Repeat steps 2 to 4 until all nodes are in a
    single,giant community (hierarchical clustering)
  5. Decide where to cut the hierarchy
    (dendrogram)using modularity or other criteria
    (cluster size,a priori knowledge, etc.)
Write a Comment
User Comments (0)
About PowerShow.com