Discovery of Patterns in the Global Climate System using Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Discovery of Patterns in the Global Climate System using Data Mining

Description:

Research Goals Sources of Earth Science Data Before 1950, very sparse ... (10N,10E,1) {Temp-Hi, Prec-Lo} (10N,10E,2) {Temp-Hi,Prec-Lo,NPP-Lo} (10N,11E,2) {Temp ... – PowerPoint PPT presentation

Number of Views:306
Avg rating:3.0/5.0
Slides: 64
Provided by: Comput554
Category:

less

Transcript and Presenter's Notes

Title: Discovery of Patterns in the Global Climate System using Data Mining


1
Discovery of Patterns in the Global Climate
System using Data Mining
  • Vipin Kumar
  • Army High Performance Computing Research Center
  • Department of Computer Science
  • University of Minnesota http//www.cs.umn.edu/
    kumar
  • Collaborators
  • G. Karypis, S. Shekhar, M. Steinbach, P.N. Tan
    (AHPCRC),
  • C. Potter, (NASA Ames Research Center),
  • S. Klooster (California State University,
    Monterey Bay).
  • This work was partially funded by NASA and Army
    High Performance Computing Center

2
Research Goals
  • Research Goals
  • Find global climate patterns of interest to
    Earth Scientists

A key interest is finding connections between the
ocean and the land.
  • Global snapshots of values for a number of
    variables on land surfaces or water.
  • Monthly over a range of 10 to 50 years.

3
Sources of Earth Science Data
  • Before 1950, very sparse, unreliable data.
  • Since 1950, reliable global data.
  • Ocean temperature and pressure are based on data
    from ships.
  • Most land data, (solar, precipitation,
    temperature and pressure) comes from weather
    stations.
  • Since 1981, data has been available from earth
    orbiting satellites.
  • FPAR, a measure related to plants and greenness
  • Since 1999 TERRA, the flagship of the NASA earth
    observing system, is providing much more detailed
    data.

4
Importance of Global Climate Patterns
  • The climate of the Earths land surface is
    strongly influenced by the behavior of the
    Earths oceans.
  • El Nino is the anomalous warming of the eastern
    tropical region of the Pacific.
  • Associated with droughts in Australia and
    Southern Africa and heavy rainfall along the
    western coast of South America.

El Nino Events
Sea Surface Temperature Anomalies off Peru (ANOM
12)
5
Importance of Global Climate Patterns and NPP
  • Net Primary Production (NPP) is the net
    assimilation of atmospheric carbon dioxide (CO2)
    into organic matter by plants.
  • NPP is driven by solar radiation and can be
    constrained by precipitation and temperature.
  • Keeping track of NPP is important because it
    includes the food source of humans and all other
    organisms.
  • Sudden changes in the NPP of a region can have a
    direct impact on the regional ecology.
  • NPP is impacted by global climate patterns.
  • Precipitation and temperature are directly
    affected by global climate patterns such as El
    Nino.
  • Solar radiation is affected indirectly by
    cloudiness.

6
Role of Statistics and Data Mining
  • Previously Earth scientists have relied on
    statistical techniques.
  • Hypothesize-and-test paradigm is extremely
    labor-intensive.
  • Data mining provides earth scientist with tools
    that allow them to spend more time choosing and
    exploring interesting families of hypotheses.
  • By applying the proposed data mining techniques,
    some of the steps of hypothesis generation and
    evaluation will be automated, facilitated and
    improved.
  • However, statistics is needed to provide methods
    for determining the statistical significance of
    results.

7
Patterns of Interest
  • Zone Formation
  • Find regions of the land or ocean which have
    similar behavior.
  • Teleconnections
  • Teleconnections are the simultaneous variation in
    climate and related processes over widely
    separated points on the Earth.
  • Associations
  • Find relations between climate events and land
    cover.
  • River Discharge
  • Relationship between water discharged from a
    river and precipitation, climate, and man.

8
Clustering for Zone Formation
  • Interested in relationships between regions, not
    points.
  • For ocean, clustering based on SST (Sea Surface
    Temperature) or SLP (Sea Level Pressure).
  • For land, clustering based on NPP or other
    variables, e.g., precipitation, temperature.
  • Typically we work with the points.
  • When raw NPP and SST are used, clustering can
    find seasonal patterns.
  • Anomalous regions have plant growth patterns
    which reversed from those typically observed in
    the hemisphere in which they reside, and are easy
    to spot.

9
K-Means Clustering of Raw NPP and Raw SST (Num
clusters 2)
10
K-Means Clustering of Raw NPP and Raw SST (Num
clusters 2)
Land Cluster Cohesion North 0.78, South
0.59 Ocean Cluster Cohesion North 0.77, South
0.80
11
K-Means Clustering of Raw NPP and Raw SST (Num
clusters 6)
12
Preprocessing
  • Time series preprocessing issues
  • Need to remove seasonality
  • Earth scientists mostly interest in anomalies
  • Need to remove most of the autocorrelation
  • Statistical test are affected
  • Need to remove trends
  • Normally want to detect patterns and trends
    separately
  • Normally interested in similarity once
    differences in means and scale have been
    considered.
  • Pearsons correlation coefficient has this
    property

13
Sample NPP Time Series
Correlations between time series
14
Seasonality Accounts for Much Correlation
Normalized using monthly Z Score Subtract off
monthly mean and divide by monthly standard
deviation
Correlations between time series
15
Removing Seasonality Removes Most Autocorrelation
16
Preprocessing Removing Trends
A slight linear trend added to two random time
series increases their correlation dramatically,
from 0.01 to 0.17.
17
Ocean Climate Indices Connecting the Ocean and
the Land
  • An OCI is a time series of temperature or
    pressure
  • Based on Sea Surface Temperature (SST) or Sea
    Level Pressure (SLP)
  • OCIs are important because
  • They distill climate variability at a regional or
    global scale into a single time series.
  • They are well-accepted by Earth scientists.
  • They are related to well-known climate phenomena
    such as El Niño.

18
Ocean Climate Indices ANOM 12
  • ANOM 12 is associated with El Niño and La Niña.
  • Defined as the Sea Surface Temperature (SST)
    anomalies in a regions off the coast of Peru
  • El Nino is associated with
  • Droughts in Australia and Southern Africa
  • Heavy rainfall along the western coast of South
    America
  • Milder winters in the Midwest

El Nino Events
19
Connection of ANOM 12 to Land Temp
  • OCIs capture teleconnections, i.e., the
    simultaneous variation in climate and related
    processes over widely separated points on the
    Earth.

20
Ocean Climate Indices - NAO
  • The North Atlantic Oscillation (NAO) is
    associated with climate variation in Europe and
    North America.
  • Normalized pressure differences between Ponta
    Delgada, Azores and Stykkisholmur, Iceland.
  • Associated with warm and wet winters in Europe
    and in cold and dry winters in northern Canada
    and Greenland
  • The eastern US experiences mild and wet winter
    conditions.

Iceland
Azores
21
Connection of NAO to Land Temp
22
Influence of OCI on Land Area Weighted
Correlation
  • Correlation of an OCI with a land variable is a
    standard way to evaluate its influence.
  • Correlation does not imply causality.
  • Temperature and precipitation are the typical
    land variables.
  • If relatively many land points have a relatively
    high correlation, then an OCI is influential.
  • To evaluate whether clusters (or pairs) are
    potential OCIs we compute their area weighted
    correlation.
  • Weighted average of the correlation with land
    points, where weight is based on area.
  • May exclude points whose correlation is low and
    then calculate area weighted correlation.

23
Evaluation of Known OCIs via Area Weighted
Correlation
Area Weighted Correlation of Known OCIs to Land
Temp Overlapping, threshold 0
24
Evaluation of Known OCIs via Area Weighted
Correlation
Area weighted correlation declines as we consider
only land points whose temperature correlates
with the OCI above a given threshold.
25
Discovering OCIs via Data Mining
  • Earth scientists have discovered currently known
    OCIs.
  • Observation
  • Eigenvalue techniques such as Principal
    Components Analysis (PCA) and Singular Value
    Decomposition (SVD).
  • Clustering provides an alternative approach.
  • Clusters represent ocean regions with relatively
    homogeneous behavior.
  • The centroids of these clusters are time series
    that summarize the behavior of these ocean areas,
    and thus, represent potential OCIs.

26
Finding Influential Ocean Regions
  • Not all points on the ocean correlate well with
    land variables such as temperature and
    precipitation.
  • Best points are those which have a high density
  • Dense points are relatively homogenous with
    respect to their neighboring points.

27
Discovery of Ocean Climate Indices
  • Use clustering to find areas of the oceans that
    have high density, I.e., relatively homogeneous
    behavior.
  • Cluster centroids are potential OCIs.
  • For SLP pairs of cluster centroids are potential
    OCIs.
  • Evaluate the influence of potential OCIs on
    land points.
  • Determine if the potential OCI matches a known
    OCI.
  • For potential OCIs that are not well-known,
    conduct further evaluation.
  • Are there land points that have higher
    correlation for the potential OCI than for known
    indices?

28
SST Clusters
29
Evaluating Cluster Centroids as Potential OCIs
  • Evaluation will be based on area weighted
    correlation
  • Ignore clusters who area weighted correlation is
    low.
  • Three cases
  • Clusters are highly similar to known OCIs (corr gt
    0.4)
  • May represent a known OCI
  • Clusters may be better, i.e., higher coverage
  • Clusters may cover different area, i.e., some
    points for which the new OCI is a better
    predictor
  • Clusters are moderately similar to known OCIs (
    0.25 lt corr lt 0.4 )
  • Again, new OCIs may be better predictors for some
    points.
  • Clusters are not similar to known OCIs (corr lt
    0.25)
  • These clusters may represent as yet undiscovered
    Earth Science phenomena.

30
SST Clusters Highly Correlated to Known Indices
Area Weighted Correlation of Cluster Centroids to
Land Temp Overlapping, threshold 0
31
SST Clusters Highly Correlated to Known Indices
32
SST Clusters that Correspond to El Nino Climate
Indices
75 78 67 94
El Nino Regions Defined by Earth Scientists
SNN clusters of SST that are highly correlated
with El Nino indices, 0.93 correlation.
33
SST Clusters Highly Correlated to Known Indices
  • Examples of some SST clusters that are
    highly correlated to known OCIs and have high
    area weighted correlation with land temperature.
    These indices have a significant correlation with
    El Nino indices.


34
SST Clusters Highly Correlated to Known Indices
  • However, there are areas (yellow) where these
    clusters correlate better.



35
SST Clusters Highly Correlated to Known Indices


36
SST Cluster Moderately Correlated to Known Indices

37
Comments from our NASA collaborators
  • Ocean cluster results based on SST correlations
    with land surface temperature suggest that
  • New areas of the ocean may be identified that are
    unknown as being highly representative of the El
    Nino Southern Oscillation (ENSO) and the Arctic
    Oscillation (AO).
  • New predictive indices for land climate over the
    past 40 years can be identified that will improve
    upon predictions using any known ocean climate
    index to date, including SOI and AO.

38
Issues in Mining Associations from Earth Science
Data
  • Data is continuous rather than discrete.
  • Data has spatial and temporal components.
  • Data can be multilevel
  • time and spatial granularities.
  • Observations are not i.i.d. due to spatial and
    temporal autocorrelations.
  • Data may contain noise, missing information and
    measurement errors
  • historical SST data between 1856-1941 is measured
    using wooden buckets.
  • Data may come from heterogeneous sources
  • Calibration issues.

39
Mining Associations in Earth Science Data
Challenges
  • How to transform Earth Science data into
    transactions?
  • What are the baskets?
  • What are the items?
  • How to define support?

40
Mining Associations Patterns in Earth Science
Data Challenges
1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI gt
NPP-HI (support count145, confidence100) 2
FPAR-HI PET-HI PREC-HI TEMP-HI gt NPP-HI
(support count933, confidence99.3) 3 FPAR-HI
PET-HI PREC-HI gt NPP-HI (support count1655,
confidence98.8) 4 FPAR-HI PET-HI PREC-HI
SOLAR-HI gt NPP-HI (support count268,
confidence98.2)
  • How to efficiently discover spatio-temporal
    associations?
  • Use existing algorithms.
  • Develop new algorithms.

41
Event Definition
  • Items are events abstracted from time series.
  • Events of interest include
  • Temporal events
  • Anomalous temporal events such as warmer winters
    and droughts.
  • Changes in the periodic behavior such as longer
    growing seasons or earlier month of onset of
    greenup.
  • Spatial events
  • Large percentage of land areas in a certain
    region having below-average precipitation.
  • Spatio-temporal events
  • Changes in circulation or trajectory of
    jet-streams.

42
Example of Anomalous Event Definition
If threshold for Z ?1.5, on average, there are
20 events per time series.
43
Transaction and Support Definitions
  • Convert the time series into sequence of events
    for each spatial location.

44
Examples of Association Patterns
  • min support 0.001, min confidence10

1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI gt
NPP-HI (support count145, confidence100) 2
FPAR-HI PET-HI PREC-HI TEMP-HI gt NPP-HI
(support count933, confidence99.3) 3 FPAR-HI
PET-HI PREC-HI gt NPP-HI (support count1655,
confidence98.8) 4 FPAR-HI PET-HI PREC-HI
SOLAR-HI gt NPP-HI (support count268,
confidence98.2) 5 FPAR-HI PET-HI PREC-HI
SOLAR-LO TEMP-HI gt NPP-HI (support count44,
confidence97.8) 6 FPAR-LO PET-LO PREC-LO
SOLAR-LO gt NPP-LO (support count216,
confidence96.9) 7 FPAR-LO PREC-LO SOLAR-LO
TEMP-HI gt NPP-LO (support count152,
confidence96.2) 8 FPAR-LO PET-LO PREC-LO
SOLAR-LO TEMP-LO gt NPP-LO (support count47,
confidence95.9) 9 FPAR-LO PREC-LO SOLAR-LO
TEMP-LO gt NPP-LO (support count49,
confidence94.2) 10 FPAR-LO PREC-LO SOLAR-LO gt
NPP-LO (support count595, confidence93.7)
75 FPAR-HI gt NPP-HI (support count
216924, confidence 55.7)
NPP Solar FPAR ? Temperature Moisture
45
Example of Interesting Association Patterns
FPAR-Hi gt NPP-Hi (sup5.9, conf55.7)
46
Land Cover Types
Shrublands/
47
Using Land Cover as Additional Features
1. FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI gt
NPP-HI (support count145, confidence100) 2.
FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI GRASSLAND
gt NPP-HI (support count145, confidence100) 3.
FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI FOREST
gt NPP-HI (support count44, confidence100) 4.
FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI CROPLAND
gt NPP-HI (support count44, confidence100) 5.
FPAR-HI PET-HI PREC-HI SOLAR-HI FOREST gt NPP-HI
(support count75, confidence100) 6. FPAR-HI
PET-HI PREC-HI SOLAR-HI CROPLAND gt NPP-HI
(support count81, confidence100) 7. FPAR-HI
PREC-HI SOLAR-HI TEMP-HI CROPLAND gt NPP-HI
(support count58, confidence100) 8. FPAR-HI
PET-HI PREC-HI TEMP-HI GRASSLAND gt NPP-HI
(support count376, confidence99.5) 9. FPAR-HI
PET-HI PREC-HI TEMP-HI CROPLAND gt NPP-HI
(support count170, confidence99.4) 10. FPAR-HI
PET-HI PREC-HI CROPLAND gt NPP-HI (support
count277, confidence99.3) ..
  • Produce multiple rules that have the same form
  • A gt B, A,Grassland gt B, A, Cropland
    gt B, etc.
  • Some of the support counts could be missing if
    itemsets fall below the minimum support threshold.

48
Finding Interesting Earth-Science Patterns
  • A pattern is interesting if it occurs relatively
    more frequently in some homogeneous regions.
  • If the relative frequency of a pattern is similar
    in all groups of land areas, then it is less
    interesting.
  • If the pattern occurs mostly in a certain group
    of land areas, then it is potentially interesting.

49
Filtering Patterns using Land Cover Types
  • For each pattern p
  • Actual coverage for land cover type i si /S
  • Expected coverage for land cover type i ni /N
  • Ratio of actual to expected coverage for land
    cover type i,
  • ei si N / ni S
  • Interest Measure
  • If pattern occurs in arbitrary regions, interest
    measure will be low.

50
Interesting Spatial Association Pattern
51
Interesting Spatial Association Pattern
Land Cover
  • Prec-Hi ? NPP-Hi tends to occur in grassland and
    cropland regions.

52
Other Interesting Spatial Association Patterns
Support Count
Land Cover
  • Temp-Hi ? NPP-Hi tends to occur in the forest
    and cropland regions in the northern hemisphere
    (Forests (33.5), Grassland(8.7),
    Cropland (24.5), Desert (0.4) )

53
Global River Discharge Data
  • Global River Discharge Data
  • 30 rivers, 0.5 degree resolution
  • Two measurement stations mouth and source of
    river system/basin
  • Minimum of ten continuous years of monthly
    station discharge records
  • Interesting associations
  • e.g., Amazon discharge is highly correlated with
    ANOM3.4(r -0.5)

54
Relationship Between River Basin PREC and OCI
Amazon
Parana
  • Correlation between PREC aggregation on river
    basins and OCI is shown in left figure
  • Interesting Observations
  • Amazon and Parana are nearby, however, the
    signals to OCI are almost reverse

55
Discharge Data Amazon vs. Parana
56
Relationship Between River Basin PREC and OCI ..
Petchora
  • Interesting Observations
  • Petchora and Pacific Decadal Oscillation (PDO)
    are highly correlated.

57
Correlation between PREC and DISCHARGE
58
Correlation between OCI and DIS (r 0.3)
Amu-Darya
Brahmaputra
Amazon
Columbia
Colorado
59
Proposed Framework of River Analysis
60
Conclusions
  • Association rules can uncover interesting
    patterns for Earth Scientists to investigate.
  • Challenges arise due to spatio-temporal nature of
    the data.
  • Need to incorporate domain knowledge to prune out
    uninteresting patterns.
  • By using clustering we have made some progress
    towards automatically finding climate patterns
    that display interesting connections between the
    ocean and the land.
  • Need to further evaluate candidates for new
    climate indices.
  • Correlation analysis on river discharge data can
    be used to evaluate the effects of climate and
    man.

61
Case Studies Earth Science Data
  • Michael Steinbach, Pang-Ning Tan, Vipin Kumar,
    Chris Potter, Steven Klooster, Alicia Torregrosa,
    Clustering Earth Science Data Goals, Issues and
    Results, Workshop on Mining Scientific Data, KDD
    2001, San Francisco, CA, 2001.
  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
    Steven Klooster, Christopher Potter, Alicia
    Torregrosa, Finding Spatio-Termporal Patterns in
    Earth Science Data Goals, Issues and Results,
    Temporal Data Mining Workshop, KDD 2001, San
    Francisco, CA, 2001.
  • Vipin Kumar, Michael Steinbach, Pang-Ning Tan,
    Steven Klooster, Chris Potter, Alicia Torregrosa,
    Mining Scientific Data Discovery of Patterns in
    the Global Climate System, Joint Statistical
    Meetings, Atlanta, GA, 2001.
  • Michael Steinbach, Pang-Ning Tan, Vipin Kumar,
    Chris Potter, Steven Klooster, Data Mining for
    the Discovery of Ocean Climate Indices,
    Workshop on Mining Scientific Data, SDM 2002.

62
Statistical Issues
  • Temporal Autocorrelation
  • Makes it difficult to calculate degrees of
    freedom and determine significance levels for
    tests, e.g., non-zero correlation.
  • Moving average is nice for smoothing and seeing
    the overall behavior, but introduces additional
    autocorrelation.
  • Removal of seasonality removes much of the
    autocorrelation (as long as not performed via the
    moving average).
  • Measures of time series similarity
  • Detecting non-linear connections
  • Detecting connections that only exist at certain
    times.
  • Sometimes only extreme events have an effect.
  • Automatically detecting appropriate time lags.
  • Statistical tests for more sophisticated measures.

63
Statistical Issues
  • Detecting spurious connections.
  • We are performing many correlation calculations
    and there is a chance of spurious correlations.
  • Given that we have 100,000 locations on the
    Earth for which we have time series, how many
    spuriously high correlations will we get when we
    calculate the correlation between these locations
    and a climate index?
  • Because of spatial autocorrelation, these
    correlations are not independent.
  • Again we have trouble calculating the degrees of
    freedom.
Write a Comment
User Comments (0)
About PowerShow.com