Title: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002
1Chandrika Kamath and Imola K. FodorCenter for
Applied Scientific ComputingLawrence Livermore
National LaboratoryGatlinburg, TNMarch 26-27,
2002
Dimension Reduction and Sampling First SDM ISIC
All-Hands Meeting
UCRL. This work was performed under the auspices
of the U.S. Department of Energy by University of
California Lawrence Livermore National Laboratory
under contract W-7405-Eng-48.
2The SDM ISIC aims to minimize the effort
researchers spend in managing their data
- LLNL is participating in several of the tasks,
including - data mining to improve the management of data
- Problem data from simulations and experiments is
high dimensional (i.e. many features) - Querying the features can help in understanding
the data - but, searching in a high-dimensional space is
difficult - May want to cluster similar objects for efficient
access - but, clustering is expensive in high dimensions
? We plan to address the problem of high
dimensionality using techniques for dimension
reduction and sampling originally developed in
data mining.
3Our work on dimension reduction will help both
data management and mining
- Reducing the dimensions will improve
- searching (task 3.1, LBNL)
- clustering (task 2.1, ORNL)
- Dimension reduction is expensive if many data
items - use a sample of the data items
- techniques for sampling in presence of rare
events - We will focus on climate and high-energy-physics
data - complements work at ORNL (climate), LBNL (HEP)
- but, techniques applicable to other data as well
? We only report the .8 FTE work funded under
SciDAC however, our data mining research is more
extensive. See www.llnl.gov/casc/sapphire
4There are two different ways in which we can view
dimension reduction
- Reduce the number of features representing a data
item - Reduce the number of basis vectors used to
describe the data if some of the are
small, they can be ignored
5Our work on climate data focuses on reducing the
number of basis vectors
- Domain expert Dr. Benjamin Santer (LLNL climate)
- Climate scientists are interested in
understanding the change in the earths surface
temperature - Simulated and observed data are mixtures of
volcano, El Niño, and other effects - Our goal is to separate the signals corresponding
to different effects - traditional approaches such as principal
component analysis (PCA) have not worked - separation difficult as El Chichón and Pinatubo
volcano eruptions coincided with El Niño events - our approach is to use independent component
analysis (ICA)
? Dimension reduction supporting scientific
discovery
6The raw data is as monthly temperatures on a
144x73 spatial grid on 17 vertical levels
January 1979 raw temperatures (Kelvin) on the
144x73 latitude by longitude grid at 1000hPa
pressure level. Data from NCEP.
7Initially, we applied ICA to global monthly mean
anomaly temperatures
17 vertical levels level1 1000hPa, lowest
altitude level17 10hPa, highest altitude
Time series of global monthly mean anomalies, Jan
1979 - Dec 2000
8Next, we ran experiments with simulated data to
understand the behavior of ICA
mix
(i) Two original sources
(ii) Two mixed signals from the original
ICA
ICA estimates correctly the shapes of the two
independent components (ICs). With additional
processing, we can also estimate the relative
contributions of the two ICs in the two mixed
signals.
(iii) Sources (ICs) recovered from (ii)
9Original decomposition of the two mixed signals
(-) sine (--) and volcano (-.)
(i) Signal 1
(ii) Signal 2
10ICA decomposition of the two mixed signals (-)
sine (--) and volcano (-.)
- After proper post-processing, ICA estimates
remarkably well the underlying independent
components and their appropriate contributions in
the mixed signals
(i) Signal 1
(ii) Signal 2
11ICA can also separate noise used as an extra
component in the mixing
3 original sources
mix
3 mixed signals
ICA
3 estimated ICs
12Original decomposition of 3 mixed signals (-) El
Niño (--), volcano (-.), and noise (..)
(i) Signal 1
(ii) Signal 2
(iii) Signal 3
Cooling in global series at the arrow is in fact
a combination of an ENSO warming and a volcano
cooling. Without the volcano eruption, the El
Nino warming would dominate, resulting in warmer
global temperatures.
13ICA decomposition of 3 mixed signals (-) El Niño
(--), volcano (-.), and noise (..)
(i) Signal 1
(ii) Signal 2
(iii) Signal 3
Although not perfect in terms of the exact
amplitudes, ICA clearly separates the cooling
effect of the volcano from the warming effect of
El Nino.
14Our future plans include work with HEP data and
collaborators at ORNL and LBNL
- Complete the work on the climate problem
- our results with artificial data are encouraging
- identify appropriate ICA model for climate data
- Make the ICA software accessible to SciDAC
scientists - Try ICA and other dimension reduction techniques
in the context of the STAR high-energy-physics
data - reduce number of features
- investigate sampling to reduce computation
- collaborate with LBNL (data, searching)
- Investigate incremental PCA
- monitor climate simulations using indices based
on the principal components - collaborate with ORNL (data, clustering)