ASPECT: Adaptable Simulation Product Exploration and Control Toolkit

About This Presentation

Title:

ASPECT: Adaptable Simulation Product Exploration and Control Toolkit

Description:

Data Mining and Data Management: Rob Grossman, UIC. High Performance Computing: ... ASPECT supports parallel I/O w/ various data access patterns ... – PowerPoint PPT presentation

Number of Views:484

Avg rating:3.0/5.0

Slides: 48

Provided by: nagizafs

Learn more at: https://sdm.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: ASPECT: Adaptable Simulation Product Exploration and Control Toolkit

1
ASPECT Adaptable Simulation Product
Exploration and Control Toolkit
Nagiza Samatova George Ostrouchov Computer
Science and Mathematics Division Oak Ridge
National Laboratory http//www.csm.ornl.gov/
SDM All-Hands Meeting September 11-13, 2002
2
Our Team

Students
AbuKhzam, Faisal, Ph.D. University of
Tennessee, Knoxville
Bauer, David, B.S. Georgia Tech Institute
Hespen, Jennifer, Ph.D. University of
Tennessee, Knoxville
Nair, Rajeet, M.S. University of Illinois,
Chicago
Postdocs
Park, Hooney, Ph.D.
Staff
Ostrouchov, George, Ph.D. Principal
Investigator
Reed, Joel, M.S.
Samatova, Nagiza, Ph.D. Principal Investigator
Watkins, Ian, B.S.

3
Our Collaborators

Application
David Erickson, Climate, ORNL
John Drake, ORNL
Tony Mezzacappa, Astrophysics, ORNL
Linear Algebra Graph Theory
Gene Golub, Stanford University
Mike Langston, UTK
Data Mining and Data Management
Rob Grossman, UIC
High Performance Computing
Alok Choudhary, Wei-keng Liao NWU
Bill Gropp, Rob Ross, Rajeev Thakur ANL
Hardware and Software Infrastructure
Dan Million, ORNL
Randy Burris, ORNL

4
Typical Simulation Exploration ScenariosDriven
by limitations of existing technologies

Post-processing Scenario
Submit a long-running simulation job (weeks
months)
Periodically check the status (run tail -f
command on each machine)
Analyze large simulation data set
Real-time Scenario
Instrument a simulation code to visualize a
field(s)
While running a simulation job
Monitor the selected field(s)
If can not monitor, then either Stop a job or
Continue running without monitoring and ability
to view later what has been skipped
If changing a set of fields to monitor, then go
to 1

5
Analysis Visualization of Simulation Product
State of the Art

Post-processing data analysis tools (like
PCMDI)
Scientists must wait for the simulation
completion
Can use lots of CPU cycles on long-running
simulations
Can use up to 50 more storage and require
unnecessary data transfer for data-intensive
simulations

Real-time Simulation monitoring tools (like
Cumulvs)
Need simulation code instrumentation (e.g., call
to vis. libraries)
Interference with simulation run snapshot of
data gt can pause simulation
Computationally intensive data analysis task
becomes part of simulation
Synchronous view of data and simulation run
More control over simulation

6
Some More Limitations

Post-processing data analysis tools
Application specific (PyClimate, mtaCDF, PCMDI
tools, ncview)
tools written for one application can not be
used for another
usually written by experts in the application
not data analysis field
Not user friendly, usually script-driven (Python,
IDL, GrADS)
Support no more than a dozen of simple data
analysis algorithms
Do not exist for some applications (astrophysics
vs. climate)
Are not designed as distributed systems
distributed data sets must be centralized
tools must be installed where the data is

Real-time Simulation monitoring tools
Provide even simpler data analysis (usually
focused on rendering of the data)
Require good familiarity with the simulation
code to make changes
NCAR folks develop climate simulation codes
(PCM, CCSM) used world-wide

7
Improvements through ASPECTData stream ? not
simulation ? monitoring tool

ASPECTs advantages
No simulation code instrumentation
Single data multiple views of data
No interference w/ simulation
Decoupled from the simulation

8
Run and Render Simulation Cycle in SciDAC Our
vision
Goal To develop ASPECT (Adaptable Simulation
Product Exploration and Control Toolkit)
Benefits

Enable effective and efficient monitoring of
data generated by long running simulations
through the GUI interface to a rich set of
pluggable data analysis modules
Potentially lead to new scientific discoveries
Allow very efficient utilization of human and
computer resources

9
Approaching the Goal through a Collaborative Set
of Activities
10
Building a Workflow Environment
11
80 gt 20 Paradigm in Probes Research
Application driven Environment
From frustrations
To smooth operation

Very limited resources
General purpose software only
Lack of interface with HPSS
Homogenous platform (e.g., Linux only)

Hardware Infrastructure
RS6000 S80, 6 processors
2 GB memory,1 TB IDE FibreChannel RAID
4-processor (1.4 GHz Xeon) 8 GB 573GB,
FibreChannel HBA and GigE
two 2-processor (2.4 GHz Xeon), 2 GB, 573 GB,
GigE, FibreChannel HBA
Software Infrastructure
Compilers (Fortran, C, Java)
Data Analysis (R, Java-R, Ggobi)
Visualization (ncview, GrADS)
Data Formats (netCDF, HDF)
Data Storage Transfer (HPSS, hsi, pftp,
GridFTP, MPI-IO, PVFS)

12
ASPECT Design and Implementation
13
ASPECT InfrastructureDistributed End-to-End
System
14
ASPECT GUI Infrastructure

Functionality
Instantiate Modules
Link Modules
Synchronous Control
Add Modules by XML
XML-based Request Builder

15
ASPECT Back-End Engine Overview
The GUI passes a string indicating the script to
run, the variables to pass to the script, the
names of the files (or groups of files) where
those variables can be found, and other optional
parameters. The engine parses the string, reads
all of the data into R compatible objects (in
memory), and then calls the script through
R. When R returns, the single returned object is
broken up into respective variables, and written
to a NetCDF file.
Engine Front End (Takes Request from GUI, reads
input into memory)
R Script (Translates input to R function call)
GUI
R (Performs calculations)
Engine Back End (Converts Rs Output to NetCDF
file)
16
Interfacing with RASPECT provides a rich set of
data analysis modules through R

Status
Release under GPL in Source Forge, September,
2002
Includes about 30 algorithms
A dozen can be added in a matter of a week
Requested by DataSpace, UIC
Joint effort w/ DataSpace

http//www.r-project.org/
The open source R statistical package provides
the generic computational backend for the ASPECT
engine. While R was designed to be mostly a
stand-alone program, it does provide for internal
hooks in its libraries. Using the same functions,
macros, and syntax available to internal R code,
the ASPECT engine creates R objects from the
input data directly. These objects are then
installed in the namespace of the R engine, and
used by the R wrapper scripts as if it were
running in an ordinary R environment.
17
Scripts
Using R script wrappers to the R functions allows
for an incredible amount of flexibility. Users
can easily add their own functions, without
having to know the internals of the ASPECT
engine. Most of the scripts, like the one below,
simply translate the C input into the equivalent
R function call.
wsample lt-function(x1, x2, v1, v2, n1, n2, c1,
c2) a lt- if (n2 ! 0) TRUE else FALSE q lt- if
(!is.null(v2)) ( if (n1 ! 0) sample(v1, size
n1, replace a, probv2) else sample (v1,
replace a, prob v2) ) else ( if (n1 ! 0)
sample (v1, size n1, replace a) else sample
(v1, replace a) ) list( Sample q)
The scripts can be as complicated or simple as
they need to be. The below script is perfectly
valid.
whello lt-function(x1, x2, v1, v2, n1, n2, c1, c2)
print("Hello World")
18
XML-based Description of Algorithms and
Visualization Interfaces
ltnamegt wsort lt/namegt ltdisplayNamegt Sort
lt/displayNamegt ltinputgt ltvariablegt lttypegt
vector lt/typegt ltnamegt data lt/namegt ltdescriptio
ngt The input data lt/descriptiongt lt/variablegt
ltvariablegt ....

Dynamically loaded XML descriptions of functions
and menus provide user expandable configuration
details.
Users can add comments, change default values,
add multiple interfaces to a single function, and
add interfaces for their own functions.

19
NetCDF/HDF Input/OutputASPECT understands and
uses scientific standard file formats
http//www.unidata.ucar.edu/packages/netcdf/
The open source NetCDF format is widely used to
hold self-describing data. The output from the R
engine is a single R object. Given the
recursively defined list nature of R objects,
this is no limitation. In order to save a dynamic
R object into a flat NetCDF file, the object must
be carefully unwound, while preserving as much of
the metadata (such as dimension names, the
original source of the data, etc) as possible
into the NetCDF file. Once the output file is
written, it is ready to be used by the user
either for visualization, or as the input to
another function.
20
MPI-IO NetCDF ASPECT supports parallel I/O w/
various data access patterns(Collaboration with
ANL (Bill Gropp, Rob Ross, Rajeev Thakur) and
NWU (Alok Choudhoury, Wei-keng Liou)

Concatenate multiple files into a single file
for a given set of variables
Analyze multiple files with different data
distribution patterns among processors (by
blocks, by strided patterns, by entire files)

21
Data Sampling ASPECT handles large data sets
Types of Subsampling

Random subsampling
Decimation
Blocks
Striding

Implementations

Standard netCDF
MPI-IO netCDF

22
Interfacing with DataSpaceASPECT provides
hooks to a Web of Scientific Data(Collaboration
with Bob Grossman at UIC)
The web today provides an infrastructure for
working with distributed multimedia documents.
DataSpace is an infrastructure for creating a web
of data instead of documents.

Very high throughput for moving data through
DataSpaces parallel network transport protocols
(Psockets (TCP), Sabul (TCP, UDP))
Ability to do comparative/correlation analysis
between simulation and archived data

DataSpace Web of Data
PSockets/Sabul
UIC Amsterdam Sabul 540 Mb/s Psockets 180
Mb/s Sockets 10Mb/s
http//www.dataspaceweb.net
23
Summary of ASPECTs Design Implementation

ASPECT is a Data Stream Monitoring Tool
ASPECT has very nice features for efficient and
effective simulation data analysis
GUI interface to a rich set of pluggable data
analysis modules.
Uses the open source R statistical data analysis
package as a computational back-end.
Understands and uses the NetCDF/HDF scientific
file format.
Uses dynamically loaded R scripts and XML
descriptors for flexibility.
Handles large sets of data through the support
for block selection, striding, sampling, data
reduction, and distributed algorithms.
Provides efficient I/O through MPI-IO interface
to NetCDF and HDF
Moves data efficiently through PSockets/Sabul
Supports dataset view of the simulation not only
a collection of files

24
Distributed and Streamline Data Analysis Research
25
Simulation Data Sets are Massive Growing Fast
Astrophysics Data per Run
26
Most of this Data will NEVER Be Touched with the
current trends in technology

The amount of data stored online quadruples every
18 months, while processing power only doubles
every 18 months.
Unless the number of processors increases
unrealistically rapidly, most of this data will
never be touched.
Storage device capacity doubles every 9 months,
while memory capacity doubles every 18 months
(Moores law).
Even if the divergence between these rates of
growth will converge, the memory latency is and
will remain the rate-limiting step in
data-intensive computations
Operating systems struggle to handle files larger
than a few GB.
OS constraints and memory capacity determine data
set file size and fragmentation

27
Massive Data Sets are Naturally Distributed BUT
Effectively Immoveable (Skillicorn, 2001)

Bandwidth is increasing but not at the same rate
as stored data
There are some parts of the world with high
available bandwidth BUT there are enough
bottlenecks that high effective bandwidth is
unachievable across heterogeneous networks
Latency for transmission at global distances is
significant
Most of this latency is time-of-flight and so
will not be reduced by technology
Data has a property similar to inertia
It is cheap to store and cheap to keep moving,
but the transitions between these two states are
expensive in time and hardware.
Legal and political restrictions
Social restrictions
Data owners may let access data
but only by retaining control of it

Computations MUST move to data, rather than data
to computations
28
Simulation Data Sets are Dynamically Changing

Scientific simulations (e.g., climate modeling
and supernova explosion) typically run for at
least one month and produce data sets in the
order of one to ten terabytes per simulation.
Effectively and efficiently analyzing these
streams of data is a challenge
Most existing methods work with static datasets.
Any changes require complete re-computation.

Computations MUST be able to efficiently analyze
streams of data while they are being produced,
rather than wait until they are produced
29
Algorithms Fail for a Few Gigabyte Data

Algorithmic Complexity
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate SVD O(r c)
Clustering algorithms O(n2)

For illustration chart assumes 10-12 sec.
calculation time per data point
30
RACHET High Performance Framework for Distributed
Cluster Analysis
Strategy
Perform data mining in a distributed fashion
with reasonable data transfer overheads
Key idea
Compute local analyses using distributed agents
Merge minimum info into a global analysis via
peer-to-peer agents collaboration negotiation
Benefits
NO need to centralize data Linear scalability
with data size and with data dimensionality
31
Linear Time Dimension Reduction for Streamline
Distributed Data

Status
C, MPI, MPI-IO based implementation of package
Both one time and iterative communication
Integration into ASPECT is in progress
Requested by DataSpace, UIC P3 project (Ekow),
LBL

Features
One time communication
Linear time for each chunk
10 deviation from central version
Based on FastMap

32
Distributed Principal Components (PCA) Merging
Information Rather Than Raw Data

Global Principal Components
transmit information, not data
Dynamic Principal Components
no need to keep all data

Method Merge few local PCs and local means

Benefits
Little loss of information
Much lower transmission costs
Centralized O(np)
DPCA O(sp), sltltn
Computation cost
O(kp2) vs O(np2)

33
Data Understanding for Scientific Discovery
34
Data Analysis for Monitoring Simulations

What do we monitor?
Contrast between Supernova and Climate simulation
data analysis
Highlights from Astrophysics
Wider implications on simulation data
Data reduction and monitoring from reduced data

35
What Do We Monitor?
Entropy of 2-d (axisymmetric) Supernova
Simulation

General Concepts
Application-specific
comparative displays driven by data mining and
exploratory data analysis
Visual comparison in time is less effective
than comparison side-by-side (Visual Display of
Quantitative Information, Tufte)

36
Evolving Display Shows Entropy Progression over
Time
Radius
Time
Reduction with median
37
Specific Aspects of Simulation Can be Monitored
Entropy instability (range) over time
Radius
Time
Reduction with range (max min)
38
Shorten the Experimental Cycle with
Run-and-Render Comparative Monitoring
Radius
Radius
39
Concise Views of a Supernova Simulation

Displays must be application-specific, but some
general concepts apply
Need general data mining capability for
flexibility in building displays

40
Data Reduction for Multigrid Simulation

Based on PCA of contiguous field blocks
Exploits spatial correlation and adapts to
complexity of spatial field
Parameter controls selected variation
Field restoration with single matrix multiply
Astrophysics supernova simulation
16 to 200 times reduction per time step
Outperforms subsampling 3 times for comparable
MSE over all time steps

Timestep 390
41
Spherical Symmetry Medians Conserved under PC
Compression
Original Data
30x Compressed Data
Time
Time
42
Spherical Symmetry Instability Ranges Conserved
under PC Compression
Original Data
30x Compressed Data
Radius
Radius
Time
Time
43
Publications Presentations
44
Conference

Co-sponsored Statistical Data Mining Conference,
June 22-25, 2002, in Knoxville jointly with the
University of Tennessee Department of Statistics
Organized an invited session on Distributed Data
Mining at the conference.

45
Publications FY 2002

Y. M. Qu, G. Ostrouchov, N. F. Samatova, and G.
A. Geist (2002). Principal Component Analysis for
Dimension Reduction in Massive Distributed Data
Sets. Workshop on High Performance Data Mining
at the Second SIAM International Conference on
Data Mining, p.4-9.
N.F. Samatova, G. Ostrouchov, A. Geist, A.
Melechko. RACHET An Efficient Cover-Based
Merging of Clustering Hierarchies from
Distributed Datasets, Special Issue on Parallel
and Distributed Data Mining, International
Journal of Distributed and Parallel Databases An
International Journal, 2002, Volume 11, No. 2,
March 2002.
F. N. Abu-Khzam, N. F. Samatova, G. Ostrouchov,
M. A. Langston, and A. G. Geist (2002).
Distributed Dimension Reduction Algorithms for
Widely Dispersed Data, Fourteenth IASTED
International Conference on Parallel and
Distributed Computing and Systems. Accepted.
G. Ostrouchov and N. F. Samatova (2002). On
FastMap and the Convex Hull of Multivariate Data.
In preparation.
J. Hespen, G. Ostrouchov, N. F. Samatova, and A.
Mezzacappa (2002). Adaptive Data Reduction for
Multigrid Simulation Output. In preparation.

46
Presentations FY 2002

Invited
G. Ostrouchov and N. F. Samatova. Multivariate
Analysis of Massive Distributed Data Sets.
Spring Research Conference on Statistics in
Industry and Technology May 20-22, 2002, Ann
Arbor, Michigan.
G. Ostrouchov and N. F. Samatova. Combining
Distributed Local Principal Component Analyses
into a Global Analysis, C. Warren Neel Conference
on Statistical Data Mining and Knowledge
Discovery, June 22-25, 2002, Knoxville,
Tennessee.
N. Samatova, G. A. Geist, and G. Ostrouchov,
RACHET Petascale Distributed Data Analysis
Suite, SPEEDUP Workshop on Distributed
Supercomputing Data Intensive Computing, March
4-6, 2002, Badehotel Bristol, Leukerbad, Valais,
Switzerland
Contributed
Y. M. Qu, G. Ostrouchov, N. F. Samatova, and G.
A. Geist. Principal Component Analysis for
Dimension Reduction in Massive Distributed Data
Sets. Workshop on High Performance Data Mining
at the Second SIAM International Conference on
Data Mining, April 11-13, 2002, Washington, DC.
Local
N. Samatova and G. Ostrouchov. Large-Scale
Analysis of Distributed Scientific Data. ORNL
Weinberg Auditorium, July 11, 2002.