ASPECT: Adaptable Simulation Product Exploration and Control Toolkit - PowerPoint PPT Presentation

About This Presentation
Title:

ASPECT: Adaptable Simulation Product Exploration and Control Toolkit

Description:

Data Mining and Data Management: Rob Grossman, UIC. High Performance Computing: ... ASPECT supports parallel I/O w/ various data access patterns ... – PowerPoint PPT presentation

Number of Views:484
Avg rating:3.0/5.0
Slides: 48
Provided by: nagizafs
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: ASPECT: Adaptable Simulation Product Exploration and Control Toolkit


1
ASPECT Adaptable Simulation Product
Exploration and Control Toolkit
Nagiza Samatova George Ostrouchov Computer
Science and Mathematics Division Oak Ridge
National Laboratory http//www.csm.ornl.gov/
SDM All-Hands Meeting September 11-13, 2002
2
Our Team
  • Students
  • AbuKhzam, Faisal, Ph.D. University of
    Tennessee, Knoxville
  • Bauer, David, B.S. Georgia Tech Institute
  • Hespen, Jennifer, Ph.D. University of
    Tennessee, Knoxville
  • Nair, Rajeet, M.S. University of Illinois,
    Chicago
  • Postdocs
  • Park, Hooney, Ph.D.
  • Staff
  • Ostrouchov, George, Ph.D. Principal
    Investigator
  • Reed, Joel, M.S.
  • Samatova, Nagiza, Ph.D. Principal Investigator
  • Watkins, Ian, B.S.

3
Our Collaborators
  • Application
  • David Erickson, Climate, ORNL
  • John Drake, ORNL
  • Tony Mezzacappa, Astrophysics, ORNL
  • Linear Algebra Graph Theory
  • Gene Golub, Stanford University
  • Mike Langston, UTK
  • Data Mining and Data Management
  • Rob Grossman, UIC
  • High Performance Computing
  • Alok Choudhary, Wei-keng Liao NWU
  • Bill Gropp, Rob Ross, Rajeev Thakur ANL
  • Hardware and Software Infrastructure
  • Dan Million, ORNL
  • Randy Burris, ORNL

4
Typical Simulation Exploration ScenariosDriven
by limitations of existing technologies
  • Post-processing Scenario
  • Submit a long-running simulation job (weeks
    months)
  • Periodically check the status (run tail -f
    command on each machine)
  • Analyze large simulation data set
  • Real-time Scenario
  • Instrument a simulation code to visualize a
    field(s)
  • While running a simulation job
  • Monitor the selected field(s)
  • If can not monitor, then either Stop a job or
    Continue running without monitoring and ability
    to view later what has been skipped
  • If changing a set of fields to monitor, then go
    to 1

5
Analysis Visualization of Simulation Product
State of the Art
  • Post-processing data analysis tools (like
    PCMDI)
  • Scientists must wait for the simulation
    completion
  • Can use lots of CPU cycles on long-running
    simulations
  • Can use up to 50 more storage and require
    unnecessary data transfer for data-intensive
    simulations
  • Real-time Simulation monitoring tools (like
    Cumulvs)
  • Need simulation code instrumentation (e.g., call
    to vis. libraries)
  • Interference with simulation run snapshot of
    data gt can pause simulation
  • Computationally intensive data analysis task
    becomes part of simulation
  • Synchronous view of data and simulation run
  • More control over simulation

6
Some More Limitations
  • Post-processing data analysis tools
  • Application specific (PyClimate, mtaCDF, PCMDI
    tools, ncview)
  • tools written for one application can not be
    used for another
  • usually written by experts in the application
    not data analysis field
  • Not user friendly, usually script-driven (Python,
    IDL, GrADS)
  • Support no more than a dozen of simple data
    analysis algorithms
  • Do not exist for some applications (astrophysics
    vs. climate)
  • Are not designed as distributed systems
  • distributed data sets must be centralized
  • tools must be installed where the data is
  • Real-time Simulation monitoring tools
  • Provide even simpler data analysis (usually
    focused on rendering of the data)
  • Require good familiarity with the simulation
    code to make changes
  • NCAR folks develop climate simulation codes
    (PCM, CCSM) used world-wide

7
Improvements through ASPECTData stream ? not
simulation ? monitoring tool
  • ASPECTs advantages
  • No simulation code instrumentation
  • Single data multiple views of data
  • No interference w/ simulation
  • Decoupled from the simulation

8
Run and Render Simulation Cycle in SciDAC Our
vision
Goal To develop ASPECT (Adaptable Simulation
Product Exploration and Control Toolkit)
Benefits
  • Enable effective and efficient monitoring of
    data generated by long running simulations
    through the GUI interface to a rich set of
    pluggable data analysis modules
  • Potentially lead to new scientific discoveries
  • Allow very efficient utilization of human and
    computer resources

9
Approaching the Goal through a Collaborative Set
of Activities
10
Building a Workflow Environment
11
80 gt 20 Paradigm in Probes Research
Application driven Environment
From frustrations
To smooth operation
  • Very limited resources
  • General purpose software only
  • Lack of interface with HPSS
  • Homogenous platform (e.g., Linux only)
  • Hardware Infrastructure
  • RS6000 S80, 6 processors
  • 2 GB memory,1 TB IDE FibreChannel RAID
  • 4-processor (1.4 GHz Xeon) 8 GB 573GB,
    FibreChannel HBA and GigE
  • two 2-processor (2.4 GHz Xeon), 2 GB, 573 GB,
    GigE, FibreChannel HBA
  • Software Infrastructure
  • Compilers (Fortran, C, Java)
  • Data Analysis (R, Java-R, Ggobi)
  • Visualization (ncview, GrADS)
  • Data Formats (netCDF, HDF)
  • Data Storage Transfer (HPSS, hsi, pftp,
    GridFTP, MPI-IO, PVFS)

12
ASPECT Design and Implementation
13
ASPECT InfrastructureDistributed End-to-End
System
14
ASPECT GUI Infrastructure
  • Functionality
  • Instantiate Modules
  • Link Modules
  • Synchronous Control
  • Add Modules by XML
  • XML-based Request Builder

15
ASPECT Back-End Engine Overview
The GUI passes a string indicating the script to
run, the variables to pass to the script, the
names of the files (or groups of files) where
those variables can be found, and other optional
parameters. The engine parses the string, reads
all of the data into R compatible objects (in
memory), and then calls the script through
R. When R returns, the single returned object is
broken up into respective variables, and written
to a NetCDF file.
Engine Front End (Takes Request from GUI, reads
input into memory)
R Script (Translates input to R function call)
GUI
R (Performs calculations)
Engine Back End (Converts Rs Output to NetCDF
file)
16
Interfacing with RASPECT provides a rich set of
data analysis modules through R
  • Status
  • Release under GPL in Source Forge, September,
    2002
  • Includes about 30 algorithms
  • A dozen can be added in a matter of a week
  • Requested by DataSpace, UIC
  • Joint effort w/ DataSpace

http//www.r-project.org/
The open source R statistical package provides
the generic computational backend for the ASPECT
engine. While R was designed to be mostly a
stand-alone program, it does provide for internal
hooks in its libraries. Using the same functions,
macros, and syntax available to internal R code,
the ASPECT engine creates R objects from the
input data directly. These objects are then
installed in the namespace of the R engine, and
used by the R wrapper scripts as if it were
running in an ordinary R environment.
17
Scripts
Using R script wrappers to the R functions allows
for an incredible amount of flexibility. Users
can easily add their own functions, without
having to know the internals of the ASPECT
engine. Most of the scripts, like the one below,
simply translate the C input into the equivalent
R function call.
wsample lt-function(x1, x2, v1, v2, n1, n2, c1,
c2) a lt- if (n2 ! 0) TRUE else FALSE q lt- if
(!is.null(v2)) ( if (n1 ! 0) sample(v1, size
n1, replace a, probv2) else sample (v1,
replace a, prob v2) ) else ( if (n1 ! 0)
sample (v1, size n1, replace a) else sample
(v1, replace a) ) list( Sample q)
The scripts can be as complicated or simple as
they need to be. The below script is perfectly
valid.
whello lt-function(x1, x2, v1, v2, n1, n2, c1, c2)
print("Hello World")
18
XML-based Description of Algorithms and
Visualization Interfaces
ltnamegt wsort lt/namegt ltdisplayNamegt Sort
lt/displayNamegt ltinputgt ltvariablegt lttypegt
vector lt/typegt ltnamegt data lt/namegt ltdescriptio
ngt The input data lt/descriptiongt lt/variablegt
ltvariablegt ....
  • Dynamically loaded XML descriptions of functions
    and menus provide user expandable configuration
    details.
  • Users can add comments, change default values,
    add multiple interfaces to a single function, and
    add interfaces for their own functions.

19
NetCDF/HDF Input/OutputASPECT understands and
uses scientific standard file formats
http//www.unidata.ucar.edu/packages/netcdf/
The open source NetCDF format is widely used to
hold self-describing data. The output from the R
engine is a single R object. Given the
recursively defined list nature of R objects,
this is no limitation. In order to save a dynamic
R object into a flat NetCDF file, the object must
be carefully unwound, while preserving as much of
the metadata (such as dimension names, the
original source of the data, etc) as possible
into the NetCDF file. Once the output file is
written, it is ready to be used by the user
either for visualization, or as the input to
another function.
20
MPI-IO NetCDF ASPECT supports parallel I/O w/
various data access patterns(Collaboration with
ANL (Bill Gropp, Rob Ross, Rajeev Thakur) and
NWU (Alok Choudhoury, Wei-keng Liou)
  • Concatenate multiple files into a single file
    for a given set of variables
  • Analyze multiple files with different data
    distribution patterns among processors (by
    blocks, by strided patterns, by entire files)

21
Data Sampling ASPECT handles large data sets
Types of Subsampling
  • Random subsampling
  • Decimation
  • Blocks
  • Striding

Implementations
  • Standard netCDF
  • MPI-IO netCDF

22
Interfacing with DataSpaceASPECT provides
hooks to a Web of Scientific Data(Collaboration
with Bob Grossman at UIC)
The web today provides an infrastructure for
working with distributed multimedia documents.
DataSpace is an infrastructure for creating a web
of data instead of documents.
  • Very high throughput for moving data through
    DataSpaces parallel network transport protocols
    (Psockets (TCP), Sabul (TCP, UDP))
  • Ability to do comparative/correlation analysis
    between simulation and archived data

DataSpace Web of Data
PSockets/Sabul
UIC Amsterdam Sabul 540 Mb/s Psockets 180
Mb/s Sockets 10Mb/s
http//www.dataspaceweb.net
23
Summary of ASPECTs Design Implementation
  • ASPECT is a Data Stream Monitoring Tool
  • ASPECT has very nice features for efficient and
    effective simulation data analysis
  • GUI interface to a rich set of pluggable data
    analysis modules.
  • Uses the open source R statistical data analysis
    package as a computational back-end.
  • Understands and uses the NetCDF/HDF scientific
    file format.
  • Uses dynamically loaded R scripts and XML
    descriptors for flexibility.
  • Handles large sets of data through the support
    for block selection, striding, sampling, data
    reduction, and distributed algorithms.
  • Provides efficient I/O through MPI-IO interface
    to NetCDF and HDF
  • Moves data efficiently through PSockets/Sabul
  • Supports dataset view of the simulation not only
    a collection of files

24
Distributed and Streamline Data Analysis Research
25
Simulation Data Sets are Massive Growing Fast
Astrophysics Data per Run
26
Most of this Data will NEVER Be Touched with the
current trends in technology
  • The amount of data stored online quadruples every
    18 months, while processing power only doubles
    every 18 months.
  • Unless the number of processors increases
    unrealistically rapidly, most of this data will
    never be touched.
  • Storage device capacity doubles every 9 months,
    while memory capacity doubles every 18 months
    (Moores law).
  • Even if the divergence between these rates of
    growth will converge, the memory latency is and
    will remain the rate-limiting step in
    data-intensive computations
  • Operating systems struggle to handle files larger
    than a few GB.
  • OS constraints and memory capacity determine data
    set file size and fragmentation

27
Massive Data Sets are Naturally Distributed BUT
Effectively Immoveable (Skillicorn, 2001)
  • Bandwidth is increasing but not at the same rate
    as stored data
  • There are some parts of the world with high
    available bandwidth BUT there are enough
    bottlenecks that high effective bandwidth is
    unachievable across heterogeneous networks
  • Latency for transmission at global distances is
    significant
  • Most of this latency is time-of-flight and so
    will not be reduced by technology
  • Data has a property similar to inertia
  • It is cheap to store and cheap to keep moving,
    but the transitions between these two states are
    expensive in time and hardware.
  • Legal and political restrictions
  • Social restrictions
  • Data owners may let access data
  • but only by retaining control of it

Computations MUST move to data, rather than data
to computations
28
Simulation Data Sets are Dynamically Changing
  • Scientific simulations (e.g., climate modeling
    and supernova explosion) typically run for at
    least one month and produce data sets in the
    order of one to ten terabytes per simulation.
  • Effectively and efficiently analyzing these
    streams of data is a challenge
  • Most existing methods work with static datasets.
    Any changes require complete re-computation.

Computations MUST be able to efficiently analyze
streams of data while they are being produced,
rather than wait until they are produced
29
Algorithms Fail for a Few Gigabyte Data
  • Algorithmic Complexity
  • Calculate means O(n)
  • Calculate FFT O(n log(n))
  • Calculate SVD O(r c)
  • Clustering algorithms O(n2)

For illustration chart assumes 10-12 sec.
calculation time per data point
30
RACHET High Performance Framework for Distributed
Cluster Analysis
Strategy
Perform data mining in a distributed fashion
with reasonable data transfer overheads
Key idea
Compute local analyses using distributed agents
Merge minimum info into a global analysis via
peer-to-peer agents collaboration negotiation
Benefits
NO need to centralize data Linear scalability
with data size and with data dimensionality
31
Linear Time Dimension Reduction for Streamline
Distributed Data
  • Status
  • C, MPI, MPI-IO based implementation of package
  • Both one time and iterative communication
  • Integration into ASPECT is in progress
  • Requested by DataSpace, UIC P3 project (Ekow),
    LBL
  • Features
  • One time communication
  • Linear time for each chunk
  • 10 deviation from central version
  • Based on FastMap

32
Distributed Principal Components (PCA) Merging
Information Rather Than Raw Data
  • Global Principal Components
  • transmit information, not data
  • Dynamic Principal Components
  • no need to keep all data

Method Merge few local PCs and local means
  • Benefits
  • Little loss of information
  • Much lower transmission costs
  • Centralized O(np)
  • DPCA O(sp), sltltn
  • Computation cost
  • O(kp2) vs O(np2)

33
Data Understanding for Scientific Discovery
34
Data Analysis for Monitoring Simulations
  • What do we monitor?
  • Contrast between Supernova and Climate simulation
    data analysis
  • Highlights from Astrophysics
  • Wider implications on simulation data
  • Data reduction and monitoring from reduced data

35
What Do We Monitor?
Entropy of 2-d (axisymmetric) Supernova
Simulation
  • General Concepts
  • Application-specific
  • comparative displays driven by data mining and
    exploratory data analysis
  • Visual comparison in time is less effective
    than comparison side-by-side (Visual Display of
    Quantitative Information, Tufte)

36
Evolving Display Shows Entropy Progression over
Time
Radius
Time
Reduction with median
37
Specific Aspects of Simulation Can be Monitored
Entropy instability (range) over time
Radius
Time
Reduction with range (max min)
38
Shorten the Experimental Cycle with
Run-and-Render Comparative Monitoring
Radius
Radius
39
Concise Views of a Supernova Simulation
  • Displays must be application-specific, but some
    general concepts apply
  • Need general data mining capability for
    flexibility in building displays

40
Data Reduction for Multigrid Simulation
  • Based on PCA of contiguous field blocks
  • Exploits spatial correlation and adapts to
    complexity of spatial field
  • Parameter controls selected variation
  • Field restoration with single matrix multiply
  • Astrophysics supernova simulation
  • 16 to 200 times reduction per time step
  • Outperforms subsampling 3 times for comparable
    MSE over all time steps

Timestep 390
41
Spherical Symmetry Medians Conserved under PC
Compression
Original Data
30x Compressed Data
Time
Time
42
Spherical Symmetry Instability Ranges Conserved
under PC Compression
Original Data
30x Compressed Data
Radius
Radius
Time
Time
43
Publications Presentations
44
Conference
  • Co-sponsored Statistical Data Mining Conference,
    June 22-25, 2002, in Knoxville jointly with the
    University of Tennessee Department of Statistics
  • Organized an invited session on Distributed Data
    Mining at the conference.

45
Publications FY 2002
  • Y. M. Qu, G. Ostrouchov, N. F. Samatova, and G.
    A. Geist (2002). Principal Component Analysis for
    Dimension Reduction in Massive Distributed Data
    Sets. Workshop on High Performance Data Mining
    at the Second SIAM International Conference on
    Data Mining, p.4-9.
  • N.F. Samatova, G. Ostrouchov, A. Geist, A.
    Melechko. RACHET An Efficient Cover-Based
    Merging of Clustering Hierarchies from
    Distributed Datasets, Special Issue on Parallel
    and Distributed Data Mining, International
    Journal of Distributed and Parallel Databases An
    International Journal, 2002, Volume 11, No. 2,
    March 2002.
  • F. N. Abu-Khzam, N. F. Samatova, G. Ostrouchov,
    M. A. Langston, and A. G. Geist (2002).
    Distributed Dimension Reduction Algorithms for
    Widely Dispersed Data, Fourteenth IASTED
    International Conference on Parallel and
    Distributed Computing and Systems. Accepted.
  • G. Ostrouchov and N. F. Samatova (2002). On
    FastMap and the Convex Hull of Multivariate Data.
    In preparation.
  • J. Hespen, G. Ostrouchov, N. F. Samatova, and A.
    Mezzacappa (2002). Adaptive Data Reduction for
    Multigrid Simulation Output. In preparation.

46
Presentations FY 2002
  • Invited
  • G. Ostrouchov and N. F. Samatova. Multivariate
    Analysis of Massive Distributed Data Sets.
    Spring Research Conference on Statistics in
    Industry and Technology May 20-22, 2002, Ann
    Arbor, Michigan.
  • G. Ostrouchov and N. F. Samatova. Combining
    Distributed Local Principal Component Analyses
    into a Global Analysis, C. Warren Neel Conference
    on Statistical Data Mining and Knowledge
    Discovery, June 22-25, 2002, Knoxville,
    Tennessee.
  • N. Samatova, G. A. Geist, and G. Ostrouchov,
    RACHET Petascale Distributed Data Analysis
    Suite, SPEEDUP Workshop on Distributed
    Supercomputing Data Intensive Computing, March
    4-6, 2002, Badehotel Bristol, Leukerbad, Valais,
    Switzerland
  • Contributed
  • Y. M. Qu, G. Ostrouchov, N. F. Samatova, and G.
    A. Geist. Principal Component Analysis for
    Dimension Reduction in Massive Distributed Data
    Sets. Workshop on High Performance Data Mining
    at the Second SIAM International Conference on
    Data Mining, April 11-13, 2002, Washington, DC.
  • Local
  • N. Samatova and G. Ostrouchov. Large-Scale
    Analysis of Distributed Scientific Data. ORNL
    Weinberg Auditorium, July 11, 2002.

47
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com