Title: Put Your Title Here
1Scientific Data Management Center (ISIC)
http//sdmcenter.lbl.gov contains extensive
publication list
2Scientific Data Management Center
Participating Institutions
Center PI Arie Shoshani LBNL DOE
Laboratories co-PIs Bill Gropp, Rob
Ross ANL Arie Shoshani, Doron Rotem LBNL Terence
Critchlow, Chandrika Kamath LLNL Nagiza Samatova,
Andy White ORNL Universities co-PIs Mladen
Vouk North Carolina State Alok Choudhary
Northwestern Reagan Moore, Bertram Ludaescher
UC San Diego (SDSC) Calton Pu Georgia Tech Steve
Parker U of Utah (future)
3Phases of Scientific Exploration
- Data Generation
- From large scale simulations or experiments
- Fast data growth with computational power
- examples
- HENP 100 Teraops and 10 Petabytes by 2006
- Climate Spatial Resolution T42 (280 km) -gt T85
(140 km) -gt T170 (70 km), T42 about 1 TB/100
year run gt factor of 10-20 - Problems
- Cant dump the data to storage fast enough
waste of compute resources - Cant move terabytes of data over WAN robustly
waste of scientists time - Cant steer the simulation waste of time and
resource - Need to reorganize and transform data large
data intensive tasks slowingprogress
4Phases of Scientific Exploration
- Data Analysis
- Analysis of large data volume
- Cant fit all data in memory
- Problems
- Find the relevant data need efficient indexing
- Cluster analysis need linear scaling
- Feature selection efficient high-dimensional
analysis - Data heterogeneity combine data from diverse
sources - Streamline analysis steps output of one step
needs to match input of next
5Example Data Flow in TSI
Logistical Network
Courtesy John Blondin
6Goal Reduce the Data Management Overhead
- Efficiency
- Example parallel I/O, indexing, matching storage
structures to the application - Effectiveness
- Example Access data by attributes-not files,
facilitate massive data movement - New algorithms
- Example Specialized PCA techniques to separate
signals or to achieve better spatial data
compression - Enabling ad-hoc exploration of data
- Example by enabling exploratory run and render
capability to analyze and visualize simulation
output while the code is running
7Approach
- Use an integrated framework that
- Provides a scientific workflow capability
- Supports data mining and analysis tools
- Accelerates storage and access to data
- Simplify data management tasks for the scientist
- Hide details of underlying parallel and
indexingtechnology - Permit assembly of modules using a simple
graphical workflow description tool
SDM Framework
Scientific Process Automation Layer
Data Mining Analysis Layer
Scientific Application
Scientific Understanding
Storage Efficient Access Layer
8Technology Details by Layer
9AccomplishmentsStorage Efficient Access (SEA)
Shared memory communication
Parallel Virtual File System Enhancements and
deployment
- Developed Parallel netCDF
- Enables high performance parallel I/O to
netCDF datasets - Achieves up to 10 fold performance
improvement over HDF5 - Enhanced ROMIO
- Provides MPI access to PVFS
- Advanced parallel file system interfaces for
more efficient access - Developed PVFS2
- Adds Myrinet GM and InfiniBand support
- improved fault tolerance
- asynchronous I/O
- offered by Dell and HP for Clusters
- Deployed an HPSS Storage Resource Manager (SRM)
with PVFS - Automatic access of HPSS files to PVFS through
MPI-IO library - SRM is a middleware component
After
Before
FLASH I/O Benchmark Performance (8x8x8 block
sizes)
10Robust Multi-file Replication
- Problem move thousands of files robustly
- Takes many hours
- Need error recovery
- Mass storage systems failures
- Network failures
- Use Storage Resource Managers (SRMs)
- Problem too slow
- Use parallel streams
- Use concurrent transfers
- Use large FTP windows
- Pre-stage files from MSS
11File tracking helps to identify bottlenecks
Shows that archiving is the bottleneck
12File tracking shows recovery from transient
failures
Total 45 GBs
13AccomplishmentsData Mining and Analysis (DMA)
- Developed Parallel-VTK
- Efficient 2D/3D Parallel Scientific
Visualization for NetCDF and HDF files - Built on top of PnetCDF
- Developed region tracking tool
- For exploring 2D/3D scientific databases
- Using bitmap technology to identify regions
based on multi-attribute conditions - Implemented Independent Component Analysis (ICA)
module - Used for accurate for signal separation
- Used for discovering key parameters that
correlate with observed data - Developed highly effective data reduction
- Achieves 15 fold reduction with high level of
accuracy - Using parallel Principle Component Analysis(PCA)
technology - Developed ASPECT
- A framework that supports a rich set ofpluggable
data analysis tools - Including all the tools above
Combustion region tracking
El Nino signal (red) and estimation (blue)
closely match
14ASPECT Analysis Environment
pVTK Tool
R Analysis Tool
Select Data
Take Sample
Data Mining Analysis Layer
Read Data (buffer-name) Write Data
Read Data (buffer-name) Write Data
Read Data (buffer-name)
Get variables (var-names, ranges)
Use Bitmap (condition)
Bitmap Index Selection
Storage Efficient Access Layer
PVFS
Parallel NetCDF
Hardware, OS, and MSS (HPSS)
15AccomplishmentsScientific Process Automation
(SPA)
- Unique requirements of scientific WFs
- Moving large volumes between modules
- Tightlly-coupled efficient data movement
- Specification of granularity-based iteration
- e.g. In spatio-temporal simulations a time
step is a granule - Support for data transformation
- complex data types (including file formats, e.g.
netCDF, HDF) - Dynamic steering of workflow by user
- Dynamic user examination of results
- Developed a working scientific work flow system
- Automatic microarray analysis
- Using web-wrapping tools developed by the center
- Using Kepler WF engine
- Kepler is an adaptation of the UC Berkeley tool,
Ptolemy
workflow steps defined graphically
workflow results presented to user
16GUI for setting up and running workflows
17Re-applying Technology
SDM technology, developed for one application,
can be effectively targeted at many other
applications
- Technology
- Parallel NetCDF
- Parallel VTK
- Compressed bitmaps
- Storage Resource
- Managers
- Feature Selection
- Scientific Workflow
New Applications Climate Climate Combustion,
Astrophysics Astrophysics Fusion Astrophysics
(planned)
Initial Application Astrophysics
Astrophysics HENP HENP Climate Biology
18Broad Impact of the SDM Center
- Astrophysics
- High speed storage technology, parallel NetCDF,
parallel VTK, and ASPECT integration software
used for Terascale Supernova Initiative (TSI) and
FLASH simulations - Tony Mezzacappa ORNL, John Blondin NCSU,
Mike Zingale U of Chicago, Mike Papka ANL - Climate
- High speed storage technology, Parallel NetCDF,
and ICA technology used for Climate Modeling
projects - Ben Santer LLNL, John Drake ORNL, John
Michalakes NCAR - Combustion
- Compressed Bitmap Indexing used for fast
generation of flame regions and tracking their
progress over time - Wendy Koegler, Jacqueline Chen Sandia Lab
ASCI FLASH parallel NetCDF
Dimensionality reduction
Region growing
19Broad Impact (cont.)
- Biology
- Kepler workflow system and web-wrapping
technology used for executing complex highly
repetitive workflow tasks for processing
microarray data - Matt Coleman - LLNL
-
- High Energy Physics
- Compressed Bitmap Indexing and Storage Resource
Managers used for locating desired subsets of
data (events) and automatically retrieving data
from HPSS - Doug Olson - LBNL, Eric Hjort LBNL, Jerome
Lauret - BNL - Fusion
- A combination of PCA and ICA technology used to
identify the key parameters that are relevant to
the presence of edge harmonic oscillations in a
Tokomak - Keith Burrell - General Atomics
Building a scientific workflow
Dynamic monitoring of HPSS file transfers
Identifying key parameters for the DIII-D
Tokamak
20Goals for Years 4-5
- Fully develop the integrated SDM framework
- Implement the 3 layer framework on SDM center
facility - Provide a way to select only components needed
- Develop self-guiding web pages on the use of SDM
components - Use existing successful examples as guides
- Generalize components for reuse
- Develop general interfaces between components in
the layers - support loosely-coupled WSDL interfaces
- Support tightly-coupled components for efficient
dataflow - Integrate operation of components in the
framework - Hide details form user automate parallel access
and indexing - Develop a reusable library of components that can
be selected for use in the workflow system