End to End Scientific Data Management Framework for Petascale Science

About This Presentation

Title:

End to End Scientific Data Management Framework for Petascale Science

Description:

Old way of run now, and look at results later has problems. ... Cray Catamount (SNL Redstorm) Linux Infiniband/Gigabit (ORNL Ewok) ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 49

Provided by: ScottK90

Category:

more less

Transcript and Presenter's Notes

Title: End to End Scientific Data Management Framework for Petascale Science

1
End to End Scientific Data Management Framework
for Petascale Science

ESMF
9/23/2008
Scott Klasky, Jay Lofstead, Mladen Vouk
ORNL, Georgia Tech, NCSU

2
Outline

EFFIS (Klasky)
ADIOS.
ADIOS Overview (Klasky)
ADIOS Advanced Topics (Lofstead)
Workflow. (Vouk)
Dashboard. (Vouk)
Conclusions. (Klasky)

3
Supercomputers creating a hurricane of data.

Some simulations are starting to produce
100TB/day on the 270 TF Cray XT at ORNL.
Old way of run now, and look at results later has
problems.
Data will be eventually archived on tape.
Lots of files from 1 run with multiple users
gives us a data management headache.
Need to keep track of data over multiple system.
Extracting information from files needs to be
easy.
Example min/max of 100GB arrays needs to be
almost instant.

4
Vision

Problem Managing the data from a petascale
simulation, and debugging the simulation, and
extracting the science involves.
Tracking the codes Simulation, Analysis.
Tracking the input files/parameters
Tracking the output files, from the simulation
and then analysis programs.
Tracking the machines and environment the codes
ran on.
Gluing everything together.
Visualizing the results, and analyzing the
results without requiring users to know all of
the file names.
Fast I/O which can be easily tracked.

5
Vision

Workflow Automation to automate all of the
mundane tasks.
Analyzing the results, without knowing all of the
file locations/names.
Moving data from the simulation side to remote
locations without knowledge of filename(s)/locatio
ns.
Monitoring results in real-time,
Requirements.
Want technologies integrated together easy to
talk to one another.
Want to make the system scalable in the I/O
workflow, analysis, visualization, data
management.

6
Outline

EFFIS
ADIOS.
ADIOS Overview
BP format, and compatibility with hdf5/netcdf.
Workflow.
Dashboard.
Conclusions.

7
ADIOS Motivation

Those fine fort. files!
Multiple HPC architectures
BlueGene, Cray, IB-based clusters
Multiple Parallel Filesystems
Lustre, PVFS2, GPFS, Panasas, PNFS
Many different APIs
MPI-IO, POSIX, HDF5, netCDF
GTC (fusion) has changed IO routines 8 times so
far based on performance when moving to different
platforms.
Different IO patterns
Restarts, analysis, diagnostics
Different combinations provide different levels
of IO performance
Compensate for inefficiencies in the current IO
infrastructures to improve overall performance

8
ADIOS Overview

Allows plug-ins for different I/O
implementations.
Abstracts the API from the method used for I/O.
Simple API, almost as easy as F90 write
statement.
Best practices/optimize IO routines for all
supported transports for free
Componentization.
Thin API
XML file
data groupings with annotation
IO method selection
buffer sizes
Common tools
Buffering
Scheduling
Pluggable IO routines

9
ADIOS Overview

ADIOS is an IO componentization, which allows us
to
Abstract the API from the IO implementation.
Switch from synchronous to asynchronous IO at
runtime.
Change from real-time visualization to fast IO at
runtime.
Combines.
Fast I/O routines.
Easy to use.
Scalable architecture(100s cores) millions of
procs.
QoS.
Metadata rich output.
Visualization applied during simulations.
Analysis, compression techniques applied during
simulations.
Provenance tracking.

10
ADIOS Philosophy (End User)

Simple API very similar to standard Fortran or C
POSIX IO calls.
As close to identical as possible for C and
Fortran API
open, read/write, close is the core
set_path, end_iteration, begin/end_computation,
init/finalize are the auxiliaries
No changes in the API for different transport
methods.
Metadata and configuration defined in an external
XML file parsed once on startup.
Describe the various IO grouping including
attributes and hierarchical path structures for
elements as an adios-group
Define the transport method used for each
adios-group and give parameters for
communication/writing/reading
Change on a per element basis what is written
Change on a per adios-group basis how the IO is
handled

11
Design Goals

ADIOS Fortran and C based API almost as simple as
standard POSIX IO
External configuration to describe metadata and
control IO settings
Take advantage of existing IO techniques (no new
native IO methods)
Fast, simple-to-write, efficient IO for multiple
platforms without changing the source code

12
Architecture

Data groupings
logical groups of related items written at the
same time.
Not necessarily one group per writing event
IO Methods
Choose what works best for each grouping
Vetted, improved, and/or written by experts for
each
POSIX (Wei-keng Liao, Northwestern)
MPI-IO (Steve Hodson, ORNL)
MPI-IO Collective (Wei-keng Liao, Northwestern)
NULL (Jay Lofstead, GT)
Ga Tech DataTap Asynchronous (HasanAbbasi, GT)
phdf5
others.. (pnetcdf on the way).

13
Related Work

Specialty APIs
HDF-5 complex API
Parallel netCDF no structure
File system aware middleware
MPI ADIO layer File system connection, complex
API
Parallel File systems
Lustre Metadata server issues
PVFS2 client complexity
LWFS client complexity
GPFS, pNFS, Panasas may have other issues

14
Supported Features

Platforms tested
Cray CNL (ORNL Jaguar)
Cray Catamount (SNL Redstorm)
Linux Infiniband/Gigabit (ORNL Ewok)
BlueGene P now being tested/debugged.
Looking for future OSX support.
Native IO Methods
MPI-IO independent, MPI-IO collective, POSIX,
NULL, Ga Tech DataTap asynchronous, Rutgers DART
asynchronous, Posix-NxM, phdf5, pnetcdf,
kepler-db

15
Initial ADIOS performance.

MPI-IO method.
GTC and GTS codes have achieved over 20 GB/sec on
Cray XT at ORNL.
30GB diagnostic files every 3 minutes, 1.2 TB
restart files every 30 minutes, 300MB other
diagnostic files every 3 minutes.
DART lt2 overhead forwriting 2 TB/hour withXGC
code.
DataTap vs. Posix
1 file per process (Posix).
5 secs for GTCcomputation.
25 seconds for Posix IO
4 seconds with DataTap

16
Codes Performance

June 7, 2008 24 hour GTC run on Jaguar at ORNL
93 of machine (28,672 cores)
MPI-OpenMP mixed model on quad-core nodes (7168
MPI procs)
three interruptions total (simple node failure)
with 2 10 hour runs
Wrote 65 TB of data at gt20 GB/sec (25 TB for post
analysis)
IO overhead 3 of wall clock time.
Mixed IO methods of synchronous MPI-IO and POSIX
IO configured in the XML file

17
Chimera IO Performance (Supernova code)
2x scaling

Plot minimum value from 5 runs with 9
restarts/run
Error bars show maximum time for the method.

18
Chimera Benchmark Results

Why ADIOS is better than pHDF5?
ADIOS_MPI_IO vs. pHDF5 w/ MPI Indep. IO driver

Use 512 cores, 5 restart dumps. Conversion time
on 1 processor for the 2048 core job 3.6s
(read) 5.6s (write) 6.9 (other) 18.8
s Number above are sum among all PEs (parallelism
not shown)
19
ADIOS Advanced Topics

J. Lofstead

20
ADIOS API Fortan Example

XML configuration file
ltadios-configgt
ltadios-group nameoutput coordination-communicat
orgroup_commgt
ltvar namegroup_comm typeinteger/gt
ltvar nameg_NX typeinteger /gt
ltvar nameg_NY typeinteger/gt
ltvar namelo_x typeinteger/gt
ltvar namelo_y typeinteger/gt
ltvar namel_NX typeinteger/gt
ltvar namel_NY typeinteger/gt
ltglobal-bounds dimensionsg_NX,g_NY
offsetslo_x,lo_ygt
ltvar nametemperature dimensionsl_NX,l_NY/gt
lt/global-boundsgt
ltattribute nameunits path/temperature
valueK/gt
lt/adios-groupgt
lt!-- declare additional adios-groups --gt
ltmethod methodMPI groupoutput/gt
lt!-- add more methods --gt

Fortan90 code ! initialize the system loading
the configuration file adios_init (config.xml,
err) ! open a write path for that type adios_open
(h1, output, restart.n1, w,
err) adios_group_size (h1, size, total_size,
comm, err) ! write the data items adios_write
(h1, g_NX, 1000, err) adios_write (h1, g_NY,
800, err) adios_write (h1, lo_x, x_offset,
err) adios_write (h1, lo_y, y_offset,
err) adios_write (h1, l_NX, x_size,
err) adios_write (h1, l_NY, y_size,
err) adios_write (h1, temperature, u, err) !
commit the writes for asynchronous
transmission adios_close (h1, err) ! do more
work ! shutdown the system at the end of my
run adios_finalize (mype, err)
21
ADIOS API C Example

C code
// parse the XML file and determine buffer sizes
adios_init (config.xml)
// open and write the retrieved type
adios_open (h1, restart, restart.n1, w)
adios_group_size (h1, size, total_size, comm)
adios_write (h1, n, n) // int n
adios_write (h1, mi, mi) // int mi
adios_write (h1, zion, zion) // float zion
10203040
// write more variables
...
// commit the writes for synchronous transmission
or
// generally initiate the write for
asynchronous transmission
adios_close (h1)
// do more work
...
// shutdown the system at the end of my run
adios_finalize (mype)

XML configuration file ltadios-config
host-languageCgt ltadios-group
namerestartgt ltvar namen path/
typeinteger /gt ltvar namemi path/param
typeinteger/gt lt!-- declare more data
elements --gt ltvar namezion typereal
dimensionsn,4,2,mi/gt ltattribute nameunits
path/param valuem/s/gt lt/adios-groupgt lt!--
declare additional adios-groups --gt ltmethod
methodMPI grouprestart/gt ltmethod
priority2 methodDATATAP iterations1
typediagnosisgtsrvewok001.ccs.ornl.govlt/methodgt
lt!-- add more methods --gt ltbuffer size-MB100
allocate-timenow/gt lt/adios-configgt
22
BP File Format

netCDF and HDF-5 are excellent, mature file
formats
APIs can have trouble scaling to petascale and
beyond
metadata operations bottleneck at MDS
coordination among all processes takes time
MPI Collective writes/reads add additional
coordination
Non-stripe-sized writes impact performance
Read/write mode is slower than write only
Replicate some metadata for resilience

23
BP File Format

Solution Use an intermediate API and format
ADIOS API and BP format
API natively writes BP format (netCDF coming)
converters to netCDF and HDF-5 available
Convert files at speeds limited by the
performance of disk and the netCDF/HDF-5 API

24
BP File Format

File organization
Move the header to the end
last 28 bytes are 3 index locations and version
endian-ness flag
Each process writes completely independently
First part of file a series of Process Groups,
each the output from a single process for a
single IO grouping
Coordinate only twice
Once at start for writing location
Once at end for metadata collection to process 0
and writing by process 0 only
Replicate some metadata
Each Process Group is fully self-contained with
all related meta-data
Indexes contain copies of highlights of the
metadata

25
BP File Format

Index Structure
Process Group Index
ADIOS group, process ID, timestep, offset in file
Vars Index
Set of unique vars listing group, name, path,
datatype, characteristics (see next slide)
Uniqueness based on group name, var name, var
path
Attributes Index
Set of unique attributes listing group, name,
path, datatype, characteristics (see next slide)
Uniqueness based on group name, attribute name,
attribute path

26
BP File Format

Data Characteristics
Idea collect information about the var/attribute
for quickly characterizing the data
Examples
Offset in file
Value (only for small data)
Minimum
Maximum
Instance array dimensions
Structure setup for adding more without changing
file format

27
BP File Format

Write operation (n processes)
Gather data sizes to process 0
Process 0 generates offset to write for each
process
Scatter offsets back to processes
Everybody write data independently
Gather the local index from each process to
process 0
Merge all indices together
Process 0 write indices at the end of the file

28
BP File Format

Compromises using BP Format
Each Process Group can have different variables
defined and written (also an advantage)

29
BP File Format

Advantages using BP Format
Each process writes independently
Limited coordination
File organization more natural for striping
Rich index contents
Append operations do not require moving data
Indices read by process 0 on start and used as
base index
First new Process Group overwrites old indicies
Index corruption does not potentially destroy
entire file
Process Group corruption isolated by still
getting access to the rest of the process groups
(via indices)

30
Outline

EFFIS
ADIOS.
ADIOS Overview
BP format, and compatibility with hdf5/netcdf.
Workflow.
Dashboard.
Conclusions.

31
Scientific Workflow

Capture how a scientist works with data and
analytical tools
data access, transformation, analysis,
visualization
possible worldview dataflow-oriented (cf.
signal-processing)?
Scientific workflows start where script-based
data-management solutions leave off.
Scientific workflow (wf) benefits (v.s.
script-based approaches)
wf automation
wf component reuse, sharing, adaptation,
archiving
wf design, documentation
built-in (model) concurrency
(task-, pipeline-parallelism)
built-in provenance support
distributed parallel exec
Grid cluster support
wf fault-tolerance, reliability
Other

Why a W/F System?
Higher-level language vs. assembly-language
nature of scripts
32
Two typical types of Workflows for SC

Real-time Monitoring (Server Side Workflows)
Job submission.
File movement.
Launch Analysis Services.
Launch Visualization Services.
Launch Automatic Archiving.
Post Processing (Desktop Workflows).
Read in Files from different locations.
File movement.
Launch Analysis Services.
Launch Visualization Services.
Connect to Databases.
Obviously there are other types of workflows.
Parameter study/sensitivity analysis workflows.

33
Workflow Provenance

Process provenance.
the steps performed in the workflow, the progress
through the workflow control flow, etc.
Data provenance.
history and lineage of each data item associated
with the actual simulation (inputs, outputs,
intermediate states, etc.)
Workflow provenance.
history of the workflow evolution and structure
System provenance.
All external (environment) information relevant
to a complete run.
Compilation history of the codes.
Information about the libraries.
Source of the codes.
Run-time environment settings.
Machine information
etc.

Dashboard displays provenance information for
Data lineage.
Source Code for a simulation, analysis.
Performance Data from PAPI.
Workflow Provenance to determine if something
went wrong with the workflow.
Other

34
Modular Framework
Storage
Supercomputers Analytics Nodes
Kepler
Data Store
Rec API
Disp API
Dash
Management API
Orchestration
Meta-Data about Processes, Data, Workflows, Syst
em, Apps Environment
ADIOS is being modified to send the IO (
coupling) metadata to Kepler (e.g., file path,
variables, control commands, )
35
So what are the requirements?

Reliability (autonomics)
Usability (Must be EASY to use and functional)
Good user support, and long-term DOE support. ?
Universality and Reuse - The workflow should work
for all of my workflows. (NOT just for the
Petascale computers multiple platforms)
Integration - Must be easy to incorporate my own
services into the workflow.
Customization and adaptability - Must be
customizable by the users.
Users need to easily change the workflow to work
with the way users work.
Other - You tell us!

36
Kepler Scientific Workflow System
http//www.kepler-project.org

Kepler is a cross-project collaboration
Latest release available from the website
Builds upon the open-source Ptolemy II framework
Vergil is the GUI, but Kepler also runs in
non-GUI and batch modes.

37
Vergil is the GUI for Kepler
but Kepler can also run in batch mode as a
command-line engine.
Data Search
Actor Search

Actor ontology and semantic search for actors
Search -gt Drag and drop -gt Link via ports
Metadata-based search for datasets

38
Actor-Oriented Modeling

Actors
single component or task
well-defined interface (signature)
generally a passive entity given input data,
produces output data

Ports
each actor has a set of input and output ports
denote the actors signature
produce/consume data (a.k.a. tokens)
parameters are special static ports

39
Actor-Oriented Modeling

Dataflow Connections
actor communication channels
Directed edges
connect output ports with input ports

40
Actor-Oriented Modeling

Sub-workflows / Composite Actors
composite actors wrap sub-workflows
like actors, have signatures (i/o ports of
sub-workflow)
hierarchical workflows (arbitrary nesting levels)

41
Actor-Oriented Modeling

Directors
define the execution semantics of workflow graphs
executes workflow graph (some schedule)
sub-workflows may have different directors
enables reusability

42
Some Directors

Directed Acyclic Graph (DAG)
Common among Grid workflows no loops, each actor
fires at most once (no streaming / pipeline
parallelism)
Example DAGMan
Synchronous Dataflow (SDF)
Connections have queues for sending/receiving
fixed numbers of tokens at each firing. Schedule
is statically predetermined. SDF models are
highly analyzable and used often in SWFs.
Process Networks (PN)
Generalize SDF. Actors execute as a separate
thread/process, with queues of unbounded size.
Related to Kahn/MacQueen semantics. The workflow
is executed in parallel and pipeline parallel
fashion.
Continuous Time (CT)
Connections represent the value of a continuous
time signal at some point in time ... Often used
to model physical processes.
Discrete Event (DE)
Actors communicate through a queue of events in
time. Used for instantaneous reactions in
physical systems.
Dynamic Dataflow (DDF)
Connections have queues for sending/receiving
arbitrary numbers of tokens at each firing.
Schedule is dynamically calculated. DDF models
enable branching and looping/ (conditionals). The
workflow is sequential.

43
Types

tokens, ports have types
available types
int, float (double precision), complex, string,
boolean, object
array, record, matrix (2D only)
type resolution at workflow start-up actors can
support different types
e.g. Count, Sleep, Delay work on any type
a type lattice is pre-defined to determine
relationships among types (casting)

string and int tokens are added as strings
int tokens are added as ints
44
Dashboard
45
Machine monitoring.

Allow for secure logins with OTP.
Allow for job submission.
Allow for killing jobs.
Search old jobs.
See collaborators jobs.

46
Analysis Collaborative Features

Base analysis which will workon both the
portable dashboard and the mother-dashboard
and will feature.
Calculator for simple math, done inpython.
Hooks into R for pre-set functions.
Ability to save the analysis into anew function,
available to otherusers.
Calculator will create new movies that are
viewable on the dashboard.
First version will work with xy (t) plots.
Second version will work with x,y,z (t)plots.
Advanced analysis will contain.
Parallel backend to VisIT server, VisTrails,
Parallel R, and custom mpi/c/f90 code.
We will allow users to place executable code into
the dashboard. (Still working this out). How to
execute, .

47
Conclusions

ADIOS is an IO componentization.
ADIOS is being integrated integrated into Kepler.
Achieved over 20 GB/sec for several codes on
Jaguar.
Used daily by CPES researchers.
Can change IO implementations at runtime.
Metadata is contained in XML file.
Kepler is used daily for
Monitoring CPES simulations on Jaguar/Franklin/ewo
k.
Runs with 24 hour jobs, on large number of
processors.
Dashboard uses enterprise (LAMP) technology.
Linux, Apache, MySQL, PHP

48
EFFIS

From SDM center
Workflow engine Kepler
Provenance support
Wide-area data movement
From universities
Code coupling (Rutgers)
Visualization (Rutgers)
Newly developed technologies
Adaptable I/O (ADIOS)(with Georgia Tech)
Dashboard (with SDM center)

Foundation Technologies
Enabling Technologies
Approach place highly annotated, fast,
easy-to-use I/O methods in the code, which can be
monitored and controlled, have a workflow engine
record all of the information, visualize this on
a dashboard, move desired data to users site,
and have everything reported to a database.

Write a Comment

User Comments (0)