Workflow automation for processing plasma fusion simulation data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Workflow automation for processing plasma fusion simulation data

1
Workflow automation for processing plasma fusion
simulation data
Norbert PodhorszkiBertram Ludäscher
University of California, Davis
Scott A. Klasky
Scientific Computing GroupOak Ridge National
Laboratory
GPSC
2
Center for Plasma Edge Simulation

Focus on the edge of the plasma in the tokamak
Multi-scale, multi-physics simulation

Edge turbulence in NSTX (_at_ 100,000 frames/s)
Diverted magnetic field
3
Images plasma physicists adore
Electric potential
Parallel flow and particle positions
4
Monitoring the simulation means
5
Multi-physics ? many codes
6
XGC simulation output

Desired size of simulation (to be run on the
petascale machine)
100K time steps
100 billion particles
10 attributes (double precision) per particles
8 TB data per time step
Save (and process) 1K-10K time steps
about 5 days run on the petascale

7
XGC simulation output

Proprietary binary files (BP)
3D variables, separate file per each timestep
NetCDF files containing
2D variables, all timesteps in one file
M3D coupling data
to compute new equilibrium with external code
(loose coupling)
to check linear stability of XGC externally

8
What to do with those output?

Proprietary binary files (BP)
Transfer to end-to-end system using bbcp
Convert to HDF5 format (with a C program)
Generate images using AVS/Express (running as
service)
Archive HDF5 files in large chunks to HPSS
NetCDF files containing
Transfer to end-to-end system (updating as new
timesteps are written into the files)
Generate images using grace library
Archive NetCDF files at the end of simulation
M3D coupling data
Transfer to end-to-end system
Execute M3D compute new equilibrium
Transfer back the new equilibrium to XGC
Execute ELITE compute growth rate, test linear
stability
Execute M3D-MPP to study unstable states (ELM
crash)

9
Schematic view of components
ORNL
40 GB/s
HPSS
Command control site
10
Schematic view of components
ORNL
40 GB/s
HPSS
11
Schematic view of components
ORNL
Pull data
Seaborg _at_ NERSC
40 GB/s
Cray XT4
HPSS
Command control site
12

Kepler workflow
to accomplish all these tasks
1239 (java) actors
4 levels of hierarchy

13
Workflow java - remote script - remote prg
14
Kepler actors for CPES

Permanent SSH connection to perform tasks on a
remote machine
Generalized actors (sub-workflows) for specified
tasks
Watch a remote directory for simulation timesteps
Execute an external command on a remote machine
Tar and archive data in large junks to HPSS
Transfer a remote image file and display on
screen
Control a running SCIRun server remotely
Job submission and control to various resource
managers
Above actors do logging/checkpointing
the final workflow can be stopped / restarted

15
What Kepler features are used in CPES?

Different computational models
PN for parallelism and pipeline processing
DDF for sequential workflow with if-then-else and
while loop structures
SDF for efficient (static schedule) sequential
execution of simple sub-workflows
Stateful actors in stream processing of files
SSH for remote operations
keeps the connection alive
Command-line execution of the workflow
from a script (at deployment) (no GUI)
reading workflow parameters from a file

16
FileWatcher data-dependent loop

SSH Directory Listing Java actor gives new files
in a directory (once)
This is a do-while loop where the termination
condition is whether the list contains a specific
element (which indicates end of simulation)

17
Modeling problem stopping and finishing

You create working pipelines finally. Fine.
How do you stop them?
How do you let intermediate actors know that they
will not receive more tokens?
How do you perform something after the
processing?
We use a special token flowing through the
pipelines
Always the last item in the pipeline.
Actors are implemented (extra work) to skip this
token.
Stop file created by the simulation
to stop the task generator actors in the
workflow (FileWatchers)
to notify (stateful) actors in the pipeline that
they should finalize (Archiver, Stop_AVS/Express)
to synchronize on two independent pipelines
(NetCDFHDF5 ? archive images at the end)

18
Role of stop file
19
Role of stop file
Extra work after the end
20
Problem how to restart this workflow?

Kepler has no system-level checkpoint/restart
mechanism
seems to be difficult for large Java applications
not to mention the status of external (and
remote) things.
Pipeline execution
each actor is processing a different step

21
Our solution user-level logging/restart

We record
the successful operations at each (heavy) actor
Those actors
are implemented to check before doing something
whether that has been done already
When the workflow is restarted
it starts from the very beginning, but the actors
simply skip operations (files, tokens) that have
already been done.
We do not worry about repeating small (control
related) actions within the workflow
external operations are that matter here

22
ProcessFile core check-perform-record
23
Problem failed operations

What if an operation fails, e.g. one timestep
cannot be transferred? Options
a) trust that they fail silently on missing
data
b) notify everybody on the pipeline below (to
skip)
c) avoid giving tasks to them for the erroneous
step
Retrying later and processing that step is
important but
Keeping up with the simulation on the next steps
is even more important.

24
Our approach for failed operations

ProcessFile and thus the workflow handles
failures by discarding tokens related to failed
operations from the stream
Advantage
actors need not care about failures
an incoming token is a task to be done
Disadvantage
rate of token production varies
this can upset Keplers models of computation

25
Discarding tokens on failure
3
2
1
transfer 1
convert 1
arch 1
failed 2
transfer 3
convert 3
arch 3
26
After a restart
3
2
1
skip 1
skip 1
skip 1
transfer 2
convert 2
arch 2
skip 3
skip 3
skip 3
27
Future Plans

Provenance management
one main reason to use scientific workflow system
e.g. in bioinformatics workflows
needed for debugging runs, interpreting results,
repeat experiment, generate documentation,
compare runs etc.
CPES workflow is selected as one use case for the
ongoing Kepler provenance work
New actors in CPES for controlling asynchronous
I/O from the petascale computer towards the
processing cluster

28
Thank You

Questions?

29
Disadvantage of discarding 1/7
1
3
6
5
2
4

Distributor splits the stream and distributes it
to the two actors evenly
Commutator keeps the original order of tokens

30
Disadvantage of discarding 2/7
1
3
6
5
2
4

T2 is waiting at the Commutator for T1 to be
finished
T4 is started by the lower actor

31
Disadvantage of discarding 3/7
1
3
6
5
2
4

T4 is also finished but waiting in the lower
actor for T2 going through the Commutator
Lower actor becomes idle (this comes from
Commutators behavior)

32
Disadvantage of discarding 4/7
1
3
6
5
2
4

suppose T1 is discarded

33
Disadvantage of discarding 5/7
3
5
8
7
2
4
6

T3 can be started finally
The lower actor is still idle (this comes from
discarding T1!)

34
Disadvantage of discarding 6/7
5
7
3
9
8
2
4
6

T3 is finished and sent out. The Commutator sends
it out first.
Then T2 is sent out

35
Disadvantage of discarding 7/7
5
7
10
9
4
8
6

Lower actor can start working on T6
T4 will be sent out only after T5
Order of outgoing stream 3, 2, 5, 4

36
Checkpointing
..., f4, f3, f2, f1
..., g4, g3, g2, g1
..., h4, h3, h2, h1
..., list2, list1
UNFINISHED

Write a Comment

User Comments (0)

About PowerShow.com

Workflow automation for processing plasma fusion simulation data PowerPoint PPT Presentation