Workflow automation for processing plasma fusion simulation data PowerPoint PPT Presentation

presentation player overlay
1 / 36
About This Presentation
Transcript and Presenter's Notes

Title: Workflow automation for processing plasma fusion simulation data


1
Workflow automation for processing plasma fusion
simulation data
Norbert PodhorszkiBertram Ludäscher
University of California, Davis
Scott A. Klasky
Scientific Computing GroupOak Ridge National
Laboratory
GPSC
2
Center for Plasma Edge Simulation
  • Focus on the edge of the plasma in the tokamak
  • Multi-scale, multi-physics simulation

Edge turbulence in NSTX (_at_ 100,000 frames/s)
Diverted magnetic field
3
Images plasma physicists adore
Electric potential
Parallel flow and particle positions
4
Monitoring the simulation means
5
Multi-physics ? many codes
6
XGC simulation output
  • Desired size of simulation (to be run on the
    petascale machine)
  • 100K time steps
  • 100 billion particles
  • 10 attributes (double precision) per particles
  • 8 TB data per time step
  • Save (and process) 1K-10K time steps
  • about 5 days run on the petascale

7
XGC simulation output
  • Proprietary binary files (BP)
  • 3D variables, separate file per each timestep
  • NetCDF files containing
  • 2D variables, all timesteps in one file
  • M3D coupling data
  • to compute new equilibrium with external code
    (loose coupling)
  • to check linear stability of XGC externally

8
What to do with those output?
  • Proprietary binary files (BP)
  • Transfer to end-to-end system using bbcp
  • Convert to HDF5 format (with a C program)
  • Generate images using AVS/Express (running as
    service)
  • Archive HDF5 files in large chunks to HPSS
  • NetCDF files containing
  • Transfer to end-to-end system (updating as new
    timesteps are written into the files)
  • Generate images using grace library
  • Archive NetCDF files at the end of simulation
  • M3D coupling data
  • Transfer to end-to-end system
  • Execute M3D compute new equilibrium
  • Transfer back the new equilibrium to XGC
  • Execute ELITE compute growth rate, test linear
    stability
  • Execute M3D-MPP to study unstable states (ELM
    crash)

9
Schematic view of components
ORNL
40 GB/s
HPSS
Command control site
10
Schematic view of components
ORNL
40 GB/s
HPSS
11
Schematic view of components
ORNL
Pull data
Seaborg _at_ NERSC
40 GB/s
Cray XT4
HPSS
Command control site
12
  • Kepler workflow
  • to accomplish all these tasks
  • 1239 (java) actors
  • 4 levels of hierarchy

13
Workflow java - remote script - remote prg
14
Kepler actors for CPES
  • Permanent SSH connection to perform tasks on a
    remote machine
  • Generalized actors (sub-workflows) for specified
    tasks
  • Watch a remote directory for simulation timesteps
  • Execute an external command on a remote machine
  • Tar and archive data in large junks to HPSS
  • Transfer a remote image file and display on
    screen
  • Control a running SCIRun server remotely
  • Job submission and control to various resource
    managers
  • Above actors do logging/checkpointing
  • the final workflow can be stopped / restarted

15
What Kepler features are used in CPES?
  • Different computational models
  • PN for parallelism and pipeline processing
  • DDF for sequential workflow with if-then-else and
    while loop structures
  • SDF for efficient (static schedule) sequential
    execution of simple sub-workflows
  • Stateful actors in stream processing of files
  • SSH for remote operations
  • keeps the connection alive
  • Command-line execution of the workflow
  • from a script (at deployment) (no GUI)
  • reading workflow parameters from a file

16
FileWatcher data-dependent loop
  • SSH Directory Listing Java actor gives new files
    in a directory (once)
  • This is a do-while loop where the termination
    condition is whether the list contains a specific
    element (which indicates end of simulation)

17
Modeling problem stopping and finishing
  • You create working pipelines finally. Fine.
  • How do you stop them?
  • How do you let intermediate actors know that they
    will not receive more tokens?
  • How do you perform something after the
    processing?
  • We use a special token flowing through the
    pipelines
  • Always the last item in the pipeline.
  • Actors are implemented (extra work) to skip this
    token.
  • Stop file created by the simulation
  • to stop the task generator actors in the
    workflow (FileWatchers)
  • to notify (stateful) actors in the pipeline that
    they should finalize (Archiver, Stop_AVS/Express)
  • to synchronize on two independent pipelines
    (NetCDFHDF5 ? archive images at the end)

18
Role of stop file
19
Role of stop file
Extra work after the end
20
Problem how to restart this workflow?
  • Kepler has no system-level checkpoint/restart
    mechanism
  • seems to be difficult for large Java applications
  • not to mention the status of external (and
    remote) things.
  • Pipeline execution
  • each actor is processing a different step

21
Our solution user-level logging/restart
  • We record
  • the successful operations at each (heavy) actor
  • Those actors
  • are implemented to check before doing something
    whether that has been done already
  • When the workflow is restarted
  • it starts from the very beginning, but the actors
    simply skip operations (files, tokens) that have
    already been done.
  • We do not worry about repeating small (control
    related) actions within the workflow
  • external operations are that matter here

22
ProcessFile core check-perform-record
23
Problem failed operations
  • What if an operation fails, e.g. one timestep
    cannot be transferred? Options
  • a) trust that they fail silently on missing
    data
  • b) notify everybody on the pipeline below (to
    skip)
  • c) avoid giving tasks to them for the erroneous
    step
  • Retrying later and processing that step is
    important but
  • Keeping up with the simulation on the next steps
    is even more important.

24
Our approach for failed operations
  • ProcessFile and thus the workflow handles
    failures by discarding tokens related to failed
    operations from the stream
  • Advantage
  • actors need not care about failures
  • an incoming token is a task to be done
  • Disadvantage
  • rate of token production varies
  • this can upset Keplers models of computation

25
Discarding tokens on failure
3
2
1
transfer 1
convert 1
arch 1
failed 2
transfer 3
convert 3
arch 3
26
After a restart
3
2
1
skip 1
skip 1
skip 1
transfer 2
convert 2
arch 2
skip 3
skip 3
skip 3
27
Future Plans
  • Provenance management
  • one main reason to use scientific workflow system
    e.g. in bioinformatics workflows
  • needed for debugging runs, interpreting results,
    repeat experiment, generate documentation,
    compare runs etc.
  • CPES workflow is selected as one use case for the
    ongoing Kepler provenance work
  • New actors in CPES for controlling asynchronous
    I/O from the petascale computer towards the
    processing cluster

28
Thank You
  • Questions?

29
Disadvantage of discarding 1/7
1
3
6
5
2
4
  • Distributor splits the stream and distributes it
    to the two actors evenly
  • Commutator keeps the original order of tokens

30
Disadvantage of discarding 2/7
1
3
6
5
2
4
  • T2 is waiting at the Commutator for T1 to be
    finished
  • T4 is started by the lower actor

31
Disadvantage of discarding 3/7
1
3
6
5
2
4
  • T4 is also finished but waiting in the lower
    actor for T2 going through the Commutator
  • Lower actor becomes idle (this comes from
    Commutators behavior)

32
Disadvantage of discarding 4/7
1
3
6
5
2
4
  • suppose T1 is discarded

33
Disadvantage of discarding 5/7
3
5
8
7
2
4
6
  • T3 can be started finally
  • The lower actor is still idle (this comes from
    discarding T1!)

34
Disadvantage of discarding 6/7
5
7
3
9
8
2
4
6
  • T3 is finished and sent out. The Commutator sends
    it out first.
  • Then T2 is sent out

35
Disadvantage of discarding 7/7
5
7
10
9
4
8
6
  • Lower actor can start working on T6
  • T4 will be sent out only after T5
  • Order of outgoing stream 3, 2, 5, 4

36
Checkpointing
..., f4, f3, f2, f1
..., g4, g3, g2, g1
..., h4, h3, h2, h1
..., list2, list1
UNFINISHED
Write a Comment
User Comments (0)
About PowerShow.com