Scientific Workflows

About This Presentation

Title:

Scientific Workflows

Description:

Scientific Workflows – PowerPoint PPT presentation

Number of Views:248

Avg rating:3.0/5.0

Slides: 146

Provided by: bertr68

Category:

more less

Transcript and Presenter's Notes

Title: Scientific Workflows

1
Introduction to Scientific Workflow Management
and the Kepler System Ilkay Altintas1, Roselyne
Barreto4, Paul Breimyer5,Terence Critchlow2,
Daniel Crawl1, Ayla Khan3, David Koop3, Scott
Klasky4, Jeff Ligon5, Bertram Ludaescher6, Pierre
Mouallem5, Meiyappan Nagappan5, Steve Parker3,
Norbert Podhorszki6, Claudio Silva3, Mladen Vouk5
1. San Diego Supercomputer Center 2. Pacific
Northwest National Laboratory 3. University of
Utah 4. Oak Ridge National Laboratory 5. North
Carolina State University 6. University of
California, Davis
2
Tutorial Overview

9-930am Introduction to Scientific Workflows
(Bertram)
930-10am Workflow Demos (Daniel)
1000-1030am BREAK
1030-11am Kepler Basics (Bertram)
11-1230am (install Kepler) Hands-On 1 (Daniel)
1230-130pm LUNCH
130-2pm Advanced Features (Bertram)
200-3pm Hands-On 2 (Daniel)
300-330pm BREAK
330-410pm Workflows Provenance (Bertram)
410-440pm Provenance Demo (Daniel)
440-500pm QA and Open Discussion

3
Introduction to Scientific Workflows

Motivating Examples
Ecological Niche Modeling
Processing Sensor Data Streams
Ecology, Oceanography use cases
Fusion Simulation
Requirements Features

3
3
4
Scientific Workflows Cyberinfrastructure
UPPER-WARE
5
Scientific Workflow

Capture how a scientist works with data and
analytical tools
data access, transformation, analysis,
visualization
possible worldview dataflow-oriented (cf.
signal-processing)
Scientific workflow (wf) benefits (v.s.
script-based approaches)
wf automation
wf component reuse
wf design, documentation
wf archival, sharing
built-in concurrency
(task-, pipeline-parallelism)
built-in provenance support
distributed parallel exec
Grid cluster support

6
Kepler science domains

Ecology
SEEK Ecological Niche Modeling and climate
change
REAP Modeling parasite invasions in grasslands
using sensor networks
NEON Ecological sensor networks
COMET environmental science
Geosciences
GEON LiDAR data processing
GEON Geological data integration
Molecular biology
SDM Gene promoter identification
ChIP-chip genome-scale research
CAMERA metagenomics
Physics
CPES Plasma fusion simulation
FermiLab particle physics
Oceanography
REAP SST data processing
LOOKING ocean observing CI
NORIA ocean observing CI

Phylogenetics
ATOL/pPOD Processing Phylodata
CiPRES phylogentic tools
Chemistry
Resurgence Computational chemistry
DART (X-Ray crystallography)
Library science
DIGARCH Digital preservation
Cheshire digital library archival
Conservation biology
SanParks Thresholds of Potential Concerns

Slide Matt Jones
7
Simple Kepler workflow using R (a statistics
package)
8
Ecological Niche Modeling
Temperature layer
Many other layers
Slide from D. Pennington
9
Managing Complexity

Scientific workflows use hierarchy to manage
complexity
Top level workflows can be a conceptual
representation of the science process that is
easy to comprehend at a glance
Drilling down into sub-workflows reveals
increasing levels of detail
Composing models using hierarchy promotes the
development of re-usable components that can be
shared with other scientists

10
Partial ENM Workflow
Slide Matt Jones
11
Workflow features required by ENM use case

Design phase
Access to distributed data specimens and climate
Streamline, automate labor-intensive data
preparation
Workflow GUI environment
communication about complex models
experimentation and rapid modification of models
re-usable, sharable components
Software environment
Multi-platform (Mac/Windows/Linux), open,
extensible

Slide Matt Jones
12
Workflow features required by ENM use case

Execution phase
Execution using multiple analytical environments
Java, C, R, Matlab, GDAL, web services, ...
Integration of multiple computing environments
into a single environment
? Glue-ware
High-throughput distributed execution
Iterate across many species with many model runs
Assume no prior knowledge of distributed
computing technologies
Threadsafe components no back channel
communication
Archiving products in community repositories
Provenance and metadata for derived products

Slide Matt Jones
13
NSF/CEOP REAP (Real-time Analysis Pipelines)
Ecology, Oceanography case studies

Terrestrial Ecology
Predictive Modeling to Examine the Role of an
Insect Vectored Pathogen in Exotic Plant Invasion
temperature, precipitation, light interception _at_
7 core research areas
integrate Metacat archived data with these
sensors in analyses implemented in Kepler
Oceanography
Integrated Framework for Hybrid, Adaptive Ocean
Modeling
Sea Surface Temperature (SST) fields from OPeNDAP
servers
Kepler workflows to quantitatively evaluate SST
data sets

Slide Matt Jones
14
REAP Project Goals

For scientists
capabilities for designing and executing complex
analytical models over near real-time and
archived data sources
For data-grid engineers

monitoring and management capabilities of
underlying sensor networks
For outside users
access to observatory data and results of models,
approachable to non-scientists.

15
Key
Internet
- radio with antenna
RBNB
data logger
Internet Point of Presence (IPP)?
RBNB
- sensor
- battery
Relay Station?
1 km
OPeNDAP EcoGrid Metacat
Linear Light Probe (B)? Reflectometer (B) Rain
gage Anemometer RH Temp Probe Quantum Point
Light Sensor?
Linear Light Probe (A)?
CR800 Datalogger
Reflectometer (A)?
Public website
Vegetation Plots
Slide Matt Jones
16
Ring Buffered Network Bus (RBNB) streaming data
component

Scientists can discover data streams
accessing streams requires little IT knowledge
Can easily assimilate streams
into existing or new workflow models

Slide Matt Jones
17
Modeling Disease Effects on Competition

Discrete Time Model
Survival between seasons
Reset of system Loss of Disease
Continuous Time Model
Growing (Winter Rainy) Season
Ongoing infection processes (SI model)
Competition (Lotka-Volterra)
Integro-Difference Equations
Parameterized with data from field experiments
Can utilize coupled models (aka hybrid models)
Continuous time model that is coupled to a
discrete time model
Each model developed independently, joined via
the workflow engine

Slide Matt Jones
18
Models of Computations (Directors) in Kepler

Continuous time
Lotka-Volterra predator-prey dynamics
Synchronize on a global clock
Synchronous Data Flow
Sensor data access, analysis
Static dependency analysis,
fixed data flow rate

Director controls Model of Computation (MoC)
Slide Matt Jones
19
Requirements of REAP use cases

All features from the ENM, plus
Design phase
Access to sensor data streams via catalog
RBNB and Antelope support
Bi-directional communication, monitor and control
sensors
Execution phase
Support hybrid models
Population and community dynamics mix discrete
and continuous time models
Provenance
archive modeling scenarios
support exploratory modeling

Slide Matt Jones
20
Discovery Streaming Workflows

Typical analytical models are complex and
difficult to comprehend and maintain
Use cases described here are only two of many
overlapping cases
Scientific workflows provide
An intuitive visual model
Structure and efficiency (user-time) in modeling
and analysis
Abstractions to help deal with complexity
Direct access to data
Means to publish and share models

Slide Matt Jones
21
Plumbing Workflows Fusion Simulation (SDM
CPES)
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
22
Plumbing Workflows Archive Migration
Stage data files from NERSC HPSS to local disk
transfer to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
23

Plumbing workflow
to accomplish all these tasks
50 composite actors (subworkflows)
4 levels of hierarchy
1000 atomic (Java) actors

Norbert Podhorszki UC Davis, soon ORNL
24
Summary a broad range of workflow types

Desktop / discovery workflows
analysis/method-intensive, R, Matlab, custom
algorithims
e.g. bioinformatics, ecoinformatics, genomics,
phylogenetics
exploratory workflow, rapidly evolving
need data workflow provenance
Streaming workflows
(near) real-time processing and data analysis
distributed setting
Plumbing workflows
data-intensive, e.g. moving TBs between from
ORNL (compute) to LBL/NERSC (archive)
Production workflow reliable, fault-tolerant,
high-throughput, runtime monitoring
HPC workflows
cpu-intensive, need to utilize a local cluster
or distribute Grid, e.g. Ecological Niche
Modeling, Parameter studies,
Parallel/distributed workflow

25
Workflow Demos
25
25
26
Bioinformatics Web Service

Retrieve genetic sequence from DNA Data Bank of
Japan (DDBJ).
Data transformations via XSLT and XPath.

27
Bioinformatics Web Service Access
27
28
REAP Data Streaming
28
29
29
30
Transfer-Convert-Archive-Image-Workflow
30
31
Basic Kepler Features
31
31
32
Kepler is a Scientific Workflow System
http//www.kepler-project.org

Kepler is a cross-project collaboration
Latest release available from the website
Builds upon the open-source Ptolemy II framework

32
32
33
Kepler Communities Collaboration

Open-source
Builds on Ptolemy II from UC Berkeley
Contributors from
SEEK
SciDAC SDM
Ptolemy
GEON
ROADNet
Resurgence
AToL CIPRES, POD
Goals
Create powerful analytical tools that are useful
across disciplines
Ecology, Biology, Engineering, Geology, Physics,
Chemistry, Astronomy,

Ptolemy II
34
Vergil is the GUI for Kepler
but Kepler can also run in batch mode as a
command-line engine.
Data Search
Actor Search
Actor ontology (semantic search) Search ?
Drag drop ? Link via ports Metadata-based
search for datasets
35
Actor-Oriented Modeling Design

Actor
single component or task
well-defined interface (signature)?
given input data, produces output data

36
Actor-Oriented Modeling Ports

Ports
each actor has a set of input and output ports
denote the actors signature
produce/consume data (a.k.a. tokens)?
Parameters
(visible after double-click) can be seen as
special static ports

37
Actor-Oriented Modeling Connections / Channels

Dataflow Connections
actor communication channels
directed (hyper) edges
connect output ports with input ports
can fork (cloning tokens) at relation nodes
(little diamonds)

38
Actor-Oriented Modeling Subworkflows

Sub-workflows / Composite Actors
composite actors wrap sub-workflows
like actors, have signatures (i/o ports of
sub-workflow)
hierarchical workflows (arbitrary nesting levels)

39
Actor-Oriented Modeling Directors

Directors
define the Model of Computation (MoC) of workflow
graphs
executes workflow graph (some schedule)
sub-workflows may have different directors
Facilitates actor (sub-)workflow reusability

40
Models of Computation

Directors separate the concerns of WF
orchestration from Actor execution
Synchronous Dataflow (SDF)
Connections have queues for sending/receiving
fixed numbers of tokens at each firing. Schedule
is statically predetermined. SDF models are
highly analyzable and used often in SWFs.
Downside need to know token consumption/productio
n rate ahead of time
Process Networks (PN)
Generalizes SDF. Actors execute as a separate
thread/process, with queues of (in principle)
unbounded size. Closely related to Kahn/MacQueen
semantics.
Continuous Time (CT)
Connections represent the value of a continuous
time signal at some point in time ... Often used
to model physical processes.
Discrete Event (DE)
Actors communicate through a queue of events in
time. Used for instantaneous reactions in
physical systems.

41
Searching Components (Actors)

Kepler Actor Ontology (tags hierarchy)
Used in searching actors and creating conceptual
views (virtual folders)
currently gt 370 actors

APAC07/Kepler Tutorial/V1
41
SC07/Kepler Tutorial/V8bc/Nov-07
41
42
Searching Binding Data

Kepler DataGrid
Discovery of data resources through local and
remote services
SRB,
Grid and Web Services,
DB connections
Registry of datasets on the fly using workflows

APAC07/Kepler Tutorial/V1
42
42
43
Hands-On Exercises 1
APAC07/Kepler Tutorial/V1
44
Opening and Running a Workflow

Start Kepler
Open the HelloWorld.xml under the demos/sc07
directory in your local Kepler folder
Two options to run a workflow
PLAY BUTTON in the toolbar
RUNTIME WINDOW from the run menu

45
Modifying an Existing Workflow Saving It

GOAL
Modify the HelloWorld workflow to display a
parameter-based message
Step-by-step instructions
Open the HelloWorld workflow as before
From actors search tab, search for Parameter
Drag and drop the parameter to the workflow
canvas on the right
Double click the parameter and type your name
Right click the parameter and select "Customize
Name", type in "name".
Double click the Constant actor and type the
following
Hello name
Save
Run the workflow

46
Creating a HelloWorld! Workflow

Open a new blank workflow canvas
From toolbar File ? New Workflow ? Blank
In the Components tab, search for Constant and
select the Constant actor.
Drag the Constant actor onto the Workflow canvas
Configure the Constant actor
Right-click the actor and selecting Configure
Actor from the menu
Or, double click the actor
Type Hello World in the value field and click
Commit
In the Components and Data Access area, search
for Display and select the Display actor found
under Textual Output.
Drag the Display actor to the Workflow canvas.
Connect the output port of the Constant actor to
the input port of the Display actor.
In the Components and Data Access area, select
the Components tab, then navigate to the
/Components/Director/ directory.
Drag the SDF Director to the top of the Workflow
canvas.
Run the model

47
47
47
48
Using Various Displays

GOAL Use different graphical output actors.
Step-by-step instructions
Open the "03-ImageDisplay.xml" under the
demos/getting-started directory in your local
Kepler folder.
Run the workflow.
Search for "browser" in the components tab.
Drag and drop "Browser Display" onto the canvas.
Replace "ImageJ" with "Browser Display" (connect
Image Converter output to "Browser Display"
inputURL.
Run workflow again.
Replace "Browser Display" with a textual
"Display.
Run workflow.

49
Advanced Kepler Features
50
Process Networks

The partial (or total linear) order implied by a
DAG gives as a schedule for workflows for
one-time tasks (jobs)
What about Pipelined Workflows on Token Streams??
Communicating processes with directed token flow
Dataflow Process Networks
communication token stream between two
processes
process operations on tokens
host language process description
coordination language network description

process
process
token stream
channel
51
Kahn process networks (1974)

special class of process networks
stream is FIFO with unbounded capacity
process
destructive read (consumption) at process
start,
non-destructive write (production) at process
end,
blocking read process only executed if data
available,
non-blocking write

EXAMPLE
52
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
53
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
54
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
55
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
56
Problems with Process Networks

How to run/schedule a process network without
accumulating arbitrarily many tokens?
Difficult to schedule because of need to balance
relative process rates
System inherently gives the scheduler few hints
about appropriate rates
Tom Parks Algorithm
runs in bounded memory whenever possible
(bounded memory condition is undecidable)
Synchronous Dataflow (SDF)
Edward Lee and David Messerschmitt, Berkeley,
1987
Restriction Kahn Process Networks to allow
compile-time scheduling
Basic idea each process reads and writes a fixed
number of tokens each time it fires. Example
Loop forever
read 2 tokens from A, 3 tokens from B
compute
write 1 token to C write 2 tokens to D

57
Synchronous Dataflow (SDF)Fixed
Production/Consumption Rates

Balance equations (one for each channel)
Schedulable statically
Decidable
buffer memory requirements
deadlock

number of tokens consumed
number of firings per iteration
number of tokens produced
fire B consume M
fire A produce N
channel
N
M
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
58
Parallel Scheduling of SDF Models
Many scheduling optimization problems can be
formulated. Some can be solved, too!
SDF is suitable for automated mapping onto
parallel processors and synthesis of parallel
circuits.
A
C
B
D
Sequential
Parallel
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
59
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
60
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
61
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
62
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
63
Selected Generalizations

Multidimensional Synchronous Dataflow (1993)
Arcs carry multidimensional streams
One balance equation per dimension per arc
Cyclo-Static Dataflow (Lauwereins, et al., 1994)
Periodically varying production/consumption rates
Boolean Integer Dataflow (1993/4)
Balance equations are solved symbolically
Permits data-dependent routing of tokens
Heuristic-based scheduling (undecidable)
Dynamic Dataflow (1981-)
Firings scheduled at run time
Challenge maintain bounded memory, deadlock
freedom, liveness
Demand driven, data driven, and fair policies all
fail
Kahn Process Networks (1974-)
Replace discrete firings with process suspension
Challenge maintain bounded memory, deadlock
freedom, liveness
Heterochronous Dataflow (1997)
Combines state machines with SDF graphs
Very expressive, yet decidable

Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
64
(Internal) Workflow Format MoML
65
Sharing Kepler Workflows -- Use Cases

UC-1) Facilitate transport of workflows to
grid/distributed/server/p2p systems
UC-2) Preserve an analysis to allow replication
UC-3) Allow the development and distribution of
components (actors/directors) which can be
released on a schedule independently from Kepler
itself.

66
Kepler Archive File (KAR)
67
KAR File Functional Requirements

FR-1) Mechanism to package resources required to
implement a component in kepler system.
FR-1a) must be able to contain java class files
FR-1b) must be able to contain native binary
executable files
FR-1c) must be able to contain native library
files
FR-1d) must be able to contain MoML and other XML
based text
FR-1e) must be able to contain data in binary and
ascii formats including zipped data.
FR-2) Must describe the contained components so
they can be utilized in a Kepler system.
FR-2a) each component must have a unique LSID
identifier which is tied to the specific
implementation of the component.
FR-2b) must contain an OWL document with semantic
ordering for the contained objects

68
The need for Plumbing WorkflowsTales from the
life of a simulation scientist
69
A few days in the life of Sim Scientist. Day 1
-morning.

800AM Get Coffee, Check to see if job is
running.
Ssh into jaguar.ccs.ornl.gov (job 1)?
Ssh into seaborg.nersc.gov (job 2) (this is
running yea!)?
Run gnuplot to see if run is going ok on seaborg.
This looks ok.
900AM Look at data from old run for post
processing.
Legacy code (IDL, Matlab) to analyze most data.
Visualize some of the data to see if there is
anything interesting.
Is my job running on jaguar? I submitted this 4K
processor job 2 days ago!
1000AM scp some files from seaborg to my local
cluster.
Luckily I only have 10 files (which are only 1
GB/file).
1030AM first file appears on my local machine
for analysis.
Visualize data with Matlab.. Seems to be ok. ?
1130AM see that the second file had trouble
coming over.
Scp the files over again Dohhh

Slide Scott Klasky
69
70
A few days in the life of Sim Scientist. Day 1
evening.

100PM Look at the output from the second file.
Opps, I had a mistake in my input parameters.
Ssh into seaborg, kill job. Emacs the input,
submit job.
Ssh into jaguar, see status. Cool, its running.
bbcp 2 files over to my local machine. (8
GB/file).
Gnuplot data.. This looks ok too, but still need
to see more information.
130PM Files are on my cluster.
Run matlab on hdf5 output files. Looks good.
Write down some information in my notebook about
the run.
Visualize some of the data. All looks good.
Go to meetings.
400PM Return from meetings.
Ssh into jaguar. Run gnuplot. Still looks good.
Ssh into seaborg. My job still isnt running
800PM Are my jobs running?
ssh into jaguar. Run gnuplot. Still looks good.
Ssh into seaborg. Cool. My job is running. Run
gnuplot. Looks good this time!

Slide Scott Klasky
71
And Later

400AM yawn is my job on jaguar done?
Ssh into jaguar. Cool. Job is finished.
Start bbcp files over to my work machine.
(2 TB of data).
800AM Bbcp is having troubles.
Resubmit some of my bbcp from jaguar to my local
cluster.
800AM (next day).
Still need to get the rest of my 200GB of data
over to my machine.
300PM My data is finally here!
Run Matlab. Run Ensight. Oppps. Somethings
wrong!!!!!!!!! Where did that instability come
from?
600PM finish screaming!

Slide Scott Klasky
72
And 2 years from now.

Simulations /computers are getting larger and
more expensive to operate.
In Fusion, large runs will be using gt50K cores/
100 wallclock hours, to understand turbulent
transport in ITER size reactors.
The cost of a simulation approaches 0.6M (power,
cooling, system cost averaged over 5 years).
Data Sizes are getting larger.
Large simulations produce 2 TB/simulation
(today), 100TB/simulation(week) in the future.
Demand for real-time monitoring/analysis of
simulations.
Demand for fast-reliable data movement to local
machines for post processing.
Demand to keep data provenance at 1 location.

Slide Scott Klasky
73
Workflows to the rescue!

In our demo section (SC07 tutorial only) you
will see us automate this process.
Job submission starts services on ORNL IB cluster
(ewok).
Files are automatically moved from Cray XT3 to
ORNL IB cluster.
Files are converted from binary to hdf5 files.
Files accumulate until they gt 6GB. Then they are
tarred.
Files use hsi commands to place tar files into
HPSS. (xml file describes which files are in
which tar files).
Hdf5 file is read into SciRun service which
creates a jpeg.
Jpeg files create an mpeg file via a mpeg
service.
Jpeg and mpeg files are moved to web portal.
Hdf5 files are archived to PPPL.
And of course we will keep track of the
provenance of the workflow in a database!
And we can monitor this on our dashboard.

Slide Scott Klasky
74
Why do Pflop computing scientists care?

Typical situation for Sim Scientist.
We run on 1 60K processors, producing lots of
data.
Typical method of work.
Prepare input data for smaller simulations.
Iterate until we come up with the correct
parameters for the large run.
Run the large simulation only at a handful number
of locations (usually lt4).
Must Archive results. Must be of the correct size
archives on HPSS.
Must move some data over to our local clusters
for analysis after the simulation.
Did we make a mistake with the input parameters?
Is something going wrong? Fix the code/input
start the run over again.
Wow, I just wasted 100K CPU hours because I
missed a sign. Duhh.
Where are all of my files? I want to look at the
temperature in the 200 time slice, where is it on
HPSS.

Slide Scott Klasky
75
Post Processing Workflow A day in the life of
Sim Scientist

900AM Get Diet Coke, decide which runs/
experimental data to analyze.
930AM Start to download files from HPSS from
NERSC and ORNL.
1000AM Move files from NERSC/ORNL to local
desktop machine. Smallish data (10GB /location).
1100AM Start IDL, and compute various post
processing quantities.
1130AM look at the data from the simulations,
and grab data from a database which has
experimental data.
100PM Move some more data from ORNL to local
desktop to compare to more experimental data.
Save the plot from Matlab to Postscript to
include in paper.
Write down results into notebook, copy figure
into notebook.
200PM Think about results, and decide on new
analysis routines to write in the future.
400PM Start moving more data from NERSC to local
desktop.

Slide Scott Klasky
76
Whats changing in his life?

Collaboration.
More clusters, more simulations.
Just analyze the data where we run. Dont move
the data.
But
What if the network goes (Im on a plane,).
What if the resource is not available for my
late-breaking analysis before the BIG conference?
OK, but what about the large data?
OK. Large data will be server-side analysis. Not
DESKTOP. But can run workflow on a server.
Data from multiple resources
VV data from multiple simulations/experiments.
But cant we just run VISIT/SciRun?
Yes. But need to orchestrate the data movement
from different sources, track the provenance, and
perhaps use multiple analysis/visualization
packages, then a workflow system can help.

Slide Scott Klasky
77
How do we help this scientist?

The workflow is the glue for the scientists.
The scientists hooks up all of the analysis
routines.
The director makes sure that the data movement
occurs, and is reliable, and secure.
All of the tedious portions of ssh, start this
program, is removed by the workflow automation.
The workflow will be able to keep the provenance
information which allows the user to understand
how they processed the dataset.
This enables the scientist to compare new data
with old data.

Slide Scott Klasky
78
So what are the requirements?

Must be EASY to use.
If you need a manual, then FORGET IT!
Good user support, and long-term DOE support. ?
The workflow should work for all of my workflows.
NOT just for the Petascale computers.
And on multiple platforms!
Must be easy to incorporate my own services into
the workflow.
Must be customizable by the users.
Users need to easily change the workflow to work
with the way users work.
Long-term requirements. NOT being worked on
yet.
Autonomics/ User Adaptivity.
Faster data movement in the workflow? High
Quality front-end for the end-user interaction.
You tell us!

Slide Scott Klasky
79
SWF Systems Requirements

Design tools-- especially for non-expert users
Ease of use-- fairly simple user interface having
more complex features hidden in the background
Reusable generic features
Generic enough to serve to different communities
but specific enough to serve one domain (e.g.
geosciences) ? customizable
Extensibility for the expert user
Registration, publication provenance of data
products and process products (workflows)?
Dynamic plug-in of data and processes from
registries/repositories
Distributed WF execution (e.g. Web and Grid
awareness)?
Semantics awareness
WF Deployment
as a web site, as a web service,Power apps.

Slide Scott Klasky
80
The Big Picture Supporting the Scientist
From Napkin Drawings
to Executable
Workflows
Conceptual SWF
Executable SWF
Here John Blondin, NC State Astrophysics Terascal
e Supernova Initiative SciDAC, DOE
80
Slide M. Vouk
81
CPES Fusion Simulation Workflow

Fusion Simulation Codes (a) GTC (b) XGC with
M3D
e.g. (a) currently 4,800 (soon 9,600) nodes Cray
XT3 9.6TB RAM 1.5TB simulation data/run
GOAL
automate remote simulation job submission
continuous file movement to secondary analysis
cluster for dynamic visualization simulation
control
with runtime-configurable observables

Submit FileMover Job
Submit Simulation Job
Execution Log (gt Data Provenance)
Select JobMgr
Overall architect ( prototypical user) Scott
Klasky (ORNL)? WF design implementation
Norbert Podhorszki (UC Davis)?
APAC07/Kepler Tutorial/V1
81
82
CPES Analysis Workflow

Concurrent analysis pipeline(_at_Analysis Cluster)
convert analyze copy-to-Web-portal
easy configuration, re-purposing

Reusable Actor Class
Specialized Actor Instances
SpecializeActor instances
SpecializeActor instances
Pipelined Execution Model
Inline Documentation
Inline Display
Easy-to-edit Parameter Settings
Overall architect ( prototypical user) Scott
Klasky (ORNL)? WF design implementation
Norbert Podhorszki (UC Davis)?
APAC07/Kepler Tutorial/V1
82
83
Dashboard integration with Kepler

Dashboard present information created from the
workflow.
We have been developing a dashboard for Kepler
workflows.
AJAX.
FLASH.
PHP.
MySQL.

Slide SDM/SPA, Klasky,Vouk, et al
84
Machine Monitoring

DOE Machine monitoring
Which machines are up?
Which machines have long queues, which are idle?
Where can I run my job?
Where am I running jobs?
Where are my running jobs and can I look at my
old runs?
Can I monitor a new job, and compare this to an
old job?

Slide SDM/SPA, Klasky,Vouk, et al
85
Dashboards for Simulation Monitoring

Back end shell scripts, python scripts and PHP.
Machine queues command
Users personal information
Services to display and manipulate data before
display
Dynamic Front end
Machine monitoring standard web technology
Ajax
Simulation monitoring Flash
Storage MySQL (queue-info, min-max data, users
notes)?

85
Slide SDM/SPA, Klasky,Vouk, et al
86
Scientific Workflow Systems

Combination of
Data management, integration, analysis, and
visualization steps
Larger, automated "scientific process"
Mission of scientific workflow systems
Promote scientific discovery by providing tools
and methods to generate scientific workflows
Provide an extensible and customizable graphical
user interface for scientists from different
scientific domains
Support workflow design, execution, sharing,
reuse and provenance
Design frameworks which define efficient ways to
connect to the existing data and integrate
heterogeneous data from multiple resources
Make technology useful through users computer!!!

86
87
Two typical types of Workflows for SC

Real-time Monitoring (Server Side Workflows)?
Job submission.
File movement.
Launch Analysis Services.
Launch Visualization Services.
Launch Automatic Archiving.
Post Processing (Desktop Workflows).
Read in Files from different locations.
File movement.
Launch Analysis Services.
Launch Visualization Services.
Connect to Databases.
Obviously there are other types of workflows
What is your type of workflow?

87
88
Plumbing Workflow using Kepler
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
89
Plumbing Workflow for Archive Migration
Stage from NERSC HPSS to local disk transfer
to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
90
Pipeline and parallel processing
Norbert Podhorszki (UC Davis)
91

Plumbing workflow
to accomplish all these tasks
50 composite actors (subworkflows)
4 levels of hierarchy
1000 atomic (Java) actors

Norbert Podhorszki UC Davis, soon ORNL
92
Distributed Execution Many ways to skin a cat

Do it all in Kepler (white-box)
Single machine single-threaded and/or
multi-threaded
Multiple nodes (cluster)
Distributed Kepler, Kepler/HPC
Medium-tightly coupled (grey box)
use remote commands
and their exist status
Loosely-coupled (black-box -- Norberts
Workflows)
Launch remote scripts
Inquire about their status e.g. via ls -1
Minimalist approach
works even w/ tough ORNL constraints!

APAC07/Kepler Tutorial/V1
92
92
93
Authoring Distributed Workflows
Normal Workflow
Distributed Workflow

Place wf in a DistributedCompositeActor (DCA).
At runtime, the contents of the DCA are packaged
up and shipped to the remote nodes.
The workflow is executed and the output is
returned to the master Kepler node to be
viewed/further processed.

Slide from C. Berkley
94
Node Discovery and Remote Management
Slide from C. Berkley
95
Efficient Data Transfer

Large datasets need special handling
Inefficient data transfer could wipe out time
savings of distributed computation

Slave1 depends on Slave0 Slave2 depends on Slave1
Slave0
Slave0
Large Dataset
Large Dataset
Large Dataset
Large Dataset
Slave1
Master
Slave1
Master
Large Dataset
Large Dataset
Large Dataset
Results
Slave2
Slave2
Inefficient (6 possible transfers)
More efficient (4 possible transfers)
Slide from C. Berkley
96
A Hierarchical View of the Architecture
Control Plane (light data flows)?
Provenance, Tracking Meta-Data (DBs and
Portals)?
Execution Plane (Heavy Lifting
Computations and flows)?
Synchronous or Asynchronous?
96
97
Scientific Workflow Automation (e.g.,
Astrophysics)In conjunction with John Blondin,
NC State UniversityAutomate data acquisition,
transfer and visualization of a large-scale
simulation at ORNL
Logistic Network L-Bone or bbcp
Aggregate to 500 files (lt 50GB each)?
Input Data
Local Mass Storage 14TB)?
VH1
Depot
HPSS archive
Local 44 Proc. Data Cluster - data sits on local
nodes for weeks
Provenance
Highly Parallel Compute
Output 500x500 files
Web
Viz Software
Viz Wall
97
Viz Client
98
Scientific Workflow Modeling Design
And thats why our scientific workflows are
much easier to develop, understand, reuse and
maintain!
99
Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
100
Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
101
But how do we get from messy to neat reusable
designs?
102
The Problem Evolving Workflows
Daniel Zinn (UC Davis)
103
What we want Simple Analysis Pipelines
Author Tim McPhillips, UC Davis
104
The Answer (YMMV)

Collection-Oriented Modeling Design (COMAD)
embrace the assembly line metaphor fully
? Virtual Assembly Lines (VALs)
? cf. Flow-based Programming (J. Morrison)
data tagged nested collections
pipelined (XML) token streams
passing the buck on whats not in your scope

Timothy McPhillips (UC Davis)
105
Conventional vs Assembly Line Delta-XML
Thinking
Daniel Zinn (UC Davis)
106
More secret sauce User vs. Optimized Dataflow
Daniel Zinn (UC Davis)
107
What we got Simple Change-Resilient Pipelines
Author Tim McPhillips, UC Davis
108
Result Change-Resilience (Wf graph)
?
X
A
B
C
S
R
W
Original
Automatic Configuration
W
WX
S R
S R
Infer Configuration X of X
Daniel Zinn (UC Davis)
109
Related Change-Resilience (nested data types)
S. Bowers, Daniel Zinn (UC Davis)
110
Scientific Workflow Modeling Design Paradigms

Vanilla Process Network
Functional Programming Dataflow Network
XML Transformation Network
Collection-oriented Modeling Design framework
(COMAD)
Look Ma No Shims!

also running DAGs, Petri Nets, easyBPEL,

111
Hands-On Exercises 2
APAC07/Kepler Tutorial/V1
111
111
112
Using R in Kepler

GOAL Use the R actor to generate histogram plot.
Step-by-step instructions
In demos/getting-started directory, open
05-LinearRegression.xml.
Run the workflow to view linear regression.
Add another RExpression actor to canvas.
Double-click on new R actor and enter the
following for R function or script
Mean lt- mean(Values)?
hist(Values)?
Right-click on new R actor and Configure Ports
Add an input called Values
Add an output called Mean
Control-click on the canvas to create a new
Relation diamond
Connect the T_AIR port from Datos to the diamond.
Connect the T_AIR port from R_linear_regression
to the diamond.
Connect the Values port from the new R actor to
the diamond.
Place a second ImageJ actor on the canvas (can
copy and paste existing one).
Connect R actor"s graphicsFileName port to
second ImageJ actor"s input.
Run the workflow.

APAC07/Kepler Tutorial/V1
112
SC07/Kepler Tutorial/V8bc/Nov-07
112
113
Creating Web Service Workflows

GOAL Executing a Web Service using the generic
Web Service client.
Step-by-step instructions
In the Components and Data Access area, select
the components tab.
Search for "web service".
Drag "Web Service Actor" onto the canvas.
Double click the actor, enter http//xml.nig.ac.jp
/wsdl/DDBJ.wsdl, commit.
Double click the actor again, select
"getXMLEntry" as method name, commit.
Search for "String Constant" in the components
tab. Drag and drop "String Constant" onto
workflow canvas.
Double click the "String Constant", set AA045112
as value, commit.
Connect "String Constant" output with the "Web
Service Actor" input.
Add a "Display" and connect its input with the
"Web Service Actor" "Result" output.
Add the SDF director.
Run the workflow.

APAC07/Kepler Tutorial/V1
113
113
114
SSH Actor and Including Existing Scripts in a
Workflow

GOAL Use SSH actor to execute command on remote
host.
Step-by-step instructions
Search for "ssh" in the Components tab in left
pane.
Drag "SSH To Execute" onto the canvas.
Double click the actor
Type in a remote host you have access to.
Type in your username.
Search for "String Constant" in the components
tab. Drag and drop "String Constant" onto
workflow canvas.
Double click the "String Constant", type "ls" and
commit.
Connect "String Constant" output with the "SSH To
Execute" command input (lowest)?.
Add a "Display" and connect its input with the
"SSH To Execute" stdout output (top)?.
Add the SDF director.
Run the workflow.
If you have a script deployed on the server, you
can replace the "ls" command to invoke the
script.
e.g., perl tmp.pl

APAC07/Kepler Tutorial/V1
114
114
115
Using Relational Databases

GOAL Accessing a geoscience database using a
generic database actor
Step by step instructions
In the Components and Data Access area, select
the components tab
Search for database
Drag Open Database Connection and Database
Query onto the canvas
Configure Open Database Connection with the
following parameters
Database format PostgreSQL
Database URL jdbcpostgresql//geon17.sdsc.edu54
32/igneous
Username readonly
Password read0n1y
Connect the output of Open Database Connection
with the dbcon input port of Database Query
Double-click to customize the actor
Query SELECT FROM IGROCKS.ModalData WHERE SSID
227
227 for ssID
Add Display actor (from components tab), connect
ports, add sdf director (as in previous example)?
Run the workflow

115
116
Provenance

Two different takes on it
Scientists View (discovery workflows)
Engineers View (plumbing workflows)

APAC07/Kepler Tutorial/V1
116
117
A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
118
More Evidence
data reference
type of evidence
tool reference
trust me on this one

provenance/data lineage show the history and
evidence
related to proof trees
unlike w/ scripts, SWF system can keep track of
what happened
In the future deposit your data workflows in a
repository

119
Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
120
Different Dependency Graphs
A Model for User-Oriented Data Provenance in
Pipelined Scientific Workflows, Shawn Bowers,
Timothy McPhillips, Bertram Ludäscher, Shirley
Cohen, Susan B. Davidson. International
Provenance and Annotation Workshop (IPAW'06),
Chicago,May 3-5, 2006.
121
Scientific Provenance Questions we can ask

What DNA sequences were input to the workflow?
What phylogenetic trees were output by the
workflow?
What DNA sequences input to the workflow does
this consensus tree depend on?
What input sequences were not used to derive any
output consensus trees?
What was the sequence alignment (key intermediate
data) used in the process of inferring this tree?
plus the usual smart-rerun, VCR replay,

122
Provenance in the COMAD Framework
Without Provenance
With Provenance
123
The Answer (YMMV)

Collection-Oriented Modeling Design (COMAD)
embrace the assembly line metaphor fully
? Virtual Assembly Lines (VALs)
? cf. Flow-based Programming (J. Morrison)
data tagged nested collections
pipelined (XML) token streams
passing the buck on whats not in your scope

Timothy McPhillips (UC Davis)
124
Provenance for the WF Engineer / Plumber

A Workflow Engineers View
Monitor, benchmark, and optimize workflow
performance
Record resource usage for a workflow execution
Smart Re-run of (variants of) previous
executions
Checkpointing restart (e.g. for crash recovery,
load balancing)
Debug or troubleshoot a workflow run
Explain when, where, why a workflow crashed

125
Provenance for Domain Scientists!

Query the lineage of a data product
from what data was this computed? (real
dependencies please!)
Evaluate the results of a workflow
do I like how this result was computed?
Reuse data products of one workflow run in
another
(re-)attach prior data products to a new workflow
Archive scientific results in a repository
Replicate the results reported by another
researcher
Discover all results derived from a given dataset
i.e. across all runs
Explain unexpected results
via parameter-, dataset-, object-dependencies
in the scientists terms (yes, you may think
ontology here )

126
Observables

Model of Computation MoC M
specification/algorithm to compute o M(W,P,i)
a director or scheduler implements M
gives rise to formal notions of
computation (aka run) R typically tree models
Model of Provenance MoP M
approximation M of M
a trace T approximates a run R by
inclusion/exclusion of observables
T R Ignored-observables
Model-observables
Observables (of a MoC M)
functional observables (may influence output o)
token rate, notions of firing,
non-functional observables (not part of M, do not
influence o)
token timestamp, size, (unless the MoC cares
about those)
What is a good model of provenance?
What is a good provenance schema?

127
Provenance in the General Architecture (SDM/SPA
View)
Analytics
Computations
Control Panel (Dashboard)? Display
Local and/or remote communications (networks)?
Orchestration (Kepler)?
Data, DataBases, Provenance, Storage
128
What is Provenance? (SDM/SPA view)

Provenance is about meta-data (data about data),
the history (lineage) of data, code execution and
conditions applied to a workflow run.
Run-time monitoring may be part of the provenance
meta-data, but it also may require collection of
additional information and display of that
information in a user-friendly format, for
example on a dashboard, so that run-time
tracking, problem determination, computational
steering, and other workflow-related feedback may
take place.

129
Why Provenance?

Recreate results and rebuild workflows using the
evolution information
Associate the workflow with the results it
produced
Create links between generated data in different
runs, and compare different runs
Checkpoint a workflow and Recover from a system
failure
Debug and explain results (via lineage tracing,
)?
Smart Reruns
Other

130
Types of Provenance

Process provenance dynamics of control flows
and their progression, execution, etc.
Data provenance dynamics of data and data
flows, file locations, application input/output
information, etc.
Workflow provenance structure, form, evolution,
System (or Environment) provenance system
information, O/S, compiler versions, etc.

131
Other Data Views and Concepts

Raw data
Application/Simulation monitoring (input, output,
configuration, intermediate states, )?
Data history and location
Machine monitoring
Shelf-life of data
Auditability
Error and execution logs
Analytics and Data information summation
(visual, formulas, smoothing, etc.)?

132
Framewok
Storage
Supercomputer Analytics
Kepler
Dash
Meta-Data about Processes, Data, Workflows, Syst
em Environment
Orchestration
133
A Hierarchical View of the Architecture
Control Plane (light data flows)?
Provenance, Tracking Meta-Data (DBs and
Portals)?
Execution Plane (Heavy Lifting
Computations and flows)?
Synchronous or Asynchronous?
134
Implementation

Kepler Linux Apache MySQL PHP (K-LAMP)?
Windows based solutions
Communications sockets, xmlrpc, http, files,
NFS, synchronous, asynchronous, etc.
Single node vs. distributed solutions
Service-based solutions
Which information?