Title: Scientific Workflows
1Introduction to Scientific Workflow Management
and the Kepler System Ilkay Altintas1, Roselyne
Barreto4, Paul Breimyer5,Terence Critchlow2,
Daniel Crawl1, Ayla Khan3, David Koop3, Scott
Klasky4, Jeff Ligon5, Bertram Ludaescher6, Pierre
Mouallem5, Meiyappan Nagappan5, Steve Parker3,
Norbert Podhorszki6, Claudio Silva3, Mladen Vouk5
1. San Diego Supercomputer Center 2. Pacific
Northwest National Laboratory 3. University of
Utah 4. Oak Ridge National Laboratory 5. North
Carolina State University 6. University of
California, Davis
2Tutorial Overview
- 9-930am Introduction to Scientific Workflows
(Bertram) - 930-10am Workflow Demos (Daniel)
- 1000-1030am BREAK
- 1030-11am Kepler Basics (Bertram)
- 11-1230am (install Kepler) Hands-On 1 (Daniel)
- 1230-130pm LUNCH
- 130-2pm Advanced Features (Bertram)
- 200-3pm Hands-On 2 (Daniel)
- 300-330pm BREAK
- 330-410pm Workflows Provenance (Bertram)
- 410-440pm Provenance Demo (Daniel)
- 440-500pm QA and Open Discussion
3Introduction to Scientific Workflows
- Motivating Examples
- Ecological Niche Modeling
- Processing Sensor Data Streams
- Ecology, Oceanography use cases
- Fusion Simulation
- Requirements Features
3
3
4Scientific Workflows Cyberinfrastructure
UPPER-WARE
5Scientific Workflow
- Capture how a scientist works with data and
analytical tools - data access, transformation, analysis,
visualization - possible worldview dataflow-oriented (cf.
signal-processing) - Scientific workflow (wf) benefits (v.s.
script-based approaches) - wf automation
- wf component reuse
- wf design, documentation
- wf archival, sharing
- built-in concurrency
- (task-, pipeline-parallelism)
- built-in provenance support
- distributed parallel exec
- Grid cluster support
-
6Kepler science domains
- Ecology
- SEEK Ecological Niche Modeling and climate
change - REAP Modeling parasite invasions in grasslands
using sensor networks - NEON Ecological sensor networks
- COMET environmental science
- Geosciences
- GEON LiDAR data processing
- GEON Geological data integration
- Molecular biology
- SDM Gene promoter identification
- ChIP-chip genome-scale research
- CAMERA metagenomics
- Physics
- CPES Plasma fusion simulation
- FermiLab particle physics
- Oceanography
- REAP SST data processing
- LOOKING ocean observing CI
- NORIA ocean observing CI
- Phylogenetics
- ATOL/pPOD Processing Phylodata
- CiPRES phylogentic tools
- Chemistry
- Resurgence Computational chemistry
- DART (X-Ray crystallography)
- Library science
- DIGARCH Digital preservation
- Cheshire digital library archival
- Conservation biology
- SanParks Thresholds of Potential Concerns
Slide Matt Jones
7Simple Kepler workflow using R (a statistics
package)
8Ecological Niche Modeling
Temperature layer
Many other layers
Slide from D. Pennington
9Managing Complexity
- Scientific workflows use hierarchy to manage
complexity - Top level workflows can be a conceptual
representation of the science process that is
easy to comprehend at a glance - Drilling down into sub-workflows reveals
increasing levels of detail - Composing models using hierarchy promotes the
development of re-usable components that can be
shared with other scientists
10Partial ENM Workflow
Slide Matt Jones
11Workflow features required by ENM use case
- Design phase
- Access to distributed data specimens and climate
- Streamline, automate labor-intensive data
preparation - Workflow GUI environment
- communication about complex models
- experimentation and rapid modification of models
- re-usable, sharable components
- Software environment
- Multi-platform (Mac/Windows/Linux), open,
extensible
Slide Matt Jones
12Workflow features required by ENM use case
- Execution phase
- Execution using multiple analytical environments
- Java, C, R, Matlab, GDAL, web services, ...
- Integration of multiple computing environments
into a single environment - ? Glue-ware
- High-throughput distributed execution
- Iterate across many species with many model runs
- Assume no prior knowledge of distributed
computing technologies - Threadsafe components no back channel
communication - Archiving products in community repositories
- Provenance and metadata for derived products
Slide Matt Jones
13NSF/CEOP REAP (Real-time Analysis Pipelines)
Ecology, Oceanography case studies
- Terrestrial Ecology
- Predictive Modeling to Examine the Role of an
Insect Vectored Pathogen in Exotic Plant Invasion - temperature, precipitation, light interception _at_
7 core research areas - integrate Metacat archived data with these
sensors in analyses implemented in Kepler - Oceanography
- Integrated Framework for Hybrid, Adaptive Ocean
Modeling - Sea Surface Temperature (SST) fields from OPeNDAP
servers - Kepler workflows to quantitatively evaluate SST
data sets
Slide Matt Jones
14REAP Project Goals
- For scientists
- capabilities for designing and executing complex
analytical models over near real-time and
archived data sources - For data-grid engineers
- monitoring and management capabilities of
underlying sensor networks - For outside users
- access to observatory data and results of models,
approachable to non-scientists.
15Key
Internet
- radio with antenna
RBNB
data logger
Internet Point of Presence (IPP)?
RBNB
- sensor
- battery
Relay Station?
1 km
OPeNDAP EcoGrid Metacat
Linear Light Probe (B)? Reflectometer (B) Rain
gage Anemometer RH Temp Probe Quantum Point
Light Sensor?
Linear Light Probe (A)?
CR800 Datalogger
Reflectometer (A)?
Public website
Vegetation Plots
Slide Matt Jones
16Ring Buffered Network Bus (RBNB) streaming data
component
- Scientists can discover data streams
- accessing streams requires little IT knowledge
- Can easily assimilate streams
- into existing or new workflow models
Slide Matt Jones
17Modeling Disease Effects on Competition
- Discrete Time Model
- Survival between seasons
- Reset of system Loss of Disease
- Continuous Time Model
- Growing (Winter Rainy) Season
- Ongoing infection processes (SI model)
- Competition (Lotka-Volterra)
- Integro-Difference Equations
- Parameterized with data from field experiments
- Can utilize coupled models (aka hybrid models)
- Continuous time model that is coupled to a
discrete time model - Each model developed independently, joined via
the workflow engine
Slide Matt Jones
18Models of Computations (Directors) in Kepler
- Continuous time
- Lotka-Volterra predator-prey dynamics
- Synchronize on a global clock
- Synchronous Data Flow
- Sensor data access, analysis
- Static dependency analysis,
- fixed data flow rate
Director controls Model of Computation (MoC)
Slide Matt Jones
19Requirements of REAP use cases
- All features from the ENM, plus
- Design phase
- Access to sensor data streams via catalog
- RBNB and Antelope support
- Bi-directional communication, monitor and control
sensors - Execution phase
- Support hybrid models
- Population and community dynamics mix discrete
and continuous time models - Provenance
- archive modeling scenarios
- support exploratory modeling
Slide Matt Jones
20Discovery Streaming Workflows
- Typical analytical models are complex and
difficult to comprehend and maintain - Use cases described here are only two of many
overlapping cases - Scientific workflows provide
- An intuitive visual model
- Structure and efficiency (user-time) in modeling
and analysis - Abstractions to help deal with complexity
- Direct access to data
- Means to publish and share models
Slide Matt Jones
21Plumbing Workflows Fusion Simulation (SDM
CPES)
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
22Plumbing Workflows Archive Migration
Stage data files from NERSC HPSS to local disk
transfer to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
23- Plumbing workflow
- to accomplish all these tasks
- 50 composite actors (subworkflows)
- 4 levels of hierarchy
- 1000 atomic (Java) actors
Norbert Podhorszki UC Davis, soon ORNL
24Summary a broad range of workflow types
- Desktop / discovery workflows
- analysis/method-intensive, R, Matlab, custom
algorithims - e.g. bioinformatics, ecoinformatics, genomics,
phylogenetics - exploratory workflow, rapidly evolving
- need data workflow provenance
- Streaming workflows
- (near) real-time processing and data analysis
- distributed setting
- Plumbing workflows
- data-intensive, e.g. moving TBs between from
ORNL (compute) to LBL/NERSC (archive) - Production workflow reliable, fault-tolerant,
high-throughput, runtime monitoring - HPC workflows
- cpu-intensive, need to utilize a local cluster
or distribute Grid, e.g. Ecological Niche
Modeling, Parameter studies, - Parallel/distributed workflow
25Workflow Demos
25
25
26Bioinformatics Web Service
- Retrieve genetic sequence from DNA Data Bank of
Japan (DDBJ). - Data transformations via XSLT and XPath.
27Bioinformatics Web Service Access
27
28REAP Data Streaming
28
2929
30Transfer-Convert-Archive-Image-Workflow
30
31Basic Kepler Features
31
31
32Kepler is a Scientific Workflow System
http//www.kepler-project.org
- Kepler is a cross-project collaboration
- Latest release available from the website
- Builds upon the open-source Ptolemy II framework
32
32
33Kepler Communities Collaboration
- Open-source
- Builds on Ptolemy II from UC Berkeley
- Contributors from
- SEEK
- SciDAC SDM
- Ptolemy
- GEON
- ROADNet
- Resurgence
- AToL CIPRES, POD
-
- Goals
- Create powerful analytical tools that are useful
across disciplines - Ecology, Biology, Engineering, Geology, Physics,
Chemistry, Astronomy,
Ptolemy II
34Vergil is the GUI for Kepler
but Kepler can also run in batch mode as a
command-line engine.
Data Search
Actor Search
Actor ontology (semantic search) Search ?
Drag drop ? Link via ports Metadata-based
search for datasets
35Actor-Oriented Modeling Design
- Actor
- single component or task
- well-defined interface (signature)?
- given input data, produces output data
36Actor-Oriented Modeling Ports
- Ports
- each actor has a set of input and output ports
- denote the actors signature
- produce/consume data (a.k.a. tokens)?
- Parameters
- (visible after double-click) can be seen as
special static ports
37Actor-Oriented Modeling Connections / Channels
- Dataflow Connections
- actor communication channels
- directed (hyper) edges
- connect output ports with input ports
- can fork (cloning tokens) at relation nodes
(little diamonds)
38Actor-Oriented Modeling Subworkflows
- Sub-workflows / Composite Actors
- composite actors wrap sub-workflows
- like actors, have signatures (i/o ports of
sub-workflow) - hierarchical workflows (arbitrary nesting levels)
39Actor-Oriented Modeling Directors
- Directors
- define the Model of Computation (MoC) of workflow
graphs - executes workflow graph (some schedule)
- sub-workflows may have different directors
- Facilitates actor (sub-)workflow reusability
40Models of Computation
- Directors separate the concerns of WF
orchestration from Actor execution - Synchronous Dataflow (SDF)
- Connections have queues for sending/receiving
fixed numbers of tokens at each firing. Schedule
is statically predetermined. SDF models are
highly analyzable and used often in SWFs. - Downside need to know token consumption/productio
n rate ahead of time - Process Networks (PN)
- Generalizes SDF. Actors execute as a separate
thread/process, with queues of (in principle)
unbounded size. Closely related to Kahn/MacQueen
semantics. - Continuous Time (CT)
- Connections represent the value of a continuous
time signal at some point in time ... Often used
to model physical processes. - Discrete Event (DE)
- Actors communicate through a queue of events in
time. Used for instantaneous reactions in
physical systems.
41Searching Components (Actors)
- Kepler Actor Ontology (tags hierarchy)
- Used in searching actors and creating conceptual
views (virtual folders) - currently gt 370 actors
APAC07/Kepler Tutorial/V1
41
SC07/Kepler Tutorial/V8bc/Nov-07
41
42Searching Binding Data
- Kepler DataGrid
- Discovery of data resources through local and
remote services - SRB,
- Grid and Web Services,
- DB connections
- Registry of datasets on the fly using workflows
APAC07/Kepler Tutorial/V1
42
42
43Hands-On Exercises 1
APAC07/Kepler Tutorial/V1
44Opening and Running a Workflow
- Start Kepler
- Open the HelloWorld.xml under the demos/sc07
directory in your local Kepler folder - Two options to run a workflow
- PLAY BUTTON in the toolbar
- RUNTIME WINDOW from the run menu
45Modifying an Existing Workflow Saving It
- GOAL
- Modify the HelloWorld workflow to display a
parameter-based message - Step-by-step instructions
- Open the HelloWorld workflow as before
- From actors search tab, search for Parameter
- Drag and drop the parameter to the workflow
canvas on the right - Double click the parameter and type your name
- Right click the parameter and select "Customize
Name", type in "name". - Double click the Constant actor and type the
following - Hello name
- Save
- Run the workflow
46Creating a HelloWorld! Workflow
- Open a new blank workflow canvas
- From toolbar File ? New Workflow ? Blank
- In the Components tab, search for Constant and
select the Constant actor. - Drag the Constant actor onto the Workflow canvas
- Configure the Constant actor
- Right-click the actor and selecting Configure
Actor from the menu - Or, double click the actor
- Type Hello World in the value field and click
Commit - In the Components and Data Access area, search
for Display and select the Display actor found
under Textual Output. - Drag the Display actor to the Workflow canvas.
- Connect the output port of the Constant actor to
the input port of the Display actor. - In the Components and Data Access area, select
the Components tab, then navigate to the
/Components/Director/ directory. - Drag the SDF Director to the top of the Workflow
canvas. - Run the model
4747
47
48Using Various Displays
- GOAL Use different graphical output actors.
- Step-by-step instructions
- Open the "03-ImageDisplay.xml" under the
demos/getting-started directory in your local
Kepler folder. - Run the workflow.
- Search for "browser" in the components tab.
- Drag and drop "Browser Display" onto the canvas.
- Replace "ImageJ" with "Browser Display" (connect
Image Converter output to "Browser Display"
inputURL. - Run workflow again.
- Replace "Browser Display" with a textual
"Display. - Run workflow.
49Advanced Kepler Features
50Process Networks
- The partial (or total linear) order implied by a
DAG gives as a schedule for workflows for
one-time tasks (jobs) - What about Pipelined Workflows on Token Streams??
- Communicating processes with directed token flow
- Dataflow Process Networks
- communication token stream between two
processes - process operations on tokens
- host language process description
- coordination language network description
process
process
token stream
channel
51Kahn process networks (1974)
- special class of process networks
- stream is FIFO with unbounded capacity
- process
- destructive read (consumption) at process
start, - non-destructive write (production) at process
end, - blocking read process only executed if data
available, - non-blocking write
EXAMPLE
52Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
53Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
54Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
55Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
56Problems with Process Networks
- How to run/schedule a process network without
accumulating arbitrarily many tokens? - Difficult to schedule because of need to balance
relative process rates - System inherently gives the scheduler few hints
about appropriate rates - Tom Parks Algorithm
- runs in bounded memory whenever possible
- (bounded memory condition is undecidable)
- Synchronous Dataflow (SDF)
- Edward Lee and David Messerschmitt, Berkeley,
1987 - Restriction Kahn Process Networks to allow
compile-time scheduling - Basic idea each process reads and writes a fixed
number of tokens each time it fires. Example - Loop forever
- read 2 tokens from A, 3 tokens from B
- compute
- write 1 token to C write 2 tokens to D
57Synchronous Dataflow (SDF)Fixed
Production/Consumption Rates
- Balance equations (one for each channel)
- Schedulable statically
- Decidable
- buffer memory requirements
- deadlock
number of tokens consumed
number of firings per iteration
number of tokens produced
fire B consume M
fire A produce N
channel
N
M
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
58Parallel Scheduling of SDF Models
Many scheduling optimization problems can be
formulated. Some can be solved, too!
SDF is suitable for automated mapping onto
parallel processors and synthesis of parallel
circuits.
A
C
B
D
Sequential
Parallel
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
59Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
60Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
61Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
62Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
63Selected Generalizations
- Multidimensional Synchronous Dataflow (1993)
- Arcs carry multidimensional streams
- One balance equation per dimension per arc
- Cyclo-Static Dataflow (Lauwereins, et al., 1994)
- Periodically varying production/consumption rates
- Boolean Integer Dataflow (1993/4)
- Balance equations are solved symbolically
- Permits data-dependent routing of tokens
- Heuristic-based scheduling (undecidable)
- Dynamic Dataflow (1981-)
- Firings scheduled at run time
- Challenge maintain bounded memory, deadlock
freedom, liveness - Demand driven, data driven, and fair policies all
fail - Kahn Process Networks (1974-)
- Replace discrete firings with process suspension
- Challenge maintain bounded memory, deadlock
freedom, liveness - Heterochronous Dataflow (1997)
- Combines state machines with SDF graphs
- Very expressive, yet decidable
Source Edward Lee http//ptolemy.eecs.berkeley.ed
u/
64(Internal) Workflow Format MoML
65Sharing Kepler Workflows -- Use Cases
- UC-1) Facilitate transport of workflows to
grid/distributed/server/p2p systems - UC-2) Preserve an analysis to allow replication
- UC-3) Allow the development and distribution of
components (actors/directors) which can be
released on a schedule independently from Kepler
itself.
66Kepler Archive File (KAR)
67KAR File Functional Requirements
- FR-1) Mechanism to package resources required to
implement a component in kepler system. - FR-1a) must be able to contain java class files
- FR-1b) must be able to contain native binary
executable files - FR-1c) must be able to contain native library
files - FR-1d) must be able to contain MoML and other XML
based text - FR-1e) must be able to contain data in binary and
ascii formats including zipped data. - FR-2) Must describe the contained components so
they can be utilized in a Kepler system. - FR-2a) each component must have a unique LSID
identifier which is tied to the specific
implementation of the component. - FR-2b) must contain an OWL document with semantic
ordering for the contained objects
68The need for Plumbing WorkflowsTales from the
life of a simulation scientist
69A few days in the life of Sim Scientist. Day 1
-morning.
- 800AM Get Coffee, Check to see if job is
running. - Ssh into jaguar.ccs.ornl.gov (job 1)?
- Ssh into seaborg.nersc.gov (job 2) (this is
running yea!)? - Run gnuplot to see if run is going ok on seaborg.
This looks ok. - 900AM Look at data from old run for post
processing. - Legacy code (IDL, Matlab) to analyze most data.
- Visualize some of the data to see if there is
anything interesting. - Is my job running on jaguar? I submitted this 4K
processor job 2 days ago! - 1000AM scp some files from seaborg to my local
cluster. - Luckily I only have 10 files (which are only 1
GB/file). - 1030AM first file appears on my local machine
for analysis. - Visualize data with Matlab.. Seems to be ok. ?
- 1130AM see that the second file had trouble
coming over. - Scp the files over again Dohhh
Slide Scott Klasky
69
70A few days in the life of Sim Scientist. Day 1
evening.
- 100PM Look at the output from the second file.
- Opps, I had a mistake in my input parameters.
- Ssh into seaborg, kill job. Emacs the input,
submit job. - Ssh into jaguar, see status. Cool, its running.
- bbcp 2 files over to my local machine. (8
GB/file). - Gnuplot data.. This looks ok too, but still need
to see more information. - 130PM Files are on my cluster.
- Run matlab on hdf5 output files. Looks good.
- Write down some information in my notebook about
the run. - Visualize some of the data. All looks good.
- Go to meetings.
- 400PM Return from meetings.
- Ssh into jaguar. Run gnuplot. Still looks good.
- Ssh into seaborg. My job still isnt running
- 800PM Are my jobs running?
- ssh into jaguar. Run gnuplot. Still looks good.
- Ssh into seaborg. Cool. My job is running. Run
gnuplot. Looks good this time!
Slide Scott Klasky
71And Later
- 400AM yawn is my job on jaguar done?
- Ssh into jaguar. Cool. Job is finished.
- Start bbcp files over to my work machine.
(2 TB of data). - 800AM Bbcp is having troubles.
- Resubmit some of my bbcp from jaguar to my local
cluster. - 800AM (next day).
- Still need to get the rest of my 200GB of data
over to my machine. - 300PM My data is finally here!
- Run Matlab. Run Ensight. Oppps. Somethings
wrong!!!!!!!!! Where did that instability come
from? - 600PM finish screaming!
Slide Scott Klasky
72And 2 years from now.
- Simulations /computers are getting larger and
- more expensive to operate.
- In Fusion, large runs will be using gt50K cores/
100 wallclock hours, to understand turbulent
transport in ITER size reactors. - The cost of a simulation approaches 0.6M (power,
cooling, system cost averaged over 5 years). - Data Sizes are getting larger.
- Large simulations produce 2 TB/simulation
(today), 100TB/simulation(week) in the future. - Demand for real-time monitoring/analysis of
simulations. - Demand for fast-reliable data movement to local
machines for post processing. - Demand to keep data provenance at 1 location.
Slide Scott Klasky
73Workflows to the rescue!
- In our demo section (SC07 tutorial only) you
will see us automate this process. - Job submission starts services on ORNL IB cluster
(ewok). - Files are automatically moved from Cray XT3 to
ORNL IB cluster. - Files are converted from binary to hdf5 files.
- Files accumulate until they gt 6GB. Then they are
tarred. - Files use hsi commands to place tar files into
HPSS. (xml file describes which files are in
which tar files). - Hdf5 file is read into SciRun service which
creates a jpeg. - Jpeg files create an mpeg file via a mpeg
service. - Jpeg and mpeg files are moved to web portal.
- Hdf5 files are archived to PPPL.
- And of course we will keep track of the
provenance of the workflow in a database! - And we can monitor this on our dashboard.
Slide Scott Klasky
74Why do Pflop computing scientists care?
- Typical situation for Sim Scientist.
- We run on 1 60K processors, producing lots of
data. - Typical method of work.
- Prepare input data for smaller simulations.
- Iterate until we come up with the correct
parameters for the large run. - Run the large simulation only at a handful number
of locations (usually lt4). - Must Archive results. Must be of the correct size
archives on HPSS. - Must move some data over to our local clusters
for analysis after the simulation. - Did we make a mistake with the input parameters?
Is something going wrong? Fix the code/input
start the run over again. - Wow, I just wasted 100K CPU hours because I
missed a sign. Duhh. - Where are all of my files? I want to look at the
temperature in the 200 time slice, where is it on
HPSS.
Slide Scott Klasky
75Post Processing Workflow A day in the life of
Sim Scientist
- 900AM Get Diet Coke, decide which runs/
experimental data to analyze. - 930AM Start to download files from HPSS from
NERSC and ORNL. - 1000AM Move files from NERSC/ORNL to local
desktop machine. Smallish data (10GB /location). - 1100AM Start IDL, and compute various post
processing quantities. - 1130AM look at the data from the simulations,
and grab data from a database which has
experimental data. - 100PM Move some more data from ORNL to local
desktop to compare to more experimental data. - Save the plot from Matlab to Postscript to
include in paper. - Write down results into notebook, copy figure
into notebook. - 200PM Think about results, and decide on new
analysis routines to write in the future. - 400PM Start moving more data from NERSC to local
desktop.
Slide Scott Klasky
76Whats changing in his life?
- Collaboration.
- More clusters, more simulations.
- Just analyze the data where we run. Dont move
the data. - But
- What if the network goes (Im on a plane,).
- What if the resource is not available for my
late-breaking analysis before the BIG conference? - OK, but what about the large data?
- OK. Large data will be server-side analysis. Not
DESKTOP. But can run workflow on a server. - Data from multiple resources
- VV data from multiple simulations/experiments.
- But cant we just run VISIT/SciRun?
- Yes. But need to orchestrate the data movement
from different sources, track the provenance, and
perhaps use multiple analysis/visualization
packages, then a workflow system can help.
Slide Scott Klasky
77How do we help this scientist?
- The workflow is the glue for the scientists.
- The scientists hooks up all of the analysis
routines. - The director makes sure that the data movement
occurs, and is reliable, and secure. - All of the tedious portions of ssh, start this
program, is removed by the workflow automation. - The workflow will be able to keep the provenance
information which allows the user to understand
how they processed the dataset. - This enables the scientist to compare new data
with old data.
Slide Scott Klasky
78So what are the requirements?
- Must be EASY to use.
- If you need a manual, then FORGET IT!
- Good user support, and long-term DOE support. ?
- The workflow should work for all of my workflows.
- NOT just for the Petascale computers.
- And on multiple platforms!
- Must be easy to incorporate my own services into
the workflow. - Must be customizable by the users.
- Users need to easily change the workflow to work
with the way users work. - Long-term requirements. NOT being worked on
yet. - Autonomics/ User Adaptivity.
- Faster data movement in the workflow? High
Quality front-end for the end-user interaction. - You tell us!
Slide Scott Klasky
79SWF Systems Requirements
- Design tools-- especially for non-expert users
- Ease of use-- fairly simple user interface having
more complex features hidden in the background - Reusable generic features
- Generic enough to serve to different communities
but specific enough to serve one domain (e.g.
geosciences) ? customizable - Extensibility for the expert user
- Registration, publication provenance of data
products and process products (workflows)? - Dynamic plug-in of data and processes from
registries/repositories - Distributed WF execution (e.g. Web and Grid
awareness)? - Semantics awareness
- WF Deployment
- as a web site, as a web service,Power apps.
Slide Scott Klasky
80The Big Picture Supporting the Scientist
From Napkin Drawings
to Executable
Workflows
Conceptual SWF
Executable SWF
Here John Blondin, NC State Astrophysics Terascal
e Supernova Initiative SciDAC, DOE
80
Slide M. Vouk
81CPES Fusion Simulation Workflow
- Fusion Simulation Codes (a) GTC (b) XGC with
M3D - e.g. (a) currently 4,800 (soon 9,600) nodes Cray
XT3 9.6TB RAM 1.5TB simulation data/run - GOAL
- automate remote simulation job submission
- continuous file movement to secondary analysis
cluster for dynamic visualization simulation
control - with runtime-configurable observables
Submit FileMover Job
Submit Simulation Job
Execution Log (gt Data Provenance)
Select JobMgr
Overall architect ( prototypical user) Scott
Klasky (ORNL)? WF design implementation
Norbert Podhorszki (UC Davis)?
APAC07/Kepler Tutorial/V1
81
82CPES Analysis Workflow
- Concurrent analysis pipeline(_at_Analysis Cluster)
- convert analyze copy-to-Web-portal
- easy configuration, re-purposing
Reusable Actor Class
Specialized Actor Instances
SpecializeActor instances
SpecializeActor instances
Pipelined Execution Model
Inline Documentation
Inline Display
Easy-to-edit Parameter Settings
Overall architect ( prototypical user) Scott
Klasky (ORNL)? WF design implementation
Norbert Podhorszki (UC Davis)?
APAC07/Kepler Tutorial/V1
82
83Dashboard integration with Kepler
- Dashboard present information created from the
workflow. - We have been developing a dashboard for Kepler
workflows. - AJAX.
- FLASH.
- PHP.
- MySQL.
Slide SDM/SPA, Klasky,Vouk, et al
84Machine Monitoring
- DOE Machine monitoring
- Which machines are up?
- Which machines have long queues, which are idle?
- Where can I run my job?
- Where am I running jobs?
- Where are my running jobs and can I look at my
old runs? - Can I monitor a new job, and compare this to an
old job?
Slide SDM/SPA, Klasky,Vouk, et al
85Dashboards for Simulation Monitoring
- Back end shell scripts, python scripts and PHP.
- Machine queues command
- Users personal information
- Services to display and manipulate data before
display - Dynamic Front end
- Machine monitoring standard web technology
Ajax - Simulation monitoring Flash
- Storage MySQL (queue-info, min-max data, users
notes)?
85
Slide SDM/SPA, Klasky,Vouk, et al
86Scientific Workflow Systems
- Combination of
- Data management, integration, analysis, and
visualization steps - Larger, automated "scientific process"
- Mission of scientific workflow systems
- Promote scientific discovery by providing tools
and methods to generate scientific workflows - Provide an extensible and customizable graphical
user interface for scientists from different
scientific domains - Support workflow design, execution, sharing,
reuse and provenance - Design frameworks which define efficient ways to
connect to the existing data and integrate
heterogeneous data from multiple resources - Make technology useful through users computer!!!
86
87Two typical types of Workflows for SC
- Real-time Monitoring (Server Side Workflows)?
- Job submission.
- File movement.
- Launch Analysis Services.
- Launch Visualization Services.
- Launch Automatic Archiving.
- Post Processing (Desktop Workflows).
- Read in Files from different locations.
- File movement.
- Launch Analysis Services.
- Launch Visualization Services.
- Connect to Databases.
- Obviously there are other types of workflows
- What is your type of workflow?
87
88Plumbing Workflow using Kepler
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
89Plumbing Workflow for Archive Migration
Stage from NERSC HPSS to local disk transfer
to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
90Pipeline and parallel processing
Norbert Podhorszki (UC Davis)
91- Plumbing workflow
- to accomplish all these tasks
- 50 composite actors (subworkflows)
- 4 levels of hierarchy
- 1000 atomic (Java) actors
Norbert Podhorszki UC Davis, soon ORNL
92Distributed Execution Many ways to skin a cat
- Do it all in Kepler (white-box)
- Single machine single-threaded and/or
multi-threaded - Multiple nodes (cluster)
- Distributed Kepler, Kepler/HPC
- Medium-tightly coupled (grey box)
- use remote commands
- and their exist status
- Loosely-coupled (black-box -- Norberts
Workflows) - Launch remote scripts
- Inquire about their status e.g. via ls -1
- Minimalist approach
- works even w/ tough ORNL constraints!
APAC07/Kepler Tutorial/V1
92
92
93Authoring Distributed Workflows
Normal Workflow
Distributed Workflow
- Place wf in a DistributedCompositeActor (DCA).
- At runtime, the contents of the DCA are packaged
up and shipped to the remote nodes. - The workflow is executed and the output is
returned to the master Kepler node to be
viewed/further processed.
Slide from C. Berkley
94Node Discovery and Remote Management
Slide from C. Berkley
95Efficient Data Transfer
- Large datasets need special handling
- Inefficient data transfer could wipe out time
savings of distributed computation
Slave1 depends on Slave0 Slave2 depends on Slave1
Slave0
Slave0
Large Dataset
Large Dataset
Large Dataset
Large Dataset
Slave1
Master
Slave1
Master
Large Dataset
Large Dataset
Large Dataset
Results
Slave2
Slave2
Inefficient (6 possible transfers)
More efficient (4 possible transfers)
Slide from C. Berkley
96A Hierarchical View of the Architecture
Control Plane (light data flows)?
Provenance, Tracking Meta-Data (DBs and
Portals)?
Execution Plane (Heavy Lifting
Computations and flows)?
Synchronous or Asynchronous?
96
97Scientific Workflow Automation (e.g.,
Astrophysics)In conjunction with John Blondin,
NC State UniversityAutomate data acquisition,
transfer and visualization of a large-scale
simulation at ORNL
Logistic Network L-Bone or bbcp
Aggregate to 500 files (lt 50GB each)?
Input Data
Local Mass Storage 14TB)?
VH1
Depot
HPSS archive
Local 44 Proc. Data Cluster - data sits on local
nodes for weeks
Provenance
Highly Parallel Compute
Output 500x500 files
Web
Viz Software
Viz Wall
97
Viz Client
98Scientific Workflow Modeling Design
And thats why our scientific workflows are
much easier to develop, understand, reuse and
maintain!
99Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
100 Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
101But how do we get from messy to neat reusable
designs?
102The Problem Evolving Workflows
Daniel Zinn (UC Davis)
103What we want Simple Analysis Pipelines
Author Tim McPhillips, UC Davis
104The Answer (YMMV)
- Collection-Oriented Modeling Design (COMAD)
- embrace the assembly line metaphor fully
- ? Virtual Assembly Lines (VALs)
- ? cf. Flow-based Programming (J. Morrison)
- data tagged nested collections
- pipelined (XML) token streams
- passing the buck on whats not in your scope
Timothy McPhillips (UC Davis)
105Conventional vs Assembly Line Delta-XML
Thinking
Daniel Zinn (UC Davis)
106More secret sauce User vs. Optimized Dataflow
Daniel Zinn (UC Davis)
107What we got Simple Change-Resilient Pipelines
Author Tim McPhillips, UC Davis
108Result Change-Resilience (Wf graph)
?
X
A
B
C
S
R
W
Original
Automatic Configuration
W
WX
S R
S R
Infer Configuration X of X
Daniel Zinn (UC Davis)
109Related Change-Resilience (nested data types)
S. Bowers, Daniel Zinn (UC Davis)
110Scientific Workflow Modeling Design Paradigms
- Vanilla Process Network
- Functional Programming Dataflow Network
- XML Transformation Network
- Collection-oriented Modeling Design framework
(COMAD) - Look Ma No Shims!
- also running DAGs, Petri Nets, easyBPEL,
111Hands-On Exercises 2
APAC07/Kepler Tutorial/V1
111
111
112Using R in Kepler
- GOAL Use the R actor to generate histogram plot.
- Step-by-step instructions
- In demos/getting-started directory, open
05-LinearRegression.xml. - Run the workflow to view linear regression.
- Add another RExpression actor to canvas.
- Double-click on new R actor and enter the
following for R function or script - Mean lt- mean(Values)?
- hist(Values)?
- Right-click on new R actor and Configure Ports
- Add an input called Values
- Add an output called Mean
- Control-click on the canvas to create a new
Relation diamond - Connect the T_AIR port from Datos to the diamond.
- Connect the T_AIR port from R_linear_regression
to the diamond. - Connect the Values port from the new R actor to
the diamond. - Place a second ImageJ actor on the canvas (can
copy and paste existing one). - Connect R actor"s graphicsFileName port to
second ImageJ actor"s input. - Run the workflow.
APAC07/Kepler Tutorial/V1
112
SC07/Kepler Tutorial/V8bc/Nov-07
112
113Creating Web Service Workflows
- GOAL Executing a Web Service using the generic
Web Service client. - Step-by-step instructions
- In the Components and Data Access area, select
the components tab. - Search for "web service".
- Drag "Web Service Actor" onto the canvas.
- Double click the actor, enter http//xml.nig.ac.jp
/wsdl/DDBJ.wsdl, commit. - Double click the actor again, select
"getXMLEntry" as method name, commit. - Search for "String Constant" in the components
tab. Drag and drop "String Constant" onto
workflow canvas. - Double click the "String Constant", set AA045112
as value, commit. - Connect "String Constant" output with the "Web
Service Actor" input. - Add a "Display" and connect its input with the
"Web Service Actor" "Result" output. - Add the SDF director.
- Run the workflow.
APAC07/Kepler Tutorial/V1
113
113
114SSH Actor and Including Existing Scripts in a
Workflow
- GOAL Use SSH actor to execute command on remote
host. - Step-by-step instructions
- Search for "ssh" in the Components tab in left
pane. - Drag "SSH To Execute" onto the canvas.
- Double click the actor
- Type in a remote host you have access to.
- Type in your username.
- Search for "String Constant" in the components
tab. Drag and drop "String Constant" onto
workflow canvas. - Double click the "String Constant", type "ls" and
commit. - Connect "String Constant" output with the "SSH To
Execute" command input (lowest)?. - Add a "Display" and connect its input with the
"SSH To Execute" stdout output (top)?. - Add the SDF director.
- Run the workflow.
- If you have a script deployed on the server, you
can replace the "ls" command to invoke the
script. - e.g., perl tmp.pl
APAC07/Kepler Tutorial/V1
114
114
115Using Relational Databases
- GOAL Accessing a geoscience database using a
generic database actor - Step by step instructions
- In the Components and Data Access area, select
the components tab - Search for database
- Drag Open Database Connection and Database
Query onto the canvas - Configure Open Database Connection with the
following parameters - Database format PostgreSQL
- Database URL jdbcpostgresql//geon17.sdsc.edu54
32/igneous - Username readonly
- Password read0n1y
- Connect the output of Open Database Connection
with the dbcon input port of Database Query - Double-click to customize the actor
- Query SELECT FROM IGROCKS.ModalData WHERE SSID
227 - 227 for ssID
- Add Display actor (from components tab), connect
ports, add sdf director (as in previous example)? - Run the workflow
115
116Provenance
- Two different takes on it
- Scientists View (discovery workflows)
- Engineers View (plumbing workflows)
APAC07/Kepler Tutorial/V1
116
117A Scientific Publication (the final
provenance frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
118More Evidence
data reference
type of evidence
tool reference
trust me on this one
- provenance/data lineage show the history and
evidence - related to proof trees
- unlike w/ scripts, SWF system can keep track of
what happened - In the future deposit your data workflows in a
repository
119Pipelined workflow for inferring phylogenetic
trees
Author Tim McPhillips, UC Davis
120Different Dependency Graphs
A Model for User-Oriented Data Provenance in
Pipelined Scientific Workflows, Shawn Bowers,
Timothy McPhillips, Bertram Ludäscher, Shirley
Cohen, Susan B. Davidson. International
Provenance and Annotation Workshop (IPAW'06),
Chicago,May 3-5, 2006.
121Scientific Provenance Questions we can ask
- What DNA sequences were input to the workflow?
- What phylogenetic trees were output by the
workflow? - What DNA sequences input to the workflow does
this consensus tree depend on? - What input sequences were not used to derive any
output consensus trees? - What was the sequence alignment (key intermediate
data) used in the process of inferring this tree? - plus the usual smart-rerun, VCR replay,
122Provenance in the COMAD Framework
Without Provenance
With Provenance
123The Answer (YMMV)
- Collection-Oriented Modeling Design (COMAD)
- embrace the assembly line metaphor fully
- ? Virtual Assembly Lines (VALs)
- ? cf. Flow-based Programming (J. Morrison)
- data tagged nested collections
- pipelined (XML) token streams
- passing the buck on whats not in your scope
Timothy McPhillips (UC Davis)
124Provenance for the WF Engineer / Plumber
- A Workflow Engineers View
- Monitor, benchmark, and optimize workflow
performance - Record resource usage for a workflow execution
- Smart Re-run of (variants of) previous
executions - Checkpointing restart (e.g. for crash recovery,
load balancing) - Debug or troubleshoot a workflow run
- Explain when, where, why a workflow crashed
125Provenance for Domain Scientists!
- Query the lineage of a data product
- from what data was this computed? (real
dependencies please!) - Evaluate the results of a workflow
- do I like how this result was computed?
- Reuse data products of one workflow run in
another - (re-)attach prior data products to a new workflow
- Archive scientific results in a repository
- Replicate the results reported by another
researcher - Discover all results derived from a given dataset
- i.e. across all runs
- Explain unexpected results
- via parameter-, dataset-, object-dependencies
in the scientists terms (yes, you may think
ontology here )
126Observables
- Model of Computation MoC M
- specification/algorithm to compute o M(W,P,i)
- a director or scheduler implements M
- gives rise to formal notions of
- computation (aka run) R typically tree models
- Model of Provenance MoP M
- approximation M of M
- a trace T approximates a run R by
inclusion/exclusion of observables - T R Ignored-observables
Model-observables - Observables (of a MoC M)
- functional observables (may influence output o)
- token rate, notions of firing,
- non-functional observables (not part of M, do not
influence o) - token timestamp, size, (unless the MoC cares
about those) - What is a good model of provenance?
- What is a good provenance schema?
127Provenance in the General Architecture (SDM/SPA
View)
Analytics
Computations
Control Panel (Dashboard)? Display
Local and/or remote communications (networks)?
Orchestration (Kepler)?
Data, DataBases, Provenance, Storage
128What is Provenance? (SDM/SPA view)
- Provenance is about meta-data (data about data),
the history (lineage) of data, code execution and
conditions applied to a workflow run. - Run-time monitoring may be part of the provenance
meta-data, but it also may require collection of
additional information and display of that
information in a user-friendly format, for
example on a dashboard, so that run-time
tracking, problem determination, computational
steering, and other workflow-related feedback may
take place.
129Why Provenance?
- Recreate results and rebuild workflows using the
evolution information - Associate the workflow with the results it
produced - Create links between generated data in different
runs, and compare different runs - Checkpoint a workflow and Recover from a system
failure - Debug and explain results (via lineage tracing,
)? - Smart Reruns
- Other
130Types of Provenance
- Process provenance dynamics of control flows
and their progression, execution, etc. - Data provenance dynamics of data and data
flows, file locations, application input/output
information, etc. - Workflow provenance structure, form, evolution,
- System (or Environment) provenance system
information, O/S, compiler versions, etc.
131Other Data Views and Concepts
- Raw data
- Application/Simulation monitoring (input, output,
configuration, intermediate states, )? - Data history and location
- Machine monitoring
- Shelf-life of data
- Auditability
- Error and execution logs
- Analytics and Data information summation
(visual, formulas, smoothing, etc.)? -
132Framewok
Storage
Supercomputer Analytics
Kepler
Dash
Meta-Data about Processes, Data, Workflows, Syst
em Environment
Orchestration
133A Hierarchical View of the Architecture
Control Plane (light data flows)?
Provenance, Tracking Meta-Data (DBs and
Portals)?
Execution Plane (Heavy Lifting
Computations and flows)?
Synchronous or Asynchronous?
134Implementation
- Kepler Linux Apache MySQL PHP (K-LAMP)?
- Windows based solutions
- Communications sockets, xmlrpc, http, files,
NFS, synchronous, asynchronous, etc. - Single node vs. distributed solutions
- Service-based solutions
- Which information?
135Data Model
APAC07/Kepler Tutorial/V1
136Kepler Provenance Framework
136
137A Key