Title: The KEPLER Scientific Workflow System
1The KEPLER Scientific Workflow System
Bertram Ludäscher Ilkay Altintas the Kepler
Team San Diego Supercomputer Center University
of California, San Diego
SDM Center AHM, LBL, August 3-5, 2004
2Outline
- Project Overview
- from Ptolemy II to Kepler
- Workflow Modeling Issues
- from Dataflow to Control-flow (CCA et al)
- Current Kepler Features
- from plumbing to distributed execution
- Example Workflows
- from bioinformatics to geoinformatics
- Future Plans
- from today to tomorrow -)
3What is a Scientific Workflow (SWF)?
- Goals
- automate a scientists repetitive steps (data
analysis, data transformation, computational
steps, ) - can encompass data generation, aggregation,
analysis, visualization (WF granularity) - design, test, share, deploy, execute, reuse,
SWFs - Typical requirements/characteristics
- data-intensive and/or compute-intensive
- plumbing-intensive
- dataflow-oriented
- distribution (data, processing)
- user-interaction in the middle,
- vs. (C-z bg fg)-ing (detach and reconnect)
- advanced programming constructs (map(f), zip,
takewhile, ) - logging, provenance, registering back
(intermediate) products - easy to recognize a SWF when you see one!
4Promoter Identification Workflow
Source Matt Coleman (LLNL)
5Source NIH BIRN (Jeffrey Grethe, UCSD)
6Ecology GARP Analysis Pipeline for Invasive
Species Prediction
Source NSF SEEK (Deana Pennington et. al, UNM)
7(No Transcript)
8Starting Point for SDM-Center/SPA SEEK
Ptolemy II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
9An Early Example Promoter Identification SSDBM,
AD 2003
- Scientist models application as a workflow of
connected components (actors) - If all components exist, the workflow can be
automated/ executed - Different directors can be used to pick
appropriate execution model (often pipelined
execution PN director)
10Why Ptolemy II (and thus KEPLER)?
- Ptolemy II Objective
- The focus is on assembly of concurrent
components. The key underlying principle in the
project is the use of well-defined models of
computation that govern the interaction between
components. A major problem area being addressed
is the use of heterogeneous mixtures of models of
computation. - Dataflow Process Networks w/ natural
pipelining/streaming support - User-Orientation
- Workflow design exec console (Vergil GUI)
- Application/Glue-Ware
- excellent modeling and design support
- run-time support, monitoring,
- not a middle-/underware (we use someone elses,
e.g. Globus, SRB, ) - but middle-/underware is conveniently accessible
through actors! - PRAGMATICS
- Ptolemy II is mature, continuously extended
improved, well-documented (500pp) - open source system
- Ptolemy II folks actively participate in KEPLER
11KEPLER An Open Collaboration
- Founding projects
- DOE SDM/SPA and NSF SEEK
- Open Source (BSD-style license)
- Intensive Communications
- Web-archived mailing lists
- IRC (!)
- Co-development
- via shared CVS repository
- joining as a new co-developer (currently)
- get a CVS account (read-only)
- local development contribution via existing
KEPLER member - be voted in as a member/co-developer
- Software social engineering
- How to better accommodate new groups/communities?
- How to better accommodate different
usage/contribution models (core dev special
purpose extender user)?
12KEPLER/CSP Contributors, Sponsors, Projects(or
loosely coupled Communicating Sequential Persons
-)
- Ilkay Altintas SDM, Resurgence
- Kim Baldridge Resurgence, NMI
- Chad Berkley SEEK
- Shawn Bowers SEEK
- Terence Critchlow SDM
- Tobin Fricke ROADNet
- Jeffrey Grethe BIRN
- Christopher H. Brooks Ptolemy II
- Zhengang Cheng SDM
- Dan Higgins SEEK
- Efrat Jaeger GEON
- Matt Jones SEEK
- Werner Krebs, EOL
- Edward A. Lee Ptolemy II
- Kai Lin GEON
- Bertram Ludaescher BIRN, SDM, SEEK, GEON
- Mark Miller EOL
- Steve Mock NMI
- Steve Neuendorffer Ptolemy II
Ptolemy II
13History
- Gabriel (1986-1991)
- Written in Lisp
- Aimed at signal processing
- Synchronous dataflow (SDF) block diagrams
- Parallel schedulers
- Code generators for DSPs
- Hardware/software co-simulators
- Ptolemy Classic (1990-1997)
- Written in C
- Multiple models of computation
- Hierarchical heterogeneity
- Dataflow variants BDF, DDF, PN
- C/VHDL/DSP code generators
- Optimizing SDF schedulers
- Higher-order components
- Ptolemy II (1996-2022)
- Written in Java
- Domain polymorphism
- Multithreaded
- PtPlot (1997-??)
- Java plotting package
- Tycho (1996-1998)
- Itcl/Tk GUI framework
- Diva (1998-2000)
- Java GUI framework
- Copernicus (code generator)
- KEPLER (2003-2028)
- scientific workflow extensions
Source (Ptolemy) Edward Lee et al.
http//ptolemy.eecs.berkeley.edu/
14KEPLER then
15 and KEPLER today
Whats a poly- morphic actor?
Whats a scientific workflow?
What is HPC?
BTW Kepler is NOT a GUI (Vergil is)
16The KEPLER/Ptolemy II GUI (Vergil)
Directors define the component interaction
execution semantics
Large, polymorphic component (Actors) and
Directors libraries (drag drop)
17Actor-Oriented Design
What flows through an object is sequential
control (cf. CCA, MPI)
class name
data
methods
call
return
What flows through an object is a stream of data
tokens (in SWFs/KEPLER also references!!)
- Actor/Dataflow orientation
actor name
data (state)
parameters
Input data
Output data
ports
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
18Object-Oriented vs.Actor-Oriented Interfaces
Object Oriented
OO interface gives procedures that have to be
invoked in an order not specified as part of the
interface definition.
AO interface definition says Give me text and
Ill give you speech
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
19Ptolemy II Actor-Oriented Modeling
- Component (actor) interaction semantics not
hard-wired inside components, but factored out
in a director - Different directors for different modeling and
execution needs ( can even be combined!) - Better abstraction, modeling, component reuse,
20Behavioral Polymorphism in Ptolemy
These polymorphic methods implement the
communication semantics of a domain in Ptolemy
II. The receiver instance used in communication
is supplied by the director, not by the
component. (cf. CCA, WS-??, GBPL4??, !)
IOPort
Behavioral polymorphism is the idea that
components can be defined to operate with
multiple models of computation and multiple
middleware frameworks.
consumer
producer
actor
actor
Receiver
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
21Component Composition Interaction
- Components linked via ports
- Dataflow (and msg/ctl-flow)
- Where is the component interaction semantics
defined?? - each component is its own director!
- But still useful for special applications, e.g.
parallel programs (MPI, )
Source GRIST/SC4DEVO workshop, July 2004, Caltech
22Data/Control-Flow Spectrum
message passing, control flow
clean data(ctl)-flow
special tokens flow
- Data (tokens) flow
- (almost) no other side effects
- WYSIWYG (usually)
- References flow
- token reference type may be http-get,
ftp-get, hsi put - generic handling still possible
- Application specific tokens flow
- e.g. current Nimrod job management in Resurgence
- invisible contract between components
- Director is unaware of whats going on (sounds
familiar? -) - Specific messages passing protocols (e.g., CSP,
MPI) - for systems of tightly coupled components
23CCA via special (look the other way)
Director(s)?
- Dataflow in CCA
- a CCA convention can be used to accommodate
actor-oriented/dataflow modeling - CCA/Message Passing in KEPLER
- Kepler/Ptolemy can be extended to accommodate
message passing semantics (CSP is already in
Ptolemy II)
24Domains and Directors Semantics for Component
Interaction
- CI Push/pull component interaction
- CSP concurrent threads with rendezvous
- CT continuous-time modeling
- DE discrete-event systems
- DDE distributed discrete events
- FSM finite state machines
- DT discrete time (cycle driven)
- Giotto synchronous periodic
- GR 2-D and 3-D graphics
- PN process networks
- SDF synchronous dataflow
- SR synchronous/reactive
- TM timed multitasking
For (finer-grained) concurrent jobs!?
For (coarse grained) Scientific Workflows!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
25Polymorphic Actor Components Working Across Data
Types and Domains
- Actor Data Polymorphism
- Add numbers (int, float, double, Complex)
- Add strings (concatenation)
- Add complex types (arrays, records, matrices)
- Add user-defined types
- Actor Behavioral Polymorphism
- In dataflow, add when all connected inputs have
data - In a time-triggered model, add when the clock
ticks - In discrete-event, add when any connected input
has data, and add in zero time - In process networks, execute an infinite loop in
a thread that blocks when reading empty inputs - In CSP, execute an infinite loop that performs
rendezvous on input or output - In push/pull, ports are push or pull (declared or
inferred) and behave accordingly - In real-time CORBA, priorities are associated
with ports and a dispatcher determines when to
add
By not choosing among these when defining the
component, we get a huge increment in component
re-usability. But how do we ensure that the
component will work in all these circumstances?
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
26Directors and Combining Different Component
Interaction Semantics
- Possible app. in SWF
- time-series aware
- parameter-sweep aware
- XY aware
- execution models
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
27A Few Specific Kepler Features and Example
Workflows
28Web Services ? Actors (WS Harvester)
1
2
4
3
- ? Minute-made (MM) WS-based application
integration - Similarly MM workflow design sharing w/o
implemented components
29Recent Actor Additions
30Digression Who are the clients?
- Domain scientists
- C/Perl/Python/Java/WS/DB-enabled ones
- others (the rest of us?)
- Goal make the life better for both categories!
- Workflow automation
- Plumbing support
- Execution monitoring, steering, runtime revision
(pause-inspect-modify-resume cycle)
31GEON Mineral Classification Workflow
32 inside the Classifier
BrowserUI actor w/ SVG client display
33GEON Dataset Generation Registration(and
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
34GEON Data Registration UI
35GEON Data Registration in KEPLER
36Registered Resources show up in Vergil (joint
SEEK, SPA, GEON, Registry!?)
37Data Analysis Biodiversity Indices
38Traffic info for a list of highways Uses iterate
(higher-order map) actor to access highway info
web service repeatedly, sending out one email per
highway.
39Traffic info for a list of highways Uses iterate
(higher-order map) actor to access highway info
web service repeatedly, sending out one email per
highway.
40Traffic info for a list of highways Uses iterate
(higher-order map) actor to access highway info
web service repeatedly, sending out one email per
highway.
41Re-engineered PIW w/ Iteration Constructs AD 2004
map(GenbankWS) Input NM_001924,
NM020375 Output CAGTAATATGAC",GGGGACAA
AGA
42Streaming Real-time Data
Straightforward Example
Laser Strainmeter Channels in Scientific
Workflow Earth-tide signal out
Seismic Waveforms
43(No Transcript)
44Job Management (here NIMROD)
- Job management infrastructure in place
- Results database under development
- Goal 1000s of GAMESS jobs (quantum mechanics)
Fall/Winter04
45KEPLER Today
- Support for SWF life cycle
- Design, share, prototype, run, monitor, deploy,
- Coarse-grained scientific workflows, e.g.,
- web service actors, grid actors, command-line
actors, - Fine grained workflows and simulations, e.g.,
- Database access, XSLT transformations,
- Kepler Extensions
- SDM Center/SPA support for data- and
compute-intensive workflows! - real-time data streaming (ROADNet)
- other special and generic extensions (e.g. GEON,
SEEK) - Status
- first release (alpha) was in May 2004
- nightly builds w/ version tests
- Link-Up Sister Project w/ other SWF systems (UK
Taverna, Triana, ) - Participation in various workshops and
conferences (GGF10, SSDBMs, eScience WF workshop,
)
46KEPLER Tomorrow
- Application-driven extensions
- access to/integration with other IDMAF components
- SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?,
parallel-R?, ASPECT?, FastBit, - support for execution of new SWF domains
- Astrophysics TSI/Blondin (SPA/NCSU)
- Nuclear Physics Swesty (SPA/LLNL)
-
- Generic extensions
- addtl. support for data-intensive and
compute-intensive workflows (all SRB Scommands,
CCA support, ) - (C-z bg fg)-ing (detach and reconnect)
- workflow deployment models
- Additional domain awareness (e.g. via new
directors) - time series, parameter sweeps, job scheduling,
- hybrid type system with semantic types
- Consolidation
- More installers, regular releases, improved
documentation,
47KEPLER SPA
- First alpha releases since May 2004
https//www-casc.llnl.gov/sdm/
http//kepler.ecoinformatics.org
48Breaking into the Parallel (MPI) and Stream
Computing Worlds!?
Source Real-Time Signal Processing Dataflow,
Visual, and Functional Programming, Hideki John
Reekie, University of Technology, Sydney
- Clean functional semantics facilitates algebraic
workflow (program) transformations
(Bird-Meertens) e.g. mapS f mapS g ? mapS (f
g)
49Hybrid Types (Structure Semantics)
- Services can be semantically compatible, but
structurally incompatible
Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Incompatible
StructuralType Pt
StructuralType Ps
(?)
?
?(Ps)
Source Service
Target Service
Desired Connection
Pt
Ps
Source Bowers-Ludaescher, DILS04