Title: Using explicit control processes in distributed workflows to gather provenance
1Using explicit control processes in distributed
workflows to gather provenance
- Sergio M. S. Cruz
- Fernando Seabra Chirigati
- Rafael Dahis
- Maria Luiza M. Campos
- Marta Mattoso
- Federal University of Rio de Janeiro, Brazil
UFRJ
2Agenda
- Introduction
- Motivation
- Control flow in data centric workflows
- Objective
- Provenance Gathering in Distributed Workflows
with Explicit Control Flows - Case of Use
- Control Flow on VisTrails
- Conclusion
3Distribution Heterogeneity in Workflows
- Scientific Wf enables data intensive analyses
- Use of grid x remote parallel machines
- Use of different WfMS
- Different provenance capture mechanisms
- Use Centralized x Distributed WfMS
- often offer disjoint set of capabilities
How to obtain a homogeneous provenance
representation and capture mechanism?
4Control flow matters in data centric workflows
- Scientific workflows also need control structures
to specify how the data flow should be directed - Goderis et al. 6 stress the importance of
combining different models of computation in one
scientific workflow - Bowers et al. 5 say that
- modeling control-flow using only dataflow
constructs can quickly lead to overly complex
workflows that are hard to understand, reuse,
reconfigure, maintain, and schedule - Tudruj et al. 7 state the importance of general
dynamic control flow, but focus on
synchronization of parallel execution - Presented a set of generic control structures and
proposed the use of a monitoring middleware
5A real example OrthoSearch workflow
Detect distant homologies on five parasites
associated with tropical neglected diseases
6OrthoSearch specification in Kepler
MAFFT/HMMER packages
Best Hits Finder
FormatDB
BLAST
InterPRO
Time consuming tasks
- Some lighweight tasks can run locally
- Suppose we need to execute MAFFT/HMMER in a High
Performance Environment - Just send it to a grid !
7OrthoSearch - loops, choice,
MAFFT/HMMER packages
How to map this to the grid language ?
Best Hits Finder
FormatDB
BLAST
InterPRO
8OrthoSearch - loops, choice,
MAFFT/HMMER packages
Alternatively, send one job at a time to execute
remotely
Best Hits Finder
FormatDB
LOCAL BLAST
InterPRO
Can be very inefficient !
9OrthoSearch - loops, choice,
Rewrite this to the grid language. e.g. Triana,
supports loops !
But, how to bring provenance data back to Kepler ?
How to register loop iterations ?
10OrthoSearch - loops, choice other issues
What if my available grid supports another WfMS ?
What if the grid WfMS does not support loops ?
What if my available grid does not have a WfMS ?
Generic control flow modules with remote
provenance gathering!
11Motivation
- Workflow design
- Different WfMS present their own control
structures, parallel execution models, etc. - Expose different modeling semantics to the users!
- Provenance gathering
- WfMS register provenance in their own schema
- Often encompassing specific grid features
- Based on application domain attributes
Many challenges in changing WfMS for the same
workflow
A lot of mappings and conversions!
12Objective
- Diminish the dependence of the workflow
definition on the WfMS - uncoupling the provenance gathering system from
the WfMS - having some control flow of execution independent
of the WfMS workflow specification language - Plugging control flow and provenance gathering
modules along the workflow original tasks - the workflow specification can be executed almost
independently of the current WfMS - provenance can be gathered uniformly
13Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
14Scientific Workflow Control Flows
COGs DB
MAFFT
hmmbuild
HMMER
Implicit DECISION
hmmcalibrate
Ptn DB
hmmsearch
hmmpfam
ReciprocalsBest Hits Finder
BLAST
fastacmd
formatdb
Reannotated genes
InterPRO
15Scientific Workflow with Explicit Control Flows
Initial condition
MUX
MAFFT
hmmbuild
HMMER
hmmcalibrate
Explicit LOOP
Explicit DECISION
IF
T
F
hmmsearch
hmmpfam
Meta-Workflow ? eases migration of a Wf from WfMS
to another!
- All these modules can be sent to execute in any
HPC environment - Provenance gathering mechanisms can be inserted
in the control flow modules or other specific
modules
16Control flow modules on VisTrails
- All these control flow modules were made
available on Vistrails - More explicit control is now available
- Remote execution can keep specified control
- Remote execution can bring provenance data back
to Vistrails with compatible structure
Advantages
17Orthosearch on VisTrails
External LOOP (parameter exploration)
- All these inner modules (sub-workflow) can be
sent to execute in a grid or HPC environment - Provenance gathering mechanisms can be inserted
in the control flow modules or other specific
modules - In Vistrails the loop could not be implemented
because it is a DAG based WfMS
18Scientific Workflow - Heterogeinity
COGs DB
MAFFT
hmmbuild
HMMER
hmmcalibrate
Ptn DB
hmmsearch
Time consuming
hmmpfam
ReciprocalsBest Hits Finder
BLAST
fastacmd
formatdb
Reannotated genes
InterPRO
19Orthosearch on VisTrails
- BLAST modules should be sent to execute in PC
cluster - Provenance gathering mechanisms can be inserted
in the control flow modules to be sent to the
parallel environement - In Vistrails this can be achieved using the
MidMon modules
20MidMon on VisTrails
Implementation
- Monitoring tool that checks scientific processes
running on distributed environments - Message exchange-based tool
- Decoupled and present modular infrastructure
- Support to legacy applications on distributed
resources
Control Modules
Data Modules
BLAST
21Concluding
- We share the same motivation of Bowers et al.,
Goderis et al. and Tudruj et al. - And the same as Groth et al.
- We propose
- A set of generic control-flow structures
independent of WfMS - Our implementation has shown that
- Control-flow structures can allow generic
sub-workflow remote execution - Remote process provenance can be captured in the
same representation of the wf - Workflow refactoring is facilitated
- Control-flow structures can be coupled to
monitoring middleware
Using explicit control flow
Provenance independent of a WfMS
22Conclusion
- Distribution Heterogeneity are inevitable in
scientific workflows - Adding control-flow modules to the scientific
workflow specification can help the execution by
heterogeneous WfMS running on distributed
environments - Acts as documentation of the execution control
workflow - Allows to evaluate and monitor the activities of
the workflow - Helps to gather provenance from heterogeneous and
independent environments with low programming
efforts - MidMon on top of VisTrails
- Enable scientists to monitor the submitted jobs
status on their desktops - Preserves workflows original features
23Future work
- Use workflow views, e.g. ZOOM
- Our solution makes the workflow very verbose
- Use software component reuse and refactoring
techniques to help the automatic incorporation of
these modules - Using Provenance to Improve Workflow Design
Tosta et al. - Work with other workflows from bioinformatics and
oil industry
24Using explicit control processes in distributed
workflows to gather provenance
Thanks !
- Sergio M. S. da Cruz
- Fernando Seabra Chirigati
- Rafael Dahis
- Maria Luiza M. Campos
- Marta Mattoso
- Federal University of Rio de Janeiro, Brazil
25(No Transcript)
26Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
MUX Describes a convergence between two or more
input ports, resulting in just one branch
27Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
DEMUX Represents an incoming branch that diverges
into two or more parts. Just one of the outgoing
branches is enabled depending on a condition
associated
28Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
STRING CONTROL The workflow is divided in two or
more branches, and just one of them can be
enabled the other outgoing branches are withdrawn
29Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
NUMBER CONTROL All output data are
originated simultaneously
30Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
NUMBER COMPARE Two or more incoming
branches become one outgoing branch, which will
be only enabled after the complete activation of
all the input data.
31Scientific Workflow Control Flows
- A small set of generic workflow-level control
modules - Based on workflow patterns by Van der Aalst et al.
IF Same pattern of the Demux But present two
differences If has only two input ports and has
a logical expression, where the scientists can
create any condition they need.
32MidMon
- Offer a generic and lightweight monitoring tool
that checks scientific processes running on
distributed environments - Message exchange-based, 2 layered modular
infrastructure - Decoupled and lightweight, crossing different
network boundaries - Easy to deploy and manage
- Support to legacy applications on distributed
resources
33Midmon Monitoring Data
- state data may be possible to be monitored
- it may be possible to monitor about the state of
the environment - it may be possible to monitor about service
availability
34Midmon State Data
- List of task state data that it may be possible
to monitor - Progress of a service - Rely on check points
within the service, or a service may be able to
provide an estimate of its progress - Completion of a service - This could be a simple
event that indicates that a service has produced
all of its output file - Data consumption rate of a service - This is a
measure of the rate at which service is consuming
data from input file - Data production rate of a service - This is a
measure of the rate at which service is
generating data for output file
35Midmon State of the environment
- A list of the useful data that it may be possible
to monitor about the state of the environment is - Available execution nodes - This could be a list
of changes in the available execution nodes in
the environment - Load on an execution node - This is a measure of
the load in a execution node. It could be one, or
a tuple, or a composite of services, e.g., the
CPU load, the number of processes, and the free
resources of the execution node - Load on a network link - This is a measure of
the usage of a network link, in terms of the
available bandwidth - Memory usage on an execution node - This is a
measure of the usage of memory in a execution
node
36Midmon Service availability
- The following is a list of useful data that it
may be possible to monitor about service
availability - Available services - This could be a list of the
services available as mapping targets for tasks
in a workflow. The data could also include, e.g.,
the status of services currently deployed - Available data resources. This could be a list of
the data resources available as mapping targets
for inputs and outputs in a workflow
37OrthoSearch SSH version
- Without Control-Flow modules
38hmmSearch
hmmPFam
OrthoSearch on Kepler 1/3
39FormatDB
OrthoSearch on Kepler 2/3
FastaCmd
40InterPro
OrthoSearch on Kepler 3/3