Using explicit control processes in distributed workflows to gather provenance - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Using explicit control processes in distributed workflows to gather provenance

Description:

Use of grid x remote parallel machines. Use of different WfMS ... What if the grid WfMS does not support loops ? ... encompassing specific grid features. Based ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 41
Provided by: Mar1362
Category:

less

Transcript and Presenter's Notes

Title: Using explicit control processes in distributed workflows to gather provenance


1
Using explicit control processes in distributed
workflows to gather provenance
  • Sergio M. S. Cruz
  • Fernando Seabra Chirigati
  • Rafael Dahis
  • Maria Luiza M. Campos
  • Marta Mattoso
  • Federal University of Rio de Janeiro, Brazil

UFRJ
2
Agenda
  • Introduction
  • Motivation
  • Control flow in data centric workflows
  • Objective
  • Provenance Gathering in Distributed Workflows
    with Explicit Control Flows
  • Case of Use
  • Control Flow on VisTrails
  • Conclusion

3
Distribution Heterogeneity in Workflows
  • Scientific Wf enables data intensive analyses
  • Use of grid x remote parallel machines
  • Use of different WfMS
  • Different provenance capture mechanisms
  • Use Centralized x Distributed WfMS
  • often offer disjoint set of capabilities

How to obtain a homogeneous provenance
representation and capture mechanism?
4
Control flow matters in data centric workflows
  • Scientific workflows also need control structures
    to specify how the data flow should be directed
  • Goderis et al. 6 stress the importance of
    combining different models of computation in one
    scientific workflow
  • Bowers et al. 5 say that
  • modeling control-flow using only dataflow
    constructs can quickly lead to overly complex
    workflows that are hard to understand, reuse,
    reconfigure, maintain, and schedule
  • Tudruj et al. 7 state the importance of general
    dynamic control flow, but focus on
    synchronization of parallel execution
  • Presented a set of generic control structures and
    proposed the use of a monitoring middleware

5
A real example OrthoSearch workflow

Detect distant homologies on five parasites
associated with tropical neglected diseases
6
OrthoSearch specification in Kepler
MAFFT/HMMER packages
Best Hits Finder
FormatDB
BLAST
InterPRO
Time consuming tasks
  • Some lighweight tasks can run locally
  • Suppose we need to execute MAFFT/HMMER in a High
    Performance Environment
  • Just send it to a grid !

7
OrthoSearch - loops, choice,
MAFFT/HMMER packages
How to map this to the grid language ?
Best Hits Finder
FormatDB
BLAST
InterPRO
8
OrthoSearch - loops, choice,
MAFFT/HMMER packages
Alternatively, send one job at a time to execute
remotely
Best Hits Finder
FormatDB
LOCAL BLAST
InterPRO
Can be very inefficient !
9
OrthoSearch - loops, choice,
Rewrite this to the grid language. e.g. Triana,
supports loops !
But, how to bring provenance data back to Kepler ?
How to register loop iterations ?
10
OrthoSearch - loops, choice other issues
What if my available grid supports another WfMS ?
What if the grid WfMS does not support loops ?
What if my available grid does not have a WfMS ?
Generic control flow modules with remote
provenance gathering!
11
Motivation
  • Workflow design
  • Different WfMS present their own control
    structures, parallel execution models, etc.
  • Expose different modeling semantics to the users!
  • Provenance gathering
  • WfMS register provenance in their own schema
  • Often encompassing specific grid features
  • Based on application domain attributes

Many challenges in changing WfMS for the same
workflow
A lot of mappings and conversions!
12
Objective
  • Diminish the dependence of the workflow
    definition on the WfMS
  • uncoupling the provenance gathering system from
    the WfMS
  • having some control flow of execution independent
    of the WfMS workflow specification language
  • Plugging control flow and provenance gathering
    modules along the workflow original tasks
  • the workflow specification can be executed almost
    independently of the current WfMS
  • provenance can be gathered uniformly

13
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

14
Scientific Workflow Control Flows
COGs DB
MAFFT
hmmbuild
HMMER
Implicit DECISION
hmmcalibrate
Ptn DB
hmmsearch
hmmpfam
ReciprocalsBest Hits Finder
BLAST
fastacmd
formatdb
Reannotated genes
InterPRO
15
Scientific Workflow with Explicit Control Flows
Initial condition
MUX
MAFFT
hmmbuild
HMMER
hmmcalibrate
Explicit LOOP
Explicit DECISION
IF
T
F
hmmsearch
hmmpfam
Meta-Workflow ? eases migration of a Wf from WfMS
to another!
  • All these modules can be sent to execute in any
    HPC environment
  • Provenance gathering mechanisms can be inserted
    in the control flow modules or other specific
    modules

16
Control flow modules on VisTrails
  • All these control flow modules were made
    available on Vistrails
  • More explicit control is now available
  • Remote execution can keep specified control
  • Remote execution can bring provenance data back
    to Vistrails with compatible structure

Advantages
17
Orthosearch on VisTrails
External LOOP (parameter exploration)
  • All these inner modules (sub-workflow) can be
    sent to execute in a grid or HPC environment
  • Provenance gathering mechanisms can be inserted
    in the control flow modules or other specific
    modules
  • In Vistrails the loop could not be implemented
    because it is a DAG based WfMS

18
Scientific Workflow - Heterogeinity
COGs DB
MAFFT
hmmbuild
HMMER
hmmcalibrate
Ptn DB
hmmsearch
Time consuming
hmmpfam
ReciprocalsBest Hits Finder
BLAST
fastacmd
formatdb
Reannotated genes
InterPRO
19
Orthosearch on VisTrails
  • BLAST modules should be sent to execute in PC
    cluster
  • Provenance gathering mechanisms can be inserted
    in the control flow modules to be sent to the
    parallel environement
  • In Vistrails this can be achieved using the
    MidMon modules

20
MidMon on VisTrails
Implementation
  • Monitoring tool that checks scientific processes
    running on distributed environments
  • Message exchange-based tool
  • Decoupled and present modular infrastructure
  • Support to legacy applications on distributed
    resources

Control Modules
Data Modules
BLAST
21
Concluding
  • We share the same motivation of Bowers et al.,
    Goderis et al. and Tudruj et al.
  • And the same as Groth et al.
  • We propose
  • A set of generic control-flow structures
    independent of WfMS
  • Our implementation has shown that
  • Control-flow structures can allow generic
    sub-workflow remote execution
  • Remote process provenance can be captured in the
    same representation of the wf
  • Workflow refactoring is facilitated
  • Control-flow structures can be coupled to
    monitoring middleware

Using explicit control flow
Provenance independent of a WfMS
22
Conclusion
  • Distribution Heterogeneity are inevitable in
    scientific workflows
  • Adding control-flow modules to the scientific
    workflow specification can help the execution by
    heterogeneous WfMS running on distributed
    environments
  • Acts as documentation of the execution control
    workflow
  • Allows to evaluate and monitor the activities of
    the workflow
  • Helps to gather provenance from heterogeneous and
    independent environments with low programming
    efforts
  • MidMon on top of VisTrails
  • Enable scientists to monitor the submitted jobs
    status on their desktops
  • Preserves workflows original features

23
Future work
  • Use workflow views, e.g. ZOOM
  • Our solution makes the workflow very verbose
  • Use software component reuse and refactoring
    techniques to help the automatic incorporation of
    these modules
  • Using Provenance to Improve Workflow Design
    Tosta et al.
  • Work with other workflows from bioinformatics and
    oil industry

24
Using explicit control processes in distributed
workflows to gather provenance
Thanks !
  • Sergio M. S. da Cruz
  • Fernando Seabra Chirigati
  • Rafael Dahis
  • Maria Luiza M. Campos
  • Marta Mattoso
  • Federal University of Rio de Janeiro, Brazil

25
(No Transcript)
26
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

MUX Describes a convergence between two or more
input ports, resulting in just one branch
27
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

DEMUX Represents an incoming branch that diverges
into two or more parts. Just one of the outgoing
branches is enabled depending on a condition
associated
28
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

STRING CONTROL The workflow is divided in two or
more branches, and just one of them can be
enabled the other outgoing branches are withdrawn
29
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

NUMBER CONTROL All output data are
originated simultaneously
30
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

NUMBER COMPARE Two or more incoming
branches become one outgoing branch, which will
be only enabled after the complete activation of
all the input data.
31
Scientific Workflow Control Flows
  • A small set of generic workflow-level control
    modules
  • Based on workflow patterns by Van der Aalst et al.

IF Same pattern of the Demux But present two
differences If has only two input ports and has
a logical expression, where the scientists can
create any condition they need.
32
MidMon
  • Offer a generic and lightweight monitoring tool
    that checks scientific processes running on
    distributed environments
  • Message exchange-based, 2 layered modular
    infrastructure
  • Decoupled and lightweight, crossing different
    network boundaries
  • Easy to deploy and manage
  • Support to legacy applications on distributed
    resources

33
Midmon Monitoring Data
  • state data may be possible to be monitored
  • it may be possible to monitor about the state of
    the environment
  • it may be possible to monitor about service
    availability

34
Midmon State Data
  • List of task state data that it may be possible
    to monitor
  • Progress of a service - Rely on check points
    within the service, or a service may be able to
    provide an estimate of its progress
  • Completion of a service - This could be a simple
    event that indicates that a service has produced
    all of its output file
  • Data consumption rate of a service - This is a
    measure of the rate at which service is consuming
    data from input file
  • Data production rate of a service - This is a
    measure of the rate at which service is
    generating data for output file

35
Midmon State of the environment
  • A list of the useful data that it may be possible
    to monitor about the state of the environment is
  • Available execution nodes - This could be a list
    of changes in the available execution nodes in
    the environment
  • Load on an execution node - This is a measure of
    the load in a execution node. It could be one, or
    a tuple, or a composite of services, e.g., the
    CPU load, the number of processes, and the free
    resources of the execution node
  • Load on a network link - This is a measure of
    the usage of a network link, in terms of the
    available bandwidth
  • Memory usage on an execution node - This is a
    measure of the usage of memory in a execution
    node

36
Midmon Service availability
  • The following is a list of useful data that it
    may be possible to monitor about service
    availability
  • Available services - This could be a list of the
    services available as mapping targets for tasks
    in a workflow. The data could also include, e.g.,
    the status of services currently deployed
  • Available data resources. This could be a list of
    the data resources available as mapping targets
    for inputs and outputs in a workflow

37
OrthoSearch SSH version
  • Without Control-Flow modules

38
hmmSearch
hmmPFam
OrthoSearch on Kepler 1/3
39
FormatDB
OrthoSearch on Kepler 2/3
FastaCmd
40
InterPro
OrthoSearch on Kepler 3/3
Write a Comment
User Comments (0)
About PowerShow.com