Scientific Workflows as Configurable, ChangeResilient Data Transducers - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Scientific Workflows as Configurable, ChangeResilient Data Transducers

Description:

Scientific Workflows as Configurable, ChangeResilient Data Transducers – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 50
Provided by: bertr68
Category:

less

Transcript and Presenter's Notes

Title: Scientific Workflows as Configurable, ChangeResilient Data Transducers


1
Scientific Workflows as Configurable,
Change-ResilientData Transducers
UC Davis Team Dr. Shawn Bowers Dr. Tim
McPhillips Dr. Norbert Podhorszki Dr. Carlos
Rueda Manish Anand Saumen Dey Dave Thau Daniel
Zinn
  • Bertram Ludäscher
  • Dept. of Computer Science
  • Genome Center
  • University of California, Davis
  • ludaesch_at_ucdavis.edu

2
SUMMARY
  • Déjà Vu
  • Scientific workflow the CI upperware of
    eScience
  • Scientific workflows why?
  • Diversity of scientific workflow (one size fits
    all?)
  • eScience Collaborations using
  • d-WS, a-WS, w-WS!
  • is this the W3C Gone Wild?? (W3C-GW)
  • Scientific Workflows are Forever (YMMV)
  • Facilitating sharing, design through evolution
  • The challenge
  • The Complex, the Brittle, the Unsustainable
    (CBUs) wfs
  • GOTO still considered harmful
  • A possible solution
  • Optimize human time (its about us isnt it?)
  • Change-resilience
  • Employ data coherence
  • VALs (Virtual Assembly Lines) COMAD
  • Optimizing dataflow (cpu-time)
  • Kepler/CORE

3
Scientific Workflows Cyberinfrastructure
UPPER-WARE
4
Why Scientific Workflow?
  • Capture how a scientist works with data and
    analytical tools
  • data access, transformation, analysis,
    visualization
  • possible worldview dataflow-oriented (cf.
    signal-processing)
  • Scientific workflow (wf) benefits (compare w/
    script-based approaches)
  • wf automation
  • wf component reuse
  • wf design, documentation
  • wf archival, sharing
  • built-in concurrency
  • (task-, pipeline-parallelism)
  • built-in provenance support
  • distributed parallel exec
  • Grid cluster support

5
Kepler Data Access via the EcoGrid
Data QuickSearch Tab
Metadata Keyword Search
Access Multiple EcoGrid Sources
Return Data Sets as Actors to Drag-Drop to
Canvas
6
Kepler Actor Semantic-Type Annotation
  • Actor input/output port annotation
  • Each port can be annotated with multiple classes
    from multiple ontologies
  • Annotations are stored with actor metadata (MOML)
  • Actors can be discovered, validated, etc., via
    their semantic types
  • Citations here

7
Kepler Actor Library
  • Actor Annotations for Indexing and Classification
  • New actors can be annotated and indexed into the
    component library (e.g., specializing generic
    actors)
  • Existing components can also be revised,
    annotated, and indexed (hiding previous versions)
  • Quick search leverages metadata, including
    annotations ontologies

8
Building a simple workflow in Kepler
1
3
2
  • Select actors from Kepler actor library
  • Local or remote actors
  • View actor metadata/documentation (not shown)
  • Drag desired actor to canvas
  • Connect actor ports

other actor examples
9
Building a simple workflow in Kepler
1
2
3
  • Select input data
  • Shown here is an EcoGrid for bacterial
    abundance
  • Connect data actors to workflow inputs

many ways to import data
10
Building a simple workflow in Kepler
  • Using EcoGrid data sources
  • Metadata (EML) can be displayed
  • Data can be queried via SQL/QBE interface
  • Data set here is a tab-delimited file

11
Building a simple workflow in Kepler
  • Run the workflow
  • Also set parameters, select configuredirector,
    run window, etc.

12
Scientific workflows are CI upper-ware, i.e.
the scientists way to harness
cyberinfrastructure
  • Domain Scientists View
  • Q When is CI (middle-ware, under-ware) good?
  • A When I cant see it!
  • Q When is a scientific workflow tool (CI
    upper-ware) good?
  • A When I can get more, new, faster, better
    science done!
  • Workflow Engineers View
  • How can I (help the scientist) design implement
    the desired wfs?
  • How does wf make my life easier? Is there life
    beyond Perl Python?
  • Choice of platforms, standards reuse of existing
    tools, semantic extensions, scheduling on the
    Grid?
  • How do I make all of this robust, fault-tolerant,
    etc.
  • Computer Scientists View
  • workflow modeling design, static analysis,
    optimization, theoretical limits what can /
    cant be done
  • The quest for the right models languages
    Workflow Thinking
  • The holy grail of eScience Join the Quest!

13
Rough taxonomy of (overlapping) workflow types
  • Desktop / discovery workflows
  • analysis/method-intensive, R, Matlab, custom
    algorithims
  • e.g. bioinformatics, genomics, phylogenetics
  • exploratory workflow, rapidly evolving
  • need data workflow provenance
  • Plumbing workflows
  • data-intensive, e.g. moving TBs between from
    ORNL (compute) to LBL/NERSC (archive)
  • Production workflow reliable, fault-tolerant,
    high-throughput, runtime monitoring
  • HPC workflows
  • cpu-intensive, need to utilize a local cluster
    or distribute Grid, e.g. Ecological Niche
    Modeling, Parameter studies,
  • Parallel/distributed workflow
  • Streaming workflows
  • (near) real-time processing and data analysis
  • distributed setting

14
Simple Kepler workflow using R (reuse, dont
reinvent)
15
Discovery Workflow Ecological Niche Modeling
Slide Matt Jones
16
Ex SEEK Ecological Niche Modeling Pipeline
  • Scientific Workflow paradigm
  • Reusable components (actors) a scientists
    verbs/actions
  • Top-level workflows conceptual representation
    of the science process, sentences in the
    scientists language
  • Sub-workflows increasing levels of detail
  • Separation of concerns
  • actors what to do
  • parameters configurable behavior
  • channels dataflow, pipeline composition
  • directors fix execution model, scheduling
  • semantic types smart discovery, linking

D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
17
Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
Actors
Datasets
Datasets
18
Plumbing (1/2) Archive migration workflow
Stage from NERSC HPSS to local disk transfer
to ORNL disk store at ORNL HPSS
Moved 10TB of data from NERSC archive to ORNL
archive in 11 days (network issues, bugs, and
more)
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
19
Under the hood pipeline parallel processing
Norbert Podhorszki (UC Davis)
20
Plumbing (2/2) SDM/CPES (fusion simulation)
ORNL
40 GB/s
HPSS
Norbert Podhorszki (UC Davis), Scott Klasky
(ORNL)
Command Control site
21
  • Plumbing workflow
  • to accomplish all these tasks
  • 50 composite actors (subworkflows)
  • 4 levels of hierarchy
  • 1000 atomic (Java) actors

Norbert Podhorszki UC Davis, soon ORNL
22
Enabling e-Science Collaboration
  • Workflows Its about sharing
  • standing on the shoulders of giants
  • of course after you got your paper, grant, etc in
  • myExperiment ? Kepler repository

23
Déjà vu (MS eSW06) Workflow component
repositories myExperiment your library Our
Workflow Repository!
  • Taverna Repository Kepler
    Repository

Need W-WS! Workflows Worth Sharing
24
Enabling e-Science Collaboration
  • d-WS, a-WS, w-WS (W3C gone
    wild??)
  • Data Worth Sharing
  • Actors (components) Worth Sharing
  • Workflows Worth Sharing
  • Scientific Workflows are Forever (YMMV)
  • Facilitating sharing, design through evolution
  • The challenge
  • The Complex, the Brittle, the Unsustainable
    (CBUs) wfs
  • GOTO still considered harmful

25
Behold the Beauty of Scientific Workflows
Author Kristian Stevens, UC Davis
26
the ugly truth inside (CBUs)
Author Kristian Stevens, UC Davis
27
But how do we get from messy to neat reusable
designs?
28
Scientific Workflow Modeling Design
And thats why our scientific workflows are
much easier to develop, understand, reuse and
maintain!
29
The Joy of Exa-Scale Cyberinfrastructure
  • Are we working at the right level of abstraction?
  • Are we optimizing the right thing?
  • Optimize human cycles, not just CPU cycles!
  • cf. John McCarthy (of AI/LISP fame)
  • ? Make data scientific workflows effectively
    (re-)usable for scientist
  • Make workflows first-class, shareable knowledge
    artifacts
  • cf. myExperiment!
  • Importance of user-oriented workflow design! (
    and provenance)

30
A Problem Evolving Workflows
Daniel Zinn (UC Davis)
31
What we want Simple Analysis Pipelines
Author Tim McPhillips, UC Davis
32
Ford Assembly Line (2 x 2 - DX Pipe)
x
?o
  • Actors move along the buffet (data)
  • pick up data
  • may put data
  • task pipeline parallel (if very hungry also
    data parallel possible)
  • actors are configurable
  • passing the buck on irrelevant data
  • ? Resilient to change!
  • Equivalent view
  • Who moves?
  • line up actors
  • roll buffet data past actors
  • ? COMAD/VAL model

33
The Answer (YMMV)
  • Virtual Assembly Lines Paradigm (VALs)
  • Embrace the assembly line metaphor fully
  • ? cf. Flow-based Programming (J. Morrison)
  • Collection-Oriented Modeling Design (COMAD)
  • Data tagged nested collections
  • pipelined (XML) token streams
  • passing the buck on whats not in your scope

Timothy McPhillips UC Davis
34
Virtual Assembly Lines (VAL/COMAD)
Daniel Zinn (UC Davis)
35
COMAD / VAL, hints at the secret sauce
  • Scope your work and pass the buck!
  • Let go! (often stateless actors)
  • To maximize concurrency / minimize latency
  • futures promises (holes as placeholders)

36
Conventional vs Assembly Line Delta-XML
Thinking
Daniel Zinn (UC Davis)
37
Conceptual Pipeline w/ Scopes Types
Daniel Zinn (UC Davis)
38
What we got Simple Change-Resilient Pipelines
Author Tim McPhillips, UC Davis
Look Ma No Shims!
39
Result Change-Resilience (Wf graph)
?
X
A
B
C
S
R
W
Original
Automatic Configuration
W
WX
S R
S R
Infer Configuration X of X
Daniel Zinn (UC Davis)
40
Input Change-Resilience (nested data types)
S. Bowers, Daniel Zinn (UC Davis)
41
Optimizing VAL/COMAD User vs. System View
Daniel Zinn (UC Davis)
42
X-CSR (XML Scissor) Cut-Ship-Reassemble
submitted for publication Daniel Zinn, Shawn
Bowers, Bertram Ludaescher (UC Davis)
43
Language Abstractions Modeling Design
  • Vanilla Process Network
  • Functional Programming Dataflow Network
  • XML Transformation Network
  • Collection-oriented Modeling Design framework
    (COMAD)

The limitations of my modeling language are the
limitations of my design world. BL
44
Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science
  • new develoment at the intersection of computer
    science and the sciences a leap from the
    application of computing to support scientists to
    do science (i.e. computational science) to
    the integration of computer science concepts,
    tools and theorems into the very fabric of
    science. We believe this development
    represents the foundations of a new revolution in
    science
  • we believe computer science is poised to become
    as fundamental to biology as mathematics has
    become to physics
  • to understand cells and cellular systems
    requires viewing them as information processing
    systems, as evidenced by the fundamental
    similarity between molecular machines of the
    living cell and computational automata, and by
    the natural fit between computer process algebras
    and biological signalling and between
    computational logical circuits and regulatory
    systems in the cell
  • We highlight that an immediate and important
    challenge is that of end-to-end scientific data
    management, from data acquisition and data
    integration, to data treatment, provenance and
    persistence.
  • dramatic in its impact, will be the integration
    of new conceptual and technological tools from
    computer science into the sciences.

45
Consilience The Unity of Knowledge (E. O. Wilson)
  • "Literally a jumping together of knowledge by the
    linking of facts and fact-based theory across
    disciplines to create a common groundwork for
    explanation." E.O.Wilson
  • eScience, Cyberinfrastructure, CS mechanisms
  • to make progress
  • Scientific Workflows
  • crucial elements to get the most mileage out of
    CI to fuel eScience, accelerating knowledge
    discovery
  • Need good workflow repositories!
  • Workflows Worth Sharing (Workflows are
    Forever)
  • Importance of Workflow Design, Reuse through
    Evolution, Change-Resilience
  • Wir müssen wissen, wir werden wissen!
  • We must know, we will now! -- D. Hilbert

46
New NSF/SDCI Project Kepler/CORE
Phylogenetics
Astronomy
Library Science
Ecology
Conservation Biology
Oceanography
Geosciences
Molecular Biology
Chemistry
Particle Physics
47
Thank You!
ludaesch_at_ucdavis.edu
  • Invitation Join the Workflow Community (e.g.
    become a Kepler member)
  • New NSF/SDCI Kepler/CORE grant
  • Refactoring the software to make extensions,
    customization, deployment easy
  • Open process, joint ownership
  • Kepler users, developers (core v.s. extensions),
    stakeholders,

48
Related References
  • Scientific Workflows More e-Science Mileage from
    Cyberinfrastructure, Bertram Ludäscher, Shawn
    Bowers, Timothy McPhillips, Norbert Podhorszki.
    Workshop on Scientific Workflows and Business
    workflow standards in e-Science at eScience'06,
    Amsterdam, December, 2006.
  • Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data,
    Timothy McPhillips, Shawn Bowers, Bertram
    Ludäscher. 3rd International Workshop on Data
    Integration in the Life Sciences (DILS'06),
    European Bioinformatics Institute (EBI), Hinxton,
    UK, July 20-22, 2006.
  • Project Histories Managing Data Provenance
    Across Collection-Oriented Scientific Workflow
    Runs, Shawn Bowers, Timothy McPhillips, Martin
    Wu, Bertram Ludäscher. 4th Intl. Workshop on Data
    Integration in the Life Sciences (DILS'07),
    University of Pennsylvania, Philadelphia, June
    27-29, 2007.
  • Actor-Oriented Design of Scientific Workflows,
    Shawn Bowers and Bertram Ludäscher, 24th Intl.
    Conference on Conceptual Modeling (ER'05),
    Klagenfurt, Austria, LNCS, Springer, 2005
  • D. Zinn, S. Bowers, B. Ludäscher, Dataflow
    Optimization for Distributed XML Stream
    Processors submitted for publication
  • Provenance in Collection-Oriented Scientific
    Workflows, Shawn Bowers, Timothy McPhillips,
    Bertram Ludäscher. Concurrency and Computation
    Practice Experience, special issue on the First
    Provenance Challenge, 2007, in press.
  • Workflow Automation for Processing Plasma Fusion
    Simulation Data, Norbert Podhorszki, Bertram
    Ludäscher, Scott Klasky. 2nd Workshop on
    Workflows in Support of Large-Scale Science
    (WORKS'07), Monterey Bay California, June 25,
    2007.
  • Scientific Workflow Management and the Kepler
    System, B. Ludäscher, I. Altintas, C. Berkley, D.
    Higgins, E. Jaeger-Frank, M. Jones, E. Lee, J.
    Tao, Y. Zhao, Concurrency and Computation
    Practice Experience, 18(10), pp. 1039-1065,
    2006. DOI

49
More References
  • Semantic Type Annotation
  • S Bowers, B Ludaescher. A Calculus for
    Propagating Semantic Annotations through
    Scientific Workflow Queries. ICDE Workshop on
    Query Languages and Query Processing (QLQP),
    LNCS, 2006.
  • S Bowers, B Ludaescher. Towards Automatic
    Generation of Semantic Types in Scientific
    Workflows. International Workshop on Scalable
    Semantic Web Knowledge Base Systems (SSWS), WISE
    2005 Workshop Proceedings, LNCS, 2005.
  • C Berkley, S Bowers, M Jones, B Ludaescher, M
    Schildhauer, J Tao. Incorporating Semantics in
    Scientific Workflow Authoring. SSDBM, 2005.
  • B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
    Brodaric, C Baru. Managing Scientific Data From
    Data Integration to Scientific Workflows. GSA
    Today, Special Issue on Geoinformatics, 2006.
  • S Bowers, D Thau, R Williams, B Ludaescher. Data
    Procurement for Enabling Scientific Workflows On
    Exploring Inter-Ant Parasitism. VLDB Workshop on
    Semantic Web and Databases (SWDB), 2004.
  • S Bowers, K Lin, B Ludaescher. On Integrating
    Scientific Resources through Semantic
    Registration. SSDBM, 2004.
  • S Bowers, B Ludaescher. An Ontology-Drive
    Framework for Data Transformation in Scientific
    Workflows. International Workshop on Data
    Integration in the Life Sciences (DILS), LNCS,
    2004.
  • S Bowers, B Ludaescher. Towards a Generic
    Framework for Semantic Registration of Scientific
    Data. International Semantic Web Conference
    Workshop on Semantic Web Technologies for
    Searching and Retrieving Scientific Data, 2003.
  • Workflow Design and Modeling
  • T McPhillips, S Bowers, B Ludaescher.
    Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data.
    Workshop on Data Integration in the Life Sciences
    (DILS), LNCS, 2006.
  • S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
    Davidson. A Model for User-Oriented Data
    Provenance in Pipelined Scientific Workflows.
    International Provenance and Annotation Workshop
    (IPAW), LNCS, 2006.
  • S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
    Enabling Scientific Workflow Reuse through
    Structured Composition of Dataflow and
    Control-Flow. IEEE Workshop on Workflow and Data
    Flow for Scientific Applications (SciFlow), 2006.
  • S Bowers, B Ludaescher. Actor-Oriented Design of
    Scientific Workflows. International Conference on
    Conceptual Modeling (ER), LNCS, 2005.
  • T McPhillips, S Bowers. Pipelining Nested Data
    Collections in Scientific Workflows. SIGMOD
    Record, 2005.
  • Kepler
  • D Pennington, D Higgins, AT Peterson, M Jones, B
    Ludaescher, S Bowers. Ecological Niche Modeling
    using the Kepler Workflow System. Workflows for
    e-Science, Springer-Verlag, to appear.
  • W Michener, J Beach, S Bowers, L Downey, M Jones,
    B Ludaescher, D Pennington, A Rajasekar, S
    Romanello, M Schildhauer, D Vieglais, J Zhang.
    SEEK Data Integration and Workflow Solutions for
    Ecology. Workshop on Data Integration in the Life
    Sciences (DILS), LNCS, 2005.
  • S Romanello, W Michener, J Beach, M Jones, B
    Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
    D Pennington. Creating and Providing Data
    Management Services for the Biological and
    Ecological Sciences Science Environment for
    Ecological Knowledge. SSDBM, 2005.
Write a Comment
User Comments (0)
About PowerShow.com