Bertram Ludscher - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Bertram Ludscher

Description:

Bertram Ludscher – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 79
Provided by: bertr66
Category:
Tags: bertram | fsa | ludscher

less

Transcript and Presenter's Notes

Title: Bertram Ludscher


1
Scientific Workflows Towards a New Synthesis
for Information Integration
  • Bertram Ludäscher
  • Dept. of Computer Science Genome Center
  • University of California, Davis
  • ludaesch_at_ucdavis.edu

2
Outline
  • Demystifying eScience, Cyberinfrastructure (CI),
  • Scientific Workflows and CI
  • SWF Examples
  • Scientific Workflow Design
  • Actor-oriented Modeling Design (AMAD)
  • Semantic types
  • Collection-oriented Modeling Design (COMAD)
  • SWF Provenance
  • or Show me the evidence!

3
The Diversity Unity of Science
Natural Sciences

Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction
in vivo, in vitro, in situ, in silico,
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
4
How we do Science Src C. Goble ISWC05
5
e-Science (UK) and Cyberinfrastructure (US)
  • e-Science is about global collaboration in key
    areas of science and the next generation of
    computing infrastructure that will enable it."
  • Sir John Taylor, Director Office of Science and
    Technology, UK
  • "Cyberinfrastructure is the coordinated aggregate
    of software, hardware and other technologies, as
    well as human expertise, required to support
    current and future discoveries in science and
    engineering. The challenge of Cyberinfrastructure
    is to integrate relevant and often disparate
    resources to provide a useful, usable, and
    enabling framework for research and discovery
    characterized by broad access and 'end-to-end'
    coordination.
  • Fran Berman, San Diego Supercomputer Center, UCSD

6
Example Current NSF CI Solicitation
Src Chris Greer, NSF Program Director
7
Integrated Cyberinfrastructure System meeting
the needs of multiple communities Source Dr.
Deborah Crawford, Chair, NSF CyberInfrastructure
Working Group
  • Applications
  • Environmental Science
  • High Energy Physics
  • Biomedical Informatics
  • Geoscience

DevelopmentTools Libraries
Education and Training
Discovery Innovation
Grid Services Middleware
Hardware
8
CI Tools Attempt at a simplified definition
  • CI IT that scientists use to get their
    (science) job done
  • Hardware dedicated cluster, supercomputer time,
    the Grid,
  • Pubmed, Citeseer, GoogleScholar, (digital
    libraries)
  • email, skype, (personal communication)
  • Google, Wikipedia, (search browse)
  • MySQL, PostgreSQL, Oracle, SQL Server, DB2,
    (databases)
  • R, MatLab, Statistica, LabView, . (statistics,
    data mining, )
  • Web Portals, Workspaces, Workbenches,
  • Virtual instruments, virtual experiments,
  • Community aspect
  • shared/standardized file formats, data,
    metadata, ontologies, .
  • scientific workflows! (PSEs, LIMS ? Scientific
    Workflow Systems)

9
A Scientists view on Cyberinfrastructure
  • In the beginning there was a lot of middleware
  • compute/data-grids, web services, WS-foo,
    SOA-bar,
  • running on a widely distributed underware
  • i.e., distributed, heterogeneous compute
    platforms on clusters, over LANs WANs,
    database-backends, etc.
  • only topped by the all-important Upperware!
  • i.e., the applications that users/scientists use
    to get stuff done
  • Good CI at the middleware level and below
  • Thou shalt be invisible!
  • unless you are a workflow engineer
  • Good CI at the upperware level
  • Thou shalt help me get my job done!
  • ? That is what scientific workflow systems shoot
    for

10
Scientific Workflows User Applications?
  • Q So is this scientific workflow system
    upperware then a fancy name for tools or user
    apps?
  • A Tool, application, etc. are very generic
    terms. They do not convey what SWF systems are
    about, i.e., gluing/plumbing existing data
    management, analysis, and visualization
    components together
  • Q Doh! And I heard that Ontologies its the
    GLUE STUPID! so what is now the glue??
  • A Both ontologies are about gluing (defining)
    your vocabulary, annotating (and thus
    interlinking) data, etc.
  • Scientific workflows are about gluing components
    (web services, legacy scripts, external tools
    e.g. R, ).
  • ? SWFs are about application (and data)
    integration ontologies are about sharing
    meaning via controlled vocabularies logic
    constraints (capturing some of the semantics)
    this helps, e.g., in data integration

11
Data Integration So much data, so little time
Tectonics
Earthquakes
Geology
Aquifers
Moho depth
Integration across (Sub-)Disciplines, Scales,
Time,
Topography
Gravity
Faults
Mines
Focal Mechanisms
Magnetics
Sediment thickness
Src Dogan Seber SDSC, Krishna Sinha VT
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
12
Different Types of Information Integration
  • Spatial (co-)registration/overlay of different
    data
  • from 2D, 3D, 4D (x,y,z,t), (4n) D
  • ? GIS !
  • Conventional (DB-oriented) integration
  • schema-based
  • view-based
  • at the data-level
  • Extended DI approaches using ontologies
  • ontologies? controlled vocabularies?
    metadata/annotations?
  • Application/process integration
  • ? scientific workflows
  • can include all the others and
  • statistics, data mining, visualization,

13
Interoperability Integration Challenges
  • System aspects Grid Middleware
  • distributed data computing, SOA
  • resource discovery, authentication, authorization
  • web services, WSDL/SOAP, WSRF, OGSA,
  • (re-)sources services, files, data sets, nodes
  • Syntax Structure
  • (XML-Based) Data Mediators
  • wrapping, restructuring
  • (XML) queries and views
  • sources (XML) databases
  • Semantics
  • Model-Based/Semantic Mediators
  • conceptual models and declarative views
  • Knowledge Representation ontologies, description
    logics (RDF(S),OWL ...)
  • sources knowledge bases (DBCMsICs)
  • Synthesis Scientific Workflow Design Execution
  • Composition of declarative and procedural
    components into larger workflows
  • (re)sources services, processes, actors,
  • reconciling S5 heterogeneities
  • gluing together resources
  • bridging information and knowledge gaps
    computationally

14
Information Integration Challenges S4
Heterogeneities
  • System aspects
  • platforms, devices, data service distribution,
    APIs, protocols,
  • ? Grid middleware technologies
  • e.g. single sign-on, platform independence,
    transparent use of remote resources,
  • Syntax Structure
  • heterogeneous data formats (one for each tool
    ...)
  • heterogeneous data models (RDBs, ORDBs, OODBs,
    XMLDBs, flat files, )
  • heterogeneous schemas (one for each DB ...)
  • ? Database mediation technologies
  • XML-based data exchange, integrated views,
    transparent query rewriting,
  • Semantics
  • descriptive metadata, different terminologies,
    hidden semantics (context), implicit
    assumptions,
  • ? Knowledge representation semantic mediation
    technologies
  • smart data discovery integration
  • e.g. ask about X (mafic) find data about Y
    (diorite) be happy anyways!

15
Information Integration Challenges S5
Heterogeneities
  • Synthesis of applications, analysis tools, data
    query components, into scientific workflows
  • How to put together components to solve a
    scientists problem?
  • ? Scientific Problem Solving Environments (PSEs)
  • Portals, Workbench (scientists view)
  • ontology-enhanced data registration, discovery,
    manipulation
  • creation and registration of new data products
    from existing ones,
  • Scientific Workflow System (engineers view
    scientists view)
  • for designing, re-engineering, deploying
    analysis pipelines and scientific workflows a
    tool to make new tools
  • e.g., creation of new data products from
    existing ones, dataset registration,

16
Motivation Scientific Workflows,
Pre-Cyberinfrastructure
  • Data Federation Grid Plumbing
  • access, move, replicate, query data (Data-Grid)
  • authenticate SRB Sget/Sput OPeNDAP,
    Antelope/ORBs
  • schedule, launch, monitor jobs (Compute-Grid)
  • Globus, Condor, Nimrod, APST,
  • Data Integration
  • Conceptual querying integration, structure
    semantics, e.g. mediation w/ SQL, XQuery OWL
    (Semantics-enabled Mediator) ? SOQL
  • Data Analysis, Mining, Knowledge Discovery
  • manual/scripts, Excel, R, simulations,
  • Visualization
  • 3-D (volume), 4-D (spatio-temporal), n-D
    (conceptual views)
  • one-of-a-kind custom apps., detached (island)
    solutions
  • workflows are hard to reproduce, maintain
  • no/little workflow design, automation, reuse,
    documentation
  • need for an integrated scientific workflow
    environment

17
Scientific Workflow (SWF)
  • A model of the way a scientist works with their
    data and tools
  • Mentally coordinate data export, import,
    analysis via software systems
  • Emphasize dataflow (similar but different
    business workflows)
  • Metadata automatic data ingestion, analysis,
    provenance tracking
  • Goals
  • SWF automation
  • SWF component reuse
  • SWF design documentation
  • make scientific data analysis
  • and management tasks easier
  • for the scientist!

18
Types of Scientific Workflows (overlapping)
  • What do we use scientific workflow systems for?
  • Short answer
  • nearly everything
  • Types of Scientific Workflows
  • Modeling Design Capture or reverse-engineer
    processes and information flows at all levels
  • Knowledge discovery Automate repetitive data
    access, retrieval, custom analysis (e.g. Blast),
    generic steps (PCA, cluster analysis, ..),
  • Ex PIW, Motif analysis, NDDP,
  • Plumbing Stage files, submit batch jobs,
    monitor progress, move files off XT3 to analysis
    and viz cluster, archive, steer computation,
  • Ex Fusion simulation, Astrophysics (supernova
    simulation)
  • (Real-time) analysis pipelines processing of
    environmental and earth science data from sensor
    networks
  • (? NEON, ORION, Earthscope,)

19
Promoter Identification Workflow (PIW)
or from a napkin
drawing
Source Matt Coleman (LLNL)
20
to an executable workflow (here in KEPLER)
21
Retrieving gene sequences via web services
This entire workflow can be wrapped as a
re-usable component so that the details of
extracting sequence data are hidden unless
needed.
22
Managing complexity Actor-oriented Modeling
Design
  • Scientific workflows use hierarchy to hide
    complexity
  • Top level workflows can be a conceptual
    representation of the science process that is
    easy to comprehend at a glance
  • Drilling down into sub-workflows reveals
    increasing levels of detail
  • Composing models using hierarchy promotes the
    development of re-usable components that can be
    shared with other scientists

23
Ex A Happy Fusion Simulation Workflow
Subspecies/Variety Plumbing WF (flux-laboris
plumbiensis)
  • Implements concurrent analysis pipeline (_at_2ndary
    cluster)
  • Tasks convert analyze copy-to-Web-portal
    (makes scientists really happy!)
  • easy configuration, reuse,
  • pipeline parallelism!

Reusable Actor Class
Pipelined Execution Model
Inline Documentation
Inline Display
Easy-to-edit Parameter Settings
Checkpointing for (semi-smart) restart
Overall architecture/simulation (physicist)
Scott Klasky (ORNL) Workflow design
development Norbert Podhorszki (UC Davis)
24
(No Transcript)
25
Commercial Open Source Scientific Workflow
Systems
26
KEPLER http//www.kepler-project.
org
  • The Kepler Scientific Workflow System
  • Extends Ptolemy II (Berkeley), developed by a
    EECS community (design and simulation of complex
    systems)
  • Open-source, Java
  • Computation Models, Nested WFs, Loops
  • Graphical Workflow Interface
  • Workflow Execution
  • Extensible Architecture
  • Component Libraries
  • Metadata, Discovery, Archival
  • The Kepler Vision
  • End-to-end scientific workflow design and
    execution environment
  • Data- and compute-intensive workflows
  • Comprehensive component libraries for a wide
    range of scientific domains
  • Enable collaboration, sharing across disciplines
    (synergy)

Natural Diversity Discovery Project
Science Environment for Ecological Knowledge
Real-time Observatories Applications and Data
Management Network
Ptolemy II
Encyclopedia of Life
KEPLER
NDDP www.nddp.org GEON www.geongrid.org Ptolemy
II ptolemy.eecs.berkeley.edu/ptolemyII ROADNet roa
dnet.ucsd.edu SEEK seek.ecoinformatics.org SciDAC
www-casc.llnl.gov/sdm
27
GEON Dataset Generation Registration(a
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
28
Web Services ? Actors (WS Harvester)
1
2
4
3
  • ? Minute-made (MM) WS-based application
    integration
  • Similarly MM workflow design sharing w/o
    implemented components

29
GEON Mineral Classification Workflow
An early example Classification for naming
Igneous Rocks
30
Ex Pipelined workflow for inferring phylogenetic
trees
Src Timothy McPhillips (UC Davis)
31
Kepler SWF using remote datasets, 3rd-party
software
32
Summary What are Scientific Workflows?
  • Maybe the single-most important concept you hear
    about _at_CSIG06
  • Attempt at a Definition
  • SWFs System designs and/or executable
    programs/scripts
  • aiming to solve complex scientific data
    integration, analysis, management, visualization
    tasks
  • in plainer English doing hard and/or messy stuff
  • while doing it in a scientist-friendly way
  • that is making it look easy
  • with the ultimate goal to
  • do new, more, and better (e-)Science,
  • and faster! (and reproducibler! -)
  • In short SWFs are nothing less than MIRACLE-IT
    to make scientists (biologists, physicists,
    geoscientists, ... ) happy.
  • Attempto-DL Definition

33
What about scripts instead of SWFs?
  • If WF automation and gluing is so important,
    why not just use
  • MIRACLE-Perl
  • or MIRACLE-Python
  • or MIRACLE-BPEL4WS ???
  • indeed Perl/Python in the hand of a gifted
    programmer are hard to beat
  • but (MIRACLE-) Scientific Workflows may offer
    new features
  • built-in task parallelism and
    pipeline-parallelism
  • data parallelism available upon request -)
  • built-in distributed execution (Grid/cluster)
  • modeling design of workflows
  • actor (component)-oriented workflow design
    Bowers-Ludaescher-ER05
  • component and workflow reuse repurposing
  • semantic extensions (smart search/link/)
    Bowers-Ludaescher-QLQP06
  • parameter configuration, parameter studies
  • comprehensibility, documentation
  • data (and workflow) provenance support (?
    IPAW06 papers)
  • explain data dependencies/lineage, debug
    strange results, smart rerun,

34
Some KEPLER Actors (oh, the good old days )
35
So ...
  • a question If scientific workflows are so
    great, why havent they taken over the world??
  • A1 just wait
  • A2 they already have
  • A3 The problem of creating flexible, reusable,
    comprehensible, efficient, workflows
  • is akin to the problem of creating modular,
    reusable, maintainable, software!
  • its complex systems engineering (as in
    difficult)
  • and using UML, XML, WS-foo, SOA-bar, and
    BPEL-baz are no substitute for solving your
    modeling design problem!
  • Google evolution of language Descartes,
    Church, McCarthy, W3C
  • Tony Hoare (Turing Award winner) The Emperors
    Old Clothes

36
Modeling Design of Scientific workflows
Src Kristian Stevens, UC Davis
37
the inside may be less pretty
Src Kristian Stevens, UC Davis
38
Actor-Oriented Modeling
  • Ports
  • each actor has a set of input and output ports
  • denote the actors signature
  • produce/consume data (a.k.a. tokens)
  • parameters are special static ports

39
Actor-Oriented Modeling
  • Dataflow Connections
  • actor communication channels
  • directed (hyper) edges
  • connect output ports with input ports
  • merge step distribute step

40
Actor-Oriented Modeling
  • Sub-workflows / Composite Actors
  • composite actors wrap sub-workflows
  • like actors, have signatures (i/o ports of
    sub-workflow)
  • hierarchical workflows (arbitrary nesting levels)

41
Actor-Oriented Modeling
  • Directors
  • define the execution semantics of workflow graphs
  • executes workflow graph (some schedule)
  • sub-workflows may have different directors
  • enables reusability

42
Models of Computation
  • Directors separate the concerns of WF
    orchestration from Actor execution
  • Synchronous Dataflow (SDF)
  • Connections have queues for sending/receiving
    fixed numbers of tokens at each firing. Schedule
    is statically predetermined. SDF models are
    highly analyzable and used often in SWFs.
  • Process Networks (PN)
  • Generalize SDF. Actors execute as a separate
    thread/process, with queues of unbounded size.
    Related to Kahn/MacQueen semantics.
  • Continuous Time (CT)
  • Connections represent the value of a continuous
    time signal at some point in time ... Often used
    to model physical processes.
  • Discrete Event (DE)
  • Actors communicate through a queue of events in
    time. Used for instantaneous reactions in
    physical systems.

43
Polymorphic Actors Components WorkingAcross
Data Types and Domains
  • Actor Data Polymorphism
  • Add numbers (int, float, double, Complex)
  • Add strings (concatenation)
  • Add complex types (arrays, records, matrices)
  • Add user-defined types
  • Actor Behavioral Polymorphism
  • In dataflow, add when all connected inputs have
    data
  • In a time-triggered model, add when the clock
    ticks
  • In discrete-event, add when any connected input
    has data, and add in zero time
  • In process networks, execute an infinite loop in
    a thread that blocks when reading empty inputs
  • In CSP, execute an infinite loop that performs
    rendezvous on input or output
  • In push/pull, ports are push or pull (declared or
    inferred) and behave accordingly
  • In real-time CORBA, priorities are associated
    with ports and a dispatcher determines when to
    add
  • hey, Ptolemy has been out for long!

By not choosing among these when defining the
component, we get a huge increment in component
re-usability. But how do we ensure that the
component will work in all these circumstances?
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
44
Supporting WF Design via Semantic Types
  • Src NSF/ITR SEEK proposal

45
Workflows
SMS
Ontologies
Data
46
Semantic Types Motivation
  • Scientific Workflow Life-cycle
  • Resource Discovery
  • discover relevant datasets
  • discover relevant actors or workflow templates
  • Workflow Design and Configuration
  • data ? actor (data binding)
  • data ? data (data integration / merging /
    interlinking)
  • actor ? actor (actor / workflow
    composition)
  • Challenge do all this in the presence of
  • 100s of workflows and templates
  • 1000s of actors (e.g. actors for web services,
    data analytics, )
  • 10,000s of datasets
  • 1,000,000s of data items
  • highly complex, heterogeneous data

price to pay for these resources (lots)
scientists time wasted priceless!
47
Approach SMS Capabilities
  • Employ semantic extensions (ontologies) for ..
  • Smart Search (? Resource Discovery)
  • Smart Attach (? Data Binding)
  • Smart Integration (? Transform/Merge Data)
  • Smart Links (? Actor Composition, WF Design)
  • by smart we mean these services are informed
    by metadata and ontology information
  • Characteristics of SMS work
  • big chunk of basic computer science research (?
    references)
  • but also implement this (? link to Kepler)
  • driven by real-world use cases (? link to BEAM)
  • on top of community ontologies (? link to KR
    team)

48
Example Semantic Type Annotation
  • Ontology Land

MeasContext
Observation
hasContext
appliesTo
LifeStage Property
11
11
semType(P3)
Abundance Count
itemMeasured
Number Value
hasCount
11
11
11
?
hasValue
obsProperty
semType(P2)
AccuracyQualifier
11
Workflow Land
P2
P3
P5
S1(life stage property)
S2(mortality rate for period)
P1
P4
49
Hybrid Types Semantic Structural Typing
50
Semantic Type Annotation in Kepler
  • Component input and output port annotation
  • Each port can be annotated with multiple classes
    from multiple ontologies
  • Annotations are stored within the component
    metadata

51
Component Annotation and Indexing
  • Component Annotations
  • New components can be annotated and indexed into
    the component library (e.g., specializing generic
    actors)
  • Existing components can also be revised,
    annotated, and indexed (hiding previous versions)

52
Smart Search
  • Find a component (here an actor) in different
    locations (categories)
  • based on the semantic annotation of the
    component (or its ports)
  • ? needs one (or more) ontologies to register
    against (? KR)

53
Smart (Data) Integration Merge
  • Discover data of interest
  • connect to merge actor
  • compute merge
  • align attributes via annotations
  • open dialog for user refinement
  • store merge mapping in MOML
  • enjoy!
  • your merged dataset

54
Under the hood of Smart Merge
  • Exploits semantic type annotations and ontology
    definitions to find mappings between sources
  • Executing the merge actor results in an
    integrated data product (via outer union)

a1
a3
a1a8
a4
a3a6
Merge
a6
a4
a8
55
Smart Linking (Workflow Design)
  • Statically perform semantic and structural type
    checking
  • Navigate errors and warnings within the workflow
  • Search for and insert adapters to fix
    (structural and semantic) errors

56
Smart Linking (Data-Actor, Actor-Actor)
Source Bowers-Ludaescher, DILS04
57
SEEK SMS Summary
  • Employ semantic extensions (ontologies) for ..
  • Smart Search (? Resource Discovery)
  • Smart Attach (? Data Binding)
  • Smart Integration (? Merge Actor)
  • Smart Links (? Actor Composition)

58
Scientific Workflow Design Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
59
A Simple Motivating Example
  • Take the services (actors, components) in (a)
  • and chain them together in a scientist friendly
    form a la (b)
  • considering the following signatures (cf.
    Haskell, ML, )
  • (c) BLAST DNA? DNA
  • (d) MotifSearch DNA ? Motif
  • (e) MotifSearch o BLAST \x.
    MotifSearch(BLAST)(x)
  • oops (e) is not type correct note the
    signatures of (c) and (d)!
  • a neat solution implicit or explicit iteration /
    map(f)x1,,xn
  • cf. Kepler and Taverna, Kepler solutions

60
Extended Example Workflow Evolution
  • (a) gt (b) replace Aa?b with Aa?b
  • need to call B iteratively i.e. wrap B inside a
    component or add control-flow
  • (b) gt (c) upstream produces a, a,
    instead of a, a,
  • (d) need to bypass data components since B
    cant handle ds
  • This gets messy quickly

61
So how to get from messy to clean reusable
designs?
62
Answer Collection-Oriented Modeling Design
  • Collection-Oriented Modeling Design (COMAD)
  • starting point dataflow / actor-oriented
    modeling design
  • embrace the assembly line metaphor fully
  • ? Flow-based Programming (J. Morrison)
  • data tagged nested collections
  • e.g. represented as flattened, pipelined
  • (XML) token streams

? Multi-level Pipeline Parallelism!
63
How does COMAD work?
  • Some COMAD principles
  • data tagged, flattened, nested collections
    (token streams)
  • data tokens
  • metadata tokens
  • inherited downwards into (sub)collections
  • define an actors read scope via an (X)Path-like
    expression
  • default actor behavior
  • not mine?
  • ? dont do anything just pass the buck!
  • stuff within my scope? ?
  • add-only to it (default)
  • consume scope write-out result
  • (but remember the bypass!)
  • iteration scope is a query involving group-by and
    further refines the granularity/subtrees that
    constitute the tokens consumed by an actor firing
  • has aspects of implicit iteration (a la Taverna)
  • default iteration level to fix signature
    mismatches
  • but also
  • granularity/grouping is definable
  • works on anything (assuming scope is matched
    correctly)

64
COMAD What we gained
  • from fragile, messy workflow designs
  • to more reusable actors
  • just change the scopes
  • sometimes not even that is needed
  • and cleaner workflow design
  • Crux keep the nesting structure of data (pass
    through, add-only)
  • and let it drive the (semi-)implicit iteration

65
Provenance Scientific Workflows
66
A Scientific Publication
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
67
More Evidence
data reference
type of evidence
tool reference
trust me on this one
  • provenance/lineage show the history and evidence
  • related to proof trees
  • unlike w/ scripts, SWF system can keep track of
    what happened
  • In the future deposit your data workflows in a
    repository!?

68
Provenance for the WF Engineer / Plumber
  • A Workflow Engineers View
  • Monitor, benchmark, and optimize workflow
    performance
  • Record resource usage for a workflow execution
  • Smart Re-run of (variants of) previous
    executions
  • Checkpointing restart (e.g. for crash recovery,
    load balancing)
  • Debug or troubleshoot a workflow run
  • Explain when, where, why a workflow crashed

69
Provenance for Domain Scientists!
  • Query the lineage of a data product
  • from what data was this computed? (real
    dependencies please!)
  • Evaluate the results of a workflow
  • do I like how this result was computed?
  • Reuse data products of one workflow run in
    another
  • (re-)attach prior data products to a new workflow
  • Archive scientific results in a repository
  • Replicate the results reported by another
    researcher
  • Discover all results derived from a given dataset
  • i.e. across all runs
  • Explain unexpected results
  • via parameter-, dataset-, object-dependencies
    in the scientists terms (yes, you may think
    ontology here )

70
Observables
  • Model of Computation MoC M
  • specification/algorithm to compute o M(W,P,i)
  • a director or scheduler implements M
  • gives rise to formal notions of
  • computation (aka run) R typically tree models
  • Model of Provenance MoP M
  • approximation M of M
  • a trace T approximates a run R by
    inclusion/exclusion of observables
  • T R Ignored-observables
    Model-observables
  • Observables (of a MoC M)
  • functional observables (may influence output o)
  • token rate, notions of firing,
  • non-functional observables (not part of M, do not
    influence o)
  • token timestamp, size, (unless the MoC cares
    about those)
  • What is a good model of provenance? What is a
    good provenance schema?

71
Pipelined workflow for inferring phylogenetic
trees
72
Scientific provenance questions we can ask about
a run of this workflow
  • What DNA sequences were input to the workflow
    (this run)?
  • What phylogenetic trees were output by the
    workflow?
  • What phylogenetic trees were created
    (intermediate or final) by the workflow?
  • What actor created this phylogenetic tree?
  • What sequences input to the workflow does this
    consensus tree depend on?
  • What input sequences were not used to derive any
    output consensus trees?
  • What was the sequence alignment (key intermediate
    data) used in the process of inferring this tree?
  • Which actors were involved in creating this tree?

73
Provenance in the COMAD Framework
Without Provenance
With Provenance
74
Workflow Design Paradigms
  • Vanilla Process Network
  • Functional Programming Dataflow Network
  • XML Transformation Network
  • Collection-oriented Modeling Design framework

75
Big Picture Summary
  • How did we get here?
  • universe started 13.7Gyr (billion years) ago
  • considerable chaos (physicists focusing on very
    early universe)
  • earth formed 4.5Gyr first oceans condensed
    4.4Gyr
  • life arose around 3.8 Gyr ago (carbon isotope
    data provide evidence for CO2 fixation in
    sedimentary rocks at that time)
  • but to make a rather long and evolved story
    short,
  • a story-telling animal emerged, asking, mostly in
    its spare time, increasingly complex questions
    about the environment, itself and its buddies,
    eventually even contemplating theories which
    considered the story-telling animals themselves
    to be some sort of adaptive, complex systems,
    maybe even just the survival machines or fruiting
    bodies of their own internal somewhat selfish
    operating system codes (aka genes).
  • This of course after having first invented the
    printing press (thanks Mr. Gutenberg),
  • and then the internet shortly thereafter (thanks
    Al Gore -),
  • thereby effectively wiring their brains together
    to do complex, distributed, concurrent
    computations to reveal more and increasingly
    complex answers to the same old questions
  • e-Science accelerates the production of
    scientific knowledge using IT
  • basic nutrients are raw data, observations,
    measurements (labeled ordered trees/nested
    collections) processed foods rules, equations,
    theories,
  • data information metabolism, i.e., a
    description of the knowledge discovery "plan are
    well-defined experiment protocols, aka scientific
    workflows
  • (in silico experiments)
  • pushing the limits of the latter, thus is likely
    the next big jump in evolution.

76
Acknowledgements and QA
  • NSF/ITR Science Environment for Ecological
    Knowledge (SEEK)
  • NSF/ITR Geosciences Network (GEON)
  • DOE/SciDAC Scientific Data Management Center
    (SDM)
  • Data and Knowledge Systems Lab
  • Drs. Shawn Bowers, Timothy McPhillips, Norbert
    Podhorszki
  • Dave Thau, Daniel Zinn, Alex Chen
  • many Kepler collaborators

77
  • Contact, addtl. references etc.
  • LUDAESCH_at_UCDAVIS.EDU
  • daks.ucdavis.edu

78
Some Related Publications
  • Semantic Type Annotation
  • S Bowers, B Ludaescher. A Calculus for
    Propagating Semantic Annotations through
    Scientific Workflow Queries. ICDE Workshop on
    Query Languages and Query Processing (QLQP),
    LNCS, 2006.
  • S Bowers, B Ludaescher. Towards Automatic
    Generation of Semantic Types in Scientific
    Workflows. International Workshop on Scalable
    Semantic Web Knowledge Base Systems (SSWS), WISE
    2005 Workshop Proceedings, LNCS, 2005.
  • C Berkley, S Bowers, M Jones, B Ludaescher, M
    Schildhauer, J Tao. Incorporating Semantics in
    Scientific Workflow Authoring. SSDBM, 2005.
  • B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
    Brodaric, C Baru. Managing Scientific Data From
    Data Integration to Scientific Workflows. GSA
    Today, Special Issue on Geoinformatics, 2006.
  • S Bowers, D Thau, R Williams, B Ludaesher. Data
    Procurement for Enabling Scientific Workflows On
    Exploring Inter-Ant Parasitism. VLDB Workshop on
    Semantic Web and Databases (SWDB), 2004.
  • S Bowers, K Lin, B Ludaescher. On Integrating
    Scientific Resources through Semantic
    Registration. SSDBM, 2004.
  • S Bowers, B Ludaescher. An Ontology-Drive
    Framework for Data Transformation in Scientific
    Workflows. International Workshop on Data
    Integration in the Life Sciences (DILS), LNCS,
    2004.
  • S Bowers, B Ludaescher. Towards a Generic
    Framework for Semantic Registration of Scientific
    Data. International Semantic Web Conference
    Workshop on Semantic Web Technologies for
    Searching and Retrieving Scientific Data, 2003.
  • Workflow Design and Modeling
  • T McPhillips, S Bowers, B Ludaescher.
    Collection-Oriented Scientific Workflows for
    Integrating and Analyzing Biological Data.
    Workshop on Data Integration in the Life Sciences
    (DILS), 2006, to appear.
  • S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
    Davidson. A Model for User-Oriented Data
    Provenance in Pipelined Scientific Workflows.
    International Provenance and Annotation Workshop
    (IPAW), LNCS, 2006.
  • S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
    Enabling Scientific Workflow Reuse through
    Structured Composition of Dataflow and
    Control-Flow. IEEE Workshop on Workflow and Data
    Flow for Scientific Applications (SciFlow), 2006.
  • S Bowers, B Ludaescher. Actor-Oriented Design of
    Scientific Workflows. International Conference on
    Conceptual Modeling (ER), LNCS, 2005.
  • T McPhillips, S Bowers. Pipelining Nested Data
    Collections in Scientific Workflows. SIGMOD
    Record, 2005.
  • Kepler
  • D Pennington, D Higgins, AT Peterson, M Jones, B
    Ludaescher, S Bowers. Ecological Niche Modeling
    using the Kepler Workflow System. Workflows for
    e-Science, Springer-Verlag, to appear.
  • W Michener, J Beach, S Bowers, L Downey, M Jones,
    B Ludaescher, D Pennington, A Rajasekar, S
    Romanello, M Schildhauer, D Vieglais, J Zhang.
    SEEK Data Integration and Workflow Solutions for
    Ecology. Workshop on Data Integration in the Life
    Sciences (DILS), LNCS, 2005.
  • S Romanello, W Michener, J Beach, M Jones, B
    Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
    D Pennington. Creating and Providing Data
    Management Services for the Biological and
    Ecological Sciences Science Environment for
    Ecological Knowledge. SSDBM, 2005.
Write a Comment
User Comments (0)
About PowerShow.com