Workflows - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Workflows

Description:

HoneyBee genome. The MotifNetwork is a suite of workflows. Generation of new data and techniques ... wide comprehensive analysis on the recent honey bee genome ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 30
Provided by: tera3
Category:

less

Transcript and Presenter's Notes

Title: Workflows


1
Workflows
  • Jeffrey L. Tilson
  • Research Scientist
  • Renaissance Computing Institute

2
Outline
  • Example Workflows
  • Two examples of workflows and what they do
  • Different technical domains
  • Biology
  • MotifNetwork environment
  • Genome-wide analysis environment.
  • HPC
  • Performance harness
  • Facilitated software performance sweeps

3
Workflows
  • Wikipedia the movement of documents and/or tasks
    through a work process
  • Workflows are of two types
  • Control-flow (Business oriented, BPEL)
  • Data-flow (Application chaining)
  • Science is typically Data-flow oriented
  • So our plan is to
  • Develop useful Data-flow workflows
  • Exploit the RENCI Science gateway

4
Technologies
  • Leverage Taverna for orchestrating workflows
  • A centralized, dataflow, non-DAG environment
  • A nice development platform to construct
    workflows
  • Run as a batch application to enact workflows
  • Leverage Service Technologies to build workflow
    processors
  • BioMoby, GST (grid-services) , etc
  • Utilizes Globus GT4 (pre-ws and ws)
  • Leverage the RENCI Bioportal / Teragrid Science
    Gateway
  • Authorization, job management, audits, common
    look feel, etc.
  • E.g., Gene2Life.

5
Taverna
  • Orchestration Enactor
  • Taverna
  • Collaboration of the European Bioinformatics
    Institute (EBI) and several Universities. It is
    funded via the Open Middleware Infrastructure
    Institute (OMII-UK)
  • Substantial Available Biological/Chemical support
  • Taverna can use many kinds of services (BioMoby,
    WSDL, JDBC.)
  • Assisted creation of workflows.
  • Enactment of workflows

6
Service Technologies
  • Service Technologies
  • BioMoby
  • Seqhound
  • Soaplab
  • WSDL
  • Custom
  • Etc.
  • Taverna can support all of these ( and more)
  • Service Selections
  • Support/Development of semantically well-formed
    access
  • Vs
  • Ad-hoc data acquisition
  • Must strike a balanced approach
  • But support the users!

7
TeraGrid ImplementationBioportal extensions
  • Leverages Science Gateway technology.adds
    additional support for.
  • Community accounts
  • More scalable process for managing O(1K) users
  • Increased audit capability
  • To make safer, the use of community accounts !
  • Collective (limited) administration
  • Some tasks are group oriented. E.g., an RP
    suspending a gateway user account.
  • Adds additional capability to
  • Further use of transparent gridFTP
  • Auditing / new mysql schema
  • Technology (direct use of)
  • Java / mysql / PISE

Lavanya Ramakrishnan, Mark S.C Reed, Jeffrey L.
Tilson, Daniel A. Reed, Grid Portals for
Bioinformatics, Second International Workshop on
Grid Computing Environments (GCE), Workshop at
SC06, November 2006, Tampa, Florida
8
Workflow examples
  • Two examples of workflows
  • Biological
  • genome-wide analysis
  • alternative functional genomics
  • non-Biological
  • Workflows to assist in HPC
  • performance studies

9
Workflow Example 1
Jeffrey L. Tilson, Gloria Rendon, Mao-Feng Ger,
and Eric Jakobsson, MotifNetwork A Grid-enabled
Workflow for High-throughput Domain Analysis of
Biological Sequences
10
Workflow environment
  • The MotifNetwork is a suite of workflows
  • Generation of new data and techniques
  • Highly collaborative process (NCSA/UIUC)
  • Biological users co verify results
  • Generation of reports/manuscripts
  • Basic Service Types
  • GST, JDBC, Globus GT4, Java
  • Applications .
  • Ion-channels
  • HoneyBee genome

Jeffrey L. Tilson, Gloria Rendon, Mao-Feng Ger,
and Eric Jakobsson, MotifNetwork A Grid-enabled
Workflow for High-throughput Domain Analysis of
Biological Sequences
11
MotifNetworkProtein-Probe MotifNetwork workflow
  • One of several workflows.
  • Input a.a. sequence.
  • Identify homologues using psiBlast
  • Identify all known domains using Interpro
  • Construct webs

12
P-P MotifNetworkThe Real Workflow
13
MotifNetwork Data Product 1
  • Data file
  • Protein-Motif links
  • Interpro scores
  • E-Score
  • Sequence locations
  • (start and end bps)
  • Matrix oriented analysis
  • Import by Excel

14
MotifNetwork Data Product 2
The Protein-Motif Web Cytoscape display of Data
Product 1. This indicates connections between
identified proteins and domains. Generally a few
domains are found in many proteins and the
interconnection fabric is dense.
15
MotifNetwork Data Product 3
16
MotifNetwork Data Product 4
The Protein-Protein Web Cytoscape display of
Data Product 3. Inverse of the Motif-Motif web.
This is an optional output as the data set is
very large and the graphical network is
complicated.
17
P-P MotifNetworkTypical Analysis
  • Compare two distinct MotifNetwork runs
  • Generate domain webs for both.
  • (psiBlast) homologous series
  • No apriori relationship imposed
  • Data Union
  • Identifies shared domains

18
P-P MotifNetwork wrkflw Performance
  • Scaling
  • Two significant computational steps psiBlast
    InterProScan
  • Scales well to 64p. (psiBlast is the bottleneck)
  • 20 min runtime
  • per input sequence

19
P-P MotifNetwork wrkflwresults
  • NCSA/RENCI actively using this workflow
  • Discovering of microbial associations.
  • hard-to-find using trees.
  • Useful at the desktop
  • 20 min runtime (per input seq) using MPPs
  • Comprehensive Analysis
  • Ongoing large-scale Genome-wide comprehensive
    analysis on the recent honey bee genome

20
Workflow Examples 2
  • Performance Analysis
  • Determine optimum build run parameters.
  • Many compilers and compiler options
  • gcc,icc,pgcc,etc
  • -O2, -O3, -xT, unroll lvl, etc.
  • Many link options
  • Math libs, new intrinsics, debugging, profile,
    etc
  • Many runtime options
  • Multicore, network, etc.
  • Consistently store results
  • Basic Service Types
  • GST, JDBC, Java, csh, GT4
  • Example applications .
  • GAMESS
  • ClustalW-MPI

21
Performance Campaigns
  • Campaign Descriptor
  • A specification of the
  • Compiler options
  • Linker options
  • Runtime options
  • For a single performance run
  • Campaign
  • The full list of all compiles/benchmarks/mysql
    updates that are to be executed with a single
    workflow enactment
  • The List of Campaign Descriptors

22
Example GAMESS Workflow
23
The real workflow
24
Invoke Performance Workflow
  • Define the Campaign
  • Four sets of compiler options
  • Each with a specified name description
  • Five node sets
  • Three Input files
  • Single Launch opt
  • 60 total runs. (4x5x3x1)

25
Enact The CampaignCompile Step
26
Enact The CampaignBenchmark Step
27
Enact The CampaignDatabase Step
28
Performance Workflowsresults
  • The 60 descriptor campaign took 2 hours
  • But 10 mins to setup and initiate !
  • All data are stored in a consistent permanent
    way
  • Results can be interrogated at any time.
  • This test can be re-enacted at-will and used for
    comparisons such as
  • Validate performance of the code
  • Compiler updates, OS updates, etc.
  • Validate state of the machine. (use code as a
    probe)
  • New hardware, problem discovery.

29
Final Slide
  • These were just 2 examples of workflows
  • Several more exist or are under development
  • Acknowledgements
  • NSF TeraGrid
  • NSF Grant to Eric Jakobsson.
  • North Carolina Bioportal
  • Funded by UNC Office of the President
  • Renaissance Computing Institute (RENCI)
Write a Comment
User Comments (0)
About PowerShow.com