Title: Workflows
1Workflows
- Jeffrey L. Tilson
- Research Scientist
- Renaissance Computing Institute
2Outline
- Example Workflows
- Two examples of workflows and what they do
- Different technical domains
- Biology
- MotifNetwork environment
- Genome-wide analysis environment.
- HPC
- Performance harness
- Facilitated software performance sweeps
3Workflows
- Wikipedia the movement of documents and/or tasks
through a work process - Workflows are of two types
- Control-flow (Business oriented, BPEL)
- Data-flow (Application chaining)
- Science is typically Data-flow oriented
- So our plan is to
- Develop useful Data-flow workflows
- Exploit the RENCI Science gateway
4Technologies
- Leverage Taverna for orchestrating workflows
- A centralized, dataflow, non-DAG environment
- A nice development platform to construct
workflows - Run as a batch application to enact workflows
- Leverage Service Technologies to build workflow
processors - BioMoby, GST (grid-services) , etc
- Utilizes Globus GT4 (pre-ws and ws)
- Leverage the RENCI Bioportal / Teragrid Science
Gateway - Authorization, job management, audits, common
look feel, etc. - E.g., Gene2Life.
5Taverna
- Orchestration Enactor
- Taverna
- Collaboration of the European Bioinformatics
Institute (EBI) and several Universities. It is
funded via the Open Middleware Infrastructure
Institute (OMII-UK) - Substantial Available Biological/Chemical support
- Taverna can use many kinds of services (BioMoby,
WSDL, JDBC.) - Assisted creation of workflows.
- Enactment of workflows
6Service Technologies
- Service Technologies
- BioMoby
- Seqhound
- Soaplab
- WSDL
- Custom
- Etc.
- Taverna can support all of these ( and more)
- Service Selections
- Support/Development of semantically well-formed
access - Vs
- Ad-hoc data acquisition
- Must strike a balanced approach
- But support the users!
7TeraGrid ImplementationBioportal extensions
- Leverages Science Gateway technology.adds
additional support for. - Community accounts
- More scalable process for managing O(1K) users
- Increased audit capability
- To make safer, the use of community accounts !
- Collective (limited) administration
- Some tasks are group oriented. E.g., an RP
suspending a gateway user account. - Adds additional capability to
- Further use of transparent gridFTP
- Auditing / new mysql schema
- Technology (direct use of)
- Java / mysql / PISE
Lavanya Ramakrishnan, Mark S.C Reed, Jeffrey L.
Tilson, Daniel A. Reed, Grid Portals for
Bioinformatics, Second International Workshop on
Grid Computing Environments (GCE), Workshop at
SC06, November 2006, Tampa, Florida
8Workflow examples
- Two examples of workflows
- Biological
- genome-wide analysis
- alternative functional genomics
- non-Biological
- Workflows to assist in HPC
- performance studies
9Workflow Example 1
Jeffrey L. Tilson, Gloria Rendon, Mao-Feng Ger,
and Eric Jakobsson, MotifNetwork A Grid-enabled
Workflow for High-throughput Domain Analysis of
Biological Sequences
10Workflow environment
- The MotifNetwork is a suite of workflows
- Generation of new data and techniques
- Highly collaborative process (NCSA/UIUC)
- Biological users co verify results
- Generation of reports/manuscripts
- Basic Service Types
- GST, JDBC, Globus GT4, Java
- Applications .
- Ion-channels
- HoneyBee genome
Jeffrey L. Tilson, Gloria Rendon, Mao-Feng Ger,
and Eric Jakobsson, MotifNetwork A Grid-enabled
Workflow for High-throughput Domain Analysis of
Biological Sequences
11MotifNetworkProtein-Probe MotifNetwork workflow
- One of several workflows.
- Input a.a. sequence.
- Identify homologues using psiBlast
- Identify all known domains using Interpro
- Construct webs
12P-P MotifNetworkThe Real Workflow
13MotifNetwork Data Product 1
- Data file
- Protein-Motif links
- Interpro scores
- E-Score
- Sequence locations
- (start and end bps)
- Matrix oriented analysis
- Import by Excel
14MotifNetwork Data Product 2
The Protein-Motif Web Cytoscape display of Data
Product 1. This indicates connections between
identified proteins and domains. Generally a few
domains are found in many proteins and the
interconnection fabric is dense.
15MotifNetwork Data Product 3
16MotifNetwork Data Product 4
The Protein-Protein Web Cytoscape display of
Data Product 3. Inverse of the Motif-Motif web.
This is an optional output as the data set is
very large and the graphical network is
complicated.
17P-P MotifNetworkTypical Analysis
- Compare two distinct MotifNetwork runs
- Generate domain webs for both.
- (psiBlast) homologous series
- No apriori relationship imposed
- Data Union
- Identifies shared domains
18P-P MotifNetwork wrkflw Performance
- Scaling
- Two significant computational steps psiBlast
InterProScan - Scales well to 64p. (psiBlast is the bottleneck)
- 20 min runtime
- per input sequence
19P-P MotifNetwork wrkflwresults
- NCSA/RENCI actively using this workflow
- Discovering of microbial associations.
- hard-to-find using trees.
- Useful at the desktop
- 20 min runtime (per input seq) using MPPs
- Comprehensive Analysis
- Ongoing large-scale Genome-wide comprehensive
analysis on the recent honey bee genome
20Workflow Examples 2
- Performance Analysis
- Determine optimum build run parameters.
- Many compilers and compiler options
- gcc,icc,pgcc,etc
- -O2, -O3, -xT, unroll lvl, etc.
- Many link options
- Math libs, new intrinsics, debugging, profile,
etc - Many runtime options
- Multicore, network, etc.
- Consistently store results
- Basic Service Types
- GST, JDBC, Java, csh, GT4
- Example applications .
- GAMESS
- ClustalW-MPI
21Performance Campaigns
- Campaign Descriptor
- A specification of the
- Compiler options
- Linker options
- Runtime options
- For a single performance run
- Campaign
- The full list of all compiles/benchmarks/mysql
updates that are to be executed with a single
workflow enactment - The List of Campaign Descriptors
22Example GAMESS Workflow
23The real workflow
24Invoke Performance Workflow
- Define the Campaign
- Four sets of compiler options
- Each with a specified name description
- Five node sets
- Three Input files
- Single Launch opt
- 60 total runs. (4x5x3x1)
25Enact The CampaignCompile Step
26Enact The CampaignBenchmark Step
27Enact The CampaignDatabase Step
28Performance Workflowsresults
- The 60 descriptor campaign took 2 hours
- But 10 mins to setup and initiate !
- All data are stored in a consistent permanent
way - Results can be interrogated at any time.
- This test can be re-enacted at-will and used for
comparisons such as - Validate performance of the code
- Compiler updates, OS updates, etc.
- Validate state of the machine. (use code as a
probe) - New hardware, problem discovery.
29Final Slide
- These were just 2 examples of workflows
- Several more exist or are under development
- Acknowledgements
- NSF TeraGrid
- NSF Grant to Eric Jakobsson.
- North Carolina Bioportal
- Funded by UNC Office of the President
- Renaissance Computing Institute (RENCI)