Title: SCEC Workflows on the Grid
1SCEC Workflows on the Grid
- Gaurang Mehta
- Center for Grid Technologies
- USC Information Sciences Institute
2Acknowledgements
- Ewa Deelman, Sridhar Gullapalli, Carl Kesselman,
Gurmeet Singh, Mei-Hui Su, Karan Vahi, (Center
for Grid Technologies, ISI) - James Blythe, Yolanda Gil (Intelligent Systems
Division, ISI) - Phil Maechling, Vipin Gupta (SCEC)
- http//pegasus.isi.edu
- Research funded as part of the NSF GriPhyN, NVO
and SCEC projects and EU-funded GridLab
3Scientific Applications and need for workflows
- Increasing in the level of complexity
- Use of individual application components
- Reuse of individual intermediate data products
(files) - Description of Data Products using Metadata
Attributes - Execution environment is complex and very dynamic
- Resources come and go
- Data is replicated
- Components can be found at various locations or
staged in on demand - Separation between
- the application description
- the actual execution description
4Workflow Definitions
- Workflow template shows the main steps in the
scientific analysis and their dependencies
without specifying particular data products - Abstract workflow/Workflow Instance depicts the
scientific analysis including the data used and
generated, but does not include information about
the resources needed for execution - Concrete workflow/Executable workflow a workflow
that includes details of the execution environment
5INTEGRATED WORKFLOW ARCHITECTURE
J. Zechar _at_ USC (Teamwork Geo CS)
Workflow Template Editor (CAT)
Query for components
D. Okaya _at_ USC
Tools
Domain Ontology
Workflow Template (WT)
Workflow Library
Component Library
Query for WT
Data Selection
Query for data given metadata
L. Hearn _at_ UBC
COMPONENTS
I/O data descriptions
Conceptual Data Query Engine (DataFinder)
Metadata Catalog
Workflow Instance (WI)
Execution requirements
Engineer
Workflow Mapping (Pegasus)
Grid info svcs
Tools
Grid
K. Olsen _at_ SDSU
Executable Workflow
6Concrete Workflow Generation and Mapping
Application
-
dependent
jobs
Application
independent
7PegasusPlanning for Execution in Grids
- Maps from abstract to concrete workflow
- Algorithmic and AI-based techniques
- Automatically locates physical locations for both
workflow components and data - Finds appropriate resources to execute
- Reuses existing data products where applicable
- Publishes newly derived data products
- Provides provenance information
8Generating a Concrete Workflow
- Information
- location of files and component Instances
- State of the Grid resources
- Select specific
- Resources
- Files
- Add jobs required to form a concrete workflow
that can be executed in the Grid environment - Data movement
- Data registration
- Each component in the abstract workflow is turned
into an executable job
9Information Components used by Pegasus
- Globus Monitoring and Discovery Service (MDS)
- Locates available resources
- Finds resource properties
- Dynamic load, queue length
- Static location of GridFTP server, RLS, etc
- Globus Replica Location Service
- Stores mappings of logical files to their
physical instances. - Locates data that may be replicated
- Registers new data products
- Transformation Catalog
- Stores information about transformations
(executables) on remote resources either in
installed or stageable form.
10Data Management Components
- GridFTP A grid extension to the regular ftp
protocol that allows third party transfers,
parallel streams and striping. - SRB/GridFTP DSI A data service interface is
available to provide a gridftp protocol access to
store and retrieve files from SRB. (Storage
Resource Broker) - Reliable File Transfer Builds on the GridFTP
server. Allows to do reliable transfers by
allowing restarts, retries and other
capabilities.
11Benefits of the workflow Pegasus approach
- The workflow exposes
- the structure of the application
- maximum parallelism of the application
- Pegasus can take advantage of the structure to
- Set a planning horizon (how far into the workflow
to plan) - Cluster a set of workflow nodes to be executed as
one (for performance) - Can cluster a set of workflow nodes to be
executed on the same site. (for reducing data
transfers) - Pegasus shields from the Grid details
12Benefits of the workflow Pegasus approach
- Pegasus can run the workflow on a variety of
resources - Pegasus can run a single workflow across multiple
resources - Pegasus can opportunistically take advantage of
available resources (through dynamic workflow
mapping) - Pegasus can take advantage of pre-existing
intermediate data products - Pegasus can improve the performance of the
application.
13Nagios Monitoring
14CyberSHAke Workflow
15CyberSHAke Workflow
- Tests done with ruptures from a 50km region
around USC - Approx 2350 rupture with about 415,000 points
- Test done on multiple sites including HPC
cluster at USC and TeraGrid at SDSC. - System uses Pegasus to generate and plan the
workflows to run on the grid - The peak acceleration values are use to construct
a final hazard curve
16Future Goal
- Goal to run a cyber-shake analysis on ruptures
200 km around USC - Generate a hazard curve using seismogram and peak
acceleration values from 45,000 ruptures or
300,000 ruptures with moments. - QUESTIONS?