BPEL4Job: a Fault-handling Design for Job Flow Management - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

BPEL4Job: a Fault-handling Design for Job Flow Management

Description:

BPEL4Job: a Fault-handling Design for Job Flow Management Wei Tan1, Liana Fong2, Norman Bobroff2 1 Dept. Automation, Tsinghua University, Beijing, China – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 21

Provided by: IBMU225

Category:

more less

Transcript and Presenter's Notes

Title: BPEL4Job: a Fault-handling Design for Job Flow Management

1
BPEL4Job a Fault-handling Design for Job Flow
Management

Wei Tan1, Liana Fong2, Norman Bobroff2
1 Dept. Automation, Tsinghua University, Beijing,
China
2 IBM T. J. Watson Research Center, Hawthorne,
USA
tanwei_at_mails.tsinghua.edu.cn
llfong_at_us.ibm.com, bobroff_at_us.ibm.com

2
Agenda

1 Introduction
2 BPEL4Job a fault-handling design for job flow
management
3 Integrating fault-handling policies with job
flow modeling
4 Fault-handling at the flow execution layer
5 Implementation and sample application
6 Conclusion and ongoing future work

3
1 Introduction Motivation

Job flow is especially relevant in orchestrating
batch jobs
Enforce job execution sequence
Manage job execution trace
Handle run-time fault in flow level
Various languages systems have been devised
DAGMan/Condor, Taverna/myGrid, Job Stream/Tivoli-
Workload Scheduler, JobCommand/Tivoli-LoadLeveler
BPEL-based job flow management is attracting more
attention
Resource and applications are becoming
service-oriented
Requirement to combine business process
(including human tasks) with back-end batch jobs
BPEL as a framework on flow orchestration, data
manipulation, fault handling, and could be
extended or enhanced
BPEL is supported by industry and open source
community

4
1 Introduction Challenges

The use of BPEL for job flow is not without
technical challenges
Defining a job entity
BPEL does not support using JSDL or other job
specification languages
Supporting data flow and dependencies
Data staging in/out
Incorporating the asynchronous interaction with
schedulers
Usually job scheduler reports job status in an
asynchronous manner
Incorporating fault tolerance and recovery
strategy in job flow
Job flow has special requirement on fault
handling, like re-try and re-submit
Supporting dynamic changes of flow instances
In case that flow execution logic could not be
fully anticipated in-advance.

5
1 Introduction BPEL4Job

The goal of BPEL4Job
A BPEL-based job flow system with fault-handling
capability
Challenges addressed
How to communicate with job schedulers?
A generic job proxy to facilitate the
asynchronous job submission and job status
notification
How to model a job flow with fault-handling
capability?
A policy-based, two-stage approach
How to enforce various fault-handling policies at
run-time?
A set of fundamental fault-handling schemes,
especially, including instance migration between
flow engines

6
2 BPEL4Job a fault-handling design for job flow
management

Flow modeling layer
Stage 1 define base flow, job definitions, the
fault-handling policies.
Stage 2, generate expanded flow.
Flow execution layer
Flow engine
Job proxy
Fault-handling service
Job scheduling layer
Job schedulers

7
3 Integrating fault-handling policies with job
flow modeling

BPEL4Job considers three kinds of policies
Cleanup
generate fault report and delete the instance
data in flow engine.
Re-try
re-execute the job in the same engine.
Re-submit
Export flow instance state
Restore flow instance in a different engine, such
that the flow can resume from the failed job
More policies could be defined and implemented
based on the three fundamental policies
Rollback, alternate job, etc.

8
3 Integrating fault-handling policies with job
flow modeling
The re-try policy
The re-submit policy
The base flow with policies embedded
9
3 Integrating fault-handling policies with job
flow modeling
Expanded flow
Base flow
The transformation to implement the re-try policy
of Job1
10
4 Fault-handling at the flow execution layer

We leverage
BPEL fault-handling construct Catch, CatchAll
We enhance
Specific capabilities to recognize job failures
and to handle faults according to defined
policies.
Components in this layer
The generic job proxy for job submission and job
status notification
The fault-handling service to enforce the
policies defined in flow modeling layer

11
The generic job proxy

Generic job proxy
Receives a job submission request.
Forwards the request to a scheduler, and start to
listen for the job state notification from it.
For notification indicating job success/failure,
forwards to flow engine and returns otherwise
continue listening.

12
Fault-handling schemes in flow execution
13
Flow re-submission and instance migration

Extract all the information related to a BPEL
instance.
Re-shape the instance data and migrate it into
another WPS engine.

14
Implementation
Websphere Integration Developer (WID)
Websphere Process Server (WPS)
Tivoli Dynamic Workload Broker (ITDWB)
15
Sample Montage Job Flow

Montage a toolkit for assembling raw astronomy
images into custom mosaics.
Developed by NASA California Institute of
Technology.
The assembling process is usually expressed as a
job flow.

Generate Image table
Image projection in parallel
raw images
Generate Image table
Generate Mosaic
Transform to jpeg
16
Montage job flow and the re-start policy
Base flow
Expanded flow (partial)
Policy says re-submit from mImgtbl1 when mAdd1
fails
17
Instance migration from saba10 to weitan

(a) Montage initiated failed at saba10
(b) Montage migrated to weitan
(c) Montage re-started and completed at weitan
18
Conclusion

BPEL4Job the exploration of using BPEL as a job
flow language
A two-stage approach for job flow modeling with
fault-handling policies
A generic job proxy to facilitate the
asynchronous nature of job submission and job
status notification
A set of fundamental fault-handling schemes,
including instance migration between flow engines
Future work
Support more complicated fault-handling policies
Involving Human Task, expressed as business
rules, etc
Apply instance migration technique in
Load balance between flow engines
Instance migration to newer version