Title: BPEL4Job: a Fault-handling Design for Job Flow Management
1BPEL4Job a Fault-handling Design for Job Flow
Management
- Wei Tan1, Liana Fong2, Norman Bobroff2
- 1 Dept. Automation, Tsinghua University, Beijing,
China - 2 IBM T. J. Watson Research Center, Hawthorne,
USA - tanwei_at_mails.tsinghua.edu.cn
- llfong_at_us.ibm.com, bobroff_at_us.ibm.com
2Agenda
- 1 Introduction
- 2 BPEL4Job a fault-handling design for job flow
management - 3 Integrating fault-handling policies with job
flow modeling - 4 Fault-handling at the flow execution layer
- 5 Implementation and sample application
- 6 Conclusion and ongoing future work
31 Introduction Motivation
- Job flow is especially relevant in orchestrating
batch jobs - Enforce job execution sequence
- Manage job execution trace
- Handle run-time fault in flow level
- Various languages systems have been devised
- DAGMan/Condor, Taverna/myGrid, Job Stream/Tivoli-
Workload Scheduler, JobCommand/Tivoli-LoadLeveler - BPEL-based job flow management is attracting more
attention - Resource and applications are becoming
service-oriented - Requirement to combine business process
(including human tasks) with back-end batch jobs - BPEL as a framework on flow orchestration, data
manipulation, fault handling, and could be
extended or enhanced - BPEL is supported by industry and open source
community
41 Introduction Challenges
- The use of BPEL for job flow is not without
technical challenges - Defining a job entity
- BPEL does not support using JSDL or other job
specification languages - Supporting data flow and dependencies
- Data staging in/out
- Incorporating the asynchronous interaction with
schedulers - Usually job scheduler reports job status in an
asynchronous manner - Incorporating fault tolerance and recovery
strategy in job flow - Job flow has special requirement on fault
handling, like re-try and re-submit - Supporting dynamic changes of flow instances
- In case that flow execution logic could not be
fully anticipated in-advance.
51 Introduction BPEL4Job
- The goal of BPEL4Job
- A BPEL-based job flow system with fault-handling
capability - Challenges addressed
- How to communicate with job schedulers?
- A generic job proxy to facilitate the
asynchronous job submission and job status
notification - How to model a job flow with fault-handling
capability? - A policy-based, two-stage approach
- How to enforce various fault-handling policies at
run-time? - A set of fundamental fault-handling schemes,
especially, including instance migration between
flow engines
62 BPEL4Job a fault-handling design for job flow
management
- Flow modeling layer
- Stage 1 define base flow, job definitions, the
fault-handling policies. - Stage 2, generate expanded flow.
- Flow execution layer
- Flow engine
- Job proxy
- Fault-handling service
- Job scheduling layer
- Job schedulers
73 Integrating fault-handling policies with job
flow modeling
- BPEL4Job considers three kinds of policies
- Cleanup
- generate fault report and delete the instance
data in flow engine. - Re-try
- re-execute the job in the same engine.
- Re-submit
- Export flow instance state
- Restore flow instance in a different engine, such
that the flow can resume from the failed job - More policies could be defined and implemented
based on the three fundamental policies - Rollback, alternate job, etc.
83 Integrating fault-handling policies with job
flow modeling
The re-try policy
The re-submit policy
The base flow with policies embedded
93 Integrating fault-handling policies with job
flow modeling
Expanded flow
Base flow
The transformation to implement the re-try policy
of Job1
104 Fault-handling at the flow execution layer
- We leverage
- BPEL fault-handling construct Catch, CatchAll
- We enhance
- Specific capabilities to recognize job failures
and to handle faults according to defined
policies. - Components in this layer
- The generic job proxy for job submission and job
status notification - The fault-handling service to enforce the
policies defined in flow modeling layer
11The generic job proxy
- Generic job proxy
- Receives a job submission request.
- Forwards the request to a scheduler, and start to
listen for the job state notification from it. - For notification indicating job success/failure,
forwards to flow engine and returns otherwise
continue listening.
12Fault-handling schemes in flow execution
13Flow re-submission and instance migration
- Extract all the information related to a BPEL
instance. - Re-shape the instance data and migrate it into
another WPS engine.
14Implementation
Websphere Integration Developer (WID)
Websphere Process Server (WPS)
Tivoli Dynamic Workload Broker (ITDWB)
15Sample Montage Job Flow
- Montage a toolkit for assembling raw astronomy
images into custom mosaics. - Developed by NASA California Institute of
Technology. - The assembling process is usually expressed as a
job flow.
Generate Image table
Image projection in parallel
raw images
Generate Image table
Generate Mosaic
Transform to jpeg
16Montage job flow and the re-start policy
Base flow
Expanded flow (partial)
Policy says re-submit from mImgtbl1 when mAdd1
fails
17Instance migration from saba10 to weitan
(a) Montage initiated failed at saba10
(b) Montage migrated to weitan
(c) Montage re-started and completed at weitan
18Conclusion
- BPEL4Job the exploration of using BPEL as a job
flow language - A two-stage approach for job flow modeling with
fault-handling policies - A generic job proxy to facilitate the
asynchronous nature of job submission and job
status notification - A set of fundamental fault-handling schemes,
including instance migration between flow engines - Future work
- Support more complicated fault-handling policies
- Involving Human Task, expressed as business
rules, etc - Apply instance migration technique in
- Load balance between flow engines
- Instance migration to newer version
19Future work
20Thank you for your attention.
- Please contact me at
- Dept. Automation, Tsinghua Univ, Beijing, China
- http//twtanwei.googlepages.com
- twtanwei_at_gmail.com