Title: Using Stork Barcelona, 2006
1Using StorkBarcelona, 2006
2Meet Friedrich
- Friedrich is a scientist with a BIG problem.
Friedas twin brother
3I have a lot of data to process.
4Friedrich's problem
- Friedrich has many large data sets to process.
For each data set - stage the data in from a remote server
- run a job to process the data
- stage the data out to a remote server
5The Classic Data Transfer Job
!/bin/sh globus-url-copy source dest Scripts
often work fine for short, simple data transfers,
but
6Many things can go wrong!
- These errors are more likely with large data
sets - The network is down.
- The data server is unavailable.
- The transferred data is corrupt.
- The workflow does not know that the data was bad.
7Stork Solves Problems
- Creates the concept of the
- data placement job
- Managed and scheduled the same as any Condor job
- Friedrichs jobs benefit from built-in fault
tolerance
8Supported Data Transfer Protocols
- local file system
- GridFTP
- FTP
- HTTP
- SRB
- NeST
- SRM
- and, it is extensible to other protocols
9Fault Tolerance
- Retries failed jobs
- Can also retry a failed data transfer job using
an alternate protocol. - For example, first try GridFTP, then try FTP
- Retry stuck jobs
- Configurable fault responses
10Getting Stork
- Stork is part of Condor, so get Condor. . .
- Available as a free download from
- http//www.cs.wisc.edu/condor
- Currently available for Linux platforms
11Personal Condor works well with Stork
- This is Condor/Stork on your own workstation, no
root access required, no system administrator
intervention needed - After installation, Friedrich submits his jobs to
his Personal Stork
12Friedrichs Personal Condor
Friedrich's workstation
Central Mgr.
Master
StartD
SchedD
Stork
data jobs
CPU jobs
DAG
N compute elements
external data servers
13Stork will ...
- Keep an eye on data placement jobs, and it will
keep you posted on their progress - Throttle the maximum number of jobs running
- Keep a log of job activities
- Add fault tolerance to all jobs
- Detect and retry failed data placement jobs
14The Submit Description File
- Just like the rest of Condor, a plain ASCII text
file, but with a different format - Written in new ClassAd language
- Neither Stork nor Condor care about file name
extensions - Contents of file tells Stork about jobs
- data placement type, source/destination
location/protocol, proxy location, alternate
protocols to try
15Simple Submit File
// c style comment lines // file name is
stage-in.stork dap_type "transfer"
src_url http//server/path" dest_url
"file///dir/file" log
"stage-in.log"
Note different format from Condor submit files
16Another Simple Submit File
// c style comment lines // file name is
stage-in.stork dap_type "transfer"
src_url gsiftp//server/path" dest_url
"file///dir/file" x509proxy "default"
log "stage-in.log"
Note different format from Condor submit files
17Running stork_submit
- Give stork_submit the name of the submit file
- stork_submit stage-in.stork
- stork_submit parses the submit file, checks for
it errors, and sends the job to the Stork server. - stork_submit returns the created job id (a job
handle)
18Sample stork_submit
stork_submit stage-in.stork Sen
ding request dest_url
"file///dir/file" src_url
http//server/path" dap_type
"transfer" log "path/stage-in.log"
Request assigned id 1
job id
19The Job Queue
- stork_submit sends the job to the Stork server
- The Stork server manages the local job queue
- View the queue with stork_q, or stork_status
20Job Status
- stork_q queries all active jobs
- stork_q
- stork_status queries the given job id, which may
be active, or complete - stork_status 12
21Removing jobs
- To remove a data placement job from the queue,
use stork_rm - You may only remove jobs that you own
- (Unix root may remove anyones jobs)
- Give a specific job ID
- stork_rm 21 removes a single job
22Use Log Files
// c style comment lines dap_type
"transfer" src_url "gsiftp//server/path"
dest_url "file///dir/file" x509proxy
"default" log "stage-in.log"
23Sample Stork User Log
000 (001.-01.-01) 04/17 193000 Job submitted
from host lt128.105.121.5354027gt ... 001
(001.-01.-01) 04/17 193001 Job executing on
host lt128.105.121.539621gt ... 008 (001.-01.-01)
04/17 193001 job type transfer ... 008
(001.-01.-01) 04/17 193001 src_url
gsiftp//server/path ... 008 (001.-01.-01) 04/17
193001 dest_url file///dir/file ... 005
(001.-01.-01) 04/17 193002 Job terminated.
(1) Normal termination (return value 0)
Usr 0 000000, Sys 0 000000 - Run Remote
Usage Usr 0 000000, Sys 0 000000 -
Run Local Usage Usr 0 000000, Sys 0
000000 - Total Remote Usage Usr 0
000000, Sys 0 000000 - Total Local Usage
0 - Run Bytes Sent By Job 0 - Run Bytes
Received By Job 0 - Total Bytes Sent By
Job 0 - Total Bytes Received By Job ...
24Stork and DAGMan
- Data placement jobs are integrated with Condors
DAGMan, and Friedrich benefits
25Defining Friedrich's DAG
26Friedrichs DAG
input1
input2
crunch
result
27The DAG Input File
- file name is friedrich.dag
- DATA input1 input1.stork
- DATA input2 input2.stork
- JOB crunch process.submit
- DATA result result.stork
- PARENT input1 input2 CHILD crunch
- PARENT crunch CHILD result
28One of the Stork Submit Files
// file name is input1.stork dap_type
"transfer" src_url http//north.cs.wisc.ed
u/ freidrich/data1" dest_url
"file///home/friedrich/in1" log
"in1.log"
29Condor Submit Description File
- file name is process.submit
- universe vanilla
- executable process
- input in1
- output crunch.result
- error crunch.err
- log crunch.log
- queue
30Stork Submit File
// file name is result.stork dap_type
"transfer" src_url
"file///home/friedrich/crunch.result"
dest_url http//north.cs.wisc.edu/
friedrich/final.results" log
"result.log"
31Friedrich Submits the DAG
- While Friedrichs current working directory is
/home/friedrich - condor_submit_dag friedrich.dag
32In Review
- With Stork Friedrich now can
- Submit data processing jobs and go home! Because,
- Stork manages the data transfers, including fault
detection and retry - Condor DAGMan manages dependencies.
33Additional Resources
- http//www.cs.wisc.edu/condor/stork/
- Condor Manual, Stork section
- stork-announce_at_cs.wisc.edu list
- stork-discuss_at_cs.wisc.edu list
34Additional Slides
35Important Parameters
- STORK_MAX_NUM_JOBS limits number of active jobs
- STORK_MAX_RETRY limits job attempts, before job
marked as failed - STORK_MAXDELAY_INMINUTES specifies hung job
threshold
36Current Restrictions
- Currently, best suited for Personal Stork mode
- Local file paths must be valid on Stork server,
including submit directory. - To share data, successive jobs in DAG must use
shared filesystem
37Future Work
- Enhance multi-user fair share
- Enhance support for DAGs without shared file
system - Enhance scheduling with configurable job
requirements and rank - Add DAP job matchmaking
- Additional platform ports