Workload Management - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Workload Management

Description:

Match- Maker/ Broker. Matchmaker: responsible. to find the 'best' CE. where to submit a job ... Match- Maker/ Broker. Where are (which SEs) the needed data ? ... – PowerPoint PPT presentation

Number of Views:1853
Avg rating:3.0/5.0
Slides: 45
Provided by: COLL68
Category:

less

Transcript and Presenter's Notes

Title: Workload Management


1
Workload Management
David Colling Imperial College London
2
  • Release 2 is not based on release 1
  • Whole new architecture (pretty much described in
    D1.4)
  • More modular
  • I have little practical experience of this new
    architecture (yet).

3
So what is the new architecture?
See D1.4 for details
4
The architecture
User Interface Although there have been several
changes to the architecture, the commands
available at the user end are (almost) the same
now edg-job-submit etc Also now apis Network
Server The Network Server is a generic network
daemon, responsible for accepting incoming
requests from the UI (e.g. job submission, job
removal), which, if valid, are then passed to the
Workload Manager.
5
The architecture
Workload manager The Workload Manager is the
core component of the Workload Management System.
Given a valid request, it has to take the
appropriate actions to satisfy it. To do so, it
may need support from other components, which are
specific to the different request types.
6
The architecture
  • Resource Broker
  • This has been turned into one of the modules that
    help the workload manager, actually 3
    sub-modules
  • Matchmaking
  • Ranking
  • Scheduling
  • Job Adapter
  • The Job Adapter put the finishing touches to the
    jobs jdl and creates the job wrapper.

7
The architecture
Job Controller and CondoG Actually submit the job
to the resources and track progress.
So how does this all work
8
Job submission example (for a simple job)
RB node
Replica Catalog
Network Server
Workload Manager
Inform. Service
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
9
Job submission
  • edg-job-submit myjob.jdl
  • Myjob.jdl
  • JobType Normal
  • Executable "(CMS)/exe/sum.exe"
  • InputData "LFtestbed0-00019"
  • ReplicaCatalog "ldap//sunlab2g.cnaf.infn.it201
    0/rcWP2 INFN Test Replica Catalog,dcsunlab2g,
    dccnaf, dcinfn, dcit"
  • DataAccessProtocol "gridftp"
  • InputSandbox "/home/user/WP1testC","/home/file
    , "/home/user/DATA/"
  • OutputSandbox sim.err, test.out,
    sim.log"
  • Requirements other. GlueHostOperatingSystemNam
    e linux"
  • other. GlueHostOperatingSystemRelease "Red Hat
    6.2 other.GlueCEPolicyMaxWallClockTime gt
    10000
  • Rank other.GlueCEStateFreeCPUs

Job Status
RB node
submitted
Replica Catalog
Network Server
Workload Manager
Inform. Service
Job Description Language (JDL) to specify job
characteristics and requirements
UI allows users to access the
functionalities of the WMS
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
10
NS network daemon responsible for
accepting incoming requests
RB node
Job Status
Job submission
Replica Catalog
Network Server
Job
Input Sandbox files
Workload Manager
Inform. Service
RB storage
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
11
RB node
Job Status
Job submission
Replica Catalog
Network Server
Job
Workload Manager
Inform. Service
RB storage
WM responsible to take the appropriate actions
to satisfy the request
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
12
RB node
Job Status
Job submission
Replica Catalog
Network Server
Match- maker
Workload Manager
Inform. Service
RB storage
Where does this job must be executed ?
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
13
RB node
Job Status
Job submission
Replica Catalog
Network Server
Matchmaker responsible to find the best CE
where to submit a job
Match- Maker/ Broker
Workload Manager
Inform. Service
RB storage
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
14
RB node
Job Status
Where are (which SEs) the needed data ?
Job submission
Replica Catalog
Network Server
Match- Maker/ Broker
Workload Manager
Inform. Service
RB storage
What is the status of the Grid ?
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
15
RB node
Job Status
Job submission
Replica Catalog
Network Server
Match- maker
Workload Manager
Inform. Service
RB storage
CE choice
Job Contr. - CondorG
CE characts status
SE characts status
Computing Element
Storage Element
16
RB node
Job Status
Job submission
Replica Catalog
Network Server
Workload Manager
Inform. Service
RB storage
Job Adapter
Job Contr. - CondorG
CE characts status
SE characts status
JA responsible for the final touches to the
job before performing submission (e.g. creation
of wrapper script, etc.)
Computing Element
Storage Element
17
RB node
Job Status
Job submission
Replica Catalog
Network Server
Workload Manager
Inform. Service
RB storage
Job
Job Contr. - CondorG
CE characts status
JC responsible for the actual job
management operations (done via CondorG)
SE characts status
Computing Element
Storage Element
18
RB node
Job Status
Job submission
Replica Catalog
Network Server
Workload Manager
Inform. Service
RB storage
Job Contr. - CondorG
CE characts status
Input Sandbox files
SE characts status
Job
Computing Element
Storage Element
19
RB node
Job Status
Job submission
Replica Catalog
Network Server
Workload Manager
Inform. Service
RB storage
Job Contr. - CondorG
Input Sandbox
Grid enabled data transfers/ accesses
Computing Element
Storage Element
20
RB node
Job Status
Job submission
Replica Catalog
Network Server
Workload Manager
Inform. Service
RB storage
Job Contr. - CondorG
Output Sandbox files
Computing Element
Storage Element
21
Job submission
RB node
Job Status
edg-job-get-output ltdg-job-idgt
Replica Catalog
Network Server
Workload Manager
Inform. Service
RB storage
Job Contr. - CondorG
Output Sandbox
Computing Element
Storage Element
22
RB node
Job Status
Job submission
submitted
Replica Catalog
Network Server
waiting
ready
Output Sandbox files
Workload Manager
Inform. Service
RB storage
scheduled
Job Contr. - CondorG
running
done
cleared
Computing Element
Storage Element
23
Logging and bookkeeping.
edg-job-status ltdg-job-idgt
LB receives and stores job events processes
corresponding job status
Job status
Logging Bookkeeping
Log Monitor
Log of job events
LM parses CondorG log file (where CondorG
logs info about jobs) and notifies LB
24
New functionality
  • Release 2 of WP 1 software
  • New functionality includes
  • MPI job submission
  • User APIs
  • Accounting infrastructure (Management have
    decided not to deploy this for testbed 2)
  • Interactive job support
  • Job logical checkpointing

25
New functionality
All these are implemented Specify which sort of
job using the JobType classad e.g. JobType
Checkpointable However only tested on the WP 1
testbed as yet
Dont have time to go through all of these so
will just will just go through checkpointing.
26
Job checkpointing scenario
RB node
Network Server
Workload Manager
Logging Bookkeeping Server
Job Contr. - CondorG
27
Job Status
  • edg-job-submit jobchkpt.jdl
  • jobchkpt.jdl
  • JobType Checkpointable
  • Executable "hsum.exe"
  • StdOutput Outfile
  • InputSandbox "/home/user/hsum.exe,
  • OutputSandbox Outfile,
  • Requirements member("ROOT", other.GlueHostApplic
    ationSoftwareRunTimeEnvironment)
    member("CHKPT", other.GlueHostApplicationSoftwareR
    unTimeEnvironment)
  • Rank -other.GlueCEStateEstimatedResponseTime

RB node
submitted
Replica Catalog
Network Server
Workload Manager
Logging Bookkeeping Server
Job Description Language (JDL) to specify job
characteristics and requirements
UI allows users to access the
functionalities of the WMS
Job Contr. - CondorG
28
RB node
Job Status
Network Server
1
Job
Match- maker
Job
1
2
3
Input Sandbox files
Workload Manager
Logging Bookkeeping Server
RB storage
4
Job Adapter
5
Job
Job Contr. - CondorG
6
Input Sandbox files
6
Job
29
RB node
Job Status
Network Server
Workload Manager
Logging Bookkeeping Server
RB storage
Job Contr. - CondorG
ltsave intermediate filesgt State.saveValue(var1
, value1gt State.saveValue(varn,
valuen) State.saveState()
From time to time users job asks to save the
intermediate state
30
RB node
Job Status
Network Server
Workload Manager
Logging Bookkeeping Server
RB storage
Job Contr. - CondorG
Saving of intermediate files
Saving of job state
31
RB node
Job Status
Network Server
Workload Manager
Logging Bookkeeping Server
RB storage
Job Contr. - CondorG
Job fails (e.g. for a CE problem)
Computing Element X
Computing Element Y
32
RB node
Job Status
Network Server
Match- maker
Workload Manager
Logging Bookkeeping Server
RB storage
Where must this job be executed ? Possibly on a
different CE where the job was previously
submitted
Job Contr. - CondorG
Reschedule and resubmit job
Job
33
RB node
Job Status
Network Server
Match- maker
Workload Manager
Logging Bookkeeping Server
RB storage
CE choice CEy
Job Contr. - CondorG
34
RB node
Job Status
Network Server
Workload Manager
Logging Bookkeeping Server
RB storage
Job Adapter
Job
Job Contr. - CondorG
CE characts status
35
RB node
Job Status
Network Server
Workload Manager
Logging Bookkeeping Server
RB storage
Job Contr. - CondorG
Input Sandbox files
Job
36
RB node
Job Status
scheduled
Network Server
Workload Manager
Logging Bookkeeping Server
done (failed)
RB storage
waiting
Retrieval of last saved state when job starts
Job Contr. - CondorG
ready
Retrieval of intermediate files (previously saved)
scheduled
37
RB node
Job Status
scheduled
Network Server
Workload Manager
Logging Bookkeeping Server
done (failed)
RB storage
waiting
Job Contr. - CondorG
ready
Job keeps running starting from the
point corresponding to the retrieved state
(doesnt need to start from the beginning)
scheduled
Job
38
Further additional functionality
The order of implementation is not up to WP 1
people Dependent jobs Using Condor DAGMan
For example
39
Further additional functionality
A Executable "A.sh" PreScript
"PreA.sh" PreScriptArguments "1"
Children "B", "C"   B
Executable "B.sh" PostScript
"PostA.sh" PostScriptArguments "RETURN"
Children "D"   C
Executable "C.sh" Children "D"
  D Executable "D.sh"
PreScript "PreD.sh" PostScript
"PostD.sh" PostScriptArguments "1", "a"

40
Further additional functionality
Job partitioning will be similar to
checkpointing, with the jobs being partitioned
according to some variable. Partitioned jobs
will also have a pre-job and aggregator e.g.
41
Further additional functionality
  JobType Partitionable Executable
... JobSteps ... StepWeight
... Requirements ... ...
... Prejob
Executable ... Requirements ...
... ... Aggregator
Executable ...
Requirements ... ... ...

42
Further additional functionality
Also planned is advanced reservation of resources
and co-location. Much more monitoring and
performance quantification
43
  • Summary
  • New architecture has been implemented
  • Lots of new functionality but not stress
    tested
  • Further functionality and performance
    quantification implemented by testbed 3.

44
Further into the future
EDG will not use OGSA, however the future is in
the OGSA grid world. Work is being done at LeSC
(See Steven Newhouses talk tomorrow) to wrap the
WP 1 components. Communication via JDML and
LBML Virtualisation of RB through OGSA
factory Use virtualisation to load
balance Increase interoperability
Write a Comment
User Comments (0)
About PowerShow.com