Title: Computing and Brokering
1Computing and Brokering
- Grid Middleware 5
- David Groep, lecture series 2005-2006
2Outline
- Classes of computing services
- MPP SHMEM
- Clusters with high-speed interconnect
- Conveniently parallel jobs
- Through the hourglass basic functionalities
- Representing computing services
- resource availability, RunTimeEnvironment
- Software installation and ESIA
- Jobs as resources, or ?
- Brokering
- brokering models central view, per-user broker,
neighbourhood P2P brokering - job farming and DAGs Condor-G, gLite WMS,
Nimrod-G, DAG man - resource selection ERT, freeCPUs, ? Prediction
techniques and challenges - colocating jobs and data, input output
sandboxes, LogicalFiles - Specialties
- Supporting interactivity
3Computing Service
- resource variability and the hourglass model
4The Famous Hourglass Model
5Types of systems
- Very different models and pricing suitability
depends on application - shared memory MPP systems
- vector systems
- cluster computing with high-speed interconnect
- can perform like MPP, except for the single
memory image - e.g. Myrinet, Infiniband
- course-grained compute clusters
- conveniently parallel applications without IPC
- can be built of commodity components
- specialty systems
- visualisation, systems with dedicated
co-processors,
6Quick, cheap, or both how to run an app?
- Task how to run your application
- the fastest, or
- the most cost-effective (this argument usually
wins ?) - Two choices to speed up an application
- Use the fastest processor available
- but this gives only a small factor over modest
(PC) processors - Use many processors, doing many tasks in parallel
- and since quite fast processors are inexpensive
we can think of using very many processors in
parallel - but the problem must first be decomposed
fast, cheap, good pick any two
7High Performance or High Throughput?
- Key question max. granularity of decomposition
- Have you got one big problem or a bunch of little
ones? - To what extent can the problem be decomposed
into sort-of-independent parts (grains) that
can all be processed in parallel? - Granularity
- fine-grained parallelism the independent bits
are small, need to exchange information,
synchronize often - coarse-grained the problem can be decomposed
into large chunks that can be processed
independently - Practical limits on the degree of parallelism
- how many grains can be processed in parallel?
- degree of parallelism v. grain size
- grain size limited by the efficiency of the
system at synchronising grains
8High Performance v. High Throughput?
- fine-grained problems need a high performance
system - that enables rapid synchronization between the
bits that can be processed in parallel - and runs the bits that are difficult to
parallelize as fast as possible - coarse-grained problems can use a high throughput
system - that maximizes the number of parts processed per
minute - High Throughput Systems use a large number of
inexpensive processors, inexpensively
interconnected - High Performance Systems use a smaller number of
more expensive processors expensively
interconnected
9High Performance v. High Throughput?
- There is nothing fundamental here it is just
a question of financial trade-offs like - how much more expensive is a fast computer than
a bunch of slower ones? - how much is it worth to get the answer more
quickly? - how much investment is necessary to improve the
degree of parallelization of the algorithm? - But the target is moving -
- Since the cost chasm first opened between fast
and slower computers 12-15 years ago an enormous
effort has gone into finding parallelism in big
problems - Inexorably decreasing computer costs and
de-regulation of the wide area network
infrastructure have opened the door to ever
larger computing facilities clusters ?
fabrics ? (inter)national gridsdemanding
ever-greater degrees of parallelism
10But the fact is
the food chain has been reversed, and
supercomputer vendors are struggling to make a
living.
Graphic Network of Workstations, Berkeley IEEE
Micro, Feb, 1995, Thomas E. Anderson, David E.
Culler, David A. Patterson
11Using these systems
- As both clusters and capability systems are both
expensive (i.e. not on your desktop), they are
resources that need to be scheduled - interface for scheduled access is a batch queue
- job submit, cancel, status, suspend
- sometimes checkpoint-restart in OS, e.g. on SGI
IRIX - allocate processors (and amount of memory,
these may be linked!) as part of the job request - systems usually also have smaller interactive
partition - not intended for running production jobs
12Cluster batch system model
13Some batch systems
- Batch systems and schedulers
- Torque (OpenPBS, PBS Pro)
- Sun Grid Engine (thats not a Grid ?)
- Condor
- LoadLeveller
- Load Share Facility (LSF)
- Dedicated schedulers MAUI
- can drive scheduling for Torque/PBS, SGE, LSF,
- support advanced scheduling features,
likereservation, fair-shares, accounts/banking,
QoS - head node or UI system can usually be used for
test jobs
14Torque/PBS job description
- PBS batch job script
- PBS -l walltime360000
- PBS -l cput300000
- PBS -l vmem1gb
- PBS -q qlong
- Executing user job
- UTCDATEdate -u 'YmdHMSZ'
- echo "Execution started on UTCDATE"
- echo ""
- printenv
- date
- sleep 3
- date
- id
- hostname
15PBS queue
- bosuitmp1010 qstat -an1head -10
- tbn20.nikhef.nl
-
Req'd Req'd Elap - Job ID Username Queue Jobname
SessID NDS TSK Memory Time S Time - -------------------- -------- -------- ----------
------ ----- --- ------ ----- - ----- - 823302.tbn20.nikhef. biome034 qlong STDIN
20253 1 -- -- 6000 R 2058 node15-11 - 824289.tbn20.nikhef. biome034 qlong STDIN
6775 1 -- -- 6000 R 1525 node15-5 - 824372.tbn20.nikhef. biome034 qlong STDIN
10495 1 -- -- 6000 R 1510 node16-21 - 824373.tbn20.nikhef. biome034 qlong STDIN
3422 1 -- -- 6000 R 1440 node16-32 - ...
- 827388.tbn20.nikhef. lhcb031 qlong STDIN
-- 1 -- -- 6000 Q -- -- - 827389.tbn20.nikhef. lhcb031 qlong STDIN
-- 1 -- -- 6000 Q -- -- - 827390.tbn20.nikhef. lhcb002 qlong STDIN
-- 1 -- -- 6000 Q -- --
16Example Condor clusters of idle workstations
The Condor Project, Miron Livny et al. University
of Wisconsin, Madison. See http//www.cs.wisc.edu/
condor/
17Condor example
- Write a submit file
- Executable dowork
- Input dowork.in
- Output dowork.out
- Arguments 1 alpha beta
- Universe vanilla
- Log dowork.log
- Queue
- Give it to Condor
- condor_submit ltsubmit-filegt
- Watch it run condor_q
Files on shared fs
in a cluster at least, for other options see later
From Alan Roy, IO Access in Condor and Grid, UW
Madison. See http//www.cs.wisc.edu/condor/
18Matching jobs to resources
- For homogeneous clusters mainly policy-based
- FIFO
- credential-based policy
- fair-share
- queue wait time
- banks accounts
- QoS specific
- For heterogeneous clusters (like condor pools)
- matchmaking based on resource job
characteristics - see later in grid matchmaking
19Example scheduling policies - MAUI
- RMTYPE0 PBS
- RMHOST0 tbn20.nikhef.nl
- ...
- NODEACCESSPOLICY SHARED
- NODEAVAILABILITYPOLICY DEDICATEDPROCS
- NODELOADPOLICY ADJUSTPROCS
- FEATUREPROCSPEEDHEADER xps
- BACKFILLPOLICY ON
- BACKFILLTYPE FIRSTFIT
- NODEALLOCATIONPOLICY FASTEST
- FSPOLICY DEDICATEDPES
- FSDEPTH 24
- FSINTERVAL 240000
- FSDECAY 0.99
- GROUPCFGusers FSTARGET1 PRIORITY10
MAXPROC50 - GROUPCFGdteam FSTARGET2
PRIORITY5000 MAXPROC32
MAUI is an open source product from
ClusterResources, Inc. http//www.supercluster.or
g/
20Grid Interface to Computing
21Grid Interfaces to the compute services
- Need common interface for job management
- for test jobs in interactive mode fork
- like the interactive partition in clusters and
supers - batch system interface
- executable
- arguments
- processors
- memory
- environment
- stdin/out/err
- Note
- batch system usually doesnt manage local file
space - assumes executable is just there, because of
shared FS or JIT copying of the files to the
worker node in job prologue - local file space management needs to be exposed
as part of the grid service and then implemented
separately
22Expectations?
- What can a user expect from a compute service?
- Different user scenarios are all valid
- paratrooper mode come in, take all your
equipment (files, executable c) with you, do
your thing and go away - youre supposed to clean up, but the system will
likely do that for you if you forget. In all
cases, garbage left behind is likely to be
removed - two-stage prepare and run
- extra services to pre-install environment and
later request it - see later on such Community Software Area
services - dont think but just do it
- blindly assume the grid is like your local system
- expect all software to be there
- expect your results to be retained indefinitely
- realism of this scenario is quite low for
production grids, as it does not scale to
larger numbers of users
23Basic Operations
- Direct run/submit
- useless unless you have an environment already
set up - Cancel
- Signal
- Suspend
- Resume
- List jobs/status
- Purge (remove garbage)
- retrieve output first
- Other useful functions
- Assess submission (eligibility, ERT)
- Register Start (needed if you have sandboxes)
24A job submission diagram for a single CE
- Example
- explicit interactions
diagram from DJRA1.1 EGEE Middleware Architecture
25WS-GRAM Job management using WS-RF
- same functionalitymodelled with jobs represented
as resources - for input sandbox leverages an existing (GT4)
data movement service - exploit re-useable components
26GT4 WS GRAM Architecture
Service host(s) and compute element(s)
SEG
Job events
GT4 Java Container
Compute element
GRAM services
Local job control
GRAM services
Local scheduler
Job functions
sudo
GRAM adapter
Delegate
Transfer request
Client
Delegation
Delegate
GridFTP
User job
RFT File Transfer
FTP control
FTP data
Remote storage element(s)
GridFTP
diagram from Carl Kesselman, ISI, ISOC/GFNL
masterclass 2006
27GT2 GRAM
- Informational historical
- so dont blame the current Globus for this
single job submission flow chart
28GRAM GT2 Protocol
- RSL over http-g
- target to a single specific resource
- http-g is like https
- modified protocol (one one byte) to specify
delegation - no longer interoperable with standard https
- delegation implicit in job submission
- RSL Resource Specification Language
- Used in the GRAM protocol to describe the job
- required some (detailed) knowledge about target
system
29GT2 RSL
- (executable"/bin/echo")
- (arguments"12345")
- (stdoutx-gass-cache//(GLOBUS_GRAM_JOB_CONTACT)
stdout anExtraTag) - (stderrx-gass-cache//(GLOBUS_GRAM_JOB_CONTACT)
stderr anExtraTag) - (queueqshort)
30GT2 Job Manager interface
- One job manager per running or queued job
- provide control interface cancel, suspend,
status - GASS Grid Access to Secondary Storage
- stdin, stdout, stderr
- selected input/output files
- listens on a specific TCP port on the Gatekeeper
host - Some issues
- protocol does not provide two-phase commit
- know way to know if the job really made it
- too many open ports
- one process for each queued job, i.e. too many
processes - Workaround
- dont submit a job, but instread a grid-manager
process
31Performance ?
- Time to submit a basic GRAM job
- Pre-WS GRAM lt 1 second
- WS GRAM (in Java) 2 seconds
- so GT2-style GRAM did have one significant
advantage - Concurrent jobs
- Pre-WS GRAM 300 jobs
- WS GRAM 32,000 jobs
32Scaling scheduling
- load on the CE head node per VO cannot be
controlled with a single common job manager - with many VOs
- might need to resolve inter-VO resource
contention - different VOs may want different policies
- make the CE pluggable
- and provide a common CE interface, irrespective
of the site-specific job submission mechanism - as long as the site supports a fork JM
33gLite job submission model
site
one grid CEMON per VO or user
34Unicore CE
- Other design and concept
- eats JSDL (GGF standard) as a description
- described job requirements in detail
- security model cannot support dynamic VOs yet
- grid-wide coordinated UID space
- (or shared group accounts for all grid users)
- no VO management tools (DEISA added a directory
for that) - intra-site communication not secured
- one big plus job management uses only 1 port for
all ommunications (including file transfer), and
is thus firewall-friendly
35Unicore CE Architecture
Graphic from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
36Unicore programming model
- Abstract Job Object
- Collection of classes representing Grid functions
- Encoded as Java objects (XML encoding possible)
- Where to build AJOs
- Pallas client GUI - The users view
- Client plugins - Grid deployer
- Arcon client tool kit - Hard core
- What cant the AJO do
- Application level Meta-computing
- ???
from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
37Interfacing to the local system
- Incarnation Data Base
- Maps abstract representation to concrete jobs
- Includes resource description
- Prototype auto-generation from MDS
- Target System Interface
- Perl interface to host platform
- Very small system specific module for easy
porting - Current NQS (several versions), PBS,
Loadleveler, UNICOS, Linux, Solaris, MacOSX,
PlayStation-2 - Condor Under development ( probably done by now)
from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
38Resource Representation
- CE attributes
- obtaining metrics
- GLUE CE
39Describing a CE
- Balance between completeness and timeliness
- Some useful metrics almost impossible to obtain
- when will this job of mine be finished if I
submit now?cannot be answered! - depends on system load
- need to predict runtime for already running
queued jobs - simultaneous submission in a non-FIFO scheduling
model (e.g. fair share, priorities, pre-emption
c)
40GlueCE a resource description viewpoint
From the GLUE Information Model version 1.2, see
document for details
41Through the Glue Schema Cluster Info
- Performance info SI2k, SF2k
- Max wall time, CPU time seconds
- together these determine if a job completes in
time - but clusters are not homogeneous
- solve at the local end (scale masCPU,wall time
on each node to the system speed)CAVEAT when
doing cross-cluster grid-wide scheduling, this
can make you choose the wrong resource entirely! - solve (i.e. multiply) at the broker endbut now
you need a way to determine on which subcluster
your job will run oops.
42Cluster Info total, free and max JobSlots
- FreeJobSlots is the wrong metric to use for
scheuling (a good cluster is always 100 full) - these metrics may be VO, user and job dependent
- if a cluster have free CPUs, that does not mean
that you can use them - even if there are thousands of waiting jobs, you
might get to the front of the queue because of
your prio or fair-share
43Cluster info ERT and WRT
- Estimated/worst response time
- when will my job start to run if I submit now
- Impossible to pre-determine in case of
simultaneous submissions - Best to do is to estimate
- Possible approaches
- simulation good but very, very slowPredicting
Job Start Times on Clusters, Hui Li et al. 2004 - historical comparisons
- template approach need to discover the proper
template - look for similar system states in the past
- learning approach adapt the estimation
algorithm to the actual load and learn the best
approach - see the many other papers by Hui, bundle on
Blackboard!
44Brokering
45Brokering models
- All current grid broker systems use global
brokering - consider all known resources when matching
requests - brokering takes longer as the system grows
- Models
- Bubble-to-the-top-information-system based
- current Condor-G, gLite WMS
- Ask the world for bids
- Unicore Broker
46Some grid brokers
- Condor-G
- uses Condor schedd (matchmaker) to match
resources - a Condor submitter has a number of backends to
talk to different CEs (GT2, GT4-GRAM, Condor
(flocking)) - supports DAG workflows
- schedd is close to the user
- gLite WMS
- separation between broker (based on Condor-G) and
the UI - additional Logging and Bookkeeping (generic, but
actually only used for the WMS) - does job-data co-location scheduling
47Grid brokers (contd.)
- Nimrod-G
- parameter sweep engine
- cycles through static list of resources
- automatically inspects the job output and uses
that to drive automatic job submission - minimisation methods like simulated annealing
built in - Unicore broker
- based on a pricing model
- asks for bids from resources
- no large information system needed full of
useless resources, but instead ask bids from all
resources for every job - shifts, but does nothing to resolve, the
info-system explosion
48Alternative brokering
- Alternatives could be P2P-style brokering
- look in the neighbourhood for reasonable
matches, if none found, give the task to a peer
super-scheduler - scheduler only considers close resources (has
no global knowledge) - job submission pattern may or may not follow
brokering pattern - if it does, it needs recursive delegation for job
submission, which opens the door for worms and
trojans - trust is not very transitive(this is not a
problem in sharing public files, such as in the
popular P2P file sharing applications)
49Broker detailed example gLite WMS
- Job services in the gLite architecture
- Computing Element (just discussed)
- Workload Management System (brokering, submission
control) - Accounting (for EGEE comes in two flavours site
or user) - Job Provenance (to be done)
- Package management (to be done)
- continuous matchmaking solution
- persistent list of pending jobs, waiting for
matching resources - WMS task akin to what the resources did in
Unicore
50Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
51WMSs Architecture
52WMSs Architecture
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
53WMSs Architecture
Keeps submission requests Requests are kept
for a while if no matching resources available
54WMSs Architecture
Repository of resource information available to
matchmaker Updated via notifications and/or
active polling on sources
55WMSs Architecture
Finds an appropriate CE for each submission
request, taking into account job requests and
preferences, Grid status, utilization policies
on resources
56WMSs Architecture
Performs the actual job submission and
monitoring
57The Information Supermarket
- ISM represents one of the most notable
improvements in the WM as inherited from the EU
DataGrid (EDG) project - decoupling between the collection of information
concerning resources and its use - allows flexible application of different policies
- The ISM basically consists of a repository of
resource information that is available in read
only mode to the matchmaking engine - the update is the result of
- the arrival of notifications
- active polling of resources
- some arbitrary combination of both
- can be configured so that certain notifications
can trigger the matchmaking engine - improve the modularity of the software
- support the implementation of lazy scheduling
policies
58The Task Queue
- The Task Queue represents the second most notable
improvement in the WM internal design - possibility to keep a submission request for a
while if no resources are immediately available
that match the job requirements - technique used by the AliEn and Condor systems
- Non-matching requests
- will be retried either periodically
- eager scheduling approach
- or as soon as notifications of available
resources appear in the ISM - lazy scheduling approach
59Job Logging Bookkeeping
- LB tracks jobs in terms of events
- important points of job life
- submission, finding a matching CE, starting
execution etc - gathered from various WMS components
- The events are passed to a physically close
component of the LB infrastructure - locallogger
- avoid network problems
- stores them in a local disk file and takes over
the responsibility to deliver them further - The destination of an event is one of bookkeeping
servers - assigned statically to a job upon its submission
- processes the incoming events to give a higher
level view on the job states - Submitted, Running, Done
- various recorded attributes
- JDL, destination CE name, job exit code
- Retrieval of both job states and raw events is
available via legacy (EDG) and WS querying
interfaces - user may also register for receiving
notifications on particular job state changes
60Job Submission Services
- WMS components handling the job during its
lifetime and performing the submission - Job Adapter
- is responsible for
- making the final touches to the JDL expression
for a job, before it is passed to CondorC for the
actual submission - creating the job wrapper script that creates the
appropriate execution environment in the CE
worker node - transfer of the input and of the output sandboxes
- CondorC
- responsible for
- performing the actual job management operations
- job submission, job removal
- DAGMan
- meta-scheduler
- purpose is to navigate the graph
- determine which nodes are free of dependencies
- follow the execution of the corresponding jobs.
- instance is spawned by CondorC for each handled
DAG - Log Monitor
- is responsible for
61Job Preparation
- Information to be specified when a job has to be
submitted - Job characteristics
- Job requirements and preferences on the computing
resources - Also including software dependencies
- Job data requirements
- Information specified using a Job Description
Language (JDL) - Based upon Condors CLASSified ADvertisement
language (ClassAd) - Fully extensible language
- A ClassAd
- Constructed with the classad construction
operator - It is a sequence of attributes separated by
semi-colons. - An attribute is a pair (key, value), where value
can be a Boolean, an Integer, a list of strings,
- ltattributegt ltvaluegt
62ClassAds matchmaking
- Brokering based on advertisements by both jobs
and resources
63ClassAds matchmaking
- Allow customers to set provide requirements and
preferences on the resources - Allow resources to impose constraints on the
customers they wish to service. - Separation between matchmaking and claiming.
- The matchmake is stateless and thus can scale to
very large systems without complex failure
recovery.
64Job Description Language (JDL)
- The supported attributes are grouped into two
categories - Job Attributes
- Define the job itself
- Resources
- Taken into account by the Workload Manager for
carrying out the matchmaking algorithm (to choose
the best resource where to submit the job) - Computing Resource
- Used to build expressions of Requirements and/or
Rank attributes by the user - Have to be prefixed with other.
- Data and Storage resources
- Input data to process, Storage Element (SE) where
to store output data, protocols spoken by
application when accessing SEs
65JDL Relevant Attributes (1)
- JobType
- Normal (simple, sequential job), DAG,
Interactive, MPICH, Checkpointable - Executable (mandatory)
- The command name
- Arguments (optional)
- Job command line arguments
- StdInput, StdOutput, StdError (optional)
- Standard input/output/error of the job
- Environment
- List of environment settings
- InputSandbox (optional)
- List of files on the UIs local disk needed by
the job for running - The listed files will be staged automatically to
the remote resource - OutputSandbox (optional)
- List of files, generated by the job, which have
to be retrieved
66JDL Relevant Attributes (2)
- Requirements
- Job requirements on computing resources
- Specified using attributes of resources published
in the Information Service - If not specified, default value defined in UI
configuration file is considered - Default other.GlueCEStateStatus "Production"
(the resource has to be able to accept jobs and
dispatch them on WNs) - Rank
- Expresses preference (how to rank resources that
have already met the Requirements expression) - Specified using attributes of resources published
in the Information Service - If not specified, default value defined in the UI
configuration file is considered - Default - other.GlueCEStateEstimatedResponseTime
(the lowest estimated traversal time) - Default other.GlueCEStateFreeCPUs (the highest
number of free CPUs) for parallel jobs (see later)
67JDL Relevant Attributes (3)
- InputData
- Refers to data used as input by the job these
data are published in the Replica Catlog and
stored in the Storage Elements) - LFNs and/or GUIDs
- InputSandbox
- Execuable, files etc. that are sent to the job
- DataAccessProtocol (mandatory if InputData has
been specified) - The protocol or the list of protocols which the
application is able to speak with for accessing
InputData on a given Storage Element - OutputSE
- The Uniform Resource Identifier of the output
Storage Element - RB uses it to choose a Computing Element that is
compatible with the job and is close to Storage
Element
Details in Data Management lecture
68Example of JDL File
-
- JobTypeNormal
- Executable gridTest
- StdError stderr.log
- StdOutput stdout.log
- InputSandbox /home/mydir/test/gridTest
- OutputSandbox stderr.log, stdout.log
- InputData lfn/glite/myvo/mylfn
- DataAccessProtocol gridftp
- Requirements other.GlueHostOperatingSystemNameOp
Sys LINUX - other.GlueCEStateFreeCPUsgt4
- Rank other.GlueCEPolicyMaxCPUTime
69Jobs State Machine (1/9)
- Submitted job is entered by the user to the User
Interface but not yet transferred to Network
Server for processing
70Jobs State Machine (2/9)
- Waiting job accepted by NS and waiting for
Workload Manager processing or being processed by
WMHelper modules.
71Jobs State Machine (3/9)
- Ready job processed by WM and its Helper modules
(CE found) but not yet transferred to the CE
(local batch system queue) via JC and CondorC.
This state does not exists for a DAG as it is not
subjected to matchmaking (the nodes are) but
passed directly to DAGMan.
72Jobs State Machine (4/9)
Scheduled job waiting in the queue on the CE.
This state also does not exists for a DAG as it
is not directly sent to a CE (the node are).
73Jobs State Machine (5/9)
Running job is running. For a DAG this means
that DAGMan has started processing it.
74Jobs State Machine (6/9)
Done job exited or considered to be in a
terminal state by CondorC (e.g., submission to CE
has failed in an unrecoverable way).
75Jobs State Machine (7/9)
Aborted job processing was aborted by WMS
(waiting in the WM queue or CE for too long,
over-use of quotas, expiration of user
credentials).
76Jobs State Machine (8/9)
Cancelled job has been successfully canceled on
user request.
77Jobs State Machine (9/9)
Cleared output sandbox was transferred to the
user or removed due to the timeout.
78Directed Acyclic Graphs (DAGs)
- A DAG represents a set of jobs
- Nodes Jobs Edges Dependencies
NodeA
NodeB
NodeC
NodeD
NodeE
79DAG JDL Structure
- Type DAG
- VirtualOrganisation yourVO
- Max_Nodes_Running int gt0
- MyProxyServer
- Requirements
- Rank
- InputSandbox more later!
- OutSandbox
- Nodes nodeX more later!
- Dependencies more later!
-
Mandatory Mandatory Optional Optional Option
al Optional Optional Mandatory Mandatory
80Attribute Nodes
- The Nodes attribute is the core of the DAG
description
. Nodes nodefilename1 ...
nodefilename2 .
dependencies
Nodefilename1 file foo.jdl
Nodefilename2 file
/home/vardizzo/test.jdl retry 2
Nodefilename1 description JobType
Normal
Executable abc.exe
Arguments 1 2 3
OutputSandbox
InputSandbox
.. retry 2
81Attribute Dependencies
- It is a list of lists representing the
dependencies between the nodes of the DAG.
. Nodes nodefilename1 ...
nodefilename2 .
dependencies
dependencies nodefilename1,
nodefilename2
MANDATORY YES!
dependencies
nodefilename1, nodefilename2
nodefilename1, nodefilename2 , nodefilename3
nodefilename1, nodefilename2,
nodefilename3, nodefilename4
82InputSandbox Inheritance
- All nodes inherit the value of the attributes
from the one specified for the DAG.
NodeA description JobType
Normal Executable abc.exe
OutputSandbox myout.txt InputSandbox
/home/vardizzo/myfile.txt,
root.InputSandbox
Type DAG VirtualOrganisation
yourVO Max_Nodes_Running int gt0 MyProxyServer
Requirements Rank InputSandbox
Nodes nodefilename
.. dependencies
- Nodes without any
InputSandbox values, have to contain in their
description an empty list - InputSandbox
83Interactive Jobs
- It is a job whose standard streams are forwarded
to the submitting client. - The DISPLAY environment variable has to be set
correctly, because an X window may be opened.
Listener Process
X window or std no-gui
84Interactive Jobs
- Specified setting JobType Interactive in JDL
- When an interactive job is executed, a window for
the stdin, stdout, stderr streams is opened - Possibility to send the stdin to
- the job
- Possibility the have the stderr
- and stdout of the job when it
- is running
- Possibility to start a window for
- the standard streams for a
- previously submitted interactive
- job with command glite-job-attach
85Interactive Jobs JDL Structure
Mandatory Mandatory Mandatory Optional
Optional Optional Mandatory Mandatory
- Type job
- JobType interactive
- Executable
- Argument
- ListenerPort int gt 0
- OutputSandbox
- Requirements
- Rank
gLite Commands glite-job-attach options
ltjobIDgt
86gLite Commands
- JDL Submission
- glite-job-submit o guidfile
jobCheck.jdl - JDL Status
- glite-job-status i guidfile
- JDL Output
- glite-job-output i guidfile
- Get Latest Job State
- glite-job-get-chkpt o statefile i
guidfile - Submit a JDL from a state
- glite-job-submit -chkpt statefile
o guidfile jobCheck.jdl - See also options typing help after the
commands.
87Economy based brokering
88Unicore Broker
- Distributed brokering
- Sites Know the State of their Resources Best
- Sites Can Conceal their Resource Configuration
- Different VOs Need Different Selection Algorithms
- Preferred site sets will vary
- Different applications have different performance
characteristics - Uses an economic model
- cost-based evaluation, like in the real world
- broker developed by University of Manchester, UK
Unicore is a open source product coordinated by
the Unicore Forum, see www.unicore.org
89Unicore Broker
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
90Job description ontology
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
91Unicore Broker hierarchy
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
92Unicore Broker in the system
Resource Database
Resource Broker
NQS
Network Job Supervisor
Unicore Gateway
Unicore Client
Condor
GT
Alternative Client
Multiple firewalllayouts possible
User Database
Ext. Auth Service
UoM Broker Architecture, from Dave Snelling,
Fujitsu Labs Europe, Unicore Technology, Grid
School July 2003
93Unicore Broker
UoM Broker Architecture, from Dave Snelling,
Fujitsu Labs Europe, Unicore Technology, Grid
School July 2003
94VO Schedulers
- Pilot jobs and overlay networks
95Towards a multi-scheduler world
- expressing scheduling policies (priorities and
usage shares) for multiple complex VOs in a
single scheduler is proving difficult - resource owner does not want to know about VO
internal structure, but assign the VO just a
single share - VO wants to set fine-grained intra-VO shares
- local schedulers (such as MAUI) are not geared
towards non-admin defined policies there is no
grid-aware scheduler - possible solutions
- develop an interface to manage the local
scheduling policies - stack the schedulers, i.e. introduce a per-VO
scheduler
96traditional job submission models
- There are three traditional deployment models
- direct per-user job submission to a gatekeeper
running with root privileges (GT2GK, todays
model) - a non-privileged dedicated CE or scheduler,
accepting authenticated user jobs and submitting
to the batch system - on-demand CE, submitted by VO or user to a
front-end system, that then receives user jobs
and submits these to the batch system - in order to not have complex schedulers run as
root, a sudo-component glexec is introducted
97What is glexec?
- glexec
- a thin layerto change unix credentialsbased on
grid identity and attribute information - you can think of it as
- a replacement for the gatekeeper
- a griddy version of Apaches suexec(8)
- a program wrapper around LCAS, LCMAPS or GUMS
98What glexec does
- Input
- a certificate chain, possibly with VOMS
extensions - a user program name arguments to run
- Action
- check authorization (LCAS, GUMS)
- user credentials, proper VOMS attributes,
executable name - acquire local credentials
- local (uid, gid) pair, possibly across a cluster
- enforce the local credential on the process
- Result
- user program is run with the mapped credentials
99Jobs submission today (GT2 GK)
- Deployment model without glexec (mode GT2GK)
- jobs are submitted with an identity (hopefully
the original users one) to the site Gatekeeper
running as root - one job manager is run for each user on the head
node - with the users (uid,gid) as set by the gatekeeper
100Glexec in a one-per-site mode
- Deployment model with a CE service
- running in a non-privileged account or
- with a CE run (maybe one per VO) on a single
front-end per site
- examples
- CREAM
- GT4 WS-GRAM
101glexec with an on-demand CE
- Deployment model with on-demand CEs (mode
on-demand CEs) - The user or the VO start their own scheduler on a
front-end system - All these on-demand schedulers are
resource-limited by a site-managed master
scheduler (via a GT2GK or Condor) - the on-demand schedulers eat jobs for their VO or
user - and set the proper identity before the job gets
submitted to the site batch system
102glexec with on-demand CE
- Deployment model with on-demand CEs (mode
on-demand for VOs with native interface)
103Traditional model summary
- In all three models, the submission of the user
job to the batch system is done with the original
job owners mapped (uid, gid) identity - grid-to-local identity mapping is done only on
the front-end system (CE) - batch system accounting provides per-user records
- inspection of Unix process on worker nodes are
per-user
104Pilot jobs
- A pilot job is basically just
- a small script which downloads a real job
- from a repository once it starts executing, hence
- it is not committed to any particular task, or
perhaps even a particular user, until that point.
- If there are no tasks waiting the pilot job exits
immediately. - In principle, if the time limits on the queue are
long enough a single pilot job could run more
than one real job, although I'm not sure if
anyone is actually doing that at the moment.
105From the VO side
- Background some large VOs develop and prefer to
use their own scheduling job management
framework - late binding of jobs to job slots
- first establishing an overlay network
- subsequent scheduling and starting of jobs is
faster - hide details between the various grid flavours
- implement VO priorities
- full use of allocated slots, up to max wall clock
time - but these VOs will need their own scheduler
- some of them do have it already,
- but then others dont and most never will, so the
use of pilots should not be the only option (or
even the default) way of things
106Situation today
- VO-type pilot jobs submitted as if regular user
jobs - run with the identity of one or a few individuals
from a VO - obtain jobs from any user (within the VO) and run
that payload on the WN allocated - site sees only a single identity, not the true
owner of the workload - no effective mechanisms today can deny this use
model - note that this does not apply to the regular
per-user pilot jobs
107Issues
- Issues that drove the original glexec-on-WN
scenario - VO supplied pilot jobs must observe and honour
- the same policies the site uses for normal job
execution - preferably
- without requiring alternate mechanisms to
describe the policies - be continuously in synch with the site policies
- again, per-user pilot jobs satisfy these rules
by design
108Pieces of a solution
- Three pieces that go together
- glexec on the worker-node deployment
- mechanism for pilot job to submit themselves and
their payload to site policy control - give incontrovertible evidence of who is running
on which node at any one time - needed at selected sites for regulatory
compliance - ability to nail individual culprits
- by requiring the VO to present a valid delegation
from each user - VO should want this
- to keep user jobs from interfering with each
other - honouring site ban lists for individuals may help
in not banning the entire VO in case of an
incident
109Pieces of the solution
- glexec on the worker-node deployment
- way to keep the pilot jobs submitters to their
word - system-level auditing of the pilot jobs, to see
they are not doing the user job by themselves or
evading the controls - relies on advanced auditing features of the OS
(from EAL3) - but auditing data on the WN is useful for
incident investigations only - internal accounting should be done by the VO
- the regular site accounting mechanisms are via
the batch system, and will see the pilot job
identity - the site can easily show from those logs the
usage by the pilot job(for which wall-clock-time
accounting should be used) - making a site do accounting based glexec jobs is
non-standard, requires effort, may be intrusive,
and messes up normal accounting - a VO capable of writing their own submission
framework, ought to be able to write their own
accounting system as well
110glexec on WN deployment model
- VO submits a pilot job to the batch system
- the VO pilot job submitter is responsible for
the pilot behaviour - this might be a specific role in the VO, or a
locally registered badged user at each site - Pilot job is subject to normal site policies for
jobs - Pilot job obtains the true user job, and
presents the user credentials and the job
(executable name) to the site (glexec) to
request a decision
111VO pilot job on the node
- On success the site will set the uid/gid of the
new users job - On failure glexec will return with an error, and
pilot job can terminate or obtain other job
Note proper uid change by Gatekeeper or
Condor-C/BLAHP on head node should remain default
112What is needed in this model?
- Agreement on the three ingredients
- deployment of glexec on the WN to do setuid
- detailed auditing on the head node and the WNs
- site accounting done at the VO (i.e. pilot job)
level - glexec
- needs feature enhancements compared to single-CE
version - see status of glexec on the next slide
- Inspection of the audit logs
- detect abuse patterns in the system-call auditing
logs - Grid job logging capabilities
- glexec will log (uid, user/system/real time
usage) via syslog - credential mapping framework (LCMAPS) will log
mapping (also via syslog) - centralisation of glexec mappings, e.g. via
JobRepository
113Notes and alternatives
- glexec, like any site-managed ingress point,
trusts the submitter not to have mixed up the
user credentials and the jobs - we trust the RB today do this correctly, and RBs
are unknown quantities to the receiving site - a longer term solution is to have the job request
singed by the submitting user - since the description is modified by
intermediaries (brokers), the signature can only
be to the original content, and the site would
have to evaluate whether the job received matches
the signed JDL - or use an inheritance model for the job
description, and treat the job like you would,
e.g., a CIM entity
114Summary
- Realize that today some VOs are doing pilot
jobs today - there is no effective enforcement against this
- some sites may even just dont care yet, whilst
others have hard requirements on auditability and
regulatory compliance - The glexec-on-WN model gives the VOs tools to
comply with site requirements - at least makes it better than it is today
- but you, as a site, will miss that warm and fuzzy
feeling of trust - a glexec-on-WN is always replaceable by the
null operation for sites that dont care or
want it - but realize this is for just one of the glexec
deployment models