Computing and Brokering - PowerPoint PPT Presentation

About This Presentation
Title:

Computing and Brokering

Description:

Cluster batch system model Some batch systems Batch systems ... DAG workflows schedd is close ... transferred to Network Server for processing ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 114
Provided by: Margri4
Category:

less

Transcript and Presenter's Notes

Title: Computing and Brokering


1
Computing and Brokering
  • Grid Middleware 5
  • David Groep, lecture series 2005-2006

2
Outline
  • Classes of computing services
  • MPP SHMEM
  • Clusters with high-speed interconnect
  • Conveniently parallel jobs
  • Through the hourglass basic functionalities
  • Representing computing services
  • resource availability, RunTimeEnvironment
  • Software installation and ESIA
  • Jobs as resources, or ?
  • Brokering
  • brokering models central view, per-user broker,
    neighbourhood P2P brokering
  • job farming and DAGs Condor-G, gLite WMS,
    Nimrod-G, DAG man
  • resource selection ERT, freeCPUs, ? Prediction
    techniques and challenges
  • colocating jobs and data, input output
    sandboxes, LogicalFiles
  • Specialties
  • Supporting interactivity

3
Computing Service
  • resource variability and the hourglass model

4
The Famous Hourglass Model
5
Types of systems
  • Very different models and pricing suitability
    depends on application
  • shared memory MPP systems
  • vector systems
  • cluster computing with high-speed interconnect
  • can perform like MPP, except for the single
    memory image
  • e.g. Myrinet, Infiniband
  • course-grained compute clusters
  • conveniently parallel applications without IPC
  • can be built of commodity components
  • specialty systems
  • visualisation, systems with dedicated
    co-processors,

6
Quick, cheap, or both how to run an app?
  • Task how to run your application
  • the fastest, or
  • the most cost-effective (this argument usually
    wins ?)
  • Two choices to speed up an application
  • Use the fastest processor available
  • but this gives only a small factor over modest
    (PC) processors
  • Use many processors, doing many tasks in parallel
  • and since quite fast processors are inexpensive
    we can think of using very many processors in
    parallel
  • but the problem must first be decomposed

fast, cheap, good pick any two
7
High Performance or High Throughput?
  • Key question max. granularity of decomposition
  • Have you got one big problem or a bunch of little
    ones?
  • To what extent can the problem be decomposed
    into sort-of-independent parts (grains) that
    can all be processed in parallel?
  • Granularity
  • fine-grained parallelism the independent bits
    are small, need to exchange information,
    synchronize often
  • coarse-grained the problem can be decomposed
    into large chunks that can be processed
    independently
  • Practical limits on the degree of parallelism
  • how many grains can be processed in parallel?
  • degree of parallelism v. grain size
  • grain size limited by the efficiency of the
    system at synchronising grains

8
High Performance v. High Throughput?
  • fine-grained problems need a high performance
    system
  • that enables rapid synchronization between the
    bits that can be processed in parallel
  • and runs the bits that are difficult to
    parallelize as fast as possible
  • coarse-grained problems can use a high throughput
    system
  • that maximizes the number of parts processed per
    minute
  • High Throughput Systems use a large number of
    inexpensive processors, inexpensively
    interconnected
  • High Performance Systems use a smaller number of
    more expensive processors expensively
    interconnected

9
High Performance v. High Throughput?
  • There is nothing fundamental here it is just
    a question of financial trade-offs like
  • how much more expensive is a fast computer than
    a bunch of slower ones?
  • how much is it worth to get the answer more
    quickly?
  • how much investment is necessary to improve the
    degree of parallelization of the algorithm?
  • But the target is moving -
  • Since the cost chasm first opened between fast
    and slower computers 12-15 years ago an enormous
    effort has gone into finding parallelism in big
    problems
  • Inexorably decreasing computer costs and
    de-regulation of the wide area network
    infrastructure have opened the door to ever
    larger computing facilities clusters ?
    fabrics ? (inter)national gridsdemanding
    ever-greater degrees of parallelism

10
But the fact is
the food chain has been reversed, and
supercomputer vendors are struggling to make a
living.
Graphic Network of Workstations, Berkeley IEEE
Micro, Feb, 1995, Thomas E. Anderson, David E.
Culler, David A. Patterson
11
Using these systems
  • As both clusters and capability systems are both
    expensive (i.e. not on your desktop), they are
    resources that need to be scheduled
  • interface for scheduled access is a batch queue
  • job submit, cancel, status, suspend
  • sometimes checkpoint-restart in OS, e.g. on SGI
    IRIX
  • allocate processors (and amount of memory,
    these may be linked!) as part of the job request
  • systems usually also have smaller interactive
    partition
  • not intended for running production jobs

12
Cluster batch system model
13
Some batch systems
  • Batch systems and schedulers
  • Torque (OpenPBS, PBS Pro)
  • Sun Grid Engine (thats not a Grid ?)
  • Condor
  • LoadLeveller
  • Load Share Facility (LSF)
  • Dedicated schedulers MAUI
  • can drive scheduling for Torque/PBS, SGE, LSF,
  • support advanced scheduling features,
    likereservation, fair-shares, accounts/banking,
    QoS
  • head node or UI system can usually be used for
    test jobs

14
Torque/PBS job description
  • PBS batch job script
  • PBS -l walltime360000
  • PBS -l cput300000
  • PBS -l vmem1gb
  • PBS -q qlong
  • Executing user job
  • UTCDATEdate -u 'YmdHMSZ'
  • echo "Execution started on UTCDATE"
  • echo ""
  • printenv
  • date
  • sleep 3
  • date
  • id
  • hostname

15
PBS queue
  • bosuitmp1010 qstat -an1head -10
  • tbn20.nikhef.nl

  • Req'd Req'd Elap
  • Job ID Username Queue Jobname
    SessID NDS TSK Memory Time S Time
  • -------------------- -------- -------- ----------
    ------ ----- --- ------ ----- - -----
  • 823302.tbn20.nikhef. biome034 qlong STDIN
    20253 1 -- -- 6000 R 2058 node15-11
  • 824289.tbn20.nikhef. biome034 qlong STDIN
    6775 1 -- -- 6000 R 1525 node15-5
  • 824372.tbn20.nikhef. biome034 qlong STDIN
    10495 1 -- -- 6000 R 1510 node16-21
  • 824373.tbn20.nikhef. biome034 qlong STDIN
    3422 1 -- -- 6000 R 1440 node16-32
  • ...
  • 827388.tbn20.nikhef. lhcb031 qlong STDIN
    -- 1 -- -- 6000 Q -- --
  • 827389.tbn20.nikhef. lhcb031 qlong STDIN
    -- 1 -- -- 6000 Q -- --
  • 827390.tbn20.nikhef. lhcb002 qlong STDIN
    -- 1 -- -- 6000 Q -- --

16
Example Condor clusters of idle workstations
The Condor Project, Miron Livny et al. University
of Wisconsin, Madison. See http//www.cs.wisc.edu/
condor/
17
Condor example
  • Write a submit file
  • Executable dowork
  • Input dowork.in
  • Output dowork.out
  • Arguments 1 alpha beta
  • Universe vanilla
  • Log dowork.log
  • Queue
  • Give it to Condor
  • condor_submit ltsubmit-filegt
  • Watch it run condor_q

Files on shared fs
in a cluster at least, for other options see later
From Alan Roy, IO Access in Condor and Grid, UW
Madison. See http//www.cs.wisc.edu/condor/
18
Matching jobs to resources
  • For homogeneous clusters mainly policy-based
  • FIFO
  • credential-based policy
  • fair-share
  • queue wait time
  • banks accounts
  • QoS specific
  • For heterogeneous clusters (like condor pools)
  • matchmaking based on resource job
    characteristics
  • see later in grid matchmaking

19
Example scheduling policies - MAUI
  • RMTYPE0 PBS
  • RMHOST0 tbn20.nikhef.nl
  • ...
  • NODEACCESSPOLICY SHARED
  • NODEAVAILABILITYPOLICY DEDICATEDPROCS
  • NODELOADPOLICY ADJUSTPROCS
  • FEATUREPROCSPEEDHEADER xps
  • BACKFILLPOLICY ON
  • BACKFILLTYPE FIRSTFIT
  • NODEALLOCATIONPOLICY FASTEST
  • FSPOLICY DEDICATEDPES
  • FSDEPTH 24
  • FSINTERVAL 240000
  • FSDECAY 0.99
  • GROUPCFGusers FSTARGET1 PRIORITY10
    MAXPROC50
  • GROUPCFGdteam FSTARGET2
    PRIORITY5000 MAXPROC32

MAUI is an open source product from
ClusterResources, Inc. http//www.supercluster.or
g/
20
Grid Interface to Computing
21
Grid Interfaces to the compute services
  • Need common interface for job management
  • for test jobs in interactive mode fork
  • like the interactive partition in clusters and
    supers
  • batch system interface
  • executable
  • arguments
  • processors
  • memory
  • environment
  • stdin/out/err
  • Note
  • batch system usually doesnt manage local file
    space
  • assumes executable is just there, because of
    shared FS or JIT copying of the files to the
    worker node in job prologue
  • local file space management needs to be exposed
    as part of the grid service and then implemented
    separately

22
Expectations?
  • What can a user expect from a compute service?
  • Different user scenarios are all valid
  • paratrooper mode come in, take all your
    equipment (files, executable c) with you, do
    your thing and go away
  • youre supposed to clean up, but the system will
    likely do that for you if you forget. In all
    cases, garbage left behind is likely to be
    removed
  • two-stage prepare and run
  • extra services to pre-install environment and
    later request it
  • see later on such Community Software Area
    services
  • dont think but just do it
  • blindly assume the grid is like your local system
  • expect all software to be there
  • expect your results to be retained indefinitely
  • realism of this scenario is quite low for
    production grids, as it does not scale to
    larger numbers of users

23
Basic Operations
  • Direct run/submit
  • useless unless you have an environment already
    set up
  • Cancel
  • Signal
  • Suspend
  • Resume
  • List jobs/status
  • Purge (remove garbage)
  • retrieve output first
  • Other useful functions
  • Assess submission (eligibility, ERT)
  • Register Start (needed if you have sandboxes)

24
A job submission diagram for a single CE
  • Example
  • explicit interactions

diagram from DJRA1.1 EGEE Middleware Architecture
25
WS-GRAM Job management using WS-RF
  • same functionalitymodelled with jobs represented
    as resources
  • for input sandbox leverages an existing (GT4)
    data movement service
  • exploit re-useable components

26
GT4 WS GRAM Architecture
Service host(s) and compute element(s)
SEG
Job events
GT4 Java Container
Compute element
GRAM services
Local job control
GRAM services
Local scheduler
Job functions
sudo
GRAM adapter
Delegate
Transfer request
Client
Delegation
Delegate
GridFTP
User job
RFT File Transfer
FTP control
FTP data
Remote storage element(s)
GridFTP
diagram from Carl Kesselman, ISI, ISOC/GFNL
masterclass 2006
27
GT2 GRAM
  • Informational historical
  • so dont blame the current Globus for this

single job submission flow chart
28
GRAM GT2 Protocol
  • RSL over http-g
  • target to a single specific resource
  • http-g is like https
  • modified protocol (one one byte) to specify
    delegation
  • no longer interoperable with standard https
  • delegation implicit in job submission
  • RSL Resource Specification Language
  • Used in the GRAM protocol to describe the job
  • required some (detailed) knowledge about target
    system

29
GT2 RSL
  • (executable"/bin/echo")
  • (arguments"12345")
  • (stdoutx-gass-cache//(GLOBUS_GRAM_JOB_CONTACT)
    stdout anExtraTag)
  • (stderrx-gass-cache//(GLOBUS_GRAM_JOB_CONTACT)
    stderr anExtraTag)
  • (queueqshort)

30
GT2 Job Manager interface
  • One job manager per running or queued job
  • provide control interface cancel, suspend,
    status
  • GASS Grid Access to Secondary Storage
  • stdin, stdout, stderr
  • selected input/output files
  • listens on a specific TCP port on the Gatekeeper
    host
  • Some issues
  • protocol does not provide two-phase commit
  • know way to know if the job really made it
  • too many open ports
  • one process for each queued job, i.e. too many
    processes
  • Workaround
  • dont submit a job, but instread a grid-manager
    process

31
Performance ?
  • Time to submit a basic GRAM job
  • Pre-WS GRAM lt 1 second
  • WS GRAM (in Java) 2 seconds
  • so GT2-style GRAM did have one significant
    advantage
  • Concurrent jobs
  • Pre-WS GRAM 300 jobs
  • WS GRAM 32,000 jobs

32
Scaling scheduling
  • load on the CE head node per VO cannot be
    controlled with a single common job manager
  • with many VOs
  • might need to resolve inter-VO resource
    contention
  • different VOs may want different policies
  • make the CE pluggable
  • and provide a common CE interface, irrespective
    of the site-specific job submission mechanism
  • as long as the site supports a fork JM

33
gLite job submission model
site
one grid CEMON per VO or user
34
Unicore CE
  • Other design and concept
  • eats JSDL (GGF standard) as a description
  • described job requirements in detail
  • security model cannot support dynamic VOs yet
  • grid-wide coordinated UID space
  • (or shared group accounts for all grid users)
  • no VO management tools (DEISA added a directory
    for that)
  • intra-site communication not secured
  • one big plus job management uses only 1 port for
    all ommunications (including file transfer), and
    is thus firewall-friendly

35
Unicore CE Architecture
Graphic from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
36
Unicore programming model
  • Abstract Job Object
  • Collection of classes representing Grid functions
  • Encoded as Java objects (XML encoding possible)
  • Where to build AJOs
  • Pallas client GUI - The users view
  • Client plugins - Grid deployer
  • Arcon client tool kit - Hard core
  • What cant the AJO do
  • Application level Meta-computing
  • ???

from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
37
Interfacing to the local system
  • Incarnation Data Base
  • Maps abstract representation to concrete jobs
  • Includes resource description
  • Prototype auto-generation from MDS
  • Target System Interface
  • Perl interface to host platform
  • Very small system specific module for easy
    porting
  • Current NQS (several versions), PBS,
    Loadleveler, UNICOS, Linux, Solaris, MacOSX,
    PlayStation-2
  • Condor Under development ( probably done by now)

from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
38
Resource Representation
  • CE attributes
  • obtaining metrics
  • GLUE CE

39
Describing a CE
  • Balance between completeness and timeliness
  • Some useful metrics almost impossible to obtain
  • when will this job of mine be finished if I
    submit now?cannot be answered!
  • depends on system load
  • need to predict runtime for already running
    queued jobs
  • simultaneous submission in a non-FIFO scheduling
    model (e.g. fair share, priorities, pre-emption
    c)

40
GlueCE a resource description viewpoint
From the GLUE Information Model version 1.2, see
document for details
41
Through the Glue Schema Cluster Info
  • Performance info SI2k, SF2k
  • Max wall time, CPU time seconds
  • together these determine if a job completes in
    time
  • but clusters are not homogeneous
  • solve at the local end (scale masCPU,wall time
    on each node to the system speed)CAVEAT when
    doing cross-cluster grid-wide scheduling, this
    can make you choose the wrong resource entirely!
  • solve (i.e. multiply) at the broker endbut now
    you need a way to determine on which subcluster
    your job will run oops.

42
Cluster Info total, free and max JobSlots
  • FreeJobSlots is the wrong metric to use for
    scheuling (a good cluster is always 100 full)
  • these metrics may be VO, user and job dependent
  • if a cluster have free CPUs, that does not mean
    that you can use them
  • even if there are thousands of waiting jobs, you
    might get to the front of the queue because of
    your prio or fair-share

43
Cluster info ERT and WRT
  • Estimated/worst response time
  • when will my job start to run if I submit now
  • Impossible to pre-determine in case of
    simultaneous submissions
  • Best to do is to estimate
  • Possible approaches
  • simulation good but very, very slowPredicting
    Job Start Times on Clusters, Hui Li et al. 2004
  • historical comparisons
  • template approach need to discover the proper
    template
  • look for similar system states in the past
  • learning approach adapt the estimation
    algorithm to the actual load and learn the best
    approach
  • see the many other papers by Hui, bundle on
    Blackboard!

44
Brokering
45
Brokering models
  • All current grid broker systems use global
    brokering
  • consider all known resources when matching
    requests
  • brokering takes longer as the system grows
  • Models
  • Bubble-to-the-top-information-system based
  • current Condor-G, gLite WMS
  • Ask the world for bids
  • Unicore Broker

46
Some grid brokers
  • Condor-G
  • uses Condor schedd (matchmaker) to match
    resources
  • a Condor submitter has a number of backends to
    talk to different CEs (GT2, GT4-GRAM, Condor
    (flocking))
  • supports DAG workflows
  • schedd is close to the user
  • gLite WMS
  • separation between broker (based on Condor-G) and
    the UI
  • additional Logging and Bookkeeping (generic, but
    actually only used for the WMS)
  • does job-data co-location scheduling

47
Grid brokers (contd.)
  • Nimrod-G
  • parameter sweep engine
  • cycles through static list of resources
  • automatically inspects the job output and uses
    that to drive automatic job submission
  • minimisation methods like simulated annealing
    built in
  • Unicore broker
  • based on a pricing model
  • asks for bids from resources
  • no large information system needed full of
    useless resources, but instead ask bids from all
    resources for every job
  • shifts, but does nothing to resolve, the
    info-system explosion

48
Alternative brokering
  • Alternatives could be P2P-style brokering
  • look in the neighbourhood for reasonable
    matches, if none found, give the task to a peer
    super-scheduler
  • scheduler only considers close resources (has
    no global knowledge)
  • job submission pattern may or may not follow
    brokering pattern
  • if it does, it needs recursive delegation for job
    submission, which opens the door for worms and
    trojans
  • trust is not very transitive(this is not a
    problem in sharing public files, such as in the
    popular P2P file sharing applications)

49
Broker detailed example gLite WMS
  • Job services in the gLite architecture
  • Computing Element (just discussed)
  • Workload Management System (brokering, submission
    control)
  • Accounting (for EGEE comes in two flavours site
    or user)
  • Job Provenance (to be done)
  • Package management (to be done)
  • continuous matchmaking solution
  • persistent list of pending jobs, waiting for
    matching resources
  • WMS task akin to what the resources did in
    Unicore

50
Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
51
WMSs Architecture
52
WMSs Architecture
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
53
WMSs Architecture
Keeps submission requests Requests are kept
for a while if no matching resources available
54
WMSs Architecture
Repository of resource information available to
matchmaker Updated via notifications and/or
active polling on sources
55
WMSs Architecture
Finds an appropriate CE for each submission
request, taking into account job requests and
preferences, Grid status, utilization policies
on resources
56
WMSs Architecture
Performs the actual job submission and
monitoring
57
The Information Supermarket
  • ISM represents one of the most notable
    improvements in the WM as inherited from the EU
    DataGrid (EDG) project
  • decoupling between the collection of information
    concerning resources and its use
  • allows flexible application of different policies
  • The ISM basically consists of a repository of
    resource information that is available in read
    only mode to the matchmaking engine
  • the update is the result of
  • the arrival of notifications
  • active polling of resources
  • some arbitrary combination of both
  • can be configured so that certain notifications
    can trigger the matchmaking engine
  • improve the modularity of the software
  • support the implementation of lazy scheduling
    policies

58
The Task Queue
  • The Task Queue represents the second most notable
    improvement in the WM internal design
  • possibility to keep a submission request for a
    while if no resources are immediately available
    that match the job requirements
  • technique used by the AliEn and Condor systems
  • Non-matching requests
  • will be retried either periodically
  • eager scheduling approach
  • or as soon as notifications of available
    resources appear in the ISM
  • lazy scheduling approach

59
Job Logging Bookkeeping
  • LB tracks jobs in terms of events
  • important points of job life
  • submission, finding a matching CE, starting
    execution etc
  • gathered from various WMS components
  • The events are passed to a physically close
    component of the LB infrastructure
  • locallogger
  • avoid network problems
  • stores them in a local disk file and takes over
    the responsibility to deliver them further
  • The destination of an event is one of bookkeeping
    servers
  • assigned statically to a job upon its submission
  • processes the incoming events to give a higher
    level view on the job states
  • Submitted, Running, Done
  • various recorded attributes
  • JDL, destination CE name, job exit code
  • Retrieval of both job states and raw events is
    available via legacy (EDG) and WS querying
    interfaces
  • user may also register for receiving
    notifications on particular job state changes

60
Job Submission Services
  • WMS components handling the job during its
    lifetime and performing the submission
  • Job Adapter
  • is responsible for
  • making the final touches to the JDL expression
    for a job, before it is passed to CondorC for the
    actual submission
  • creating the job wrapper script that creates the
    appropriate execution environment in the CE
    worker node
  • transfer of the input and of the output sandboxes
  • CondorC
  • responsible for
  • performing the actual job management operations
  • job submission, job removal
  • DAGMan
  • meta-scheduler
  • purpose is to navigate the graph
  • determine which nodes are free of dependencies
  • follow the execution of the corresponding jobs.
  • instance is spawned by CondorC for each handled
    DAG
  • Log Monitor
  • is responsible for

61
Job Preparation
  • Information to be specified when a job has to be
    submitted
  • Job characteristics
  • Job requirements and preferences on the computing
    resources
  • Also including software dependencies
  • Job data requirements
  • Information specified using a Job Description
    Language (JDL)
  • Based upon Condors CLASSified ADvertisement
    language (ClassAd)
  • Fully extensible language
  • A ClassAd
  • Constructed with the classad construction
    operator
  • It is a sequence of attributes separated by
    semi-colons.
  • An attribute is a pair (key, value), where value
    can be a Boolean, an Integer, a list of strings,
  • ltattributegt ltvaluegt

62
ClassAds matchmaking
  • Brokering based on advertisements by both jobs
    and resources

63
ClassAds matchmaking
  • Allow customers to set provide requirements and
    preferences on the resources
  • Allow resources to impose constraints on the
    customers they wish to service.
  • Separation between matchmaking and claiming.
  • The matchmake is stateless and thus can scale to
    very large systems without complex failure
    recovery.

64
Job Description Language (JDL)
  • The supported attributes are grouped into two
    categories
  • Job Attributes
  • Define the job itself
  • Resources
  • Taken into account by the Workload Manager for
    carrying out the matchmaking algorithm (to choose
    the best resource where to submit the job)
  • Computing Resource
  • Used to build expressions of Requirements and/or
    Rank attributes by the user
  • Have to be prefixed with other.
  • Data and Storage resources
  • Input data to process, Storage Element (SE) where
    to store output data, protocols spoken by
    application when accessing SEs

65
JDL Relevant Attributes (1)
  • JobType
  • Normal (simple, sequential job), DAG,
    Interactive, MPICH, Checkpointable
  • Executable (mandatory)
  • The command name
  • Arguments (optional)
  • Job command line arguments
  • StdInput, StdOutput, StdError (optional)
  • Standard input/output/error of the job
  • Environment
  • List of environment settings
  • InputSandbox (optional)
  • List of files on the UIs local disk needed by
    the job for running
  • The listed files will be staged automatically to
    the remote resource
  • OutputSandbox (optional)
  • List of files, generated by the job, which have
    to be retrieved

66
JDL Relevant Attributes (2)
  • Requirements
  • Job requirements on computing resources
  • Specified using attributes of resources published
    in the Information Service
  • If not specified, default value defined in UI
    configuration file is considered
  • Default other.GlueCEStateStatus "Production"
    (the resource has to be able to accept jobs and
    dispatch them on WNs)
  • Rank
  • Expresses preference (how to rank resources that
    have already met the Requirements expression)
  • Specified using attributes of resources published
    in the Information Service
  • If not specified, default value defined in the UI
    configuration file is considered
  • Default - other.GlueCEStateEstimatedResponseTime
    (the lowest estimated traversal time)
  • Default other.GlueCEStateFreeCPUs (the highest
    number of free CPUs) for parallel jobs (see later)

67
JDL Relevant Attributes (3)
  • InputData
  • Refers to data used as input by the job these
    data are published in the Replica Catlog and
    stored in the Storage Elements)
  • LFNs and/or GUIDs
  • InputSandbox
  • Execuable, files etc. that are sent to the job
  • DataAccessProtocol (mandatory if InputData has
    been specified)
  • The protocol or the list of protocols which the
    application is able to speak with for accessing
    InputData on a given Storage Element
  • OutputSE
  • The Uniform Resource Identifier of the output
    Storage Element
  • RB uses it to choose a Computing Element that is
    compatible with the job and is close to Storage
    Element

Details in Data Management lecture
68
Example of JDL File
  • JobTypeNormal
  • Executable gridTest
  • StdError stderr.log
  • StdOutput stdout.log
  • InputSandbox /home/mydir/test/gridTest
  • OutputSandbox stderr.log, stdout.log
  • InputData lfn/glite/myvo/mylfn
  • DataAccessProtocol gridftp
  • Requirements other.GlueHostOperatingSystemNameOp
    Sys LINUX
  • other.GlueCEStateFreeCPUsgt4
  • Rank other.GlueCEPolicyMaxCPUTime

69
Jobs State Machine (1/9)
  • Submitted job is entered by the user to the User
    Interface but not yet transferred to Network
    Server for processing

70
Jobs State Machine (2/9)
  • Waiting job accepted by NS and waiting for
    Workload Manager processing or being processed by
    WMHelper modules.

71
Jobs State Machine (3/9)
  • Ready job processed by WM and its Helper modules
    (CE found) but not yet transferred to the CE
    (local batch system queue) via JC and CondorC.
    This state does not exists for a DAG as it is not
    subjected to matchmaking (the nodes are) but
    passed directly to DAGMan.

72
Jobs State Machine (4/9)
Scheduled job waiting in the queue on the CE.
This state also does not exists for a DAG as it
is not directly sent to a CE (the node are).
73
Jobs State Machine (5/9)
Running job is running. For a DAG this means
that DAGMan has started processing it.
74
Jobs State Machine (6/9)
Done job exited or considered to be in a
terminal state by CondorC (e.g., submission to CE
has failed in an unrecoverable way).
75
Jobs State Machine (7/9)
Aborted job processing was aborted by WMS
(waiting in the WM queue or CE for too long,
over-use of quotas, expiration of user
credentials).
76
Jobs State Machine (8/9)
Cancelled job has been successfully canceled on
user request.
77
Jobs State Machine (9/9)
Cleared output sandbox was transferred to the
user or removed due to the timeout.
78
Directed Acyclic Graphs (DAGs)
  • A DAG represents a set of jobs
  • Nodes Jobs Edges Dependencies

NodeA
NodeB
NodeC
NodeD
NodeE
79
DAG JDL Structure
  • Type DAG
  • VirtualOrganisation yourVO
  • Max_Nodes_Running int gt0
  • MyProxyServer
  • Requirements
  • Rank
  • InputSandbox more later!
  • OutSandbox
  • Nodes nodeX more later!
  • Dependencies more later!

Mandatory Mandatory Optional Optional Option
al Optional Optional Mandatory Mandatory
80
Attribute Nodes
  • The Nodes attribute is the core of the DAG
    description

. Nodes nodefilename1 ...
nodefilename2 .
dependencies

Nodefilename1 file foo.jdl
Nodefilename2 file
/home/vardizzo/test.jdl retry 2

Nodefilename1 description JobType
Normal
Executable abc.exe
Arguments 1 2 3
OutputSandbox
InputSandbox
.. retry 2

81
Attribute Dependencies
  • It is a list of lists representing the
    dependencies between the nodes of the DAG.

. Nodes nodefilename1 ...
nodefilename2 .
dependencies

dependencies nodefilename1,
nodefilename2
MANDATORY YES!
dependencies
nodefilename1, nodefilename2
nodefilename1, nodefilename2 , nodefilename3

nodefilename1, nodefilename2,
nodefilename3, nodefilename4
82
InputSandbox Inheritance
  • All nodes inherit the value of the attributes
    from the one specified for the DAG.

NodeA description JobType
Normal Executable abc.exe
OutputSandbox myout.txt InputSandbox
/home/vardizzo/myfile.txt,
root.InputSandbox
Type DAG VirtualOrganisation
yourVO Max_Nodes_Running int gt0 MyProxyServer
Requirements Rank InputSandbox
Nodes nodefilename
.. dependencies
  • Nodes without any
    InputSandbox values, have to contain in their
    description an empty list
  • InputSandbox

83
Interactive Jobs
  • It is a job whose standard streams are forwarded
    to the submitting client.
  • The DISPLAY environment variable has to be set
    correctly, because an X window may be opened.

Listener Process
X window or std no-gui
84
Interactive Jobs
  • Specified setting JobType Interactive in JDL
  • When an interactive job is executed, a window for
    the stdin, stdout, stderr streams is opened
  • Possibility to send the stdin to
  • the job
  • Possibility the have the stderr
  • and stdout of the job when it
  • is running
  • Possibility to start a window for
  • the standard streams for a
  • previously submitted interactive
  • job with command glite-job-attach

85
Interactive Jobs JDL Structure
Mandatory Mandatory Mandatory Optional
Optional Optional Mandatory Mandatory
  • Type job
  • JobType interactive
  • Executable
  • Argument
  • ListenerPort int gt 0
  • OutputSandbox
  • Requirements
  • Rank

gLite Commands glite-job-attach options
ltjobIDgt
86
gLite Commands
  • JDL Submission
  • glite-job-submit o guidfile
    jobCheck.jdl
  • JDL Status
  • glite-job-status i guidfile
  • JDL Output
  • glite-job-output i guidfile
  • Get Latest Job State
  • glite-job-get-chkpt o statefile i
    guidfile
  • Submit a JDL from a state
  • glite-job-submit -chkpt statefile
    o guidfile jobCheck.jdl
  • See also options typing help after the
    commands.

87
Economy based brokering
  • Unicore

88
Unicore Broker
  • Distributed brokering
  • Sites Know the State of their Resources Best
  • Sites Can Conceal their Resource Configuration
  • Different VOs Need Different Selection Algorithms
  • Preferred site sets will vary
  • Different applications have different performance
    characteristics
  • Uses an economic model
  • cost-based evaluation, like in the real world
  • broker developed by University of Manchester, UK

Unicore is a open source product coordinated by
the Unicore Forum, see www.unicore.org
89
Unicore Broker
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
90
Job description ontology
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
91
Unicore Broker hierarchy
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
92
Unicore Broker in the system
Resource Database
Resource Broker
NQS
Network Job Supervisor
Unicore Gateway
Unicore Client
Condor
GT
Alternative Client
Multiple firewalllayouts possible
User Database
Ext. Auth Service
UoM Broker Architecture, from Dave Snelling,
Fujitsu Labs Europe, Unicore Technology, Grid
School July 2003
93
Unicore Broker
UoM Broker Architecture, from Dave Snelling,
Fujitsu Labs Europe, Unicore Technology, Grid
School July 2003
94
VO Schedulers
  • Pilot jobs and overlay networks

95
Towards a multi-scheduler world
  • expressing scheduling policies (priorities and
    usage shares) for multiple complex VOs in a
    single scheduler is proving difficult
  • resource owner does not want to know about VO
    internal structure, but assign the VO just a
    single share
  • VO wants to set fine-grained intra-VO shares
  • local schedulers (such as MAUI) are not geared
    towards non-admin defined policies there is no
    grid-aware scheduler
  • possible solutions
  • develop an interface to manage the local
    scheduling policies
  • stack the schedulers, i.e. introduce a per-VO
    scheduler

96
traditional job submission models
  • There are three traditional deployment models
  • direct per-user job submission to a gatekeeper
    running with root privileges (GT2GK, todays
    model)
  • a non-privileged dedicated CE or scheduler,
    accepting authenticated user jobs and submitting
    to the batch system
  • on-demand CE, submitted by VO or user to a
    front-end system, that then receives user jobs
    and submits these to the batch system
  • in order to not have complex schedulers run as
    root, a sudo-component glexec is introducted

97
What is glexec?
  • glexec
  • a thin layerto change unix credentialsbased on
    grid identity and attribute information
  • you can think of it as
  • a replacement for the gatekeeper
  • a griddy version of Apaches suexec(8)
  • a program wrapper around LCAS, LCMAPS or GUMS

98
What glexec does
  • Input
  • a certificate chain, possibly with VOMS
    extensions
  • a user program name arguments to run
  • Action
  • check authorization (LCAS, GUMS)
  • user credentials, proper VOMS attributes,
    executable name
  • acquire local credentials
  • local (uid, gid) pair, possibly across a cluster
  • enforce the local credential on the process
  • Result
  • user program is run with the mapped credentials

99
Jobs submission today (GT2 GK)
  • Deployment model without glexec (mode GT2GK)
  • jobs are submitted with an identity (hopefully
    the original users one) to the site Gatekeeper
    running as root
  • one job manager is run for each user on the head
    node
  • with the users (uid,gid) as set by the gatekeeper

100
Glexec in a one-per-site mode
  • Deployment model with a CE service
  • running in a non-privileged account or
  • with a CE run (maybe one per VO) on a single
    front-end per site
  • examples
  • CREAM
  • GT4 WS-GRAM

101
glexec with an on-demand CE
  • Deployment model with on-demand CEs (mode
    on-demand CEs)
  • The user or the VO start their own scheduler on a
    front-end system
  • All these on-demand schedulers are
    resource-limited by a site-managed master
    scheduler (via a GT2GK or Condor)
  • the on-demand schedulers eat jobs for their VO or
    user
  • and set the proper identity before the job gets
    submitted to the site batch system

102
glexec with on-demand CE
  • Deployment model with on-demand CEs (mode
    on-demand for VOs with native interface)

103
Traditional model summary
  • In all three models, the submission of the user
    job to the batch system is done with the original
    job owners mapped (uid, gid) identity
  • grid-to-local identity mapping is done only on
    the front-end system (CE)
  • batch system accounting provides per-user records
  • inspection of Unix process on worker nodes are
    per-user

104
Pilot jobs
  • A pilot job is basically just
  • a small script which downloads a real job
  • from a repository once it starts executing, hence
  • it is not committed to any particular task, or
    perhaps even a particular user, until that point.
  • If there are no tasks waiting the pilot job exits
    immediately.
  • In principle, if the time limits on the queue are
    long enough a single pilot job could run more
    than one real job, although I'm not sure if
    anyone is actually doing that at the moment.

105
From the VO side
  • Background some large VOs develop and prefer to
    use their own scheduling job management
    framework
  • late binding of jobs to job slots
  • first establishing an overlay network
  • subsequent scheduling and starting of jobs is
    faster
  • hide details between the various grid flavours
  • implement VO priorities
  • full use of allocated slots, up to max wall clock
    time
  • but these VOs will need their own scheduler
  • some of them do have it already,
  • but then others dont and most never will, so the
    use of pilots should not be the only option (or
    even the default) way of things

106
Situation today
  • VO-type pilot jobs submitted as if regular user
    jobs
  • run with the identity of one or a few individuals
    from a VO
  • obtain jobs from any user (within the VO) and run
    that payload on the WN allocated
  • site sees only a single identity, not the true
    owner of the workload
  • no effective mechanisms today can deny this use
    model
  • note that this does not apply to the regular
    per-user pilot jobs

107
Issues
  • Issues that drove the original glexec-on-WN
    scenario
  • VO supplied pilot jobs must observe and honour
  • the same policies the site uses for normal job
    execution
  • preferably
  • without requiring alternate mechanisms to
    describe the policies
  • be continuously in synch with the site policies
  • again, per-user pilot jobs satisfy these rules
    by design

108
Pieces of a solution
  • Three pieces that go together
  • glexec on the worker-node deployment
  • mechanism for pilot job to submit themselves and
    their payload to site policy control
  • give incontrovertible evidence of who is running
    on which node at any one time
  • needed at selected sites for regulatory
    compliance
  • ability to nail individual culprits
  • by requiring the VO to present a valid delegation
    from each user
  • VO should want this
  • to keep user jobs from interfering with each
    other
  • honouring site ban lists for individuals may help
    in not banning the entire VO in case of an
    incident

109
Pieces of the solution
  • glexec on the worker-node deployment
  • way to keep the pilot jobs submitters to their
    word
  • system-level auditing of the pilot jobs, to see
    they are not doing the user job by themselves or
    evading the controls
  • relies on advanced auditing features of the OS
    (from EAL3)
  • but auditing data on the WN is useful for
    incident investigations only
  • internal accounting should be done by the VO
  • the regular site accounting mechanisms are via
    the batch system, and will see the pilot job
    identity
  • the site can easily show from those logs the
    usage by the pilot job(for which wall-clock-time
    accounting should be used)
  • making a site do accounting based glexec jobs is
    non-standard, requires effort, may be intrusive,
    and messes up normal accounting
  • a VO capable of writing their own submission
    framework, ought to be able to write their own
    accounting system as well

110
glexec on WN deployment model
  • VO submits a pilot job to the batch system
  • the VO pilot job submitter is responsible for
    the pilot behaviour
  • this might be a specific role in the VO, or a
    locally registered badged user at each site
  • Pilot job is subject to normal site policies for
    jobs
  • Pilot job obtains the true user job, and
    presents the user credentials and the job
    (executable name) to the site (glexec) to
    request a decision

111
VO pilot job on the node
  • On success the site will set the uid/gid of the
    new users job
  • On failure glexec will return with an error, and
    pilot job can terminate or obtain other job

Note proper uid change by Gatekeeper or
Condor-C/BLAHP on head node should remain default
112
What is needed in this model?
  • Agreement on the three ingredients
  • deployment of glexec on the WN to do setuid
  • detailed auditing on the head node and the WNs
  • site accounting done at the VO (i.e. pilot job)
    level
  • glexec
  • needs feature enhancements compared to single-CE
    version
  • see status of glexec on the next slide
  • Inspection of the audit logs
  • detect abuse patterns in the system-call auditing
    logs
  • Grid job logging capabilities
  • glexec will log (uid, user/system/real time
    usage) via syslog
  • credential mapping framework (LCMAPS) will log
    mapping (also via syslog)
  • centralisation of glexec mappings, e.g. via
    JobRepository

113
Notes and alternatives
  • glexec, like any site-managed ingress point,
    trusts the submitter not to have mixed up the
    user credentials and the jobs
  • we trust the RB today do this correctly, and RBs
    are unknown quantities to the receiving site
  • a longer term solution is to have the job request
    singed by the submitting user
  • since the description is modified by
    intermediaries (brokers), the signature can only
    be to the original content, and the site would
    have to evaluate whether the job received matches
    the signed JDL
  • or use an inheritance model for the job
    description, and treat the job like you would,
    e.g., a CIM entity

114
Summary
  • Realize that today some VOs are doing pilot
    jobs today
  • there is no effective enforcement against this
  • some sites may even just dont care yet, whilst
    others have hard requirements on auditability and
    regulatory compliance
  • The glexec-on-WN model gives the VOs tools to
    comply with site requirements
  • at least makes it better than it is today
  • but you, as a site, will miss that warm and fuzzy
    feeling of trust
  • a glexec-on-WN is always replaceable by the
    null operation for sites that dont care or
    want it
  • but realize this is for just one of the glexec
    deployment models
Write a Comment
User Comments (0)
About PowerShow.com