Title: An overview of grid middleware and gLite
1An overview of grid middleware and gLite
2Outline
- An overview of grid middleware
- Introduction of gLite
- Job managememt services of gLite.
3A Grid
- Grid
- Many machines
- Across many locations and administrative domains
- Grid middleware runs on each machines
- High Performance Computing
- High capacity Storage
- Meet the need of scientific computing
- Grid trust VOs
- Users join VOs
- Virtual organisation contributes resources
negotiates access - Additional services also enable the grid
- Operation
- Dissemination
Virtual Organization is entity that corresponds
to a organization or group of people. Desires to
share computing, data or software resources
4Authentication, Authorisation (AA)
Users in many locations and organisations
Access services (user interface) logon,
upload credentials, run m/w commands
Build on Grid Security InfrastructureEncryption
and Data Integrity, Authentication and
Authorization
Gate keeping Authenticate users and give
permissions
Resources in many locations and organisations
PBS, Condor, LSF,
System software
NFS,
Operating system
Local scheduler
File system
HPSS, CASTOR
Hardware
Computing clusters,
Network resources
Data storage
5Basic job Management
Users
- Tools for
- Submit jobs to a CE
- Monitor jobs
- Get outputs
- Transfer files to CE
- Transfer files between CE and SE
How do I run a job on a compute element (CE) ?
(CE batch queue)
Resources
Compute elements
Data storage
Network resources
6Information service (IS)
Users
- Information Service (IS)
- Resources such as CE and SE report their status
to IS - Grid services query IS before running jobs
How do I know which CE could run my job? Which is
free?
Resources
Compute elements
Data storage
Network resources
7File management
Users
Storage Transfer Replication management
Weve terabytes of data in files.
My data are in files, and Ive terabytes
Our data are in files, and Ive terabytes
Resources
Compute elements
Data storage
Network resources
8Main components
User Interface (UI) The place where users access
the Grid
Computing Element (CE) A batch queue on a sites
computers where
the users job is executed
Storage Element (SE) provides (large-scale)
storage for files
9Current production middleware
Replica Catalogue
User interface
Information Service
Resource Broker (WorkLoad Mgr.)
Author. Authen.
Input sandbox Broker Info
Output sandbox
Logging Book-keeping
Computing Element
Job Status
10gLite 3.0 the current middleware
- Being deployed on EGEE production Grid now
- Runs on various Linux releases
- Scientific Linux most common
- Ports to other Operating Systems in progress
- History
- During last 2 years, some new services were
created in releases of new middleware, up to
gLite 1.5, has been in pre-production use - A subset of these is deployed with some of the
previous middleware (LCG 2.7) - All components already in LCG 2.7.0 plus upgrades
- this already includes new versions of VOMS, R-GMA
and FTS - The Workload Management System (with LB, CE, UI)
of gLite 1.5.0
11gLite Grid Middleware Services
Access
API
CLI
Security Services
Authorization
Information Monitoring
Services
Application Monitoring
Information Monitoring
Auditing
Authentication
Data Management
Workload Mgmt Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataMovement
ComputingElement
WorkloadManagement
Connectivity
12http//gridportal.hep.ph.ic.ac.uk/rtm
1400 on 17 Jan 2007
13gLite Job Management Services
14gLite Job Management Services
15WMSs Architecture
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
16WMSs Architecture
Keeps submission Requests Requests are kept
for a while, waiting for being dispatched If
there is no matching resource available
17WMSs Architecture
Repository of resource information Updated via
notifications and/or active polling on
sources Provide matchmaker With information to
decide best resources for request.
18WMSs Architecture
Finds an appropriate CE or resource for job
request according to the information from
ISM. Taking into account job preferences,
resource status, policies on resources
19WMSs Architecture
Performs the actual job submission and monitoring
Normally it is Condor.
20WMSs Architecture
Computing Element is the place where you jobs run
21WMS components (1)
- WMS components handling the job during its
lifetime and performing the submission - Network Server (NS)
- is responsible for
- Accepting incoming requests from the UI.
- Authenticates the user.
- Obtains a delegated full proxy from the user
proxy. - Enqueues the job to the Workload Management..
- WorkLoad Manager (WM)
- Is responsible for
- Calls Matchmaker to find the resource which best
matches the job requirements. - Interacting with Information System and File
catalog. - Calculates the ranking of all the matchmaked
resourceCondorC - Information Supermarket (ISM)
- is responsible for
- basically consists of a repository of resource
information that is available in read only mode
to the matchmaking engine
22WMS components (2)
- WMS components handling the job during its
lifetime and performing the submission - Job Adapter
- is responsible for
- making the final touches to the JDL expression
for a job, before it is passed to CondorC for the
actual submission - creating the job wrapper script that creates the
appropriate execution environment in the CE
worker node - transfer of the input and of the output sandboxes
- Job Controller (JC)
- Is responsible for
- Converts the condor submit file into ClassAd
- hands over the job to CondorC
- CondorC
- responsible for
- performing the actual job management operations
- job submission, job removal
- Log Monitor
- is responsible for
- watching the CondorC log file
- intercepting interesting events concerning active
jobs - events affecting the job state machine
23CEs Architecture
Computing Element is built on a homogeneous farm
of computing nodes (called Worker Nodes) Also
there are many components inside CE such as
gatekeeper, globus-jobmanager, ..
24CEs Architecture
Gatekeeper Grants access to the CE and map grid
user to a local user id.
25CEs Architecture
Batch System A cluster of compute nodes
controlled by a head node. handles the job
execution Example Torque (Open PBS), PBS
26A typical case of glite-enabled grid
- Many CE in glite-enabled grid
- Few WMS coordinating the CEs and broker jobs to
proper CEs.
27Computing Element Components)
- Gatekeeper
- Grants access to the CE.. Authenticate users and
map users to local accounts. - forks the globus-jobmanager.
- globus-jobmanager
- Fork Condor-C (in CE) to help submit jobs to
batch systems. - BLAPHD (Batch Local ASCII Helper Protocol Daemon)
- Offer an unique interface for condor-c(in CE)
to submit jobs to different batch systems - BLAPHD commands is used by Condor-C (in CE) to
submit jobs to the batch system. - Batch System
- handles the job execution on the available local
worker nodes. - Batch System consists of
- - torque (formerly known as OpenPBS) resource
manager . - - maui job scheduler .
- A cluster MUST be homogeneous.
28Job State Machine
29Job State Machine (1/9)
Submitted job is entered by the user to the
User Interface but not yet transferred to Network
Server for processing
30Job State Machine (2/9)
Waiting job was accepted by NS and is waiting
for Workload Manager processing or being
processed by WMHelper modules.
31Job State Machine (3/9)
Ready job processed by WM and its Helper
modules (CE found) but not yet transferred to the
CE (local batch system queue) via JC and CondorC..
32Job State Machine (4/9)
Scheduled job waiting in the queue on the CE.
33Job State Machine (5/9)
Running job is running on CEs queuing system
(inside one of the worker nodes)
34Job State Machine (6/9)
Done job exited or considered to be in a
terminal state by CondorC (e.g., submission to CE
has failed in an unrecoverable way).
35Job State Machine (7/9)
Aborted job processing was aborted by WMS
(waiting in the WM queue or CE for too long,
over-use of quotas, expiration of user
credentials).
36Job State Machine (8/9)
Cancelled job has been successfully canceled on
user request.
37Job State Machine (9/9)
Cleared output sandbox was transferred to the
user or removed due to the timeout.
38Further information
- EGEE www.eu-egee.org
- gLite http//www.glite.org/
- LCG http//lcg.web.cern.ch/LCG/
- Open Grid Forum http//www.gridforum.org/
- Globus Alliance http//www.globus.org/
- VDT http//www.cs.wisc.edu/vdt/