An overview of grid middleware and gLite - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

An overview of grid middleware and gLite

Description:

Converts the condor submit file into ClassAd. hands over the job to ... Offer an unique interface for condor-c(in CE) to submit jobs to different batch systems ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 39
Provided by: mikem181
Category:

less

Transcript and Presenter's Notes

Title: An overview of grid middleware and gLite


1
An overview of grid middleware and gLite
2
Outline
  • An overview of grid middleware
  • Introduction of gLite
  • Job managememt services of gLite.

3
A Grid
  • Grid
  • Many machines
  • Across many locations and administrative domains
  • Grid middleware runs on each machines
  • High Performance Computing
  • High capacity Storage
  • Meet the need of scientific computing
  • Grid trust VOs
  • Users join VOs
  • Virtual organisation contributes resources
    negotiates access
  • Additional services also enable the grid
  • Operation
  • Dissemination

Virtual Organization is entity that corresponds
to a organization or group of people. Desires to
share computing, data or software resources
4
Authentication, Authorisation (AA)
Users in many locations and organisations
Access services (user interface) logon,
upload credentials, run m/w commands
Build on Grid Security InfrastructureEncryption
and Data Integrity, Authentication and
Authorization
Gate keeping Authenticate users and give
permissions
Resources in many locations and organisations
PBS, Condor, LSF,
System software
NFS,
Operating system
Local scheduler
File system
HPSS, CASTOR
Hardware
Computing clusters,
Network resources
Data storage
5
Basic job Management
Users
  • Tools for
  • Submit jobs to a CE
  • Monitor jobs
  • Get outputs
  • Transfer files to CE
  • Transfer files between CE and SE

How do I run a job on a compute element (CE) ?
(CE batch queue)
Resources
Compute elements
Data storage
Network resources
6
Information service (IS)
Users
  • Information Service (IS)
  • Resources such as CE and SE report their status
    to IS
  • Grid services query IS before running jobs

How do I know which CE could run my job? Which is
free?
Resources
Compute elements
Data storage
Network resources
7
File management
Users
Storage Transfer Replication management
Weve terabytes of data in files.
My data are in files, and Ive terabytes
Our data are in files, and Ive terabytes
Resources
Compute elements
Data storage
Network resources
8
Main components
User Interface (UI) The place where users access
the Grid
Computing Element (CE) A batch queue on a sites
computers where
the users job is executed
Storage Element (SE) provides (large-scale)
storage for files
9
Current production middleware
Replica Catalogue
User interface
Information Service
Resource Broker (WorkLoad Mgr.)
Author. Authen.
Input sandbox Broker Info
Output sandbox
Logging Book-keeping
Computing Element
Job Status
10
gLite 3.0 the current middleware
  • Being deployed on EGEE production Grid now
  • Runs on various Linux releases
  • Scientific Linux most common
  • Ports to other Operating Systems in progress
  • History
  • During last 2 years, some new services were
    created in releases of new middleware, up to
    gLite 1.5, has been in pre-production use
  • A subset of these is deployed with some of the
    previous middleware (LCG 2.7)
  • All components already in LCG 2.7.0 plus upgrades
  • this already includes new versions of VOMS, R-GMA
    and FTS
  • The Workload Management System (with LB, CE, UI)
    of gLite 1.5.0

11
gLite Grid Middleware Services
Access
API
CLI
Security Services
Authorization
Information Monitoring
Services
Application Monitoring
Information Monitoring
Auditing
Authentication
Data Management
Workload Mgmt Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataMovement
ComputingElement
WorkloadManagement
Connectivity
12
http//gridportal.hep.ph.ic.ac.uk/rtm
1400 on 17 Jan 2007
13
gLite Job Management Services
14
gLite Job Management Services
15
WMSs Architecture
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
16
WMSs Architecture
Keeps submission Requests Requests are kept
for a while, waiting for being dispatched If
there is no matching resource available
17
WMSs Architecture
Repository of resource information Updated via
notifications and/or active polling on
sources Provide matchmaker With information to
decide best resources for request.
18
WMSs Architecture
Finds an appropriate CE or resource for job
request according to the information from
ISM. Taking into account job preferences,
resource status, policies on resources
19
WMSs Architecture
Performs the actual job submission and monitoring
Normally it is Condor.
20
WMSs Architecture
Computing Element is the place where you jobs run
21
WMS components (1)
  • WMS components handling the job during its
    lifetime and performing the submission
  • Network Server (NS)
  • is responsible for
  • Accepting incoming requests from the UI.
  • Authenticates the user.
  • Obtains a delegated full proxy from the user
    proxy.
  • Enqueues the job to the Workload Management..
  • WorkLoad Manager (WM)
  • Is responsible for
  • Calls Matchmaker to find the resource which best
    matches the job requirements.
  • Interacting with Information System and File
    catalog.
  • Calculates the ranking of all the matchmaked
    resourceCondorC
  • Information Supermarket (ISM)
  • is responsible for
  • basically consists of a repository of resource
    information that is available in read only mode
    to the matchmaking engine

22
WMS components (2)
  • WMS components handling the job during its
    lifetime and performing the submission
  • Job Adapter
  • is responsible for
  • making the final touches to the JDL expression
    for a job, before it is passed to CondorC for the
    actual submission
  • creating the job wrapper script that creates the
    appropriate execution environment in the CE
    worker node
  • transfer of the input and of the output sandboxes
  • Job Controller (JC)
  • Is responsible for
  • Converts the condor submit file into ClassAd
  • hands over the job to CondorC
  • CondorC
  • responsible for
  • performing the actual job management operations
  • job submission, job removal
  • Log Monitor
  • is responsible for
  • watching the CondorC log file
  • intercepting interesting events concerning active
    jobs
  • events affecting the job state machine

23
CEs Architecture
Computing Element is built on a homogeneous farm
of computing nodes (called Worker Nodes) Also
there are many components inside CE such as
gatekeeper, globus-jobmanager, ..
24
CEs Architecture
Gatekeeper Grants access to the CE and map grid
user to a local user id.
25
CEs Architecture
Batch System A cluster of compute nodes
controlled by a head node. handles the job
execution Example Torque (Open PBS), PBS
26
A typical case of glite-enabled grid
  • Many CE in glite-enabled grid
  • Few WMS coordinating the CEs and broker jobs to
    proper CEs.

27
Computing Element Components)
  • Gatekeeper
  • Grants access to the CE.. Authenticate users and
    map users to local accounts.
  • forks the globus-jobmanager.
  • globus-jobmanager
  • Fork Condor-C (in CE) to help submit jobs to
    batch systems.
  • BLAPHD (Batch Local ASCII Helper Protocol Daemon)
  • Offer an unique interface for condor-c(in CE)
    to submit jobs to different batch systems
  • BLAPHD commands is used by Condor-C (in CE) to
    submit jobs to the batch system.
  • Batch System
  • handles the job execution on the available local
    worker nodes.
  • Batch System consists of
  • - torque (formerly known as OpenPBS) resource
    manager .
  • - maui job scheduler .
  • A cluster MUST be homogeneous.

28
Job State Machine
29
Job State Machine (1/9)
Submitted job is entered by the user to the
User Interface but not yet transferred to Network
Server for processing
30
Job State Machine (2/9)
Waiting job was accepted by NS and is waiting
for Workload Manager processing or being
processed by WMHelper modules.
31
Job State Machine (3/9)
Ready job processed by WM and its Helper
modules (CE found) but not yet transferred to the
CE (local batch system queue) via JC and CondorC..
32
Job State Machine (4/9)
Scheduled job waiting in the queue on the CE.
33
Job State Machine (5/9)
Running job is running on CEs queuing system
(inside one of the worker nodes)
34
Job State Machine (6/9)
Done job exited or considered to be in a
terminal state by CondorC (e.g., submission to CE
has failed in an unrecoverable way).
35
Job State Machine (7/9)
Aborted job processing was aborted by WMS
(waiting in the WM queue or CE for too long,
over-use of quotas, expiration of user
credentials).
36
Job State Machine (8/9)
Cancelled job has been successfully canceled on
user request.
37
Job State Machine (9/9)
Cleared output sandbox was transferred to the
user or removed due to the timeout.
38
Further information
  • EGEE www.eu-egee.org
  • gLite http//www.glite.org/
  • LCG http//lcg.web.cern.ch/LCG/
  • Open Grid Forum http//www.gridforum.org/
  • Globus Alliance http//www.globus.org/
  • VDT http//www.cs.wisc.edu/vdt/
Write a Comment
User Comments (0)
About PowerShow.com