Title: gLite Lecture 1
1gLite Lecture 1
- Peter Kunszt
- EGEE Middleware Activity (JRA1)Data Management
Cluster Leader
2Outline
- gLite why? what is it? An Overview and
Motivation. - General differences to currently deployed
Middleware - Overall Security concepts
- Workload Management
- This is a Differential presentation trying
not to repeat what has been said earlier. Focus
is on the differences to LCG, as presented in the
lectures on Monday and Tuesday.
3Chapter Overview
- OVERVIEW
- Quick EGEE / JRA1 intro
- Motivation of gLite
- Overview of Concepts and Services
4EGEE
- EU 6th Framework project
- See http//eu-egee.org for details on size,
number of partners, etc - Focus on Grid Deployment
- gt 120 sites currently participating
- Project is organized into 11 Activities
- Service Activities 1,2
- SA1 Deployment (48 of EGEE )
- SA2 Network provisioning
- Joint Research Activities 1-4
- JRA1 Middleware development
- JRA2 Quality Assurance
- JRA3 Security
- JRA4 Networking
- Networking Activities 1-5 management,
administration, dissemination, tutorials, project
relations, website..
5JRA1 management
F. Hemmer Dep E. Laure
M. Barroso
E. Laure
LCGSPI
A. Di Meglio
F. Prelz
J. Hahkala
P. Kunszt
M.Barroso
S. Fisher
6Why another Middleware?
- EU DataGrid, Globus, AliEn, NordUGrid and others
have successfully provided a first working stack
of Grid Middleware. - EDG stack in use by LCG / EGEE SA1
- Improvements provided by SA1 in the course of
last year - However, some issues cannot be fixed by simple
incremental improvements - Scalability issues 100s of sites, billions of
files - Functionality issues interactive jobs,
checkpointing, filesystem-like view of data, DAG
jobs, managed transfer and replication - EGEE JRA1 Middleware REEngineering
- Build on existing experience and middleware
initial decision to take AliEn, not EDG as
starting point (LCG-ARDA) this was changed
later - Hardening and Quality
- Baseline Services focus expect
applications/others to build some of the
high-level services
7EGEE Middleware gLite
- Aim Improve on LCG
- Name permanent name in a world of projects
- EDG called its services edg which could not
taken up by EGEE for obvious reasons - gLite is just a cool name, the Middleware
aspires to be LIGHTWEIGHT in USAGE but it is not
a slim lightweight middleware (yet). - Reengineering Exploit experience and existing
components from AliEn, VDT (CondorG, Globus),
EDG/LCG, NordUGrid and others - Develop a stack of generic middleware useful to
EGEE applications (HEP and Biomedical) - Should eventually deploy dynamically (e.g. as a
Globus job) - Pluggable components cater for different
implementations - Ease of use
- Standards Build on available standards
8Guiding Principles
Service Oriented Architecture
Interoperability
Portability
Building on existingcomponents in alightweight
manner
Web Services
Modularity
AliEn
LCG
Condor
Scalability
Globus
SRM
...
9Web Services
- Principles
- (Almost) every service has a public Web Service
interface described by a WSDL file - WS-I compliance (getting there)
- Auto-generated clients in many languages
- Uniform programming model
- Every service adheres to the same security model
- Not WS-Security yet because supporting tooling is
not mature enough - Transport-level security (HTTPS, PKI, GSI)
- Additional attributes in certificate (VOMS)
- Modularity
- Easy to build new services that make use of
existing ones - Easy to replace services by custom ones if the
WSDL is identical - Federation of Services as opposed to a Monolithic
stack
10gLite Middleware Services
Access
AvailablegLite Implementation
API
CLI
Information Monitoring
Services
Security Services
Authorization
Information Monitoring
Application Monitoring
Auditing
Authentication
ServiceDiscovery
Data Management
Workload Mgmt Services
JobProvenance
PackageManager
MetadataCatalog
File ReplicaCatalog
Accounting
ComputingElement
WorkloadManagement
StorageElement
DataMovement
Connectivity
11Security Concepts
- Current issues with LCG
- No service-level security (any service may be
used by anyone) - Insecure CE
- Insecure Storage (Castor)
- No fine-grained authorization for files
- No possibility for VOs to assign capabilities to
its members (e.g. Groups) - No possibility to apply and enforce VO policies
(preferred users) - No possibility to distinguish between VOs and
apply inter-VO policies (which VO is preferred)? - gLite additions and improvements
- VO Management Service VOMS for proxy management
- Use voms-proxy-init instead of grid-proxy-init
- File Authorization Service for fine grained file
security semantics - Services may act on VOMS roles and groups
- Users need to authenticate with all services
- Users may delegate cert to service
- More consistent usage of LCAS/LCMAPS
12WMS Concepts
- Scalability problems with LCG
- Necessitates updated information from all sites
at all times - Push model implies that all information is
up-to-date based on which the Resource Broker
schedules a Job at a given Site - Delays may cause the information to be stale and
jobs may be sent to unsuitable sites - gLite Adapt concepts from AliEn
- Introduce pull model where CE notifies the
Resource Broker of its ability to run jobs. - Provide a Task Queue that may be reordered based
on policies - Have a local proxy for data (sandboxes)
- gLite Introduce new functionality
- Interactive jobs
- DAG support
- Accounting
- Monitoring
- Security on CE level
- Slottable Allow VO-managed CEs to be running on
the Headnode
13DM Concepts
- Known problems with LCG (2004)
- RLS Scalability, single central instance (RLIs
were never deployed) - RLS/RMC duality with poor performance, no bulk
operations - No filesystem hierarchy (flat)
- Hardlink issue (LFN GUID is a N1 relation)
- No replica management service
- No secure data access and no coherent security
- gLite Learn from Alien
- Hierarchical File Catalog Metadata
- File Transfer service with queuing
- Shell-like view of the system
- gLite Added functionality
- File catalog with hierarchy, ACL control and
Metadata - POSIX-like I/O for direct file access
- Bulk operations everywhere for performance
- Catalogs built to be distributed (no single
central instance necessary) - However, we were told not to introduce
distributed catalogs in gLite 1 - Security concept completely new everything is
secured homogeneously - File Transfer and File Placement service,
high-level Data Scheduler also foreseen
- Meanwhile, LCG/SA1 has
- Evolved the DM stack themselves.
- Introduction of
- GFAL
- LFC
- lcg-utils
14Information System Concepts
- Issues with LCG (2004)
- Information system scalability and stability
- Glue Schema deficiencies
- gLite RGMA
- Evolve from EDG
- Monitoring and Information provisioning
- Distributed messaging infrastructure
- gLite Service Discovery
- Instead of schema, standardize on API
- Back-end can be anything (ldap/BDII, RGMA, plain
file..)
- Meanwhile LCG/SA1 improvements
- BDII scalability and stability
- Glue Schema extensions
15Processes
- Experience from EDG and other projects was
- Software development processes were insufficient
- Release cycle was too long
- Bugfixing took too long
- Software quality assurance weak
- Installation and Configuration very difficult and
cumbersome - gLite introduce the necessary processes
- Developers guidelines for homogeneous code
development - Automatic, homogeneous build system
- Integration and deployment modules for easy
configuration - Dependency management
- Still a lot of room for improvement!
- Meanwhile, LCG introduced
- YAIM
- Savannah
16Chapter Security
- SECURITY
- Difference to LCG
- VOMS
- Service Security Model (overview)
17Security Differences to LCG
- VO Management Service
- Each person has to be member of a VO
- Services are categorized into Site and VO
services - VO services will only accept certificates that
are signed by the VOMS - Computing Element Authz
- Secure scheduling and user mapping through LCMAPS
- MyProxy renewal for long-running jobs
- Renewal of VOMS attributes
- Fine Grained File Authorization
- The File Catalog contains Access Control bits for
each file - This is enforced through the Grid Services (see
Data Management talk later for the details) - Secure Info System
18VOMS
- VO Membership / Management Service
- Every VO needs to have one
- Extends the proxy certificate with attributes
(X509 allows for optional attributes in the
certificates) - Specifying the VO
- The roles/groups the user requests and is
actually member of - Administrator
- Reader
- Writer
- Etc
- Signs the VOMS attributes
- The signature itself has a lifetime of 12hours
- MyProxy does not extend this, and does not even
return VOMS attributes so after a MyProxy
retrieval, VOMS needs to be contacted again (done
automatically by WMS) - Attribute format is described in
http//cern.ch/edg-wp2/security/voms/edg-voms-cre
dential.pdf
19VOMS Admin
- Web interface to administer VOMS
- Every user needs to sign up through this website
- The certificate needs to be loaded into the
browser - Group memberships can be managed through this
interface by the administrators
20Delegation
- Services need to perform tasks on behalf of the
user - Contact other services AS the user
- Need to delegate rights/proxy of the user to the
given service - WMS
- Transfer services
- glite-IO
- Currently available delegation mechanisms are
insufficient - Globus CoG httpg
- JRA3 Web Service method to do delegation step
- Also allow for restricted delegation allow only
certain operations to be performed on the users
behalf
21Security Model
- Some Services are managed by the Site
- Resource services
- Local batch system
- Storage Element
- Transfer Channel
- Mapping of users into local accounts
- Usage of LCAS/LCMAPS
- VOMS-generated maps for each VO
- Site allow/deny lists managed through LCMAPS
- Apply Site policies. VO granularity (fair
share/configurable share among VOs) - Some Services are managed by the VO
- VO services
- CE
- File Placement
- Catalogs
- Apply VO policies based on VOMS groups and roles
- More on security in Data Management lecture
22Chapter WMS
- WORKLOAD MANAGEMENT
- Difference to LCG
- Some internals
- The new and old commands
- DAGs
- Interactive Jobs
23LCG - Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
24gLite Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
25Architecture
26Architecture (2)
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
27Architecture (3)
Keeps submission Requests (tasks) Requests are
kept for a while if no matching resources
available
28Architecture (4)
Repository of resource information available to
Matchmaker Updated via notifications and/or
active polling on resources
29Architecture (5)
Finds an appropriate Computing Element for each
submission request.
30Architecture (6)
Performs the actual job submission and
monitoring
31Possible Job States
Job States
32Job Submission Command Line Interface
- glite-job-submit r ltres_idgt -c ltconfig
filegt --vo ltVOgt -o ltoutput filegt
ltjob.jdlgt - -r the job is submitted directly to the computing
element identified by ltres_idgt - -c the configuration file ltconfig filegt is
pointed by the UI instead of the standard
configuration file - --vo the Virtual Organisation (if user is not
happy with the one specified in the UI
configuration file) - -o the generated edg_jobId is written in the
ltoutput filegt - Useful for other commands, e.g.
- glite-job-status i ltinput filegt (or jobId)
33Job Resubmission
- If something goes wrong, the WMS tries to
reschedule and resubmit the job (possibly on a
different resource satisfying all the
requirements) - Maximum number of resubmissions min(RetryCount,
MaxRetryCount) - RetryCount JDL attribute
- MaxRetryCount attribute in the RB
configuration file - One can disable job resubmission for a particular
job RetryCount0 in the JDL file
34Other (most relevant) UI commands
- glite-job-list-match
- Lists resources matching a job description
- Performs the matchmaking without submitting the
job - glite-job-cancel
- Cancels a given job
- glite-job-status
- Displays the status of the job
- glite-job-output
- Returns the job-output (the OutputSandbox files)
to the user - glite-job-logging-info
- Displays logging information about submitted jobs
(all the events pushed by the various
components of the WMS) - Very useful for debugging purpose
35WMS Matchmaking
- The RB (Matchmaker) has to find the best
suitable Computing Element (CE) where the job
will be executed - It interacts with data management services and
Information Services - They provide all the information required for the
resolution of the matches - The CE chosen by RB has to match the job
requirements (e.g. runtime environment, data
access requirements, and so on) - If FuzzyRankFalse (default)
- If 2 or more CEs satisfy all the requirements,
the one with the best Rank is chosen - If there are two or more CEs with the same best
rank, the choice is done in a random way among
them - If FuzzyRankTrue in the JDL
- Fuzziness in CE choice the CE with highest rank
has the highest probability to be chosen
36WMS Matchmaking Scenarios
- Possible scenarios for matchmaking
- Direct job submission
- glite-job-submit r ltCEIdgt
- Corresponds to job submission with Globus clients
(globus-job-submit) - Job submission with computational requirements
only - No InputData nor OutputSE specified in the JDL
- Job submission with data access requirements
- InputData and/or OutputSE specified in the JDL
- Details will be given in the Data Management
lecture
37Example of Job Submission (1)
- User logs in UI (User Interface) machine
- User issues a voms-proxy-init , enters her
certificates password and gets a valid Grid
proxy - User sets up her JDL file
- Example of Hello World JDL file
-
- Executable /bin/echo
- Arguments Hello World
- StdOutput Message.txt
- StdError stderr.log
- OutputSandbox
Message.txt,stderr.log -
38Example of Job Submission (2)
- User issues a glite-job-submit HelloWorld.jdl
- and gets back a unique Job Identifier (JobId)
- User issues a glite-job-status JobId
- to get logging information about the current
status of her Job - When the Output status is reached, the user
can issue a glite-job-output JobId - and the system returns the name of the temporary
directory where the job output can be found on
the UI machine.
39Example of Job Submission (3)
- glite-job-submit HelloWorld.jdl
- Selected Virtual Organisation name (from --vo
option) cms - Connecting to host egee-rb-01.mi.infn.it, port
7772 - Logging to host egee-rb-01.mi.infn.it, port 9002
- JOB SUBMIT OUTCOME
- The job has been successfully submitted to the
Network Server. - Use glite-job-status command to check job
current status. Your job identifier is - - https//egee-rb-01.mi.infn.it9000/LYVQ8JEjZVVq
BgbbQw6KMw
40Example of Job Submission (4)
- glite-job-status https//egee-rb-01.mi.infn.it
9000/LYVQ8JEjZVVqBgbbQw6KMw
- BOOKKEEPING INFORMATION
- Status info for the Job https//egee-rb-01.mi.in
fn.it9000/LYVQ8JEjZVVqBgbbQw6KMw - Current Status Done (Success)
- Exit code 0
- Status Reason Job terminated successfully
- Destination ce3.egee.unile.it2119/jobmana
ger-lcgpbs-cms - Submitted Mon Jul 4 111855 2005 CEST
41Example of Job Submission (5)
- glite-job-output --dir /tmp/mydir
https//egee-rb-01.mi.infn.it9000/LYVQ8JEjZVVqBgb
bQw6KMw
- JOB OUTPUT
OUTCOME - Output sandbox files for the job
https//egee-rb-01.mi.infn.it9000/LYVQ8JEjZVVqBgb
bQw6KMw - have been successfully retrieved and stored in
the directory - /tmp/mydir/KoBA-IgxZyVpLKhANfrhHw
- more /tmp/mydir/KoBA-IgxZyVpLKhANfrhHw/Message.t
xt - Hello World
- more /tmp/mydir/KoBA-IgxZyVpLKhANfrhHw/stderr.lo
g -
42Job Dependencies
Condors DAGman allows for job dependencies
DAG Direct Acyclic Graph
43DAG JDL Structure
- JobType DAG
- VirtualOrganisation yourVO
- Max_Nodes_Running int gt0
- MyProxyServer
- Requirements
- Rank
- InputSandbox
- OutSandbox
- Nodes nodeX more later
- Dependencies more later
-
44Attribute Nodes
- The Nodes attribute is the core of the DAG
description
45Attribute Dependencies
- It is a list of lists representing the
dependencies between the nodes of the DAG.
MANDATORY YES!
46Interactive Jobs
- Specified setting JobType Interactive in JDL
- When an interactive job is executed, a window for
the stdin, stdout, stderr streams is opened - Possibility to send the stdin to
- the job
- Possibility the have the stderr
- and stdout of the job when it
- is running
- Possibility to start a window for
- the standard streams for a
- previously submitted interactive
- job with command glite-job-attach
47Further information
- Workload Management
- http//egee-jra1-wm.mi.infn.it/egee-jra1-wm/
- In particular WMS User Admin Guide and JDL docs
- Condor ClassAd
- http//www.cs.wisc.edu/condor/classad
- Condor DAGman
- http//www.cs.wisc.edu/condor/dagman/
48Abbreviations
- JDL Job Description Language
- RB Resource Broker
- WMS Workload Management System
- CE Computing Element
- SE Storage Element
- UI User Interface
- Acknowledgements, i.e. slides were used authored
by - Erwin Laure (CERN) and Heinz Stockinger (U of
Vienna) - Salvo Monforte, Marco Pappalardo, Valeria
Ardizzone (INFN Catania)
49More Information
- The EGEE Project
- http//www.eu-egee.org
- The LCG Project
- http//cern.ch/lcg
- The gLite middleware
- http//www.glite.org
- The Condor Project
- http//www.cs.wisc.edu/condor
- The Globus Project
- http//www.globus.org