Title: EGEE Middleware Robin Middleton with much (most) material from Bob Jones Frederic Hemmer Erwin Laure
1EGEE Middleware Robin Middletonwith much
(most) material fromBob JonesFrederic
HemmerErwin Laure
GridPP EB-TB Meeting, 13th May 2004
www.eu-egee.org
EGEE is a project funded by the European Union
under contract IST-2003-508833
2Contents
- Introduction
- EGEE structure, activities,
- Middleware JRA1
- Organisation
- Design Team
- Service Oriented Architecture
- Initial Components
- EGEE Middleware Prototype
- Integration, Testing SCM
- JRA1 External Links
3Introduction
4Overview
- 70 partners (funded) many unfunded
contributions - 11 Federations
- 32M EU funds ? 60M in total
- 2 years (initially)
- started 1st April 2004
- The EGEE Vision
- To deliver production level Grid services, the
essential elements of which are manageability,
robustness, resilience to failure, and a
consistent security model, as well as the
scalability needed to rapidly absorb new
resources as these become available, while
ensuring the long-term viability of the
infrastructure. - To carry out a professional Grid middleware
re-engineering activity in support of the
production services. This will support and
continuously upgrade a suite of software tools
capable of providing production level Grid
services to a base of users which is anticipated
to rapidly grow and diversify. - To ensure an outreach and training effort which
can proactively market Grid services to new
research communities in academia and industry,
capture new e-Science requirements for the
middleware and service activities, and provide
the necessary education to enable new users to
benefit from the Grid infrastructure.
5EGEE Implementation
- From day 1 (1st April 2004)
- Production grid service based on the LCG
infrastructure running LCG-2 grid middleware (SA) - LCG-2 will be maintained until the new generation
has proven itself (fallback solution) - In parallel develop a next generation grid
facility (JRA) - Produce a new set of grid services according to
evolving standards (Web Services) - Run a development service providing early access
for evaluation purposes - Will replace LCG-2 on production facility in 2005
6EGEE Activities
7Orientation
Equivalent EDG Work Packages / Groups WP6 WP7 WP
1-5 6 QAG Security Group WP7 WP12 WP11 WP11 WP8
-10 ?
- EGEE includes 11 activities
- Services
- SA1 Grid Operations, Support and Management
- SA2 Network Resource Provision
- Joint Research
- JRA1 Middleware Engineering and Integration
- JRA2 Quality Assurance
- JRA3 Security
- JRA4 Network Services Development
- Networking
- NA1 Management
- NA2 Dissemination and Outreach
- NA3 User Training and Education
- NA4 Application Identification and Support
- NA5 Policy and International Cooperation
8Services Activities
- SA1 Grid Operations Support
- Objectives Create operate a production quality
infrastructure - 48 partners, approx 45 of total project budget
- regional structure
- Builds on the existing LCG infrastructure to
provide expanded grid facility for many
application domains - SA2 Network Resource Provision
- Objectives Ensure EGEE access to network
services provided by GEANT and the NRENs to link
users, resources and operational management - 3 partners, approx 1.5 of total project budget
- Most work will be associated with defining
SLR/S/As
9Joint Research Activities
- JRA1 Middleware Engineering and Integration
- Objectives
- Provide robust, supportable middleware components
- Integrate grid services to provide a consistent
functional basis for the EGEE grid infrastructure - Verify the middleware forms a dependable and
scalable infrastructure that meets the needs of a
large, diverse eScience user community - 5 partners, approx 16 of total project budget
- Middleware design team active
- Core software team has been working quickly to
produce the design of an initial prototype - Taking input from HEP ARDA project as well as
final requirements/assessments from EDG project - Initial prototype foreseen at end of April
- Not all services implemented, not for general
distribution - EDG testbed infrastructure being reused for JRA1
clusters
10Joint Research Activities (II)
- JRA4 Network Services Development
- Objectives Network oriented joint research to
provide end-to-end services - Network reservation, performance monitoring and
diagnostics tools - Explore links to how Grid resources are
organise/allocated - Investigation of potential impact IPv6 on grids
- 5 partners, approx 2.5 of total project budget
- Tight collaboration with DANTE and the NRENs,
especially through future GN2 project and
potential network oriented FP6 projects - JRA3 Security
- Objectives Enable secure European Grid
infrastructure operation - Overall security architecture and framework
- Policies to be adopted by other EGEE activities
(middleware, operations etc.) - 5 partners, approx 3 of total project budget
- JRA2 Quality Assurance
- Objectives Foster production delivery of
quality Grid software operations. - 2 partners, approx 2 of total project budget
- Many procedures and guidelines already defined
11Networking Activities
- NA4 Application identification and support
- Objectives Identify and support a broad range of
applications from diverse domains, starting with
the pilot domains HEP and Biomedical - 20 partners, approx 12.5 of total project
budget - ARDA project is interface with HEP applications
- Initial BMI applications identified
- Industrial forum set-up in a self-financing mode
- NA3 User Training and Induction
- Objectives Develop training programme addressing
beginners and advanced users. Internal EGEE
induction courses. - 22 partners, approx 4 of total project budget
- Plans for initial training courses well advanced
- Will be able to offer training in the summer on
dedicated infrastructure - NA2 Dissemination and Outreach
- Objectives Disseminate the benefits of the EGEE
infrastructure to new user communities - 20 partners, approx 5 of total project budget
12Whos who
13JRA1
- Middleware (Re-)Engineering Integration
14JRA1 Organisation
15Software Clusters
- Tools, Testing Integration (CERN) clusters
- Development clusters
- UK
- CERN
- IT/CZ
- Nordic
- Clusters have a reasonable sized (distributed)
development testbed - Taken over from EDG
- Nordic cluster to be finalized
- Link with Integration Tools clusters
established - Clusters up and running!
- Nordic (security) cluster ? JRA3
16Design Team
- Formed in December 2003
- Current members
- UK Steve Fisher
- IT/CZ Francesco Prelz
- Nordic David Groep
- VDT Miron Livny
- CERN Predrag Buncic, Peter Kunszt,
Frederic Hemmer, Erwin Laure - Started service design based on component
breakdown defined by the LCG ARDA RTAG - Leverage experiences and existing components from
AliEn, VDT, and EDG. - A working document
- Overall design APIs
- https//edms.cern.ch/document/458972
- Basis for architecture (DJRA1.1) and design
(DJRA1.2) document
17Guiding Principles
- Lightweight (existing) services
- Easily and quickly deployable
- Interoperability
- Allow for multiple implementations
- Resilience and Fault Tolerance
- Co-existence with deployed infrastructure
- Run as an application
- Service oriented approach
- Follow WSRF standardization
- No mature WSRF implementations exist to date,
hence start with plain WS WSRF compliance is
not an immediate goal - Review situation end 2004
18High Level Service Decomposition
- Taken from the ARDA blueprint
Nordic
- Some services have no clear attribution to
cluster(according to TA) - Some services involve collaboration of multiple
clusters
19Initial Focus
- Data management
- Storage Element
- SRM based allow POSIX-like access
- Workload management
- Computing Element
- Allow pull and push mode
- Information and monitoring
- Security
- Need to integrate components with quite different
security models - Start with a minimalist approach based on VOMS
and myProxy
20Storage Element
- Strategic SE
- High QoS reliable, safe..
- Has usually an MSS
- Place to keep important data
- Needs people to keep running
- Heavyweight
- Tactical SE
- Volatile, lightweight space
- Enables sites to participate in an opportunistic
manner - Lower QoS
21Storage Element Interfaces
- SRM interface
- Management and control
- SRM (with possible evolution)
- Posix-like File I/O
- File Access
- Open, read, write
- Not real posix (like rfio)
Management
SRM interface
POSIXAPI File I/O
User
rfio
dcap
chirp
aio
dCache
NeST
Castor
Disk
22Catalogs
- File Catalog
- Filesystem-like view on logical file names
- Replica Catalog
- Keep track of replicas of the same file
- (Meta Data Catalog)
- Attributes of files on the logical level
- Boundary between generic middleware and
application layer
23Files and Catalogs Scenario
24Computing Element
- Layered service interfacing
- various batch systems (LSF, PBS, Condor)
- Grid systems like GT2, GT3, and Unicore
- CondorG as queuing system on the CE
- Allows CE to be used in push and pull mode
- Call-out module to change job ownership
(security) - Lightweight service
- should be possible to dynamically install e.g.
within an existing globus gatekeeper
25Information Service
- Adopt a common approach to information and
monitoring infrastructure. - There may be a need for specialised information
services - e.g. accounting, package management, grid
information, monitoring, provenance, logging - these may be built on an underlying information
service - A range of visualisation tools may be used
- Using R-GMA
26 Authentication/Authorization
- Different models and mechanisms
- Authentication based on Globus/GSI, AFS, SSH,
X509, tokens - Authorization
- AliEn exploits mechanism of RDBMS backend
- EDG gridmap file VOMS credentials and
LCAS/LCMAPS - VDT gridmap file CAS, VOMS (client)
- Security and protection at a level acceptable by
fabric managers and end users needs to be
discussed and blessed in advance.
27A minimalist approach to security
- Need to integrate components with quite different
security model - Start with a minimalist approach
- Based on VOMS (proxy issuing) and myProxy (proxy
store) - User stores proxy in myProxy from where it can be
retrieved by access services and sent to other
services - Credential chain needs to be preserved
- Allow service to authenticate client
- Local authorization could be done via LCAS if
required - User is mapped to group accounts or components
like LCMAPS are used to assign local user identity
28Towards a prototype
- Focus on key services discussed exploit existing
components - Initially an ad-hoc installation at Cern and
Wisconsin - Aim to have first instance ready by end of April
- Open only to a small user community
- Expect frequent changes (also API changes) based
on user feedback and integration of further
services - Enter a rapid feedback cycle
- Continue with the design of remaining services
- Enrich/harden existing services based on early
user-feedback
This is not a release! Its purely an ad-hoc
installation
29Planning
- Evolution of the prototype
- Envisaged status at end of 2004
- Key services need to fulfill all requirements
(application, operation, quality, security, )
and form a deployable release - Remaining services available as prototype
- Need to develop a roadmap
- Incremental changes to prototype (where possible)
- Early user feedback through ARDA and early
deployment on SA1 pre-production service - Detailed release plan being planned
- Converge prototype work with integration
testing activities - Need to get rolling now!
- First components will start using SCM in May
30Integration
- A master Software Configuration Plan is being
finalized now - It contains basic principles and rules about the
various areas of SCM and Integration (version
control, release management, build systems, bug
tracking, etc) - Compliant with internationally agreed standards
(ISO 10007-2003 E, IEEE SCM Guidelines series) - Most EGEE stakeholders have already been involved
in the process to make sure everybody is aware
of, contributes to and uses the plan - An EGEE JRA1 Developer's Guide will follow
shortly in collaboration with JRA2 (Quality
Assurance) based on the SCM Plan - It is of paramount importance to deliver the plan
and guide as early as possible in the project
lifetime
31Testing
- The 3 initial testing sites are CERN, NIKHEF and
RAL - More sites can join the testing activity at a
later stage ! - Must fulfil site requirements
- Testing activities will be driven by the test
plan document - Test plan being developed based on user
requirements documents - Application requirements from NA4 HEPCAL III,
AWG documents, Bio-informatics requirements
documents from EDG - Deployment requirements being discussed with SA1
- ARDA working document for core Grid services
- Security work with JRA3 to design and plan
security testing - The test plan is a living document it will
evolve to remain consistent with the evolution of
the software - Coordination with NA4 testing and external groups
(e.g. Globus) established - Solid steps towards MJRA1.3 (PM5)
32Convergence with Integr Tstg
- Development clusters need to get used to SCM
- During May, initial components of the prototype
need to follow SCM - Proposed components
- R-GMA
- VOMS
- RLS
- GFAL (is this 3rd party?)
- SRM (will there be an EGEE implementation or just
3rd party?) - New developments need to follow SCM from the
beginning - ISSUE perl modules seem not to fit well
33Convergence with Integr Tstg II
- IT/CZ
- Put EDG code under SCM for training purposes and
be prepared to move components to EGEE when
needed - VOMS in May
- UK
- Full R-GMA under SCM in May
- CERN/DM
- RLS in May
- GFAL ?
34Development Roadmap
- Prototype work as starting point
- Priorities need to be adjusted based on user
feedback - Incremental, frequent releases
- All discussions and decisions take place in the
design team - Project-wide body being formed to oversee this
activity - PTF ? Project Technical Forum
- Boundary conditions
- Architecture document due end of Month 3 (June)
- Design document due end of Month 5 (August)
35JRA1/SA1 - Process description
- No official delivery of requirements from SA1 to
JRA1 stated in the TA - The definition, discussion and agreement of the
requirements has already started, done through
dedicated meetings - This is an ongoing process
- Not all the requirements defined yet
- Set of requirements agreed, need basic agreement
to start working! But can be reviewed at any time
there is a valid reason for it
36JRA1/SA1 - Requirements
- Middleware delivery to SA1
- Release management
- Deployment scenarios
- Middleware configuration
- JRA1 will provide a standard set of configuration
files and documentation with examples that SA1
can use to design tools. Format to be agreed
between SA1-JRA1 - It is the responsibility of SA1 to provide
configuration tools to the sites - Enforcement of the procedures
- Platforms to support
- Primary platform Red Hat Enterprise 3.0, gcc
3.2.3 and icc8 compilers (both 32 and 64-bits) . - Secondary platform Windows (XP/2003), vc 7.1
compiler (both 32 and 64-bits) - Versions for compilers, libraries, third party
software - Programming languages
- Packaging and software distribution
- Others
- Sites must be allowed to organize the network as
they wish, internal or external connectivity,
NAT, firewall, etc, all must be possible, no
special constraints. WNs must not require
Outgoing IP connectivity Not inbound
connectivity either.
37JRA1/JRA3
- A lot of progress has been achieved here
- Security Group formed, JRA1 members identified
- First meeting scheduled on May 5-6, 2004
- GAP analysis planned by then
- VOMS Administration support clarified
- Handled by JRA3
- Issue VOMS effort reporting
38JRA1/JRA4
- SCM plan presented and discussed
- More discussions on which components of JRA4 will
be required in the overall architecture/design
need to take place
39American Involvement in JRA1
- UWisc
- Miron Livny part of the design Team
- Condor Team actively involved in reengineering
resource access - In collaboration with Italian Cluster
- ISI
- Identification of potential contributions started
(e.g. RLS) - Focused discussions being planned
- Argonne
- Collaboration on Testing started
- Support for key Globus Components enhancements
being discussed
40JRA1 and other activities
- NA4
- HEP ARDA project started ensures close
relations between HEP and middleware - Bio activities with similar spirit needed
focused meeting tentatively being planned for May - SA1
- Revision of requirements (platforms)
- JRA2
- QAG started
- Monthly meeting established
- JRA3
- Necessary structures established
- Focused Meeting in May
- JRA4
- Architectural components required need to be
clarified - Other projects
- Potential drain of resources for dissemination
activities
41The End
42(No Transcript)
43UK Cluster
- R-GMA
- Interface to various graphical tools
- Monitoring largely driven by application and
infrastructure needs - Information system clarify role in
job-submission/data mgmt cycles (e.g. role of
GLUE) - Interface to other monitoring systems (e.g.
Grid3) - Understand R-GMA role in
- Accounting
- Job provenance
- Logging bookkeeping
-
44IT/CZ Cluster
- Resource Access (aka CE)
- Interface to various batch systems
- Starting with the integration of CondorG
- JobFetcher (implementing pull model)
- Site policy mgmt, enforcement, and advertisement
- WMS
- High level optimizer components at TaskQueue
- Matchmaking
- Job adjustment
- VO policy management and enforcement
- TaskQueue interactions
45IT/CZ Cluster II
- Accounting
- LCG accounting system (usage records) has to be
considered - Role of DGAS needs to be understood
- LB
- Assessment of its role in
- Accounting
- Job provenance
-
- Relationship to R-GMA
- VOMS
- Relationship to JRA3
- Integration into Access Service
46CERN/DM Cluster
- SE
- Posix-like file I/O
- GFAL/aio relationship
- SRM interface
- Will EGEE provide an implementation will we ship
an implementation will we just make it a
requirement? - Space reservation not in v1.1 migration path to
v2.1? - File Catalog
- Schema evolution/customization to different
user-groups - Server implementation
- Metadata catalog interaction
47CERN/DM Cluster II
- Replica Catalog
- Deployment model (wrt File Catalog)
- Schema evolution
- Distributed Catalog
- Metadata Catalog
- Mostly in application domain
- File Transfer Service
- Overlaps with WMS/CE
- Local and global (VO) policy enforcement
- Error handling and recovery transaction handling
and boundaries load-balancing and fail-over
modes - Upgrade resilience
- Data subscription service
- GDMP functionality
- How does it relate to FTS
- Integration into Access Service