Title: The EGEE Project Status
1The EGEE Project Status
- Ian Bird
- EGEE Operations Manager
- CERN
- Geneva, Switzerland
ISGC, Taipei 27thApril 2005
2Contents
- The EGEE Project
- Overview and Structure
- Grid Operations
- Middleware
- Networking Activities
- Applications
- HEP
-
- Biomedical
- Summary
3EGEE goals
- Goal of EGEE develop a service grid
infrastructure which is available to scientists
24 hours-a-day - The project concentrates on
- building a consistent, robust and secure Grid
network that will attract additional computing
resources - continuously improve and maintain the middleware
in order to deliver a reliable service to users - attracting new users from industry as well as
science and ensure they receive the high standard
of training and support they need
4EGEE
- EGEE is the largest Grid
- infrastructure project in Europe
- 70 leading institutions in 27 countries,
federated in regional Grids - Leveraging national and regional grid activities
- 32 M Euros EU funding for initially 2 years
starting 1st April 2004 - EU review, February 2005 successful
- Preparing 2nd phase of the project proposal to
EU Grid call September 2005 - Promoting scientific partnership outside EU
5EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
6EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
7Computing Resources April 2005
This greatly exceeds the project expectations for
numbers of sites Shows that the main issue of
complexity is the number of sites
8SA1 Operations Structure
- Operations Management Centre (OMC)
- At CERN coordination etc
- Core Infrastructure Centres (CIC)
- Manage daily grid operations oversight,
troubleshooting - Run essential infrastructure services
- Provide 2nd level support to ROCs
- UK/I, Fr, It, CERN, Russia (M12)
- Taipei will also run a CIC
- Regional Operations Centres (ROC)
- Act as front-line support for user and operations
issues - Provide local knowledge and adaptations
- One in each region many distributed
- User Support Centre (GGUS)
- In FZK manage PTS provide single point of
contact (service desk) - Not foreseen as such in TA, but need is clear
9Grid Operations
- The grid is flat, but
- Hierarchy of responsibility
- Essential to scale the operation
- CICs act as a single Operations Centre
- Operational oversight (grid operator)
responsibility - rotates weekly between CICs
- Report problems to ROC/RC
- ROC is responsible for ensuring problem is
resolved - ROC oversees regional RCs
- ROCs responsible for organising the operations in
a region - Coordinate deployment of middleware, etc
- CERN coordinates sites not associated with a ROC
RC - Resource Centre ROC - Regional Operations
Centre CIC Core Infrastructure Centre
10Grid monitoring
- Operation of Production Service real-time
display of grid operations - Accounting information
- Selection of Monitoring tools
- GIIS Monitor Monitor Graphs
- Sites Functional Tests
- GOC Data Base
- Scheduled Downtimes
- Live Job Monitor
- GridIce VO fabric view
- Certificate Lifetime Monitor
11Operations focus
- Main focus of activities now
- Improving the operational reliability and
application efficiency - Automating monitoring ? alarms
- Ensuring a 24x7 service
- Removing sites that fail functional tests
- Operations interoperability with OSG and others
- Improving user support
- Demonstrate to users a reliable and trusted
support infrastructure - Deployment of gLite components
- Testing, certification ? pre-production service
- Migration planning and deployment while
maintaining/growing interoperability - Further developments now have to be driven by
experience in real use
12EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
13gLite middleware
- The 1st release of gLite (v1.0) made end March05
- http//glite.web.cern.ch/glite/packages/R1.0/R2005
0331 - http//glite.web.cern.ch/glite/documentation
- Lightweight services
- Interoperability Co-existence with deployed
infrastructure - Performance Fault Tolerance
- Portable
- Service oriented approach
- Site autonomy
- Open source license
14gLite Release 1.0
- Job management Services
- Workload Management
- Computing Element
- Logging and Bookkeeping
- Data management Services
- File and Replica catalog
- File Transfer and Placement Services
- gLite I/O
- Information Services
- R-GMA
- Service Discovery
- Security
- Deployment Modules
- Distribution available as RPMs, Binary Tarballs,
Source Tarballs and APT cache
Serious testing certification is just starting
15gLite Services for Release 1.0Components Summary
and Origin
- Computing Element
- Gatekeeper, WSS (Globus)
- Condor-C (Condor)
- CE Monitor (EGEE)
- Local batch system (PBS, LSF, Condor)
- Workload Management
- WMS (EDG)
- Logging and bookkeeping (EDG)
- Condor-C (Condor)
- Storage Element
- File Transfer/Placement (EGEE)
- glite-I/O (AliEn)
- GridFTP (Globus)
- SRM Castor (CERN), dCache (FNAL, DESY), other
SRMs
- Catalog
- File and Replica Catalog (EGEE)
- Metadata Catalog (EGEE)
- Information and Monitoring
- R-GMA (EDG)
- Service Discovery (EGEE)
- Security
- VOMS (DataTAG, EDG)
- GSI (Globus)
- Authentication for C and Java based (web)
services (EDG)
16Main Differences to LCG-2
- Workload Management System works in push and pull
mode - Computing Element moving towards a VO based
scheduler guarding the jobs of the VO (reduces
load on GRAM) - Re-factored file replica catalogs
- Secure catalogs (based on user DN VOMS
certificates being integrated) - Scheduled data transfers
- SRM based storage
- Information Services R-GMA with improved API,
Service - Discovery and registry replication
- Move towards Web Services
17EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
18Outreach Training
- Public and technical websites constantly evolving
to expand information available and keep it up to
date - 2 conferences organised
- 300 _at_ Cork, 400 _at_ Den Haag
- Athens 3rd project conference 18-22 April 05
- http//public.eu-egee.org/conferences/3rd/
- Pisa 4th project conference 24-28 October 05
- More than 70 training events (including the GGF
grid school) across many countries - 1000 people trained
- induction application developer advanced
retreats - Material archive with more than 100 presentations
- Strong links with GILDA testbed and GENIUS portal
developed in EU DataGrid
19Deployment of applications
- Pilot applications
- High Energy Physics
- Biomed applications
- http//egee-na4.ct.infn.it/biomed/applications.htm
l - Generic applications Deployment under way
- Computational Chemistry
- Earth science research
- EGEODE first industrial application
- Astrophysics
- With interest from
- Hydrology
- Seismology
- Grid search engines
- Stock market simulators
- Digital video etc.
- Industry (provider, user, supplier)
- Many users
- broad range of needs
- different communities with different background
and internal organization
20High Energy Physics
- Very experienced and large international user
community - Involvement in many projects worldwide and users
of several grids (e.g. all LHC experiments do use
multiple grids at the same time for their data
challenges) - LG experiments ZEUS, D0, CDF, H1, Babar
- Production infrastructure (LCG/EGEE)
- Intensive usage during 2004 data challenges
- LHCb 3500 concurrent jobs for long periods
- Many issues of functionality and performance were
exposed - Data challenges were also first real use of LCG-2
only limited testing had been done in advance - Major issue was reliability badly configured
and unstable sites - Nevertheless significant work was done
- gt1 M SI2K years of cpu time (1000 cpu years)
- 400 TB of data generated, moved and stored
- 4000-5000 simultaneous jobs (4 times CERN grid
capacity) - ARDA role in application development and
middleware testing - Helping the evolution of the experiments
specific middleware towards analysis usage - Large effort on the 4 LHC experiments prototypes
- CMS prototype migrated to gLite version 1 and
exposed to several users
- Improved reliability has been achieved by
selecting well maintained sites - Efficiencies of better than 90 have been
possible D0, CMS, ATLAS, in well controlled
conditions - This remains main area of focus for improvement
due in large part to number of sites in the
infrastructure
21Recent ATLAS work
10,000 concurrent jobs in the system
Number of jobs/day
- ATLAS jobs in EGEE/LCG-2 in 2005
- In latest period up to 8K jobs/day
- Used a combination of RB and Condor_G submissions
22 ZEUS on LCG-2
23LCG Deployment Schedule
- LHC starts in 2007
- Ramp-up with series of service challenges to
ensure key services infrastructure in place - Extremely aggressive timescale
24Introduction The MAGIC Telescope
- Ground based Air Cerenkov Telescope
- Gamma ray 30 GeV - TeV
- LaPalma, Canary Islands (28 North, 18 West)
- 17 m diameter
- operation since autumn 2003(still in
commissioning) - Collaborators
IFAE Barcelona, UAB Barcelona, Humboldt U.
Berlin, UC Davis, U. Lodz, UC Madrid, MPI
München, INFN / U. Padova, U. Potchefstrom, INFN
/ U. Siena, Tuorla Observatory, INFN / U. Udine,
U. Würzburg, Yerevan Physics Inst., ETH Zürich
Physics Goals Origin of VHE Gamma rays Active
Galactic Nuclei Supernova Remnants Unidentified
EGRET sources Gamma Ray Burst
25Introduction ground ?-ray astronomy
26MAGIC Hadron rejection
- Based on extensive Monte Carlo Simulation
- air shower simulation program CORSIKA
- Simulation of hadronic background is very CPU
consuming - to simulate the background of one night, 70 CPUs
(P4 2GHz) needs to run 19200 days - to simulate the gamma events of one night for a
Crab like source takes 288 days. - At higher energies (gt 70 GeV) observations are
possible already by On-Off method (This reduces
the On-time by a factor of two) - Lowering the threshold of the MAGIC telescope
requires new methods based on Monte Carlo
Simulations
27Experiences
- Data challenge Grid-1
- 12M hadron events
- 12000 jobs needed
- started march 2005
- up to now 4000 jobs
170/3780 Jobs failed ? 4.5 failure
Job successful Output file registered at PIC
- First tests
- with manual GUI submission
- Reasons for failure
- Network problems
- RB problems
- Queue problems
- Diagnostic
- no tools found
- complex and time consuming
- ? use metadata base, log the failure,
- resubmit and dont care
28Biomed applications
- Loosely coupled community
- Had to go the long way of getting up to speed
- VO creation and core services installation
- Setting up a task force of experts
- Recently joined the user support at application
level - Applications
- See list and description from web site
- http//egee-na4.ct.infn.it/biomed/applications.htm
l - 12 applications running today
- New applications emerging
- medical imaging, bioinformatics, phylogenetics,
molecule structures and drug discovery... - Grown to a significant infrastructure usage
- 29kCPU hours and 24k jobs reported on January
29Bioinformatics
- GPS_at_ Grid Protein Sequence Analysis
- NPSA is a web portal offering proteins databases
and sequence analysis algorithms to the
bioinformaticians (3000 hits per day) - GPS_at_ is a gridified version with increased
computing power - Need for large databases and big number of short
jobs - xmipp_MLrefine
- 3D structure analysis of macromolecules from
(very noisy) electron microscopy images - Maximum likelihood approach for finding the
optimal model - Very compute intensive
- Drug discovery
- Health related area with high performance
computation need - An application currently being ported in Germany
(Fraunhofer institute)
30Medical imaging
- GATE
- Radiotherapy planning
- Improvement of precision by Monte Carlo
simulation - Processing of DICOM medical images
- Objective very short computation time compatible
with clinical practice - Status development and performance testing
- CDSS
- Clinical Decision Support System
- knowledge databases assembling
- image classification engines widespreading
- Objective access to knowledge databases from
hospitals - Status from development to deployment, some
medical end users
31Medical imaging
- SiMRI3D
- 3D Magnetic Resonance Image Simulator
- MRI physics simulation, parallel implementation
- Very compute intensive
- Objective offering an image simulator service to
the research community - Satus parallelized and now running on LCG2
resources - gPTM3D
- Interactive tool for medical images segmentation
and analysis - A non gridified version is distributed in several
hospitals - Need for very fast scheduling of interactive
tasks - Objectives shorten computation time using the
grid - Status development of the gridified version
being finalized
32Evolution of biomedical applications
- Growing interest of the biomedical community
- Partners involved proposing new applications
- New application proposals (in various
health-related areas) - Enlargement of the biomedical community (drug
discovery) - Growing scale of the applications
- Progressive migration from prototypes to
pre-production services for some applications - Increase in scale (volume of data and number of
CPU hours)
33EGEE Geographical Extensions
- EGEE is a truly international under-taking
- Collaborations with other existing European
projects, in particular - GÉANT, DEISA, SEE-GRID
- Relations to other projects/proposals
- OSG OpenScienceGrid (USA)
- Asia Korea, Taiwan, EU-ChinaGrid
- BalticGrid Lithuania, Latvia, Estonia
- EELA Latin America
- EUMedGrid Mediterranean Area
-
- Expansion of EGEE infrastructure in these regions
is a key element for the future of the project
and international science
34Summary
- EGEE is a first attempt to build a worldwide Grid
infrastructure for data intensive applications
from many scientific domains - A large-scale production grid service is already
deployed and being used for HEP and BioMed
applications with new applications being ported - Resources user groups are expanding
- A process is in place for migrating new
applications to the EGEE infrastructure - A training programme has started with many events
already held - next generation middleware is being tested
(gLite) - First project review by the EU successfully
passed in Feb05 - Plans for a follow-on project are being prepared