Title: The EGEE Project Status
1The EGEE Project Status
- Ian Bird
- EGEE Operations Manager
- Geneva, Switzerland
ISGC, Taipei 27thApril 2005
- The EGEE Project
- Overview and Structure
- Grid Operations
- Middleware
- Networking Activities
- Applications
- Biomedical
- Summary
3EGEE goals
- Goal of EGEE develop a service grid
infrastructure which is available to scientists
24 hours-a-day - The project concentrates on
- building a consistent, robust and secure Grid
network that will attract additional computing
resources - continuously improve and maintain the middleware
in order to deliver a reliable service to users - attracting new users from industry as well as
science and ensure they receive the high standard
of training and support they need
- EGEE is the largest Grid
- infrastructure project in Europe
- 70 leading institutions in 27 countries,
federated in regional Grids - Leveraging national and regional grid activities
- 32 M Euros EU funding for initially 2 years
starting 1st April 2004 - EU review, February 2005 successful
- Preparing 2nd phase of the project proposal to
EU Grid call September 2005 - Promoting scientific partnership outside EU
5EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
6EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
7Computing Resources April 2005
This greatly exceeds the project expectations for
numbers of sites Shows that the main issue of
complexity is the number of sites
8SA1 Operations Structure
- Operations Management Centre (OMC)
- At CERN coordination etc
- Core Infrastructure Centres (CIC)
- Manage daily grid operations oversight,
troubleshooting - Run essential infrastructure services
- Provide 2nd level support to ROCs
- UK/I, Fr, It, CERN, Russia (M12)
- Taipei will also run a CIC
- Regional Operations Centres (ROC)
- Act as front-line support for user and operations
issues - Provide local knowledge and adaptations
- One in each region many distributed
- User Support Centre (GGUS)
- In FZK manage PTS provide single point of
contact (service desk) - Not foreseen as such in TA, but need is clear
9Grid Operations
- The grid is flat, but
- Hierarchy of responsibility
- Essential to scale the operation
- CICs act as a single Operations Centre
- Operational oversight (grid operator)
responsibility - rotates weekly between CICs
- Report problems to ROC/RC
- ROC is responsible for ensuring problem is
resolved - ROC oversees regional RCs
- ROCs responsible for organising the operations in
a region - Coordinate deployment of middleware, etc
- CERN coordinates sites not associated with a ROC
RC - Resource Centre ROC - Regional Operations
Centre CIC Core Infrastructure Centre
10Grid monitoring
- Operation of Production Service real-time
display of grid operations - Accounting information
- Selection of Monitoring tools
- GIIS Monitor Monitor Graphs
- Sites Functional Tests
- GOC Data Base
- Scheduled Downtimes
- Live Job Monitor
- GridIce VO fabric view
- Certificate Lifetime Monitor
11Operations focus
- Main focus of activities now
- Improving the operational reliability and
application efficiency - Automating monitoring ? alarms
- Ensuring a 24x7 service
- Removing sites that fail functional tests
- Operations interoperability with OSG and others
- Improving user support
- Demonstrate to users a reliable and trusted
support infrastructure - Deployment of gLite components
- Testing, certification ? pre-production service
- Migration planning and deployment while
maintaining/growing interoperability - Further developments now have to be driven by
experience in real use
12EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
13gLite middleware
- The 1st release of gLite (v1.0) made end March05
- http//glite.web.cern.ch/glite/packages/R1.0/R2005
0331 - http//glite.web.cern.ch/glite/documentation
- Lightweight services
- Interoperability Co-existence with deployed
infrastructure - Performance Fault Tolerance
- Portable
- Service oriented approach
- Site autonomy
- Open source license
14gLite Release 1.0
- Job management Services
- Workload Management
- Computing Element
- Logging and Bookkeeping
- Data management Services
- File and Replica catalog
- File Transfer and Placement Services
- gLite I/O
- Information Services
- Service Discovery
- Security
- Deployment Modules
- Distribution available as RPMs, Binary Tarballs,
Source Tarballs and APT cache
Serious testing certification is just starting
15gLite Services for Release 1.0Components Summary
and Origin
- Computing Element
- Gatekeeper, WSS (Globus)
- Condor-C (Condor)
- CE Monitor (EGEE)
- Local batch system (PBS, LSF, Condor)
- Workload Management
- Logging and bookkeeping (EDG)
- Condor-C (Condor)
- Storage Element
- File Transfer/Placement (EGEE)
- glite-I/O (AliEn)
- GridFTP (Globus)
- SRM Castor (CERN), dCache (FNAL, DESY), other
- Catalog
- File and Replica Catalog (EGEE)
- Metadata Catalog (EGEE)
- Information and Monitoring
- Service Discovery (EGEE)
- Security
- GSI (Globus)
- Authentication for C and Java based (web)
services (EDG)
16Main Differences to LCG-2
- Workload Management System works in push and pull
mode - Computing Element moving towards a VO based
scheduler guarding the jobs of the VO (reduces
load on GRAM) - Re-factored file replica catalogs
- Secure catalogs (based on user DN VOMS
certificates being integrated) - Scheduled data transfers
- SRM based storage
- Information Services R-GMA with improved API,
Service - Discovery and registry replication
- Move towards Web Services
17EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
18Outreach Training
- Public and technical websites constantly evolving
to expand information available and keep it up to
date - 2 conferences organised
- 300 _at_ Cork, 400 _at_ Den Haag
- Athens 3rd project conference 18-22 April 05
- http//public.eu-egee.org/conferences/3rd/
- Pisa 4th project conference 24-28 October 05
- More than 70 training events (including the GGF
grid school) across many countries - 1000 people trained
- induction application developer advanced
retreats - Material archive with more than 100 presentations
- Strong links with GILDA testbed and GENIUS portal
developed in EU DataGrid
19Deployment of applications
- Pilot applications
- High Energy Physics
- Biomed applications
- http//egee-na4.ct.infn.it/biomed/applications.htm
l - Generic applications Deployment under way
- Computational Chemistry
- Earth science research
- EGEODE first industrial application
- Astrophysics
- With interest from
- Hydrology
- Seismology
- Grid search engines
- Stock market simulators
- Digital video etc.
- Industry (provider, user, supplier)
- Many users
- broad range of needs
- different communities with different background
and internal organization
20High Energy Physics
- Very experienced and large international user
community - Involvement in many projects worldwide and users
of several grids (e.g. all LHC experiments do use
multiple grids at the same time for their data
challenges) - LG experiments ZEUS, D0, CDF, H1, Babar
- Production infrastructure (LCG/EGEE)
- Intensive usage during 2004 data challenges
- LHCb 3500 concurrent jobs for long periods
- Many issues of functionality and performance were
exposed - Data challenges were also first real use of LCG-2
only limited testing had been done in advance - Major issue was reliability badly configured
and unstable sites - Nevertheless significant work was done
- gt1 M SI2K years of cpu time (1000 cpu years)
- 400 TB of data generated, moved and stored
- 4000-5000 simultaneous jobs (4 times CERN grid
capacity) - ARDA role in application development and
middleware testing - Helping the evolution of the experiments
specific middleware towards analysis usage - Large effort on the 4 LHC experiments prototypes
- CMS prototype migrated to gLite version 1 and
exposed to several users
- Improved reliability has been achieved by
selecting well maintained sites - Efficiencies of better than 90 have been
possible D0, CMS, ATLAS, in well controlled
conditions - This remains main area of focus for improvement
due in large part to number of sites in the
21Recent ATLAS work
10,000 concurrent jobs in the system
Number of jobs/day
- ATLAS jobs in EGEE/LCG-2 in 2005
- In latest period up to 8K jobs/day
- Used a combination of RB and Condor_G submissions
22 ZEUS on LCG-2
23LCG Deployment Schedule
- LHC starts in 2007
- Ramp-up with series of service challenges to
ensure key services infrastructure in place - Extremely aggressive timescale
24Introduction The MAGIC Telescope
- Ground based Air Cerenkov Telescope
- Gamma ray 30 GeV - TeV
- LaPalma, Canary Islands (28 North, 18 West)
- 17 m diameter
- operation since autumn 2003(still in
commissioning) - Collaborators
IFAE Barcelona, UAB Barcelona, Humboldt U.
Berlin, UC Davis, U. Lodz, UC Madrid, MPI
München, INFN / U. Padova, U. Potchefstrom, INFN
/ U. Siena, Tuorla Observatory, INFN / U. Udine,
U. Würzburg, Yerevan Physics Inst., ETH Zürich
Physics Goals Origin of VHE Gamma rays Active
Galactic Nuclei Supernova Remnants Unidentified
EGRET sources Gamma Ray Burst
25Introduction ground ?-ray astronomy
26MAGIC Hadron rejection
- Based on extensive Monte Carlo Simulation
- air shower simulation program CORSIKA
- Simulation of hadronic background is very CPU
consuming - to simulate the background of one night, 70 CPUs
(P4 2GHz) needs to run 19200 days - to simulate the gamma events of one night for a
Crab like source takes 288 days. - At higher energies (gt 70 GeV) observations are
possible already by On-Off method (This reduces
the On-time by a factor of two) - Lowering the threshold of the MAGIC telescope
requires new methods based on Monte Carlo
- Data challenge Grid-1
- 12M hadron events
- 12000 jobs needed
- started march 2005
- up to now 4000 jobs
170/3780 Jobs failed ? 4.5 failure
Job successful Output file registered at PIC
- First tests
- with manual GUI submission
- Reasons for failure
- Network problems
- RB problems
- Queue problems
- Diagnostic
- no tools found
- complex and time consuming
- ? use metadata base, log the failure,
- resubmit and dont care
28Biomed applications
- Loosely coupled community
- Had to go the long way of getting up to speed
- VO creation and core services installation
- Setting up a task force of experts
- Recently joined the user support at application
level - Applications
- See list and description from web site
- http//egee-na4.ct.infn.it/biomed/applications.htm
l - 12 applications running today
- New applications emerging
- medical imaging, bioinformatics, phylogenetics,
molecule structures and drug discovery... - Grown to a significant infrastructure usage
- 29kCPU hours and 24k jobs reported on January
- GPS_at_ Grid Protein Sequence Analysis
- NPSA is a web portal offering proteins databases
and sequence analysis algorithms to the
bioinformaticians (3000 hits per day) - GPS_at_ is a gridified version with increased
computing power - Need for large databases and big number of short
jobs - xmipp_MLrefine
- 3D structure analysis of macromolecules from
(very noisy) electron microscopy images - Maximum likelihood approach for finding the
optimal model - Very compute intensive
- Drug discovery
- Health related area with high performance
computation need - An application currently being ported in Germany
(Fraunhofer institute)
30Medical imaging
- Radiotherapy planning
- Improvement of precision by Monte Carlo
simulation - Processing of DICOM medical images
- Objective very short computation time compatible
with clinical practice - Status development and performance testing
- Clinical Decision Support System
- knowledge databases assembling
- image classification engines widespreading
- Objective access to knowledge databases from
hospitals - Status from development to deployment, some
medical end users
31Medical imaging
- 3D Magnetic Resonance Image Simulator
- MRI physics simulation, parallel implementation
- Very compute intensive
- Objective offering an image simulator service to
the research community - Satus parallelized and now running on LCG2
resources - gPTM3D
- Interactive tool for medical images segmentation
and analysis - A non gridified version is distributed in several
hospitals - Need for very fast scheduling of interactive
tasks - Objectives shorten computation time using the
grid - Status development of the gridified version
being finalized
32Evolution of biomedical applications
- Growing interest of the biomedical community
- Partners involved proposing new applications
- New application proposals (in various
health-related areas) - Enlargement of the biomedical community (drug
discovery) - Growing scale of the applications
- Progressive migration from prototypes to
pre-production services for some applications - Increase in scale (volume of data and number of
CPU hours)
33EGEE Geographical Extensions
- EGEE is a truly international under-taking
- Collaborations with other existing European
projects, in particular - GÉANT, DEISA, SEE-GRID
- Relations to other projects/proposals
- OSG OpenScienceGrid (USA)
- Asia Korea, Taiwan, EU-ChinaGrid
- BalticGrid Lithuania, Latvia, Estonia
- EELA Latin America
- EUMedGrid Mediterranean Area
- Expansion of EGEE infrastructure in these regions
is a key element for the future of the project
and international science
- EGEE is a first attempt to build a worldwide Grid
infrastructure for data intensive applications
from many scientific domains - A large-scale production grid service is already
deployed and being used for HEP and BioMed
applications with new applications being ported - Resources user groups are expanding
- A process is in place for migrating new
applications to the EGEE infrastructure - A training programme has started with many events
already held - next generation middleware is being tested
(gLite) - First project review by the EU successfully
passed in Feb05 - Plans for a follow-on project are being prepared