LCG Status and Plans

About This Presentation

Title:

LCG Status and Plans

Description:

The goal of the LCG project is to prototype and deploy the computing environment ... Fr d ric Hemmer. Provision of a base set of grid middleware ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 33

Provided by: ianb196

Category:

more less

Transcript and Presenter's Notes

Title: LCG Status and Plans

1
LCG Status and Plans

GridPP13
Durham, UK
4th July 2005
Ian Bird
IT/GD, CERN

2
Overview

Introduction
Project goals and overview
Status
Applications area
Fabric
Deployment and Operations
Baseline Services
Service Challenges
Summary

3
LCG Goals

The goal of the LCG project is to prototype and
deploy the computing environment for the LHC
experiments
Two phases
Phase 1 2002 2005
Build a service prototype, based on existing grid
middleware
Gain experience in running a production grid
service
Produce the TDR for the final system
Phase 2 2006 2008
Build and commission the initial LHC computing
environment

LCG and Experiment Computing TDRs completed and
presented to the LHCC last week
4
Project Areas Management
Distributed Analysis - ARDA Massimo
Lamanna Prototyping of distributed end-user
analysis using grid technology
Project Leader Les Robertson Resource Manager
Chris Eck Planning Officer Jürgen
Knobloch Administration Fabienne Baud-Lavigne
Joint with EGEE
Applications Area Pere Mato Development
environment Joint projects, Data
management Distributed analysis
Middleware Area Frédéric Hemmer Provision of a
base set of grid middleware (acquisition,
development, integration)Testing, maintenance,
support
CERN Fabric AreaBernd Panzer Large cluster
management Data recording, Cluster
technology Networking, Computing service at CERN
Grid Deployment Area Ian Bird Establishing and
managing the Grid Service - Middleware,
certification, security operations, registration,
authorisation,accounting
5
Applications Area
6
Application Area Focus

Deliver the common physics applications software
Organized to ensure focus on real experiment
needs
Experiment-driven requirements and monitoring
Architects in management and execution
Open information flow and decision making
Participation of experiment developers
Frequent releases enabling iterative feedback
Success defined by experiment validation
Integration, evaluation, successful deployment

7
Validation Highlights

POOL successfully used in large scale production
in ATLAS, CMS, LHCb data challenges in 2004
400TB of POOL data produced
Objective of a quickly-developed persistency
hybrid leveraging ROOT I/O and RDBMSes has been
fulfilled
Geant4 firmly established as baseline simulation
in successful ATLAS, CMS, LHCb production
EM hadronic physics validated
Highly stable 1 G4-related crash per O(1M)
events
SEAL components underpin POOLs success, in
particular the dictionary system
Now entering a second generation with Reflex
SPIs Savannah project portal and external
software service are accepted standards inside
and outside the project

8
Current AA Projects

SPI Software process infrastructure (A. Aimar)
Software and development services external
libraries, savannah, software distribution,
support for build, test, QA, etc.
ROOT Core Libraries and Services (R. Brun)
Foundation class libraries, math libraries,
framework services, dictionaries, scripting, GUI,
graphics, etc.
POOL Persistency Framework (D. Duellmann)
Storage manager, file catalogs, event
collections, relational access layer, conditions
database, etc.
SIMU - Simulation project (G. Cosmo)
Simulation framework, physics validation studies,
MC event generators, participation in Geant4,
Fluka.

9
SEAL and ROOT Merge

Major change in the AA has been the merge of the
SEAL project with ROOT project
Details of the merge are being discussed
following a process defined by the AF
Breakdown into a number of topics
Proposals discussed with the experiments
Public presentations
Final decisions by the AF
Current status
Dictionary plans approved
MathCore and Vector libraries proposals have been
approved
First development release of ROOT including these
new libraries

10
Ongoing work

SPI
Porting LCG-AA software to amd64 (gcc 3.4.4)
Finalizing software distribution based on Pacman
QA tools test coverage and savannah reports
ROOT
Development version v5.02 released last week
Including new libraries mathcore, reflex,
cintex, roofit
POOL
Version 2.1 released including new file catalog
implementations LFCCatalog (lfc), GliteCatalog
(glite, Fireman), GTCatalog (globus toolkit)
New version of Conditions DB (COOL) 1.2
Adapting POOL to new dictionaries (Reflex)
SIMU
New Geant4 public minor release 7.1 is being
prepared
Public release of Fluka expected by end July
Intense activity in the combined calorimeter
physics validation with ATLAS, report in
September.
New MC generators being added (CASCADE ,
CHARYBDIS, etc.) into the already long list of
generators provided
Prototyping persistency of Geant4 geometry with
ROOT

11
Fabric Area
12
CERN Fabric

Fabric automation has seen very good progress
The new systems for managing large farms are in
production at CERN

13
CERN Fabric

Fabric automation has seen very good progress
The new systems for managing large farms are in
production at CERN
New CASTOR Mass Storage System
Was deployed first on the high throughput cluster
for the recent ALICE data recording computing
challenge
Agreement on collaboration with Fermilab on Linux
distribution
Scientific Linux based on Red Hat Enterprise 3
Improves uniformity between the HEP sites serving
LHC and Run 2 experiments
CERN computer centre preparations
Power upgrade to 2.5 MW
Computer centre refurbishment well under way
Acquisition process started

14
Preparing for 7,000 boxes in 2008
15
High Throughput Prototype openlab/LCG

Experience with likely ingredients in LCG
64-bit programming
next generation I/O (10 Gb Ethernet,
Infiniband, etc.)
High performance cluster used for evaluations,
and for data challenges with experiments
Flexible configuration
components moved in and out of production
environment
Co-funded by industry and CERN

16
Alice Data Recording Challenge

Target one week sustained at 450 MB/sec
Used the new version of Castor mass storage
system
Note smooth degradation and recovery after
equipment failure

17
Deployment and Operations
18
Computing Resources June 2005
Number of sites is already at the scale expected
for LHC - demonstrates the full complexity of
operations

Country providing resources
Country anticipating joining
In LCG-2
139 sites, 32 countries
14,000 cpu
5 PB storage
Includes non-EGEE sites
9 countries
18 sites

19
Operations Structure

Operations Management Centre (OMC)
At CERN coordination etc
Core Infrastructure Centres (CIC)
Manage daily grid operations oversight,
troubleshooting
Run essential infrastructure services
Provide 2nd level support to ROCs
UK/I, Fr, It, CERN, Russia (M12)
Hope to get non-European centres
Regional Operations Centres (ROC)
Act as front-line support for user and operations
issues
Provide local knowledge and adaptations
One in each region many distributed
User Support Centre (GGUS)
In FZK support portal provide single point of
contact (service desk)

20
Grid Operations

The grid is flat, but
Hierarchy of responsibility
Essential to scale the operation
CICs act as a single Operations Centre
Operational oversight (grid operator)
responsibility
rotates weekly between CICs
Report problems to ROC/RC
ROC is responsible for ensuring problem is
resolved
ROC oversees regional RCs
ROCs responsible for organising the operations in
a region
Coordinate deployment of middleware, etc
CERN coordinates sites not associated with a ROC

RC Resource Centre
It is in setting up this operational
infrastructure where we have really benefited
from EGEE funding
21
Grid monitoring

Operation of Production Service real-time
display of grid operations
Accounting information
Selection of Monitoring tools

GIIS Monitor Monitor Graphs
Sites Functional Tests
GOC Data Base
Scheduled Downtimes

Live Job Monitor
GridIce VO fabric view
Certificate Lifetime Monitor

22
Operations focus

Main focus of activities now
Improving the operational reliability and
application efficiency
Automating monitoring ? alarms
Ensuring a 24x7 service
Removing sites that fail functional tests
Operations interoperability with OSG and others
Improving user support
Demonstrate to users a reliable and trusted
support infrastructure
Deployment of gLite components
Testing, certification ? pre-production service
Migration planning and deployment while
maintaining/growing interoperability
Further developments now have to be driven by
experience in real use

23
Recent ATLAS work
10,000 concurrent jobs in the system
Number of jobs/day

ATLAS jobs in EGEE/LCG-2 in 2005
In latest period up to 8K jobs/day
Several times the current capacity for ATLAS at
CERN alone shows the reality of the grid
solution

24
Baseline Services Service Challenges
25
Baseline Services Goals

Experiments and regional centres agree on
baseline services
Support the computing models for the initial
period of LHC
Thus must be in operation by September 2006.
Expose experiment plans and ideas
Timescales
For TDR now
For SC3 testing, verification, not all
components
For SC4 must have complete set
Define services with targets for functionality
scalability/performance metrics.
Very much driven by the experiments needs
But try to understand site and other constraints

Not done yet
26
Baseline services

Nothing really surprising here but a lot was
clarified in terms of requirements,
implementations, deployment, security, etc

VO management services
Clear need for VOMS roles, groups, subgroups
POSIX-like I/O service
local files, and include links to catalogues
Grid monitoring tools and services
Focussed on job monitoring
VO agent framework
Applications software installation service
Reliable messaging service
Information system

Storage management services
Based on SRM as the interface
Basic transfer services
gridFTP, srmCopy
Reliable file transfer service
Grid catalogue services
Catalogue and data management tools
Database services
Required at Tier1,2
Compute Resource Services
Workload management

27
Preliminary Priorities
A High priority, mandatory service B Standard
solutions required, experiments could select
different implementations C Common solutions
desirable, but not essential
Service ALICE ATLAS CMS LHCb
Storage Element A A A A
Basic transfer tools A A A A
Reliable file transfer service A A A/B A
Catalogue services B B B B
Catalogue and data management tools C C C C
Compute Element A A A A
Workload Management B A A C
VO agents A A A A
VOMS A A A A
Database services A A A A
Posix-I/O C C C C
Application software installation C C C C
Job monitoring tools C C C C
Reliable messaging service C C C C
Information system A A A A
28
Service Challenges ramp up to LHC start-up
service
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 SC4 Service Phase
Sep06 Initial LHC Service in stable operation
Apr07 LHC Service commissioned
See Jamies talk for more details
SC2
SC2 Reliable data transfer (disk-network-disk)
5 Tier-1s, aggregate 500 MB/sec sustained at
CERN SC3 Reliable base service most Tier-1s,
some Tier-2s basic experiment software chain
grid data throughput 500 MB/sec,
including mass storage (25 of the nominal final
throughput for the proton period) SC4
All Tier-1s, major Tier-2s capable of
supporting full experiment software chain inc.
analysis sustain nominal final grid
data throughput LHC Service in Operation
September 2006 ramp up to full operational
capacity by April 2007 capable of
handling twice the nominal data throughput

29
Baseline Services, Service Challenges, Production
Service, Pre-production service, gLite
deployment,

confused?

30
Services

Baseline services
Are the set of essential services that the
experiments need to be in production by September
2006
Verify components in SC3, SC4
Service challenges
The ramp up of the LHC computing environment
building up the production service, based on
results and lessons of the service challenges
Production service
The evolving service putting in place new
components prototyped in SC3, SC4
No big-bang changes, but many releases!!!
gLite deployment
As new components are certified, will be added to
the production service releases, either in
parallel with or replacing existing services
Pre-production service
Should be literally a preview of the production
service,
But is a demonstration of gLite services at the
moment this has been forced on us by many other
constraints (urgency to deploy gLite, need for
reasonable scale testing, )

31
Releases and Distributions

We intend to maintain a single line of production
middleware distributions
Middleware releases from JRA1, VDT, LCG,
Middleware distributions for deployment from
GDA/SA1
Remember announcement of a release is months
away from a deployable distribution (based on
last 2 years experience)
Distributions still labelled LCG-2.x.x
Would like to change to something less specific
to avoid LCG/EGEE confusion
Frequent updates for Service challenge sites
But only needed for SC sites
Frequent updates as gLite is deployed
Not clear if all sites will deploy all gLite
components immediately
This is unavoidable
A strong request from LHC experiment spokesmen to
the LCG POB
early, gradual and frequent releases of the
baseline services is essential rather than
waiting for a complete sets

Throughout all this, we must maintain a reliable
production service, which gradually improves in
reliability and performance
32
Summary

We are at end of LCG Phase 1
Good time to step back and look at achievements
and issues
LCG Phase 2 has really started
Consolidation of AA projects
Baseline services
Service challenges and experiment data challenges
Acquisitions process starting
No new developments ? make what we have work
absolutely reliably, and be scaleable, performant
Timescale is extremely tight
Must ensure that we have appropriate levels of
effort committed