LCG Status and Plans - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

LCG Status and Plans

Description:

The goal of the LCG project is to prototype and deploy the computing environment ... Fr d ric Hemmer. Provision of a base set of grid middleware ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 33
Provided by: ianb196
Category:
Tags: lcg | hemmer | plans | status

less

Transcript and Presenter's Notes

Title: LCG Status and Plans


1
LCG Status and Plans
  • GridPP13
  • Durham, UK
  • 4th July 2005
  • Ian Bird
  • IT/GD, CERN

2
Overview
  • Introduction
  • Project goals and overview
  • Status
  • Applications area
  • Fabric
  • Deployment and Operations
  • Baseline Services
  • Service Challenges
  • Summary

3
LCG Goals
  • The goal of the LCG project is to prototype and
    deploy the computing environment for the LHC
    experiments
  • Two phases
  • Phase 1 2002 2005
  • Build a service prototype, based on existing grid
    middleware
  • Gain experience in running a production grid
    service
  • Produce the TDR for the final system
  • Phase 2 2006 2008
  • Build and commission the initial LHC computing
    environment

LCG and Experiment Computing TDRs completed and
presented to the LHCC last week
4
Project Areas Management
Distributed Analysis - ARDA Massimo
Lamanna Prototyping of distributed end-user
analysis using grid technology
Project Leader Les Robertson Resource Manager
Chris Eck Planning Officer Jürgen
Knobloch Administration Fabienne Baud-Lavigne
Joint with EGEE
Applications Area Pere Mato Development
environment Joint projects, Data
management Distributed analysis
Middleware Area Frédéric Hemmer Provision of a
base set of grid middleware (acquisition,
development, integration)Testing, maintenance,
support
CERN Fabric AreaBernd Panzer Large cluster
management Data recording, Cluster
technology Networking, Computing service at CERN
Grid Deployment Area Ian Bird Establishing and
managing the Grid Service - Middleware,
certification, security operations, registration,
authorisation,accounting
5
Applications Area
6
Application Area Focus
  • Deliver the common physics applications software
  • Organized to ensure focus on real experiment
    needs
  • Experiment-driven requirements and monitoring
  • Architects in management and execution
  • Open information flow and decision making
  • Participation of experiment developers
  • Frequent releases enabling iterative feedback
  • Success defined by experiment validation
  • Integration, evaluation, successful deployment

7
Validation Highlights
  • POOL successfully used in large scale production
    in ATLAS, CMS, LHCb data challenges in 2004
  • 400TB of POOL data produced
  • Objective of a quickly-developed persistency
    hybrid leveraging ROOT I/O and RDBMSes has been
    fulfilled
  • Geant4 firmly established as baseline simulation
    in successful ATLAS, CMS, LHCb production
  • EM hadronic physics validated
  • Highly stable 1 G4-related crash per O(1M)
    events
  • SEAL components underpin POOLs success, in
    particular the dictionary system
  • Now entering a second generation with Reflex
  • SPIs Savannah project portal and external
    software service are accepted standards inside
    and outside the project

8
Current AA Projects
  • SPI Software process infrastructure (A. Aimar)
  • Software and development services external
    libraries, savannah, software distribution,
    support for build, test, QA, etc.
  • ROOT Core Libraries and Services (R. Brun)
  • Foundation class libraries, math libraries,
    framework services, dictionaries, scripting, GUI,
    graphics, etc.
  • POOL Persistency Framework (D. Duellmann)
  • Storage manager, file catalogs, event
    collections, relational access layer, conditions
    database, etc.
  • SIMU - Simulation project (G. Cosmo)
  • Simulation framework, physics validation studies,
    MC event generators, participation in Geant4,
    Fluka.

9
SEAL and ROOT Merge
  • Major change in the AA has been the merge of the
    SEAL project with ROOT project
  • Details of the merge are being discussed
    following a process defined by the AF
  • Breakdown into a number of topics
  • Proposals discussed with the experiments
  • Public presentations
  • Final decisions by the AF
  • Current status
  • Dictionary plans approved
  • MathCore and Vector libraries proposals have been
    approved
  • First development release of ROOT including these
    new libraries

10
Ongoing work
  • SPI
  • Porting LCG-AA software to amd64 (gcc 3.4.4)
  • Finalizing software distribution based on Pacman
  • QA tools test coverage and savannah reports
  • ROOT
  • Development version v5.02 released last week
  • Including new libraries mathcore, reflex,
    cintex, roofit
  • POOL
  • Version 2.1 released including new file catalog
    implementations LFCCatalog (lfc), GliteCatalog
    (glite, Fireman), GTCatalog (globus toolkit)
  • New version of Conditions DB (COOL) 1.2
  • Adapting POOL to new dictionaries (Reflex)
  • SIMU
  • New Geant4 public minor release 7.1 is being
    prepared
  • Public release of Fluka expected by end July
  • Intense activity in the combined calorimeter
    physics validation with ATLAS, report in
    September.
  • New MC generators being added (CASCADE ,
    CHARYBDIS, etc.) into the already long list of
    generators provided
  • Prototyping persistency of Geant4 geometry with
    ROOT

11
Fabric Area
12
CERN Fabric
  • Fabric automation has seen very good progress
  • The new systems for managing large farms are in
    production at CERN

13
CERN Fabric
  • Fabric automation has seen very good progress
  • The new systems for managing large farms are in
    production at CERN
  • New CASTOR Mass Storage System
  • Was deployed first on the high throughput cluster
    for the recent ALICE data recording computing
    challenge
  • Agreement on collaboration with Fermilab on Linux
    distribution
  • Scientific Linux based on Red Hat Enterprise 3
  • Improves uniformity between the HEP sites serving
    LHC and Run 2 experiments
  • CERN computer centre preparations
  • Power upgrade to 2.5 MW
  • Computer centre refurbishment well under way
  • Acquisition process started

14
Preparing for 7,000 boxes in 2008
15
High Throughput Prototype openlab/LCG
  • Experience with likely ingredients in LCG
  • 64-bit programming
  • next generation I/O (10 Gb Ethernet,
    Infiniband, etc.)
  • High performance cluster used for evaluations,
    and for data challenges with experiments
  • Flexible configuration
  • components moved in and out of production
    environment
  • Co-funded by industry and CERN

16
Alice Data Recording Challenge
  • Target one week sustained at 450 MB/sec
  • Used the new version of Castor mass storage
    system
  • Note smooth degradation and recovery after
    equipment failure

17
Deployment and Operations
18
Computing Resources June 2005
Number of sites is already at the scale expected
for LHC - demonstrates the full complexity of
operations
  • Country providing resources
  • Country anticipating joining
  • In LCG-2
  • 139 sites, 32 countries
  • 14,000 cpu
  • 5 PB storage
  • Includes non-EGEE sites
  • 9 countries
  • 18 sites

19
Operations Structure
  • Operations Management Centre (OMC)
  • At CERN coordination etc
  • Core Infrastructure Centres (CIC)
  • Manage daily grid operations oversight,
    troubleshooting
  • Run essential infrastructure services
  • Provide 2nd level support to ROCs
  • UK/I, Fr, It, CERN, Russia (M12)
  • Hope to get non-European centres
  • Regional Operations Centres (ROC)
  • Act as front-line support for user and operations
    issues
  • Provide local knowledge and adaptations
  • One in each region many distributed
  • User Support Centre (GGUS)
  • In FZK support portal provide single point of
    contact (service desk)

20
Grid Operations
  • The grid is flat, but
  • Hierarchy of responsibility
  • Essential to scale the operation
  • CICs act as a single Operations Centre
  • Operational oversight (grid operator)
    responsibility
  • rotates weekly between CICs
  • Report problems to ROC/RC
  • ROC is responsible for ensuring problem is
    resolved
  • ROC oversees regional RCs
  • ROCs responsible for organising the operations in
    a region
  • Coordinate deployment of middleware, etc
  • CERN coordinates sites not associated with a ROC

RC Resource Centre
It is in setting up this operational
infrastructure where we have really benefited
from EGEE funding
21
Grid monitoring
  • Operation of Production Service real-time
    display of grid operations
  • Accounting information
  • Selection of Monitoring tools
  • GIIS Monitor Monitor Graphs
  • Sites Functional Tests
  • GOC Data Base
  • Scheduled Downtimes
  • Live Job Monitor
  • GridIce VO fabric view
  • Certificate Lifetime Monitor

22
Operations focus
  • Main focus of activities now
  • Improving the operational reliability and
    application efficiency
  • Automating monitoring ? alarms
  • Ensuring a 24x7 service
  • Removing sites that fail functional tests
  • Operations interoperability with OSG and others
  • Improving user support
  • Demonstrate to users a reliable and trusted
    support infrastructure
  • Deployment of gLite components
  • Testing, certification ? pre-production service
  • Migration planning and deployment while
    maintaining/growing interoperability
  • Further developments now have to be driven by
    experience in real use

23
Recent ATLAS work
10,000 concurrent jobs in the system
Number of jobs/day
  • ATLAS jobs in EGEE/LCG-2 in 2005
  • In latest period up to 8K jobs/day
  • Several times the current capacity for ATLAS at
    CERN alone shows the reality of the grid
    solution

24
Baseline Services Service Challenges
25
Baseline Services Goals
  • Experiments and regional centres agree on
    baseline services
  • Support the computing models for the initial
    period of LHC
  • Thus must be in operation by September 2006.
  • Expose experiment plans and ideas
  • Timescales
  • For TDR now
  • For SC3 testing, verification, not all
    components
  • For SC4 must have complete set
  • Define services with targets for functionality
    scalability/performance metrics.
  • Very much driven by the experiments needs
  • But try to understand site and other constraints

Not done yet
26
Baseline services
  • Nothing really surprising here but a lot was
    clarified in terms of requirements,
    implementations, deployment, security, etc
  • VO management services
  • Clear need for VOMS roles, groups, subgroups
  • POSIX-like I/O service
  • local files, and include links to catalogues
  • Grid monitoring tools and services
  • Focussed on job monitoring
  • VO agent framework
  • Applications software installation service
  • Reliable messaging service
  • Information system
  • Storage management services
  • Based on SRM as the interface
  • Basic transfer services
  • gridFTP, srmCopy
  • Reliable file transfer service
  • Grid catalogue services
  • Catalogue and data management tools
  • Database services
  • Required at Tier1,2
  • Compute Resource Services
  • Workload management

27
Preliminary Priorities
A High priority, mandatory service B Standard
solutions required, experiments could select
different implementations C Common solutions
desirable, but not essential
Service ALICE ATLAS CMS LHCb
Storage Element A A A A
Basic transfer tools A A A A
Reliable file transfer service A A A/B A
Catalogue services B B B B
Catalogue and data management tools C C C C
Compute Element A A A A
Workload Management B A A C
VO agents A A A A
VOMS A A A A
Database services A A A A
Posix-I/O C C C C
Application software installation C C C C
Job monitoring tools C C C C
Reliable messaging service C C C C
Information system A A A A
28
Service Challenges ramp up to LHC start-up
service
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 SC4 Service Phase
Sep06 Initial LHC Service in stable operation
Apr07 LHC Service commissioned
See Jamies talk for more details
SC2
SC2 Reliable data transfer (disk-network-disk)
5 Tier-1s, aggregate 500 MB/sec sustained at
CERN SC3 Reliable base service most Tier-1s,
some Tier-2s basic experiment software chain
grid data throughput 500 MB/sec,
including mass storage (25 of the nominal final
throughput for the proton period) SC4
All Tier-1s, major Tier-2s capable of
supporting full experiment software chain inc.
analysis sustain nominal final grid
data throughput LHC Service in Operation
September 2006 ramp up to full operational
capacity by April 2007 capable of
handling twice the nominal data throughput

29
Baseline Services, Service Challenges, Production
Service, Pre-production service, gLite
deployment,
  • confused?

30
Services
  • Baseline services
  • Are the set of essential services that the
    experiments need to be in production by September
    2006
  • Verify components in SC3, SC4
  • Service challenges
  • The ramp up of the LHC computing environment
    building up the production service, based on
    results and lessons of the service challenges
  • Production service
  • The evolving service putting in place new
    components prototyped in SC3, SC4
  • No big-bang changes, but many releases!!!
  • gLite deployment
  • As new components are certified, will be added to
    the production service releases, either in
    parallel with or replacing existing services
  • Pre-production service
  • Should be literally a preview of the production
    service,
  • But is a demonstration of gLite services at the
    moment this has been forced on us by many other
    constraints (urgency to deploy gLite, need for
    reasonable scale testing, )

31
Releases and Distributions
  • We intend to maintain a single line of production
    middleware distributions
  • Middleware releases from JRA1, VDT, LCG,
  • Middleware distributions for deployment from
    GDA/SA1
  • Remember announcement of a release is months
    away from a deployable distribution (based on
    last 2 years experience)
  • Distributions still labelled LCG-2.x.x
  • Would like to change to something less specific
    to avoid LCG/EGEE confusion
  • Frequent updates for Service challenge sites
  • But only needed for SC sites
  • Frequent updates as gLite is deployed
  • Not clear if all sites will deploy all gLite
    components immediately
  • This is unavoidable
  • A strong request from LHC experiment spokesmen to
    the LCG POB
  • early, gradual and frequent releases of the
    baseline services is essential rather than
    waiting for a complete sets

Throughout all this, we must maintain a reliable
production service, which gradually improves in
reliability and performance
32
Summary
  • We are at end of LCG Phase 1
  • Good time to step back and look at achievements
    and issues
  • LCG Phase 2 has really started
  • Consolidation of AA projects
  • Baseline services
  • Service challenges and experiment data challenges
  • Acquisitions process starting
  • No new developments ? make what we have work
    absolutely reliably, and be scaleable, performant
  • Timescale is extremely tight
  • Must ensure that we have appropriate levels of
    effort committed
Write a Comment
User Comments (0)
About PowerShow.com