LHC Computing Grid Project LCG - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

LHC Computing Grid Project LCG

Description:

The Grid is just a tool towards achieving this goal ... Note smooth degradation and recovery after equipment failure. Deployment and Operations ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 64
Provided by: ianb188
Category:

less

Transcript and Presenter's Notes

Title: LHC Computing Grid Project LCG


1
  • LHC Computing Grid Project LCG
  • Ian Bird LCG Deployment Manager
  • IT Department, CERN
  • Geneva, Switzerland
  • BNL
  • March 2005
  • ian.bird_at_cern.ch

2
Overview
  • LCG Project Overview
  • Overview of main project areas
  • Deployment and Operations
  • Current LCG-2 Status
  • Operations and issues
  • Plans for migration to gLite
  • Service Challenges
  • Interoperability
  • Outlook Summary

3
LHC Computing Grid Project
  • Aim of the project
  • To prepare, deploy and operate the computing
    environment
  • for the experiments to analyse the data from the
    LHC detectors

Applications development environment, common
tools and frameworks
Build and operate the LHC computing service
The Grid is just a tool towards achieving this
goal
4
Project Areas Management
Distributed Analysis - ARDA Massimo
Lamanna Prototyping of distributed end-user
analysis using grid technology
Project Leader Les Robertson Resource Manager
Chris Eck Planning Officer Jürgen
Knobloch Administration Fabienne Baud-Lavigne
Joint with EGEE
Applications Area Torre Wenaus Development
environment Joint projects, Data
management Distributed analysis
Middleware Area Frédéric Hemmer Provision of a
base set of grid middleware (acquisition,
development, integration)Testing, maintenance,
support
Pere Mato from 1 March 05
CERN Fabric AreaBernd Panzer Large cluster
management Data recording, Cluster
technology Networking, Computing service at CERN
Grid Deployment Area Ian Bird Establishing and
managing the Grid Service - Middleware,
certification, security operations, registration,
authorisation,accounting
5
Relation with EGEE
6
Applications Area
  • All Applications Area projects have software
    deployed in production by the experiments
  • POOL, SEAL, ROOT, Geant4, GENSER, PI/AIDA,
    Savannah
  • 400 TB of POOL data produced in 2004
  • Pre-release of Conditions Database (COOL)
  • 3D project will help POOL and COOL in terms of
    scalability
  • 3D Distributed Deployment of Databases
  • Geant4 successfully used in ATLAS, CMS and LHCb
    Data Challenges with excellent reliability
  • GENSER MC generator library in production
  • Progress on integrating ROOT with other
    Applications Area components
  • Improved I/O package used by POOL Common
    dictionary, maths library with SEAL
  • Pere Mato (CERN, LHCb) has taken over from Torre
    Wenaus (BNL, ATLAS) as Applications Area Manager
  • Plan for next phase of the applications area
    being developed for internal review at end of
    March

7
The ARDA project
  • ARDA is an LCG project
  • main activity is to enable LHC analysis on the
    grid
  • ARDA is contributing to EGEE NA4
  • uses the entire CERN NA4-HEP resource
  • Interface with the new EGEE middleware (gLite)
  • By construction, use the new middleware
  • Use the grid software as it matures
  • Verify the components in an analysis environments
    (users!)
  • Provide early and continuous feedback
  • Support the experiments in the evolution of their
    analysis systems
  • Forum for activity within LCG/EGEE and with other
    projects/initiatives

8
ARDA activity with the experiments
  • The complexity of the field requires a great care
    in the phase of middleware evolution and
    delivery
  • Complex (evolving) requirements
  • New use cases to be explored (for HEP
    large-scale analysis)
  • Different communities in the loop - LHC
    experiments, middleware experts from the
    experiments and other communities providing large
    middleware stacks (CMS GEOD, US OSG, LHCb Dirac,
    etc)
  • The complexity of the experiment-specific part is
    comparable (often larger) to the general one
  • The experiments do require seamless access to a
    set of sites (computing resources) but the real
    usage (therefore the benefit for the LHC
    scientific programme) will come by exploiting the
    possibility to build their computing systems on a
    flexible and dependable infrastructure
  • How to progress?
  • Build end-to-end prototype systems for the
    experiments to allow end users to perform
    analysis tasks

9
LHC prototype overview
10
LHC experiments prototypes (ARDA)
All prototypes have been demoed within the
corresponding user communities
11
CERN Fabric
12
CERN Fabric
  • Fabric automation has seen very good progress
  • The new systems for managing large farms are in
    production at CERN since January
  • New CASTOR Mass Storage System
  • Being deployed first on the high throughput
    cluster for the ongoing ALICE data recording
    computing challenge
  • Agreement on collaboration with Fermilab on Linux
    distribution
  • Scientific Linux based on Red Hat Enterprise 3
  • Improves uniformity between the HEP sites serving
    LHC and Run 2 experiments
  • CERN computer centre preparations
  • Power upgrade to 2.5 MW
  • Computer centre refurbishment well under way
  • Acquisition process started

13
Preparing for 7,000 boxes in 2008
14
High Throughput Prototype (openlab LCG
prototype)
4 GE connections to the backbone
10GE WAN connection
12 Tape Server STK 9940B
24 Disk Server (P4, SATA disks, 2TB disk space
each)
4 ENTERASYS N7 10 GE Switches 2 Enterasys
X-Series
2 50 Itanium 2 (dual 1.3/1.5 GHz, 2 GB mem)
  • Experience with likely ingredients in LCG
  • 64-bit programming
  • next generation I/O(10 Gb Ethernet, Infiniband,
    etc.)
  • High performance cluster used for evaluations,
    and for data challenges with experiments
  • Flexible configuration components moved in and
    out of production environment
  • Co-funded by industry and CERN

36 Disk Server (dual P4, IDE disks, 1TB disk
space each)
10 GE per node
10GE
10 GE per node
80 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
10GE
1 GE per node
10GE
28 TB , IBM StorageTank
12 Tape Server STK 9940B
40 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
80 IA32 CPU Server (dual 2.8 GHz P4, 2 GB mem.)
15
Alice Data Recording Challenge
  • Target one week sustained at 450 MB/sec
  • Used the new version of Castor mass storage
    system
  • Note smooth degradation and recovery after
    equipment failure

16
Deployment and Operations
17
  • LHC Computing Model (simplified!!)
  • Tier-0 the accelerator centre
  • Filter? raw data ? reconstruction ?
    event summary data (ESD)
  • Record the master copy of raw and ESD
  • Tier-1
  • Managed Mass Storage permanent storage raw,
    ESD, calibration data, meta-data, analysis data
    and databases? grid-enabled data service
  • Data-heavy (ESD-based) analysis
  • Re-processing of raw data
  • National, regional support
  • online to the data acquisition processhigh
    availability, long-term commitment
  • Tier-2
  • Well-managed, grid-enabled disk storage
  • End-user analysis batch and interactive
  • Simulation

18
Computing Resources March 2005
  • Country providing resources
  • Country anticipating joining
  • In LCG-2
  • 121 sites, 32 countries
  • gt12,000 cpu
  • 5 PB storage
  • Includes non-EGEE sites
  • 9 countries
  • 18 sites

19
Infrastructure metrics
Countries, sites, and CPU available in LCG-2
production service
EGEE partner regions
Other collaborating sites
20
Service Usage
  • VOs and users on the production service
  • Active HEP experiments
  • 4 LHC, D0, CDF, Zeus, Babar
  • Active other VO
  • Biomed, ESR (Earth Sciences), Compchem, Magic
    (Astronomy), EGEODE (Geo-Physics)
  • 6 disciplines
  • Registered users in these VO 500
  • In addition to these there are many VO that are
    local to a region, supported by their ROCs, but
    not yet visible across EGEE
  • Scale of work performed
  • LHC Data challenges 2004
  • gt1 M SI2K years of cpu time (1000 cpu years)
  • 400 TB of data generated, moved and stored
  • 1 VO achieved 4000 simultaneous jobs (4 times
    CERN grid capacity)

Number of jobs processed/month
21
Current production software (LCG-2)
  • Evolution through 2003/2004
  • Focus has been on making these reliable and
    robust
  • rather than additional functionality
  • Respond to needs of users, admins, operators
  • The software stack is the following
  • Virtual Data Toolkit
  • Globus (2.4.x), Condor, etc
  • EU DataGrid project developed higher-level
    components
  • Workload management (RB, LB, etc)
  • Replica Location Service (single central
    catalog), replica management tools
  • R-GMA as accounting and monitoring framework
  • VOMS being deployed now
  • Operations team re-worked components
  • Information system MDS GRIS/GIIS ? LCG-BDII
  • edg-rm tools replaced and augmented as lcg-utils
  • Developments on
  • Disk pool managers (dCache, DPM)
  • Not addressed by JRA1
  • Other tools as required
  • Maintenance agreements with
  • VDT team (inc Globus support)
  • DESY/FNAL - dCache
  • EGEE/LCG teams
  • WLM, VOMS, R-GMA, Data Management

22
Software 2
  • Platform support
  • Was an issue limited to RedHat 7.3
  • Now ported to Scientific Linux (RHEL), Fedora,
    IA64, AIX, SGI
  • Another problem was heaviness of installation
  • Now much improved and simpler with simple
    installation tools, allow integration with
    existing fabric management tools
  • Much lighter installation on worker nodes user
    level

23
Overall status
  • The production grid service is quite stable
  • The services are quite reliable
  • Remaining instabilities in the IS are being
    addressed
  • Sensitivity to site management
  • Problems in underlying services must be addressed
  • Work on stop-gap solutions (e.g. RB maintains
    state, Globus gridftp ? reliable file transfer
    service)
  • The biggest problem is stability of sites
  • Configuration problems due to complexity of the
    middleware
  • Fabric management at less experienced sites
  • Job efficiency is not high, unless
  • Operations/Applications select stable sites (BDII
    allows a application-specific view)
  • In large tests, selecting stable sites, achieve
    gtgt90 efficiency
  • Operations workshop last November to address this
  • Fabric management working group write fabric
    management cookbook
  • Tighten operations control of the grid
    escalation procedures, removing bad sites
  • Complexity is in the number of sites not number
    of cpu

24
Operations Structure
  • Operations Management Centre (OMC)
  • At CERN coordination etc
  • Core Infrastructure Centres (CIC)
  • Manage daily grid operations oversight,
    troubleshooting
  • Run essential infrastructure services
  • Provide 2nd level support to ROCs
  • UK/I, Fr, It, CERN, Russia (M12)
  • Taipei also run a CIC
  • Regional Operations Centres (ROC)
  • Act as front-line support for user and operations
    issues
  • Provide local knowledge and adaptations
  • One in each region many distributed
  • User Support Centre (GGUS)
  • In FZK manage PTS provide single point of
    contact (service desk)
  • Not foreseen as such in TA, but need is clear

25
Grid Operations
  • The grid is flat, but
  • Hierarchy of responsibility
  • Essential to scale the operation
  • CICs act as a single Operations Centre
  • Operational oversight (grid operator)
    responsibility
  • rotates weekly between CICs
  • Report problems to ROC/RC
  • ROC is responsible for ensuring problem is
    resolved
  • ROC oversees regional RCs
  • ROCs responsible for organising the operations in
    a region
  • Coordinate deployment of middleware, etc
  • CERN coordinates sites not associated with a ROC

RC Resource Centre
26
SLAs and 24x7
  • Start with service level definitions
  • What a site supports (apps, software, MPI,
    compilers, etc)
  • Levels of support ( admins, hrs/day, on-call,
    operators)
  • Response time to problems
  • Define metrics to measure compliance
  • Publish metrics performance of sites relative
    to their commitments
  • Remote monitoring/management of services
  • Can be considered for small sites
  • Middleware/services
  • Should cope with bad sites
  • Clarify what 24x7 means
  • Service should be available 24x7
  • Does not mean all sites must be available 24x7
  • Specific crucial services that justify cost
  • Classify services according to level of support
    required
  • Operations tools need to become more and more
    automated
  • Having an operating production infrastructure
    should not mean having staff on shift everywhere
  • best-effort support
  • The infrastructure (and applications) must adapt
    to failures

27
Operational Security
  • Operational Security team in place
  • EGEE security officer, ROC security contacts
  • Concentrate on 3 activities
  • Incident response
  • Best practice advice for Grid Admins creating
    dedicated web
  • Security Service Monitoring evaluation
  • Incident Response
  • JSPG agreement on IR in collaboration with OSG
  • Update existing policy To guide the development
    of common capability for handling and response to
    cyber security incidents on Grids
  • Basic framework for incident definition and
    handling
  • Site registration process in draft
  • Part of basic SLA
  • CA Operations
  • EUGridPMA best practice, minimum standards,
    etc.
  • More and more CAs appearing
  • Security group and work was started in LCG was
    from the start a cross-grid activity.
  • Much already in place at start of EGEE usage
    policy, registration process and infrastructure,
    etc.
  • We regard it as crucial that this activity
    remains broader than just EGEE

28
Policy Joint Security Group
Incident Response
Certification Authorities
Audit Requirements
Usage Rules
Security Availability Policy
Application Development Network Admin Guide
User Registration
http//cern.ch/proj-lcg-security/documents.html
29
User Support
We have found that user support has 2 distinct
aspects
  • User support
  • Call centre/helpdesk
  • Coordinated through GGUS
  • ROCs as front-line
  • Task force in place to improve the service
  • VO Support
  • Was an oversight in the project and is not really
    provisioned
  • In LCG there is a team (5 FTE)
  • Help apps integrate with m/w
  • Direct 11 support
  • Understanding of needs
  • Act as advocate for app
  • This is really missing for the other apps
    adaptation to the grid environment takes expertise

Deployment SupportMiddleware Problems
Operations Centres (CIC / ROC)Operations Problems
Resource Centres (RC)Hardware Problems
Application Specific User SupportVO specific
problems
Global Grid User Support (GGUS)Single Point of
Contact Coordination of UserSupport
LHC experiments
Other Communities e.g. Biomed
non-LHC experiments
30
Certification process
  • Process was decisive to improve the middleware
  • The process is time consuming (5 releases 2004)
  • Many sequential steps
  • Many different site layouts have to be tested
  • Format of internal and external releases differ
  • Multiple packaging formats (tool based, generic)
  • All components are treated equal
  • same level of testing for non vital and core
    components
  • new tools and tools in use by other projects are
    tested to the same level
  • Process to include new components is not
    transparent
  • Timing for releases difficult
  • Users now sites scheduled
  • Upgrades need a long time to cover all sites
  • Some sites had problems to become functional
    after an upgrade

31
Additional Input
  • Data Challenges
  • client libs need fast and frequent updates
  • core services need fast patches
    (functional/fixes)
  • applications need a transparent release
    preparation
  • many problems only become visible during full
    scale production
  • Configuration is a major problem on smaller sites
  • Operations Workshop
  • smaller sites can handle major upgrades only
    every 3 months
  • sites need to give input in the selection of new
    packages
  • resolve conflicts with local policies

32
Changes I
  • Simple Installation/Configuration Scripts
  • YAIM (YetAnotherInstallMethod)
  • semi-automatic simple configuration management
  • based on scripts (easy to integrate into other
    frameworks)
  • all configuration for a site are kept in one file
  • APT (Advanced Package Tool) based installation of
    middleware RPMs
  • simple dependency management
  • updates (automatic on demand)
  • no OS installation
  • Client libs packaged in addition as user space
    tar-ball
  • can be installed like application software

33
Changes II
  • Different frequency of separate release types
  • client libs (UI, WN)
  • services (CE, SE)
  • core services (RB, BDII,..)
  • major releases (configuration changes, RPMs, new
    services)
  • updates (bug fixes) added any time to specific
    releases
  • non-critical components will be made available
    with reduced testing
  • Fixed release dates for major releases (allows
    planning)
  • every 3 months, sites have to upgrade within 3
    weeks
  • Minor releases every month
  • based on ranked components available at a
    specific date in the month
  • not mandatory for smaller RCs to follow
  • client libs will be installed as application
    level software
  • early access to pre-releases of new software for
    applications
  • client libs. will be made available on selected
    sites
  • services with functional changes are installed on
    EIS-Applications testbed
  • early feedback from applications

34
Certification Process
3
Bugs/Patches/Task Savannah
Applications
RC
integration first tests
Developers
CT
EIS
GIS
4
CT
GDB
assign and update cost
Internal Releases
Internal Client Release
Bugs/Patches/Task Savannah
1
CICs
EIS
6
full deployment on test clusters
(6) functional/stress tests 1 week
Developers
CT
CT
Head of Deployment
components ready at cutoff
35
Deployment Process
YAIM
Release(s)
Update Release Notes
Update User Guides
EIS
GIS
Every 3 months on fixed dates !
User Guides
Release Notes Installation Guides
Every Month
Certification is run daily
Every Month
at own pace
36
Operations Procedures
  • Driven by experience during 2004 Data Challenges
  • Reflecting the outcome of the November Operations
    Workshop
  • Operations Procedures
  • roles of CICs - ROCs - RCs
  • weekly rotation of operations centre duties
    (CIC-on-duty)
  • daily tasks of the operations shift
  • monitoring (tools, frequency)
  • problem reporting
  • problem tracking system
  • communication with ROCs RCs
  • escalation of unresolved problems
  • handing over the service to the next CIC

37
Implementation
  • Evolutionary Development
  • Procedures
  • documented (constantly adapted)
  • available at the CIC portal http//cic.in2p3.fr/
  • in use by the shift crews
  • Portal http//cic.in2p3.fr
  • access to tools and process documentation
  • repository for logs and FAQs
  • provides means of efficient communication
  • provides condensed monitoring information
  • Problem tracking system
  • currently based on Savannah at CERN
  • is moving to the GGUS at FZK
  • exports/imports tickets to local systems used by
    the ROCs
  • Weekly Phone Conferences and Quarterly Meetings

38
Grid operator dashboard
Cic-on-duty Dashboardhttps//cic.in2p3.fr/pages/c
ic/framedashboard.html
39
Operator procedure
Escalation
  • st
  • SEVERITY
  • Incident
  • ESCALATION
  • PROCEDURE
  • closure
  • Savannah
  • ROC
  • (5.1)
  • (6)
  • In Depth
  • Diagnosis
  • Follow up
  • (5.2)
  • Testing
  • help
  • Report
  • Cic
  • Monitoring tools
  • GIIS
  • Wiki
  • Savannah
  • mailing
  • GridIce, GOC
  • Monitor
  • pages
  • tool
  • (1)
  • (2)
  • (3)
  • (4)
  • (5)
  • (5.1)
  • RC

40
Selection of monitoring tools
41
Middleware
42
Architecture Design
  • Design team including representatives from
    Middleware providers (AliEn, Condor, EDG,
    Globus,) including US partners produced
    middleware architecture and design.
  • Takes into account input and experiences from
    applications, operations, and related projects
  • DJRA1.1 EGEE Middleware Architecture (June
    2004)
  • https//edms.cern.ch/document/476451/
  • DJRA1.2 EGEE Middleware Design (August 2004)
  • https//edms.cern.ch/document/487871/
  • Much feedback from within the project (operation
    applications) and from related projects
  • Being used and actively discussed by OSG,
    GridLab, etc. Input to various GGF groups

43
gLite Services and Responsible Clusters
JRA3
UK
Access Services
Grid AccessService
API
CERN
IT/CZ
Security Services
Authorization
Information Monitoring
Services
ApplicationMonitoring
Information Monitoring
Auditing
Authentication
Data Services
Job Management Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataManagement
ComputingElement
WorkloadManagement
Site Proxy
44
gLite Services for Release 1
JRA3
UK
Access Services
Grid AccessService
API
CERN
IT/CZ
Security Services
Authorization
Information Monitoring
Services
Application Monitoring
Information Monitoring
Auditing
Focus on key servicesaccording to gLite Mgmt
taskforce
Authentication
Data Services
Job Management Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataManagement
ComputingElement
WorkloadManagement
Site Proxy
45
gLite Services for Release 1Software stack and
origin (simplified)
  • Computing Element
  • Gatekeeper (Globus)
  • Condor-C (Condor)
  • CE Monitor (EGEE)
  • Local batch system (PBS, LSF, Condor)
  • Workload Management
  • WMS (EDG)
  • Logging and bookkeeping (EDG)
  • Condor-C (Condor)
  • Storage Element
  • File Transfer/Placement (EGEE)
  • glite-I/O (AliEn)
  • GridFTP (Globus)
  • SRM Castor (CERN), dCache (FNAL, DESY), other
    SRMs
  • Catalog
  • File and Replica Catalog (EGEE)
  • Metadata Catalog (EGEE)
  • Information and Monitoring
  • R-GMA (EDG)
  • Security
  • VOMS (DataTAG, EDG)
  • GSI (Globus)
  • Authentication for C and Java based (web)
    services (EDG)

46
Summary
  • WMS
  • Task Queue, Pull mode, Data management interface
  • Available in the prototype
  • Used in the testing testbed
  • Now working on the certification testbed
  • Submission to LCG-2 demonstrated
  • Catalog
  • MySQL and Oracle
  • Available in the prototype
  • Used in the testing testbed
  • Delivered to SA1
  • But not tested yet
  • gLite I/O
  • Available in the prototype
  • Used in the testing testbed
  • Basic functionality and stress test available
  • Delivered to SA1
  • But not tested yet
  • FTS
  • FTS is being evolved with LCG
  • Milestone on March 15, 2005
  • Stress tests in service challenges
  • UI
  • Available in the prototype
  • Incudes data management
  • Not yet formally tested
  • R-GMA
  • Available in the prototype
  • Testing has shown deployment problems
  • VOMS
  • Available in the prototype
  • No tests available

47
Schedule
  • All of the Services are available now on the
    development testbed
  • User documentation currently being added
  • On a limited scale testbed
  • Most of the Services are being deployed on the
    LCG Preproduction Service
  • Initially at CERN, more sites once
    tested/validated
  • Scheduled in April-May
  • Schedule for deployment at major sites by the end
    of May
  • In time to be included in the LCG service
    challenge that must demonstrate full capability
    in July prior to operate as a stable service in
    2H2005

48
Migration Strategy
  • Certify gLite components on existing LCG-2
    service
  • Deploy components in parallel replacing with
    new service once stability and functionality is
    demonstrated
  • WN tools and libs must co-exist on same cluster
    nodes
  • As far as possible must have a smooth transition

49
Service Challenges
50
Problem Statement
  • Robust File Transfer Service often seen as the
    goal of the LCG Service Challenges
  • Whilst it is clearly essential that we ramp up at
    CERN and the T1/T2 sites to meet the required
    data rates well in advance of LHC data taking,
    this is only one aspect
  • Getting all sites to acquire and run the
    infrastructure is non-trivial (managed disk
    storage, tape storage, agreed interfaces, 24 x
    365 service aspect, including during conferences,
    vacation, illnesses etc.)
  • Need to understand networking requirements and
    plan early
  • But transferring dummy files is not enough
  • Still have to show that basic infrastructure
    works reliably and efficiently
  • Need to test experiments Use Cases
  • Check for bottlenecks and limits in s/w, disk and
    other caches etc.
  • We can presumably write some test scripts to
    mock up the experiments Computing Models
  • But the real test will be to run your s/w
  • Which requires strong involvement from production
    teams

51
LCG Service Challenges - Overview
  • LHC will enter production (physics) in April 2007
  • Will generate an enormous volume of data
  • Will require huge amount of processing power
  • LCG solution is a world-wide Grid
  • Many components understood, deployed, tested..
  • But
  • Unprecedented scale
  • Humungous challenge of getting large numbers of
    institutes and individuals, all with existing,
    sometimes conflicting commitments, to work
    together
  • LCG must be ready at full production capacity,
    functionality and reliability in less than 2
    years from now
  • Issues include h/w acquisition, personnel hiring
    and training, vendor rollout schedules etc.
  • Should not limit ability of physicist to exploit
    performance of detectors nor LHCs physics
    potential
  • Whilst being stable, reliable and easy to use

52
Key Principles
  • Service challenges results in a series of
    services that exist in parallel with baseline
    production service
  • Rapidly and successively approach production
    needs of LHC
  • Initial focus core (data management) services
  • Swiftly expand out to cover full spectrum of
    production and analysis chain
  • Must be as realistic as possible, including
    end-end testing of key experiment use-cases over
    extended periods with recovery from glitches and
    longer-term outages
  • Necessary resources and commitment pre-requisite
    to success!
  • Should not be under-estimated!

53
Initial Schedule (evolving)
  • Q1 / Q2 up to 5 T1s, writing to disk at 100MB/s
    per T1 (no expts)
  • Q3 / Q4 include two experiments, tape and a few
    selected T2s
  • 2006 progressively add more T2s, more
    experiments, ramp up to twice nominal data rate
  • 2006 production usage by all experiments at
    reduced rates (cosmics) validation of computing
    models
  • 2007 delivery and contingency
  • N.B. there is more detail in Dec / Jan / Feb GDB
    presentations

54
Key dates for Service Preparation
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 SC4 Service Phase
Sep06 Initial LHC Service in stable operation
Apr07 LHC Service commissioned
SC2
SC2 Reliable data transfer (disk-network-disk)
5 Tier-1s, aggregate 500 MB/sec sustained at
CERN SC3 Reliable base service most Tier-1s,
some Tier-2s basic experiment software chain
grid data throughput 500 MB/sec,
including mass storage (25 of the nominal final
throughput for the proton period) SC4
All Tier-1s, major Tier-2s capable of
supporting full experiment software chain inc.
analysis sustain nominal final grid
data throughput LHC Service in Operation
September 2006 ramp up to full operational
capacity by April 2007 capable of
handling twice the nominal data throughput

55
FermiLab Dec 04/Jan 05
  • FermiLab demonstrated 500MB/s for 3 days in
    November

56
FTS stability
M not K !!!
57
Interoperability
58
Introduction grid flavours
  • LCG-2 vs Grid3
  • Both use same VDT version
  • Globus 2.4.x
  • LCG-2 has components for WLM, IS, R-GMA, etc
  • Both use same information schema (GLUE)
  • Grid3 schema not all GLUE
  • Some small extensions by each
  • Both use MDS (BDII)
  • LCG-2 vs NorduGrid
  • NorduGrid uses modified version of Globus 2.x
  • Does not use gatekeeper different interface
  • Very different information schema
  • but does use MDS
  • Work done
  • With Grid3/OSG strong contacts, many points of
    collaboration, etc.
  • With NorduGrid discussions have started
  • Canada
  • Gateway into GridCanada and WestGrid (Globus
    based) in production
  • Catalogues
  • LCG-2 EDG derived catalogue (for POOL)
  • Grid3 and NorduGrid Globus RLS

59
Common areas (with Grid3/OSG)
  • Interoperation
  • Align Information Systems
  • Run jobs between LCG-2 and Grid3/NorduGrid
  • Storage interfaces SRM
  • Reliable file transfer
  • Service challenges
  • Infrastructure
  • Security
  • Security policy JSPG
  • Operational security
  • Both are explicitly common activities across all
    sites
  • Monitoring
  • Job monitoring
  • Grid monitoring
  • Accounting
  • Grid Operations
  • Common operations policies
  • Problem tracking

60
Interoperation
  • LCG-2 jobs on Grid3
  • G3 site runs LCG-developed generic info provider
    fills their site GIIS with missing info GLUE
    schema
  • From LCG-2 BDII can see G3 sites
  • Running a job on grid3 site needed
  • G3 installs full set of LCG CAs
  • Added users into VOMS
  • WN installation (very lightweight now) installs
    on the fly
  • Grid3 jobs on LCG-2
  • Added Grid3 VO to our configuration
  • They point directly to the site (do not use IS
    for job submission)
  • Job submission LCG-2 ? Grid3 has been
    demonstrated
  • NorduGrid can run generic info provider at a
    site
  • But requires work to use the NG clusters

61
Storage and file transfer
  • Storage interfaces
  • LCG-2, gLite, Open Science Grid all agree on SRM
    as basic interface to storage
  • SRM collaboration for gt2 years, group in GGF
  • SRM interoperability has been demonstrated
  • LHCb use SRM in their stripping phase
  • Reliable file transfer
  • Work ongoing with Tier 1s (inc. FNAL, BNL,
    Triumf) in service challenges.
  • Agree that interface is SRM and srmcopy or
    gridftp as transfer protocol
  • Reliable transfer software will run at all sites
    already in place as part of service challenges

62
Operations
  • Several points where collaboration will happen
  • Started from LCG and OSG operations workshops
  • Operational security/incident response
  • Common site charter/service definitions possible?
  • Collaboration on operations centres (CIC-on-duty)
    ?
  • Operations monitoring
  • Common schema for problem description/views
    allow tools to understand both?
  • Common metrics for performance and reliability
  • Common site and application validation suites
    (for LHC apps)
  • Accounting
  • Grid3 and LCG-2 use GGF schema
  • Agree to publish into common tool (NG could too)
  • Job monitoring
  • LCG-2 Logging and bookkeeping well defined set
    of states
  • Agree common set will allow common tools to view
    job states in any grid
  • Need good high level (web) tools to display
    user could track jobs easily across grids

63
Outlook Summary
  • LHC startup is very close. Services have to be in
    place 6 months earlier
  • The Service Challenge program is the ramp-up
    process
  • All aspects are really challenging!
  • Now the experiment computing models have been
    published, trying to clarify what services LCG
    must provide and what their interfaces need to
    be
  • Baseline services working group
  • A enormous amount of work has been done by the
    various grid projects
  • Already at full complexity and scale foreseen for
    startup
  • But still significant problems to address
    functionality, and stability of operations
  • Time to bring these efforts together to build the
    solution for LHC

Thank you for your attention !
Write a Comment
User Comments (0)
About PowerShow.com