Title: LHC Computing Grid Project LCG
1- LHC Computing Grid Project LCG
- Ian Bird LCG Deployment Manager
- IT Department, CERN
- Geneva, Switzerland
- BNL
- March 2005
- ian.bird_at_cern.ch
2Overview
- LCG Project Overview
- Overview of main project areas
- Deployment and Operations
- Current LCG-2 Status
- Operations and issues
- Plans for migration to gLite
- Service Challenges
- Interoperability
- Outlook Summary
3LHC Computing Grid Project
- Aim of the project
- To prepare, deploy and operate the computing
environment - for the experiments to analyse the data from the
LHC detectors
Applications development environment, common
tools and frameworks
Build and operate the LHC computing service
The Grid is just a tool towards achieving this
goal
4Project Areas Management
Distributed Analysis - ARDA Massimo
Lamanna Prototyping of distributed end-user
analysis using grid technology
Project Leader Les Robertson Resource Manager
Chris Eck Planning Officer Jürgen
Knobloch Administration Fabienne Baud-Lavigne
Joint with EGEE
Applications Area Torre Wenaus Development
environment Joint projects, Data
management Distributed analysis
Middleware Area Frédéric Hemmer Provision of a
base set of grid middleware (acquisition,
development, integration)Testing, maintenance,
support
Pere Mato from 1 March 05
CERN Fabric AreaBernd Panzer Large cluster
management Data recording, Cluster
technology Networking, Computing service at CERN
Grid Deployment Area Ian Bird Establishing and
managing the Grid Service - Middleware,
certification, security operations, registration,
authorisation,accounting
5Relation with EGEE
6Applications Area
- All Applications Area projects have software
deployed in production by the experiments - POOL, SEAL, ROOT, Geant4, GENSER, PI/AIDA,
Savannah - 400 TB of POOL data produced in 2004
- Pre-release of Conditions Database (COOL)
- 3D project will help POOL and COOL in terms of
scalability - 3D Distributed Deployment of Databases
- Geant4 successfully used in ATLAS, CMS and LHCb
Data Challenges with excellent reliability - GENSER MC generator library in production
- Progress on integrating ROOT with other
Applications Area components - Improved I/O package used by POOL Common
dictionary, maths library with SEAL - Pere Mato (CERN, LHCb) has taken over from Torre
Wenaus (BNL, ATLAS) as Applications Area Manager - Plan for next phase of the applications area
being developed for internal review at end of
March
7The ARDA project
- ARDA is an LCG project
- main activity is to enable LHC analysis on the
grid - ARDA is contributing to EGEE NA4
- uses the entire CERN NA4-HEP resource
- Interface with the new EGEE middleware (gLite)
- By construction, use the new middleware
- Use the grid software as it matures
- Verify the components in an analysis environments
(users!) - Provide early and continuous feedback
- Support the experiments in the evolution of their
analysis systems - Forum for activity within LCG/EGEE and with other
projects/initiatives
8ARDA activity with the experiments
- The complexity of the field requires a great care
in the phase of middleware evolution and
delivery - Complex (evolving) requirements
- New use cases to be explored (for HEP
large-scale analysis) - Different communities in the loop - LHC
experiments, middleware experts from the
experiments and other communities providing large
middleware stacks (CMS GEOD, US OSG, LHCb Dirac,
etc) - The complexity of the experiment-specific part is
comparable (often larger) to the general one - The experiments do require seamless access to a
set of sites (computing resources) but the real
usage (therefore the benefit for the LHC
scientific programme) will come by exploiting the
possibility to build their computing systems on a
flexible and dependable infrastructure - How to progress?
- Build end-to-end prototype systems for the
experiments to allow end users to perform
analysis tasks
9LHC prototype overview
10LHC experiments prototypes (ARDA)
All prototypes have been demoed within the
corresponding user communities
11CERN Fabric
12CERN Fabric
- Fabric automation has seen very good progress
- The new systems for managing large farms are in
production at CERN since January - New CASTOR Mass Storage System
- Being deployed first on the high throughput
cluster for the ongoing ALICE data recording
computing challenge - Agreement on collaboration with Fermilab on Linux
distribution - Scientific Linux based on Red Hat Enterprise 3
- Improves uniformity between the HEP sites serving
LHC and Run 2 experiments - CERN computer centre preparations
- Power upgrade to 2.5 MW
- Computer centre refurbishment well under way
- Acquisition process started
13Preparing for 7,000 boxes in 2008
14High Throughput Prototype (openlab LCG
prototype)
4 GE connections to the backbone
10GE WAN connection
12 Tape Server STK 9940B
24 Disk Server (P4, SATA disks, 2TB disk space
each)
4 ENTERASYS N7 10 GE Switches 2 Enterasys
X-Series
2 50 Itanium 2 (dual 1.3/1.5 GHz, 2 GB mem)
- Experience with likely ingredients in LCG
- 64-bit programming
- next generation I/O(10 Gb Ethernet, Infiniband,
etc.) - High performance cluster used for evaluations,
and for data challenges with experiments - Flexible configuration components moved in and
out of production environment - Co-funded by industry and CERN
36 Disk Server (dual P4, IDE disks, 1TB disk
space each)
10 GE per node
10GE
10 GE per node
80 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
10GE
1 GE per node
10GE
28 TB , IBM StorageTank
12 Tape Server STK 9940B
40 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
80 IA32 CPU Server (dual 2.8 GHz P4, 2 GB mem.)
15Alice Data Recording Challenge
- Target one week sustained at 450 MB/sec
- Used the new version of Castor mass storage
system - Note smooth degradation and recovery after
equipment failure
16Deployment and Operations
17- LHC Computing Model (simplified!!)
- Tier-0 the accelerator centre
- Filter? raw data ? reconstruction ?
event summary data (ESD) - Record the master copy of raw and ESD
- Tier-1
- Managed Mass Storage permanent storage raw,
ESD, calibration data, meta-data, analysis data
and databases? grid-enabled data service - Data-heavy (ESD-based) analysis
- Re-processing of raw data
- National, regional support
- online to the data acquisition processhigh
availability, long-term commitment
- Tier-2
- Well-managed, grid-enabled disk storage
- End-user analysis batch and interactive
- Simulation
18Computing Resources March 2005
- Country providing resources
- Country anticipating joining
- In LCG-2
- 121 sites, 32 countries
- gt12,000 cpu
- 5 PB storage
- Includes non-EGEE sites
- 9 countries
- 18 sites
19Infrastructure metrics
Countries, sites, and CPU available in LCG-2
production service
EGEE partner regions
Other collaborating sites
20Service Usage
- VOs and users on the production service
- Active HEP experiments
- 4 LHC, D0, CDF, Zeus, Babar
- Active other VO
- Biomed, ESR (Earth Sciences), Compchem, Magic
(Astronomy), EGEODE (Geo-Physics) - 6 disciplines
- Registered users in these VO 500
- In addition to these there are many VO that are
local to a region, supported by their ROCs, but
not yet visible across EGEE - Scale of work performed
- LHC Data challenges 2004
- gt1 M SI2K years of cpu time (1000 cpu years)
- 400 TB of data generated, moved and stored
- 1 VO achieved 4000 simultaneous jobs (4 times
CERN grid capacity)
Number of jobs processed/month
21Current production software (LCG-2)
- Evolution through 2003/2004
- Focus has been on making these reliable and
robust - rather than additional functionality
- Respond to needs of users, admins, operators
- The software stack is the following
- Virtual Data Toolkit
- Globus (2.4.x), Condor, etc
- EU DataGrid project developed higher-level
components - Workload management (RB, LB, etc)
- Replica Location Service (single central
catalog), replica management tools - R-GMA as accounting and monitoring framework
- VOMS being deployed now
- Operations team re-worked components
- Information system MDS GRIS/GIIS ? LCG-BDII
- edg-rm tools replaced and augmented as lcg-utils
- Developments on
- Disk pool managers (dCache, DPM)
- Not addressed by JRA1
- Other tools as required
- Maintenance agreements with
- VDT team (inc Globus support)
- DESY/FNAL - dCache
- EGEE/LCG teams
- WLM, VOMS, R-GMA, Data Management
22Software 2
- Platform support
- Was an issue limited to RedHat 7.3
- Now ported to Scientific Linux (RHEL), Fedora,
IA64, AIX, SGI - Another problem was heaviness of installation
- Now much improved and simpler with simple
installation tools, allow integration with
existing fabric management tools - Much lighter installation on worker nodes user
level
23Overall status
- The production grid service is quite stable
- The services are quite reliable
- Remaining instabilities in the IS are being
addressed - Sensitivity to site management
- Problems in underlying services must be addressed
- Work on stop-gap solutions (e.g. RB maintains
state, Globus gridftp ? reliable file transfer
service) - The biggest problem is stability of sites
- Configuration problems due to complexity of the
middleware - Fabric management at less experienced sites
- Job efficiency is not high, unless
- Operations/Applications select stable sites (BDII
allows a application-specific view) - In large tests, selecting stable sites, achieve
gtgt90 efficiency - Operations workshop last November to address this
- Fabric management working group write fabric
management cookbook - Tighten operations control of the grid
escalation procedures, removing bad sites - Complexity is in the number of sites not number
of cpu
24 Operations Structure
- Operations Management Centre (OMC)
- At CERN coordination etc
- Core Infrastructure Centres (CIC)
- Manage daily grid operations oversight,
troubleshooting - Run essential infrastructure services
- Provide 2nd level support to ROCs
- UK/I, Fr, It, CERN, Russia (M12)
- Taipei also run a CIC
- Regional Operations Centres (ROC)
- Act as front-line support for user and operations
issues - Provide local knowledge and adaptations
- One in each region many distributed
- User Support Centre (GGUS)
- In FZK manage PTS provide single point of
contact (service desk) - Not foreseen as such in TA, but need is clear
25Grid Operations
- The grid is flat, but
- Hierarchy of responsibility
- Essential to scale the operation
- CICs act as a single Operations Centre
- Operational oversight (grid operator)
responsibility - rotates weekly between CICs
- Report problems to ROC/RC
- ROC is responsible for ensuring problem is
resolved - ROC oversees regional RCs
- ROCs responsible for organising the operations in
a region - Coordinate deployment of middleware, etc
- CERN coordinates sites not associated with a ROC
RC Resource Centre
26SLAs and 24x7
- Start with service level definitions
- What a site supports (apps, software, MPI,
compilers, etc) - Levels of support ( admins, hrs/day, on-call,
operators) - Response time to problems
- Define metrics to measure compliance
- Publish metrics performance of sites relative
to their commitments - Remote monitoring/management of services
- Can be considered for small sites
- Middleware/services
- Should cope with bad sites
- Clarify what 24x7 means
- Service should be available 24x7
- Does not mean all sites must be available 24x7
- Specific crucial services that justify cost
- Classify services according to level of support
required - Operations tools need to become more and more
automated - Having an operating production infrastructure
should not mean having staff on shift everywhere - best-effort support
- The infrastructure (and applications) must adapt
to failures
27Operational Security
- Operational Security team in place
- EGEE security officer, ROC security contacts
- Concentrate on 3 activities
- Incident response
- Best practice advice for Grid Admins creating
dedicated web - Security Service Monitoring evaluation
- Incident Response
- JSPG agreement on IR in collaboration with OSG
- Update existing policy To guide the development
of common capability for handling and response to
cyber security incidents on Grids - Basic framework for incident definition and
handling - Site registration process in draft
- Part of basic SLA
- CA Operations
- EUGridPMA best practice, minimum standards,
etc. - More and more CAs appearing
- Security group and work was started in LCG was
from the start a cross-grid activity. - Much already in place at start of EGEE usage
policy, registration process and infrastructure,
etc. - We regard it as crucial that this activity
remains broader than just EGEE
28Policy Joint Security Group
Incident Response
Certification Authorities
Audit Requirements
Usage Rules
Security Availability Policy
Application Development Network Admin Guide
User Registration
http//cern.ch/proj-lcg-security/documents.html
29User Support
We have found that user support has 2 distinct
aspects
- User support
- Call centre/helpdesk
- Coordinated through GGUS
- ROCs as front-line
- Task force in place to improve the service
- VO Support
- Was an oversight in the project and is not really
provisioned - In LCG there is a team (5 FTE)
- Help apps integrate with m/w
- Direct 11 support
- Understanding of needs
- Act as advocate for app
- This is really missing for the other apps
adaptation to the grid environment takes expertise
Deployment SupportMiddleware Problems
Operations Centres (CIC / ROC)Operations Problems
Resource Centres (RC)Hardware Problems
Application Specific User SupportVO specific
problems
Global Grid User Support (GGUS)Single Point of
Contact Coordination of UserSupport
LHC experiments
Other Communities e.g. Biomed
non-LHC experiments
30Certification process
- Process was decisive to improve the middleware
- The process is time consuming (5 releases 2004)
- Many sequential steps
- Many different site layouts have to be tested
- Format of internal and external releases differ
- Multiple packaging formats (tool based, generic)
- All components are treated equal
- same level of testing for non vital and core
components - new tools and tools in use by other projects are
tested to the same level - Process to include new components is not
transparent - Timing for releases difficult
- Users now sites scheduled
- Upgrades need a long time to cover all sites
- Some sites had problems to become functional
after an upgrade
31Additional Input
- Data Challenges
- client libs need fast and frequent updates
- core services need fast patches
(functional/fixes) - applications need a transparent release
preparation - many problems only become visible during full
scale production - Configuration is a major problem on smaller sites
- Operations Workshop
- smaller sites can handle major upgrades only
every 3 months - sites need to give input in the selection of new
packages - resolve conflicts with local policies
32Changes I
- Simple Installation/Configuration Scripts
- YAIM (YetAnotherInstallMethod)
- semi-automatic simple configuration management
- based on scripts (easy to integrate into other
frameworks) - all configuration for a site are kept in one file
- APT (Advanced Package Tool) based installation of
middleware RPMs - simple dependency management
- updates (automatic on demand)
- no OS installation
- Client libs packaged in addition as user space
tar-ball - can be installed like application software
33Changes II
- Different frequency of separate release types
- client libs (UI, WN)
- services (CE, SE)
- core services (RB, BDII,..)
- major releases (configuration changes, RPMs, new
services) - updates (bug fixes) added any time to specific
releases - non-critical components will be made available
with reduced testing - Fixed release dates for major releases (allows
planning) - every 3 months, sites have to upgrade within 3
weeks - Minor releases every month
- based on ranked components available at a
specific date in the month - not mandatory for smaller RCs to follow
- client libs will be installed as application
level software - early access to pre-releases of new software for
applications - client libs. will be made available on selected
sites - services with functional changes are installed on
EIS-Applications testbed - early feedback from applications
34Certification Process
3
Bugs/Patches/Task Savannah
Applications
RC
integration first tests
Developers
CT
EIS
GIS
4
CT
GDB
assign and update cost
Internal Releases
Internal Client Release
Bugs/Patches/Task Savannah
1
CICs
EIS
6
full deployment on test clusters
(6) functional/stress tests 1 week
Developers
CT
CT
Head of Deployment
components ready at cutoff
35Deployment Process
YAIM
Release(s)
Update Release Notes
Update User Guides
EIS
GIS
Every 3 months on fixed dates !
User Guides
Release Notes Installation Guides
Every Month
Certification is run daily
Every Month
at own pace
36Operations Procedures
- Driven by experience during 2004 Data Challenges
- Reflecting the outcome of the November Operations
Workshop - Operations Procedures
- roles of CICs - ROCs - RCs
- weekly rotation of operations centre duties
(CIC-on-duty) - daily tasks of the operations shift
- monitoring (tools, frequency)
- problem reporting
- problem tracking system
- communication with ROCs RCs
- escalation of unresolved problems
- handing over the service to the next CIC
37Implementation
- Evolutionary Development
- Procedures
- documented (constantly adapted)
- available at the CIC portal http//cic.in2p3.fr/
- in use by the shift crews
- Portal http//cic.in2p3.fr
- access to tools and process documentation
- repository for logs and FAQs
- provides means of efficient communication
- provides condensed monitoring information
- Problem tracking system
- currently based on Savannah at CERN
- is moving to the GGUS at FZK
- exports/imports tickets to local systems used by
the ROCs - Weekly Phone Conferences and Quarterly Meetings
38Grid operator dashboard
Cic-on-duty Dashboardhttps//cic.in2p3.fr/pages/c
ic/framedashboard.html
39Operator procedure
Escalation
40Selection of monitoring tools
41Middleware
42Architecture Design
- Design team including representatives from
Middleware providers (AliEn, Condor, EDG,
Globus,) including US partners produced
middleware architecture and design. - Takes into account input and experiences from
applications, operations, and related projects - DJRA1.1 EGEE Middleware Architecture (June
2004) - https//edms.cern.ch/document/476451/
- DJRA1.2 EGEE Middleware Design (August 2004)
- https//edms.cern.ch/document/487871/
- Much feedback from within the project (operation
applications) and from related projects - Being used and actively discussed by OSG,
GridLab, etc. Input to various GGF groups
43gLite Services and Responsible Clusters
JRA3
UK
Access Services
Grid AccessService
API
CERN
IT/CZ
Security Services
Authorization
Information Monitoring
Services
ApplicationMonitoring
Information Monitoring
Auditing
Authentication
Data Services
Job Management Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataManagement
ComputingElement
WorkloadManagement
Site Proxy
44gLite Services for Release 1
JRA3
UK
Access Services
Grid AccessService
API
CERN
IT/CZ
Security Services
Authorization
Information Monitoring
Services
Application Monitoring
Information Monitoring
Auditing
Focus on key servicesaccording to gLite Mgmt
taskforce
Authentication
Data Services
Job Management Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataManagement
ComputingElement
WorkloadManagement
Site Proxy
45gLite Services for Release 1Software stack and
origin (simplified)
- Computing Element
- Gatekeeper (Globus)
- Condor-C (Condor)
- CE Monitor (EGEE)
- Local batch system (PBS, LSF, Condor)
- Workload Management
- WMS (EDG)
- Logging and bookkeeping (EDG)
- Condor-C (Condor)
- Storage Element
- File Transfer/Placement (EGEE)
- glite-I/O (AliEn)
- GridFTP (Globus)
- SRM Castor (CERN), dCache (FNAL, DESY), other
SRMs
- Catalog
- File and Replica Catalog (EGEE)
- Metadata Catalog (EGEE)
- Information and Monitoring
- R-GMA (EDG)
- Security
- VOMS (DataTAG, EDG)
- GSI (Globus)
- Authentication for C and Java based (web)
services (EDG)
46Summary
- WMS
- Task Queue, Pull mode, Data management interface
- Available in the prototype
- Used in the testing testbed
- Now working on the certification testbed
- Submission to LCG-2 demonstrated
- Catalog
- MySQL and Oracle
- Available in the prototype
- Used in the testing testbed
- Delivered to SA1
- But not tested yet
- gLite I/O
- Available in the prototype
- Used in the testing testbed
- Basic functionality and stress test available
- Delivered to SA1
- But not tested yet
- FTS
- FTS is being evolved with LCG
- Milestone on March 15, 2005
- Stress tests in service challenges
- UI
- Available in the prototype
- Incudes data management
- Not yet formally tested
- R-GMA
- Available in the prototype
- Testing has shown deployment problems
- VOMS
- Available in the prototype
- No tests available
47Schedule
- All of the Services are available now on the
development testbed - User documentation currently being added
- On a limited scale testbed
- Most of the Services are being deployed on the
LCG Preproduction Service - Initially at CERN, more sites once
tested/validated - Scheduled in April-May
- Schedule for deployment at major sites by the end
of May - In time to be included in the LCG service
challenge that must demonstrate full capability
in July prior to operate as a stable service in
2H2005
48Migration Strategy
- Certify gLite components on existing LCG-2
service - Deploy components in parallel replacing with
new service once stability and functionality is
demonstrated - WN tools and libs must co-exist on same cluster
nodes - As far as possible must have a smooth transition
49Service Challenges
50Problem Statement
- Robust File Transfer Service often seen as the
goal of the LCG Service Challenges - Whilst it is clearly essential that we ramp up at
CERN and the T1/T2 sites to meet the required
data rates well in advance of LHC data taking,
this is only one aspect - Getting all sites to acquire and run the
infrastructure is non-trivial (managed disk
storage, tape storage, agreed interfaces, 24 x
365 service aspect, including during conferences,
vacation, illnesses etc.) - Need to understand networking requirements and
plan early - But transferring dummy files is not enough
- Still have to show that basic infrastructure
works reliably and efficiently - Need to test experiments Use Cases
- Check for bottlenecks and limits in s/w, disk and
other caches etc. - We can presumably write some test scripts to
mock up the experiments Computing Models - But the real test will be to run your s/w
- Which requires strong involvement from production
teams
51LCG Service Challenges - Overview
- LHC will enter production (physics) in April 2007
- Will generate an enormous volume of data
- Will require huge amount of processing power
- LCG solution is a world-wide Grid
- Many components understood, deployed, tested..
- But
- Unprecedented scale
- Humungous challenge of getting large numbers of
institutes and individuals, all with existing,
sometimes conflicting commitments, to work
together - LCG must be ready at full production capacity,
functionality and reliability in less than 2
years from now - Issues include h/w acquisition, personnel hiring
and training, vendor rollout schedules etc. - Should not limit ability of physicist to exploit
performance of detectors nor LHCs physics
potential - Whilst being stable, reliable and easy to use
52Key Principles
- Service challenges results in a series of
services that exist in parallel with baseline
production service - Rapidly and successively approach production
needs of LHC - Initial focus core (data management) services
- Swiftly expand out to cover full spectrum of
production and analysis chain - Must be as realistic as possible, including
end-end testing of key experiment use-cases over
extended periods with recovery from glitches and
longer-term outages - Necessary resources and commitment pre-requisite
to success! - Should not be under-estimated!
53Initial Schedule (evolving)
- Q1 / Q2 up to 5 T1s, writing to disk at 100MB/s
per T1 (no expts) - Q3 / Q4 include two experiments, tape and a few
selected T2s - 2006 progressively add more T2s, more
experiments, ramp up to twice nominal data rate - 2006 production usage by all experiments at
reduced rates (cosmics) validation of computing
models - 2007 delivery and contingency
- N.B. there is more detail in Dec / Jan / Feb GDB
presentations
54Key dates for Service Preparation
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 SC4 Service Phase
Sep06 Initial LHC Service in stable operation
Apr07 LHC Service commissioned
SC2
SC2 Reliable data transfer (disk-network-disk)
5 Tier-1s, aggregate 500 MB/sec sustained at
CERN SC3 Reliable base service most Tier-1s,
some Tier-2s basic experiment software chain
grid data throughput 500 MB/sec,
including mass storage (25 of the nominal final
throughput for the proton period) SC4
All Tier-1s, major Tier-2s capable of
supporting full experiment software chain inc.
analysis sustain nominal final grid
data throughput LHC Service in Operation
September 2006 ramp up to full operational
capacity by April 2007 capable of
handling twice the nominal data throughput
55FermiLab Dec 04/Jan 05
- FermiLab demonstrated 500MB/s for 3 days in
November
56FTS stability
M not K !!!
57Interoperability
58Introduction grid flavours
- LCG-2 vs Grid3
- Both use same VDT version
- Globus 2.4.x
- LCG-2 has components for WLM, IS, R-GMA, etc
- Both use same information schema (GLUE)
- Grid3 schema not all GLUE
- Some small extensions by each
- Both use MDS (BDII)
- LCG-2 vs NorduGrid
- NorduGrid uses modified version of Globus 2.x
- Does not use gatekeeper different interface
- Very different information schema
- but does use MDS
- Work done
- With Grid3/OSG strong contacts, many points of
collaboration, etc. - With NorduGrid discussions have started
- Canada
- Gateway into GridCanada and WestGrid (Globus
based) in production
- Catalogues
- LCG-2 EDG derived catalogue (for POOL)
- Grid3 and NorduGrid Globus RLS
59Common areas (with Grid3/OSG)
- Interoperation
- Align Information Systems
- Run jobs between LCG-2 and Grid3/NorduGrid
- Storage interfaces SRM
- Reliable file transfer
- Service challenges
- Infrastructure
- Security
- Security policy JSPG
- Operational security
- Both are explicitly common activities across all
sites - Monitoring
- Job monitoring
- Grid monitoring
- Accounting
- Grid Operations
- Common operations policies
- Problem tracking
60Interoperation
- LCG-2 jobs on Grid3
- G3 site runs LCG-developed generic info provider
fills their site GIIS with missing info GLUE
schema - From LCG-2 BDII can see G3 sites
- Running a job on grid3 site needed
- G3 installs full set of LCG CAs
- Added users into VOMS
- WN installation (very lightweight now) installs
on the fly - Grid3 jobs on LCG-2
- Added Grid3 VO to our configuration
- They point directly to the site (do not use IS
for job submission) - Job submission LCG-2 ? Grid3 has been
demonstrated - NorduGrid can run generic info provider at a
site - But requires work to use the NG clusters
61Storage and file transfer
- Storage interfaces
- LCG-2, gLite, Open Science Grid all agree on SRM
as basic interface to storage - SRM collaboration for gt2 years, group in GGF
- SRM interoperability has been demonstrated
- LHCb use SRM in their stripping phase
- Reliable file transfer
- Work ongoing with Tier 1s (inc. FNAL, BNL,
Triumf) in service challenges. - Agree that interface is SRM and srmcopy or
gridftp as transfer protocol - Reliable transfer software will run at all sites
already in place as part of service challenges
62Operations
- Several points where collaboration will happen
- Started from LCG and OSG operations workshops
- Operational security/incident response
- Common site charter/service definitions possible?
- Collaboration on operations centres (CIC-on-duty)
? - Operations monitoring
- Common schema for problem description/views
allow tools to understand both? - Common metrics for performance and reliability
- Common site and application validation suites
(for LHC apps) - Accounting
- Grid3 and LCG-2 use GGF schema
- Agree to publish into common tool (NG could too)
- Job monitoring
- LCG-2 Logging and bookkeeping well defined set
of states - Agree common set will allow common tools to view
job states in any grid - Need good high level (web) tools to display
user could track jobs easily across grids
63Outlook Summary
- LHC startup is very close. Services have to be in
place 6 months earlier - The Service Challenge program is the ramp-up
process - All aspects are really challenging!
- Now the experiment computing models have been
published, trying to clarify what services LCG
must provide and what their interfaces need to
be - Baseline services working group
- A enormous amount of work has been done by the
various grid projects - Already at full complexity and scale foreseen for
startup - But still significant problems to address
functionality, and stability of operations - Time to bring these efforts together to build the
solution for LHC
Thank you for your attention !