LHC Computing Grid Project LCG

About This Presentation

Title:

LHC Computing Grid Project LCG

Description:

The Grid is just a tool towards achieving this goal ... Note smooth degradation and recovery after equipment failure. Deployment and Operations ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 64

Provided by: ianb188

Category:

more less

Transcript and Presenter's Notes

Title: LHC Computing Grid Project LCG

1

LHC Computing Grid Project LCG
Ian Bird LCG Deployment Manager
IT Department, CERN
Geneva, Switzerland
BNL
March 2005
ian.bird_at_cern.ch

2
Overview

LCG Project Overview
Overview of main project areas
Deployment and Operations
Current LCG-2 Status
Operations and issues
Plans for migration to gLite
Service Challenges
Interoperability
Outlook Summary

3
LHC Computing Grid Project

Aim of the project
To prepare, deploy and operate the computing
environment
for the experiments to analyse the data from the
LHC detectors

Applications development environment, common
tools and frameworks
Build and operate the LHC computing service
The Grid is just a tool towards achieving this
goal
4
Project Areas Management
Distributed Analysis - ARDA Massimo
Lamanna Prototyping of distributed end-user
analysis using grid technology
Project Leader Les Robertson Resource Manager
Chris Eck Planning Officer Jürgen
Knobloch Administration Fabienne Baud-Lavigne
Joint with EGEE
Applications Area Torre Wenaus Development
environment Joint projects, Data
management Distributed analysis
Middleware Area Frédéric Hemmer Provision of a
base set of grid middleware (acquisition,
development, integration)Testing, maintenance,
support
Pere Mato from 1 March 05
CERN Fabric AreaBernd Panzer Large cluster
management Data recording, Cluster
technology Networking, Computing service at CERN
Grid Deployment Area Ian Bird Establishing and
managing the Grid Service - Middleware,
certification, security operations, registration,
authorisation,accounting
5
Relation with EGEE
6
Applications Area

All Applications Area projects have software
deployed in production by the experiments
POOL, SEAL, ROOT, Geant4, GENSER, PI/AIDA,
Savannah
400 TB of POOL data produced in 2004
Pre-release of Conditions Database (COOL)
3D project will help POOL and COOL in terms of
scalability
3D Distributed Deployment of Databases
Geant4 successfully used in ATLAS, CMS and LHCb
Data Challenges with excellent reliability
GENSER MC generator library in production
Progress on integrating ROOT with other
Applications Area components
Improved I/O package used by POOL Common
dictionary, maths library with SEAL
Pere Mato (CERN, LHCb) has taken over from Torre
Wenaus (BNL, ATLAS) as Applications Area Manager
Plan for next phase of the applications area
being developed for internal review at end of
March

7
The ARDA project

ARDA is an LCG project
main activity is to enable LHC analysis on the
grid
ARDA is contributing to EGEE NA4
uses the entire CERN NA4-HEP resource
Interface with the new EGEE middleware (gLite)
By construction, use the new middleware
Use the grid software as it matures
Verify the components in an analysis environments
(users!)
Provide early and continuous feedback
Support the experiments in the evolution of their
analysis systems
Forum for activity within LCG/EGEE and with other
projects/initiatives

8
ARDA activity with the experiments

The complexity of the field requires a great care
in the phase of middleware evolution and
delivery
Complex (evolving) requirements
New use cases to be explored (for HEP
large-scale analysis)
Different communities in the loop - LHC
experiments, middleware experts from the
experiments and other communities providing large
middleware stacks (CMS GEOD, US OSG, LHCb Dirac,
etc)
The complexity of the experiment-specific part is
comparable (often larger) to the general one
The experiments do require seamless access to a
set of sites (computing resources) but the real
usage (therefore the benefit for the LHC
scientific programme) will come by exploiting the
possibility to build their computing systems on a
flexible and dependable infrastructure
How to progress?
Build end-to-end prototype systems for the
experiments to allow end users to perform
analysis tasks

9
LHC prototype overview
10
LHC experiments prototypes (ARDA)
All prototypes have been demoed within the
corresponding user communities
11
CERN Fabric
12
CERN Fabric

Fabric automation has seen very good progress
The new systems for managing large farms are in
production at CERN since January
New CASTOR Mass Storage System
Being deployed first on the high throughput
cluster for the ongoing ALICE data recording
computing challenge
Agreement on collaboration with Fermilab on Linux
distribution
Scientific Linux based on Red Hat Enterprise 3
Improves uniformity between the HEP sites serving
LHC and Run 2 experiments
CERN computer centre preparations
Power upgrade to 2.5 MW
Computer centre refurbishment well under way
Acquisition process started

13
Preparing for 7,000 boxes in 2008
14
High Throughput Prototype (openlab LCG
prototype)
4 GE connections to the backbone
10GE WAN connection
12 Tape Server STK 9940B
24 Disk Server (P4, SATA disks, 2TB disk space
each)
4 ENTERASYS N7 10 GE Switches 2 Enterasys
X-Series
2 50 Itanium 2 (dual 1.3/1.5 GHz, 2 GB mem)

Experience with likely ingredients in LCG
64-bit programming
next generation I/O(10 Gb Ethernet, Infiniband,
etc.)
High performance cluster used for evaluations,
and for data challenges with experiments
Flexible configuration components moved in and
out of production environment
Co-funded by industry and CERN

36 Disk Server (dual P4, IDE disks, 1TB disk
space each)
10 GE per node
10GE
10 GE per node
80 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
10GE
1 GE per node
10GE
28 TB , IBM StorageTank
12 Tape Server STK 9940B
40 IA32 CPU Server (dual 2.4 GHz P4, 1 GB mem.)
80 IA32 CPU Server (dual 2.8 GHz P4, 2 GB mem.)
15
Alice Data Recording Challenge

Target one week sustained at 450 MB/sec
Used the new version of Castor mass storage
system
Note smooth degradation and recovery after
equipment failure

16
Deployment and Operations
17

LHC Computing Model (simplified!!)
Tier-0 the accelerator centre
Filter? raw data ? reconstruction ?
event summary data (ESD)
Record the master copy of raw and ESD
Tier-1
Managed Mass Storage permanent storage raw,
ESD, calibration data, meta-data, analysis data
and databases? grid-enabled data service
Data-heavy (ESD-based) analysis
Re-processing of raw data
National, regional support
online to the data acquisition processhigh
availability, long-term commitment

Tier-2
Well-managed, grid-enabled disk storage
End-user analysis batch and interactive
Simulation

18
Computing Resources March 2005

Country providing resources
Country anticipating joining
In LCG-2
121 sites, 32 countries
gt12,000 cpu
5 PB storage
Includes non-EGEE sites
9 countries
18 sites

19
Infrastructure metrics
Countries, sites, and CPU available in LCG-2
production service
EGEE partner regions
Other collaborating sites
20
Service Usage

VOs and users on the production service
Active HEP experiments
4 LHC, D0, CDF, Zeus, Babar
Active other VO
Biomed, ESR (Earth Sciences), Compchem, Magic
(Astronomy), EGEODE (Geo-Physics)
6 disciplines
Registered users in these VO 500
In addition to these there are many VO that are
local to a region, supported by their ROCs, but
not yet visible across EGEE
Scale of work performed
LHC Data challenges 2004
gt1 M SI2K years of cpu time (1000 cpu years)
400 TB of data generated, moved and stored
1 VO achieved 4000 simultaneous jobs (4 times
CERN grid capacity)

Number of jobs processed/month
21
Current production software (LCG-2)

Evolution through 2003/2004
Focus has been on making these reliable and
robust
rather than additional functionality
Respond to needs of users, admins, operators
The software stack is the following
Virtual Data Toolkit
Globus (2.4.x), Condor, etc
EU DataGrid project developed higher-level
components
Workload management (RB, LB, etc)
Replica Location Service (single central
catalog), replica management tools
R-GMA as accounting and monitoring framework
VOMS being deployed now
Operations team re-worked components
Information system MDS GRIS/GIIS ? LCG-BDII
edg-rm tools replaced and augmented as lcg-utils
Developments on
Disk pool managers (dCache, DPM)
Not addressed by JRA1
Other tools as required

Maintenance agreements with
VDT team (inc Globus support)
DESY/FNAL - dCache
EGEE/LCG teams
WLM, VOMS, R-GMA, Data Management

22
Software 2

Platform support
Was an issue limited to RedHat 7.3
Now ported to Scientific Linux (RHEL), Fedora,
IA64, AIX, SGI
Another problem was heaviness of installation
Now much improved and simpler with simple
installation tools, allow integration with
existing fabric management tools
Much lighter installation on worker nodes user
level

23
Overall status

The production grid service is quite stable
The services are quite reliable
Remaining instabilities in the IS are being
addressed
Sensitivity to site management
Problems in underlying services must be addressed
Work on stop-gap solutions (e.g. RB maintains
state, Globus gridftp ? reliable file transfer
service)
The biggest problem is stability of sites
Configuration problems due to complexity of the
middleware
Fabric management at less experienced sites
Job efficiency is not high, unless
Operations/Applications select stable sites (BDII
allows a application-specific view)
In large tests, selecting stable sites, achieve
gtgt90 efficiency
Operations workshop last November to address this
Fabric management working group write fabric
management cookbook
Tighten operations control of the grid
escalation procedures, removing bad sites
Complexity is in the number of sites not number
of cpu

24
Operations Structure

Operations Management Centre (OMC)
At CERN coordination etc
Core Infrastructure Centres (CIC)
Manage daily grid operations oversight,
troubleshooting
Run essential infrastructure services
Provide 2nd level support to ROCs
UK/I, Fr, It, CERN, Russia (M12)
Taipei also run a CIC
Regional Operations Centres (ROC)
Act as front-line support for user and operations
issues
Provide local knowledge and adaptations
One in each region many distributed
User Support Centre (GGUS)
In FZK manage PTS provide single point of
contact (service desk)
Not foreseen as such in TA, but need is clear

25
Grid Operations

The grid is flat, but
Hierarchy of responsibility
Essential to scale the operation
CICs act as a single Operations Centre
Operational oversight (grid operator)
responsibility
rotates weekly between CICs
Report problems to ROC/RC
ROC is responsible for ensuring problem is
resolved
ROC oversees regional RCs
ROCs responsible for organising the operations in
a region
Coordinate deployment of middleware, etc
CERN coordinates sites not associated with a ROC

RC Resource Centre
26
SLAs and 24x7

Start with service level definitions
What a site supports (apps, software, MPI,
compilers, etc)
Levels of support ( admins, hrs/day, on-call,
operators)
Response time to problems
Define metrics to measure compliance
Publish metrics performance of sites relative
to their commitments
Remote monitoring/management of services
Can be considered for small sites
Middleware/services
Should cope with bad sites

Clarify what 24x7 means
Service should be available 24x7
Does not mean all sites must be available 24x7
Specific crucial services that justify cost
Classify services according to level of support
required
Operations tools need to become more and more
automated
Having an operating production infrastructure
should not mean having staff on shift everywhere
best-effort support
The infrastructure (and applications) must adapt
to failures

27
Operational Security

Operational Security team in place
EGEE security officer, ROC security contacts
Concentrate on 3 activities
Incident response
Best practice advice for Grid Admins creating
dedicated web
Security Service Monitoring evaluation
Incident Response
JSPG agreement on IR in collaboration with OSG
Update existing policy To guide the development
of common capability for handling and response to
cyber security incidents on Grids
Basic framework for incident definition and
handling
Site registration process in draft
Part of basic SLA
CA Operations
EUGridPMA best practice, minimum standards,
etc.
More and more CAs appearing

Security group and work was started in LCG was
from the start a cross-grid activity.
Much already in place at start of EGEE usage
policy, registration process and infrastructure,
etc.
We regard it as crucial that this activity
remains broader than just EGEE

28
Policy Joint Security Group
Incident Response
Certification Authorities
Audit Requirements
Usage Rules
Security Availability Policy
Application Development Network Admin Guide
User Registration
http//cern.ch/proj-lcg-security/documents.html
29
User Support
We have found that user support has 2 distinct
aspects

User support
Call centre/helpdesk
Coordinated through GGUS
ROCs as front-line
Task force in place to improve the service

VO Support
Was an oversight in the project and is not really
provisioned
In LCG there is a team (5 FTE)
Help apps integrate with m/w
Direct 11 support
Understanding of needs
Act as advocate for app
This is really missing for the other apps
adaptation to the grid environment takes expertise

Deployment SupportMiddleware Problems
Operations Centres (CIC / ROC)Operations Problems
Resource Centres (RC)Hardware Problems
Application Specific User SupportVO specific
problems
Global Grid User Support (GGUS)Single Point of
Contact Coordination of UserSupport
LHC experiments
Other Communities e.g. Biomed
non-LHC experiments
30
Certification process

Process was decisive to improve the middleware
The process is time consuming (5 releases 2004)
Many sequential steps
Many different site layouts have to be tested
Format of internal and external releases differ
Multiple packaging formats (tool based, generic)
All components are treated equal
same level of testing for non vital and core
components
new tools and tools in use by other projects are
tested to the same level
Process to include new components is not
transparent
Timing for releases difficult
Users now sites scheduled
Upgrades need a long time to cover all sites
Some sites had problems to become functional
after an upgrade

31
Additional Input

Data Challenges
client libs need fast and frequent updates
core services need fast patches
(functional/fixes)
applications need a transparent release
preparation
many problems only become visible during full
scale production
Configuration is a major problem on smaller sites
Operations Workshop
smaller sites can handle major upgrades only
every 3 months
sites need to give input in the selection of new
packages
resolve conflicts with local policies

32
Changes I

Simple Installation/Configuration Scripts
YAIM (YetAnotherInstallMethod)
semi-automatic simple configuration management
based on scripts (easy to integrate into other
frameworks)
all configuration for a site are kept in one file
APT (Advanced Package Tool) based installation of
middleware RPMs
simple dependency management
updates (automatic on demand)
no OS installation
Client libs packaged in addition as user space
tar-ball
can be installed like application software

33
Changes II

Different frequency of separate release types
client libs (UI, WN)
services (CE, SE)
core services (RB, BDII,..)
major releases (configuration changes, RPMs, new
services)
updates (bug fixes) added any time to specific
releases
non-critical components will be made available
with reduced testing
Fixed release dates for major releases (allows
planning)
every 3 months, sites have to upgrade within 3
weeks
Minor releases every month
based on ranked components available at a
specific date in the month
not mandatory for smaller RCs to follow
client libs will be installed as application
level software
early access to pre-releases of new software for
applications
client libs. will be made available on selected
sites
services with functional changes are installed on
EIS-Applications testbed
early feedback from applications

34
Certification Process
3
Bugs/Patches/Task Savannah
Applications
RC
integration first tests
Developers
CT
EIS
GIS
4
CT
GDB
assign and update cost
Internal Releases
Internal Client Release
Bugs/Patches/Task Savannah
1
CICs
EIS
6
full deployment on test clusters
(6) functional/stress tests 1 week
Developers
CT
CT
Head of Deployment
components ready at cutoff
35
Deployment Process
YAIM
Release(s)
Update Release Notes
Update User Guides
EIS
GIS
Every 3 months on fixed dates !
User Guides
Release Notes Installation Guides
Every Month
Certification is run daily
Every Month
at own pace
36
Operations Procedures

Driven by experience during 2004 Data Challenges
Reflecting the outcome of the November Operations
Workshop
Operations Procedures
roles of CICs - ROCs - RCs
weekly rotation of operations centre duties
(CIC-on-duty)
daily tasks of the operations shift
monitoring (tools, frequency)
problem reporting
problem tracking system
communication with ROCs RCs
escalation of unresolved problems
handing over the service to the next CIC

37
Implementation

Evolutionary Development
Procedures
documented (constantly adapted)
available at the CIC portal http//cic.in2p3.fr/
in use by the shift crews
Portal http//cic.in2p3.fr
access to tools and process documentation
repository for logs and FAQs
provides means of efficient communication
provides condensed monitoring information
Problem tracking system
currently based on Savannah at CERN
is moving to the GGUS at FZK
exports/imports tickets to local systems used by
the ROCs
Weekly Phone Conferences and Quarterly Meetings

38
Grid operator dashboard
Cic-on-duty Dashboardhttps//cic.in2p3.fr/pages/c
ic/framedashboard.html
39
Operator procedure
Escalation

SEVERITY

Incident

ESCALATION

PROCEDURE

closure

Savannah

(5.1)

In Depth

Diagnosis

Follow up

(5.2)

Testing

help

Report

Monitoring tools

GIIS

Wiki

Savannah

mailing

GridIce, GOC

Monitor

pages

tool

(5.1)

40
Selection of monitoring tools
41
Middleware
42
Architecture Design

Design team including representatives from
Middleware providers (AliEn, Condor, EDG,
Globus,) including US partners produced
middleware architecture and design.
Takes into account input and experiences from
applications, operations, and related projects
DJRA1.1 EGEE Middleware Architecture (June
2004)
https//edms.cern.ch/document/476451/
DJRA1.2 EGEE Middleware Design (August 2004)
https//edms.cern.ch/document/487871/
Much feedback from within the project (operation
applications) and from related projects
Being used and actively discussed by OSG,
GridLab, etc. Input to various GGF groups

43
gLite Services and Responsible Clusters
JRA3
UK
Access Services
Grid AccessService
API
CERN
IT/CZ
Security Services
Authorization
Information Monitoring
Services
ApplicationMonitoring
Information Monitoring
Auditing
Authentication
Data Services
Job Management Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataManagement
ComputingElement
WorkloadManagement
Site Proxy
44
gLite Services for Release 1
JRA3
UK
Access Services
Grid AccessService
API
CERN
IT/CZ
Security Services
Authorization
Information Monitoring
Services
Application Monitoring
Information Monitoring
Auditing
Focus on key servicesaccording to gLite Mgmt
taskforce
Authentication
Data Services
Job Management Services
MetadataCatalog
JobProvenance
PackageManager
File ReplicaCatalog
Accounting
StorageElement
DataManagement
ComputingElement
WorkloadManagement
Site Proxy
45
gLite Services for Release 1Software stack and
origin (simplified)

Computing Element
Gatekeeper (Globus)
Condor-C (Condor)
CE Monitor (EGEE)
Local batch system (PBS, LSF, Condor)
Workload Management
WMS (EDG)
Logging and bookkeeping (EDG)
Condor-C (Condor)
Storage Element
File Transfer/Placement (EGEE)
glite-I/O (AliEn)
GridFTP (Globus)
SRM Castor (CERN), dCache (FNAL, DESY), other
SRMs

Catalog
File and Replica Catalog (EGEE)
Metadata Catalog (EGEE)
Information and Monitoring
R-GMA (EDG)
Security
VOMS (DataTAG, EDG)
GSI (Globus)
Authentication for C and Java based (web)
services (EDG)

46
Summary

WMS
Task Queue, Pull mode, Data management interface
Available in the prototype
Used in the testing testbed
Now working on the certification testbed
Submission to LCG-2 demonstrated
Catalog
MySQL and Oracle
Available in the prototype
Used in the testing testbed
Delivered to SA1
But not tested yet
gLite I/O
Available in the prototype
Used in the testing testbed
Basic functionality and stress test available
Delivered to SA1
But not tested yet

FTS
FTS is being evolved with LCG
Milestone on March 15, 2005
Stress tests in service challenges
UI
Available in the prototype
Incudes data management
Not yet formally tested
R-GMA
Available in the prototype
Testing has shown deployment problems
VOMS
Available in the prototype
No tests available

47
Schedule

All of the Services are available now on the
development testbed
User documentation currently being added
On a limited scale testbed
Most of the Services are being deployed on the
LCG Preproduction Service
Initially at CERN, more sites once
tested/validated
Scheduled in April-May
Schedule for deployment at major sites by the end
of May
In time to be included in the LCG service
challenge that must demonstrate full capability
in July prior to operate as a stable service in
2H2005

48
Migration Strategy

Certify gLite components on existing LCG-2
service
Deploy components in parallel replacing with
new service once stability and functionality is
demonstrated
WN tools and libs must co-exist on same cluster
nodes
As far as possible must have a smooth transition

49
Service Challenges
50
Problem Statement

Robust File Transfer Service often seen as the
goal of the LCG Service Challenges
Whilst it is clearly essential that we ramp up at
CERN and the T1/T2 sites to meet the required
data rates well in advance of LHC data taking,
this is only one aspect
Getting all sites to acquire and run the
infrastructure is non-trivial (managed disk
storage, tape storage, agreed interfaces, 24 x
365 service aspect, including during conferences,
vacation, illnesses etc.)
Need to understand networking requirements and
plan early
But transferring dummy files is not enough
Still have to show that basic infrastructure
works reliably and efficiently
Need to test experiments Use Cases
Check for bottlenecks and limits in s/w, disk and
other caches etc.
We can presumably write some test scripts to
mock up the experiments Computing Models
But the real test will be to run your s/w
Which requires strong involvement from production
teams

51
LCG Service Challenges - Overview

LHC will enter production (physics) in April 2007
Will generate an enormous volume of data
Will require huge amount of processing power
LCG solution is a world-wide Grid
Many components understood, deployed, tested..
But
Unprecedented scale
Humungous challenge of getting large numbers of
institutes and individuals, all with existing,
sometimes conflicting commitments, to work
together
LCG must be ready at full production capacity,
functionality and reliability in less than 2
years from now
Issues include h/w acquisition, personnel hiring
and training, vendor rollout schedules etc.
Should not limit ability of physicist to exploit
performance of detectors nor LHCs physics
potential
Whilst being stable, reliable and easy to use

52
Key Principles

Service challenges results in a series of
services that exist in parallel with baseline
production service
Rapidly and successively approach production
needs of LHC
Initial focus core (data management) services
Swiftly expand out to cover full spectrum of
production and analysis chain
Must be as realistic as possible, including
end-end testing of key experiment use-cases over
extended periods with recovery from glitches and
longer-term outages
Necessary resources and commitment pre-requisite
to success!
Should not be under-estimated!

53
Initial Schedule (evolving)

Q1 / Q2 up to 5 T1s, writing to disk at 100MB/s
per T1 (no expts)
Q3 / Q4 include two experiments, tape and a few
selected T2s
2006 progressively add more T2s, more
experiments, ramp up to twice nominal data rate
2006 production usage by all experiments at
reduced rates (cosmics) validation of computing
models
2007 delivery and contingency
N.B. there is more detail in Dec / Jan / Feb GDB
presentations

54
Key dates for Service Preparation
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 SC4 Service Phase
Sep06 Initial LHC Service in stable operation
Apr07 LHC Service commissioned
SC2
SC2 Reliable data transfer (disk-network-disk)
5 Tier-1s, aggregate 500 MB/sec sustained at
CERN SC3 Reliable base service most Tier-1s,
some Tier-2s basic experiment software chain
grid data throughput 500 MB/sec,
including mass storage (25 of the nominal final
throughput for the proton period) SC4
All Tier-1s, major Tier-2s capable of
supporting full experiment software chain inc.
analysis sustain nominal final grid
data throughput LHC Service in Operation
September 2006 ramp up to full operational
capacity by April 2007 capable of
handling twice the nominal data throughput

55
FermiLab Dec 04/Jan 05

FermiLab demonstrated 500MB/s for 3 days in
November

56
FTS stability
M not K !!!
57
Interoperability
58
Introduction grid flavours

LCG-2 vs Grid3
Both use same VDT version
Globus 2.4.x
LCG-2 has components for WLM, IS, R-GMA, etc
Both use same information schema (GLUE)
Grid3 schema not all GLUE
Some small extensions by each
Both use MDS (BDII)

LCG-2 vs NorduGrid
NorduGrid uses modified version of Globus 2.x
Does not use gatekeeper different interface
Very different information schema
but does use MDS

Work done
With Grid3/OSG strong contacts, many points of
collaboration, etc.
With NorduGrid discussions have started

Canada
Gateway into GridCanada and WestGrid (Globus
based) in production

Catalogues
LCG-2 EDG derived catalogue (for POOL)
Grid3 and NorduGrid Globus RLS

59
Common areas (with Grid3/OSG)

Interoperation
Align Information Systems
Run jobs between LCG-2 and Grid3/NorduGrid
Storage interfaces SRM
Reliable file transfer
Service challenges

Infrastructure
Security
Security policy JSPG
Operational security
Both are explicitly common activities across all
sites
Monitoring
Job monitoring
Grid monitoring
Accounting
Grid Operations
Common operations policies
Problem tracking

60
Interoperation

LCG-2 jobs on Grid3
G3 site runs LCG-developed generic info provider
fills their site GIIS with missing info GLUE
schema
From LCG-2 BDII can see G3 sites
Running a job on grid3 site needed
G3 installs full set of LCG CAs
Added users into VOMS
WN installation (very lightweight now) installs
on the fly
Grid3 jobs on LCG-2
Added Grid3 VO to our configuration
They point directly to the site (do not use IS
for job submission)
Job submission LCG-2 ? Grid3 has been
demonstrated
NorduGrid can run generic info provider at a
site
But requires work to use the NG clusters

61
Storage and file transfer

Storage interfaces
LCG-2, gLite, Open Science Grid all agree on SRM
as basic interface to storage
SRM collaboration for gt2 years, group in GGF
SRM interoperability has been demonstrated
LHCb use SRM in their stripping phase
Reliable file transfer
Work ongoing with Tier 1s (inc. FNAL, BNL,
Triumf) in service challenges.
Agree that interface is SRM and srmcopy or
gridftp as transfer protocol
Reliable transfer software will run at all sites
already in place as part of service challenges

62
Operations

Several points where collaboration will happen
Started from LCG and OSG operations workshops
Operational security/incident response
Common site charter/service definitions possible?
Collaboration on operations centres (CIC-on-duty)
?
Operations monitoring
Common schema for problem description/views
allow tools to understand both?
Common metrics for performance and reliability
Common site and application validation suites
(for LHC apps)
Accounting
Grid3 and LCG-2 use GGF schema
Agree to publish into common tool (NG could too)
Job monitoring
LCG-2 Logging and bookkeeping well defined set
of states
Agree common set will allow common tools to view
job states in any grid
Need good high level (web) tools to display
user could track jobs easily across grids

63
Outlook Summary

LHC startup is very close. Services have to be in
place 6 months earlier
The Service Challenge program is the ramp-up
process
All aspects are really challenging!
Now the experiment computing models have been
published, trying to clarify what services LCG
must provide and what their interfaces need to
be
Baseline services working group
A enormous amount of work has been done by the
various grid projects
Already at full complexity and scale foreseen for
startup
But still significant problems to address
functionality, and stability of operations
Time to bring these efforts together to build the
solution for LHC