Title: HENP Grid Testbeds, Applications and Demonstrations
1HENP Grid Testbeds, Applications and
Demonstrations
Ruth Pordes Fermilab
- Rob Gardner
- University of Chicago
- CHEP03
- March 29, 2003
2Overview
- High altitude survey of contributions
- group, application, testbed, services/tools
- Discuss common and recurring issues
- grid building, services development, use
- Concluding thoughts
- Acknowledgement to all the speakers who gave fine
presentations, and my apologies in advance for
not providing this very limited sampling
3Testbeds, applications, and development of tools
and services
- Testbeds
- Alien grids
- BaBar Grid
- CrossGrid
- DataTAG
- EDG Testbed(s)
- Grid Canada
- IGT Testbed (US CMS)
- Korean DataGrid
- NorduGrid(s)
- SAMGrid
- US ATLAS Testbed
- WorldGrid
- Evaluations
- EDG testbed evaluations and experience in
multiple exps. - Testbed management experience
- Applications
- ALICE production
- ATLAS production
- BaBar analysis, file replication
- CDF/D0 analysis
- CMS production
- LHCb production
- Medical applications in Italy
- Phenix
- Sloan sky survey
- Tools development
- Use cases (HEPCAL)
- Proof/Grid analysis
- LCG Pool and grid catalogs
- SRM, Magda
- Clarens, Ganga, Genius, Grappa, JAS
4EDG TB History
- Successes
- Matchmaking/Job Mgt.
- Basic Data Mgt.
- Known Problems
- High Rate Submissions
- Long FTP Transfers
- Known Problems
- GASS Cache Coherency
- Race Conditions in Gatekeeper
- Unstable MDS
ATLAS phase 1 start
CMS stress test Nov.30 - Dec. 20
- Successes
- Improved MDS Stability
- FTP Transfers OK
- Known Problems
- Interactions with RC
- Intense Use by Applications!
- Limitations
- Resource Exhaustion
- Size of Logical Collections
CMS, ATLAS, LHCB, ALICE
Emanuele Leonardi
5Resumé of experiment DC use of EDG-see experiment
talks elsewhere at CHEP
Stephen Burke
- ATLAS were first, in August 2002. The aim was to
repeat part of the Data Challenge. Found two
serious problems which were fixed in 1.3 - CMS stress test production Nov-Dec 2002 found
more problems in area of job submission and RC
handling led to 1.4.x
- ALICE started on Mar 4 production of 5,000
central Pb-Pb events - 9 TB 40,000 output files
120k CPU hours - Progressing with similar efficiency levels to CMS
- About 5 done by Mar 14
- Pull architecture
- LHCb started mid Feb
- 70K events for physics
- Like ALICE, using a pull architecture
- BaBar/D0
- Have so far done small scale tests
- Larger scale planned with EDG 2
6CMS Data Challenge 2002 on Grid
C. Grande
- Two official CMS productions on the grid in
2002 - CMS-EDG Stress Test on EDG testbed CMS sites
- 260K events CMKIN and CMSIM steps
- Top-down approach more functionality but less
robust, large manpower needed - USCMS IGT Production in the US
- 1M events Ntuple-only (full chain in single job)
- 500K up to CMSIM (two steps in single job)
- Bottom-up approach less functionality but more
stable, little manpower needed - See talk by P.Capiluppi
7CMS production components interfaced to EDG
- Four submitting UIs Bologna/CNAF (IT), Ecole
Polytechnique (FR), Imperial College (UK),
Padova/INFN (IT) - Several Resource Brokers (WMS), CMS-dedicated and
shared with other Applications one RB for each
CMS UI backup - Replica Catalog at CNAF, MDS (and II) at CERN and
CNAF, VO server at NIKHEF
CMS ProdTools on UI
8CMS/EDG Production
260K events produced 7 sec/event average 2.5
sec/event peak (12-14 Dec)
Events
Upgrade of MW
Hit some limit of implement.
20 Dec
CMS Week
30 Nov
P. Capiluppi talk
9US-CMS IGT Production
25 Oct
- gt 1 M events
- 4.7 sec/event average
- 2.5 sec/event peak (14-20 Dec 2002)
- Sustained efficiency about 44
28 Dec
P. Capiluppi talk
10G.Poulard
Grid in ATLAS DC1
US-ATLAS EDG Testbed Prod
NorduGrid part of Phase 1
reproduce part of full phase 1
2 production phase 1 data
production Full Phase 2
several tests production
See other ATLAS talks for more details
11ATLAS DC1 Phase 1 July-August 02
G.Poulard
3200 CPUs 110 kSI95 71000 CPU days
39 Institutes in 18 Countries
- Australia
- Austria
- Canada
- CERN
- Czech Republic
- France
- Germany
- Israel
- Italy
- Japan
- Nordic
- Russia
- Spain
- Taiwan
- UK
- USA
grid tools used at 11 sites
5107 events generated 1107 events
simulated 3107 single particles 30 Tbytes 35
000 files
12Meta Systems
G.Graham
- MCRunJob approach by CMS production team
- Framework for dealing with multiple grid
resources and testbeds (EDG, IGT)
13Hybrid production model
C. Grande
Users Site
Resources
Production Manager defines assignments
RefDB
Phys.Group asks for an official dataset
shell scripts
Local Batch Manager
JDL
EDG Scheduler
Site Manager starts an assignment
LCG-1 testbed
MCRunJob
DAGMan (MOP)
User starts a private production
Chimera VDL
Virtual Data Catalogue
Planner
14Interoperability glue
15Integrated Grid Systems
- Two examples of integrating advanced production
and analysis to multiple grids
SamGrid
AliEn
16SamGrid Map
- CDF
- Kyungpook National University, Korea
- Rutgers State University, New Jersey, US
- Rutherford Appelton Laboratory, UK
- Texas Tech, Texas, US
- University of Toronto, Canada
- DØ
- Imperial College, London, UK
- Michigan State University, Michigan, US
- University of Michigan, Michigan, US
- University of Texas at Arlington, Texas, US
17Physics with SAM-Grid
S. Stonjek
z0(µ1)
z0(µ2)
Standard CDF analysis job submitted via SAM-Grid
and executed somewhere
J/? gt µ µ-
18The BaBar Grid as of March 2003
D. Boutigny
CE SE WN
VO RC
CE SE WN
CE SE WN
RB
special challenges faced by a running
experiment with heterogeneous data requirements,
root, Objy
CE SE WN
CE SE WN
19Grid Applications, Interfaces, Portals
- Clarens
- Ganga
- Genius
- Grappa
- JAS-Grid
- Magda
- Proof-Grid
- and higher level services
- Storage Resource Manager (SRM)
- Magda data management
- POOL-Grid interface
20PROOF and Data Grids
Fons Rademakers
- Many services are a good fit
- Authentication
- File Catalog, replication services
- Resource brokers
- Monitoring
- ? Use abstract interfaces
- Phased integration
- Static configuration
- Use of one or multiple Grid services
- Driven by Grid infrastructure
21Different PROOFGRID Scenarios
Fons Rademakers
- Static stand-alone
- Current version, static config file,
pre-installed - Dynamic, PROOF in control
- Using grid file catalog and resource broker,
pre-installed - Dynamic, ALiEn in control
- Idem, but installed and started on the fly by
AliEn - Dynamic, Condor in control
- Idem, but allowing in addition slave migration in
a Condor pool
22see WorldGrid Poster this conf.
Executable "/usr/bin/env" Arguments "zsh
prod.dc1_wrc 00001" VirtualOrganization"datatag"
RequirementsMember(other.GlueHostApplicationSof
tware RunTimeEnvironment,"ATLAS-3.2.1" ) Rank
other.GlueCEStateFreeCPUs InputSandbox"prod.dc1
_wrc",rc.conf","plot.kumac" OutputSandbox"dc1
.002000.test.00001.hlt.pythia_jet_17.log","dc1.002
000.test.00001.hlt.pythia_jet_17.his","dc1.002000.
test.00001.hlt.pythia_jet_17.err","plot.kumac" R
eplicaCatalog"ldap//dell04.cnaf.infn.it9211/lc
ATLAS,rcGLUE,dcdell04,dccnaf,dcinfn,dcit" In
putData "LFdc1.002000.evgen.0001.hlt.pythia_je
t_17.root" StdOutput " dc1.002000.test.00001.h
lt.pythia_jet_17.log" StdError
"dc1.002000.test.00001.hlt.pythia_jet_17.err" Dat
aAccessProtocol "file"
JDL GLUE-aware files
JDL
input data location
GLUE Testbed
RB/JSS
II
Replica Catalog
TOP GIIS
GLUE-Schema based Information System
Job
data registration
CE
. . .
WN ATLAS sw
23Ganga ATLAS and LHCb
C. Tull
24C. Tull
Ganga EDG Grid Interface
Job Handler class
Job class
JobsRegistry class
Data management service
Job submission
Job monitoring
Security service
dg-job-list-match dg-job-submit dg-job-cancel
grid-proxy-init MyProxy
dg-job-status dg-job-get-logging-info GRM/PROVE
edg-replica-manager dg-job-get-output globus-url-c
opy GDMP
EDG UI
25Comment Building Grid Applications
- P is a dynamic configuration script
- Turns abstract bundle into a concrete one
- Challenge
- building integrated systems
- distributed developers and support
attributes user info grid info
P
26In summaryCommon issues
- Installation and configuration of MW
- Application packaging, run time environments
- Authentication mechanisms
- Policies differing among sites
- Private networks, firewalls, ports
- Fragility of services, job submission chain
- Inaccuracies, poor performance of information
services - Monitoring and several levels
- Debugging, site cleanup
27Conclusions
- Progress in the past 18 months has been dramatic!
- lots of experience gained in building integrated
grid systems - demonstrated functionality with large scale
production - more attention being given to analysis
- Many pitfalls exposed, areas for improvement
identified - some of these are core middleware ? feedback
given to technology providers - Policy issues remain using shared resources,
authorization - operation of production services
- user interactions, support models to be developed
- Many thanks to the contributors to this session