Title: Santiago Gonz
1ATLAS Data Challenge 2 A massive Monte Carlo
Production on the Grid
k
- Santiago González de la Hoz (Santiago.Gonzalez_at_if
ic.uv.es) - on behalf of ATLAS DC2 Collaboration
- EGC 2005
- Amsterdam, 14/02/2005
2Overview
- Introduction
- ATLAS experiment
- Data Challenge program
- ATLAS production system
- DC2 production phases
- The 3 Grid flavours (LCG, GRID3 and NorduGrid)
- ATLAS DC2 production
- Distributed analysis system
- Conclusions
3LHC (CERN)
Introduction LHC/CERN
Mont Blanc, 4810 m
Geneva
4 The challenge of the LHC computing
Storage Raw recording rate 0.1 1
GBytes/sec Accumulating at 5-8
PetaBytes/year 10 PetaBytes of
disk Processing 200,000 of todays fastest
PCs
5Introduction ATLAS
- Detector for the study of high-energy
proton-proton collision. - The offline computing will have to deal with an
output event rate of 100 Hz. i.e 109 events per
year with an average event size of 1 Mbyte. - Researchers are spread all over the world.
6Introduction Data Challenges
- Scope and Goals
- In 2002 ATLAS computing planned a first series of
Data Challenges (DCs) in order to validate its - Computing Model
- Software
- Data Model
- The major features of the DC1 were
- The development and deployment of the software
required for the production of large event
samples - The production of those samples involving
institutions worldwide. - ATLAS collaboration decided to perform the DC2
and in the future the DC3 using the Grid
middleware developed in several Grid projects
(Grid flavours) like - LHC Computing Grid project (LCG), to which CERN
is committed - GRID3
- NorduGRID
7ATLAS production system
- The production database, which contains abstract
job definitions - The windmill supervisor that reads the production
database for job definitions and present them to
the different GRID executors in an easy-to-parse
XML format - The Executors, one for each GRID flavor, that
receive the job-definitions in XML format and
convert them to the job description language of
that particular GRID - Don Quijote, the Atlas Data Management System,
moves files from their temporary output locations
to their final destination on some Storage
Element and registers the files in the Replica
Location Service of that GRID.
- In order to handle the task of ATLAS DC2 an
automated production system was designed - The ATLAS production system consists of 4
components
8DC2 production phases
Bytestream Raw Digits
Task Flow for DC2 data
ESD
Bytestream Raw Digits
Digits (RDO) MCTruth
Mixing
Reconstruction
Hits MCTruth
Events HepMC
Geant4
Digitization
Bytestream Raw Digits
ESD
Digits (RDO) MCTruth
Hits MCTruth
Events HepMC
Pythia
Reconstruction
Geant4
Digitization
Digits (RDO) MCTruth
Hits MCTruth
Events HepMC
Pile-up
Geant4
Bytestream Raw Digits
ESD
Bytestream Raw Digits
Mixing
Reconstruction
Digits (RDO) MCTruth
Events HepMC
Hits MCTruth
Geant4
Bytestream Raw Digits
Pile-up
20 TB
5 TB
20 TB
30 TB
5 TB
Event Mixing
Digitization (Pile-up)
Reconstruction
Detector Simulation
Event generation
Byte stream
Persistency Athena-POOL
TB
Physics events
Min. bias Events
Piled-up events
Mixed events
Mixed events With Pile-up
Volume of data for 107 events
9DC2 production phases
Process No. of events Event/ size CPU power Volume of data
MB kSI2k-s TB
Event generation 107 0.06 156
Simulation 107 1.9 504 30
Pile-up/ Digitization 107 3.3/1.9 144/16 35
Event mixing Byte-stream 107 2.0 5.4 20
- The ATLAS DC2 which started in July 2004 finished
the simulation part at the end of September 2004.
- 10 million events (100000 jobs) were generated
and simulated using the three Grid Flavors - The Grid technologies have provided the tools to
generate a large Monte Carlo simulation samples - The digitization and Pile-up part was completed
in December. The pile-up was done on a sub-sample
of 2 M events. - The event mixing and byte-stream production are
going on
10The 3 Grid flavors
- LCG (http//lcg.web.cern.ch/LCG/)
- The job of the LHC Computing Grid Project LCG
is to prepare the computing infrastructure for
the simulation, processing and analysis of LHC
data for all four of the LHC collaborations. This
includes both the common infrastructure of
libraries, tools and frameworks required to
support the physics application software, and the
development and deployment of the computing
services needed to store and process the data,
providing batch and interactive facilities for
the worldwide community of physicists involved in
LHC. - NorduGrid (http//www.nordugrid.org/)
- The aim of the NorduGrid collaboration is to
deliver a robust, scalable, portable and fully
featured solution for a global computational and
data Grid system. NorduGrid develops and deploys
a set of tools and services the so-called ARC
middleware, which is a free software. - Grid3 (http//www.ivdgl.org/grid2003/)
- The Grid3 collaboration has deployed an
international Data Grid with dozens of sites and
thousands of processors. The facility is operated
jointly by the U.S. Grid projects iVDGL, GriPhyN
and PPDG, and the U.S. participants in the LHC
experiments ATLAS and CMS.
- Both Grid3 and NorduGrid have similar approaches
using the same foundations (GLOBUS) as LCG but
with slightly different middleware.
11The 3 Grid flavors LCG
- This infrastructure has been operating since
2003. - The resources used (computational and storage)
are installed at a large number of Regional
Computing Centers, interconnected by fast
networks.
- 82 sites, 22 countries (This number is evolving
very fast) - 6558 TB
- 7269 CPUs (shared)
12The 3 Grid flavors NorduGRID
- NorduGrid is a research collaboration established
mainly across Nordic Countries but includes sites
from other countries. - They contributed to a significant part of the DC1
(using the Grid in 2002). - It supports production on non-RedHat 7.3
platforms
- 11 countries, 40 sites, 4000 CPUs,
- 30 TB storage
13The 3 Grid flavors GRID3
- Sep 04
- 30 sites, multi-VO
- shared resources
- 3000 CPUs (shared)
- The deployed infrastructure has been in operation
since November 2003 - At this moment running 3 HEP and 2 Biological
applications - Over 100 users authorized to run in GRID3
14ATLAS DC2 production on LCG, GRID3 and
NorduGrid
total
Validated Jobs
Day
15Typical job distribution on LCG, GRID3 and
NorduGrid
16Distributed Analysis system ADA
- The physicists want to use the Grid to perform
the analysis of the data too. - ADA (ATLAS Distributed Analysis) project aims at
putting together all software components to
facilitate the end-user analysis.
- DIAL It defines the job components (dataset,
task, applications, etc..). Together with LSF or
Condor provides interactivity ( a low response
time). - ATPROD production system to be used for low mass
scale - ARDA Analysis system to be interfaced to EGEE
middleware
17Lessons learned from DC2
- Main problems
- The production system was in development during
DC2 phase. - The beta status of the services of the Grid
caused troubles while the system was in operation - For example, the Globus RLS, the Resource Broker
and the information system were unstable at the
initial phase. - Specially on LCG, lack of uniform monitoring
system. - The mis-configuration of sites and site stability
related problems. - Main achievements
- To have an automatic production system making use
of Grid infrastructure. - 6 TB (out of 30 TB) of data have been moved among
the different Grid flavours using Don Quijote
servers. - 235000 jobs were submitted by the production
system - 250000 logical files were produced and 2500-3500
jobs per day distributed over the three Grid
flavours per day.
18Conclusions
- The generation and simulation of events for ATLAS
DC2 have been completed using 3 flavours of Grid
Technology. - They have been proven to be usable in a coherent
way for a real production and this is a major
achievement. - This exercise has taught us that all the involved
elements (Grid middleware, production system,
deployment and monitoring tools) need
improvements. - Between the start of DC2 in July 2004 and the end
of September 2004 (it corresponds G4-simulation
phase), the automatic production system has
submitted 235000 jobs, they consumed 1.5 million
SI2K months of cpu and produced more than 30TB of
physics data. - ATLAS is also pursuing a model for distributed
analysis which would improve the productivity of
end users by profiting from Grid available
resources.
19Backup Slides
20Supervisor-Executors
Jabber communication pathway
supervisors
executors
1. lexor 2. dulcinea 3. capone 4. legacy
numJobsWanted executeJobs getExecutorData getStatu
s fixJob killJob
Windmill
Prod DB (jobs database)
Don Quijote (file catalog)
21NorduGRID ARC features
- ARC is based on Globus Toolkit with core services
replaced - Currently uses Globus Toolkit 2
- Alternative/extended Grid services
- Grid Manager that
- Checks user credentials and authorization
- Handles jobs locally on clusters (interfaces to
LRMS) - Does stage-in and stage-out of files
- Lightweight User Interface with built-in resource
broker - Information System based on MDS with a NorduGrid
schema - xRSL job description language (extended Globus
RSL) - Grid Monitor
- Simple, stable and non-invasive
22LCG software
- LCG-2 core packages
- VDT (Globus2, condor)
- EDG WP1 (Resource Broker, job submission tools)
- EDG WP2 (Replica Management tools) lcg tools
- One central RMC and LRC for each VO, located at
CERN, ORACLE backend - Several bits from other WPs (Config objects,
InfoProviders, Packaging) - GLUE 1.1 (Information schema) few essential LCG
extensions - MDS based Information System with significant LCG
enhancements (replacements, simplified (see
poster)) - Mechanism for application (experiment) software
distribution - Almost all components have gone through some
reengineering - robustness
- scalability
- efficiency
- adaptation to local fabrics
- The services are now quite stable and the
performance and scalability has been
significantly improved (within the limits of the
current architecture)
23Grid3 software
- Grid environment built from core Globus and
Condor middleware, as delivered through the
Virtual Data Toolkit (VDT) - GRAM, GridFTP, MDS, RLS, VDS
- equipped with VO and multi-VO security,
monitoring, and operations services - allowing federation with other Grids where
possible, eg. CERN LHC Computing Grid (LCG) - USATLAS GriPhyN VDS execution on LCG sites
- USCMS storage element interoperability
(SRM/dCache) - Delivering the US LHC Data Challenges
24ATLAS DC2 (CPU)
25Typical job distribution on LCG
26Typical Job distribution on Grid3
27Jobs distribution on NorduGrid