Santiago Gonz - PowerPoint PPT Presentation

About This Presentation

Title:

Santiago Gonz

Description:

ATLAS Data Challenge 2: A massive Monte Carlo Production on the Grid k Santiago Gonz lez de la Hoz (Santiago.Gonzalez_at_ific.uv.es) on behalf of ATLAS DC2 ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 28

Provided by: Frances203

Category:

more less

Transcript and Presenter's Notes

Title: Santiago Gonz

1
ATLAS Data Challenge 2 A massive Monte Carlo
Production on the Grid
k

Santiago González de la Hoz (Santiago.Gonzalez_at_if
ic.uv.es)
on behalf of ATLAS DC2 Collaboration
EGC 2005
Amsterdam, 14/02/2005

2
Overview

Introduction
ATLAS experiment
Data Challenge program
ATLAS production system
DC2 production phases
The 3 Grid flavours (LCG, GRID3 and NorduGrid)
ATLAS DC2 production
Distributed analysis system
Conclusions

3
LHC (CERN)
Introduction LHC/CERN
Mont Blanc, 4810 m
Geneva
4
The challenge of the LHC computing
Storage Raw recording rate 0.1 1
GBytes/sec Accumulating at 5-8
PetaBytes/year 10 PetaBytes of
disk Processing 200,000 of todays fastest
PCs
5
Introduction ATLAS

Detector for the study of high-energy
proton-proton collision.
The offline computing will have to deal with an
output event rate of 100 Hz. i.e 109 events per
year with an average event size of 1 Mbyte.
Researchers are spread all over the world.

6
Introduction Data Challenges

Scope and Goals
In 2002 ATLAS computing planned a first series of
Data Challenges (DCs) in order to validate its
Computing Model
Software
Data Model
The major features of the DC1 were
The development and deployment of the software
required for the production of large event
samples
The production of those samples involving
institutions worldwide.
ATLAS collaboration decided to perform the DC2
and in the future the DC3 using the Grid
middleware developed in several Grid projects
(Grid flavours) like
LHC Computing Grid project (LCG), to which CERN
is committed
GRID3
NorduGRID

7
ATLAS production system

The production database, which contains abstract
job definitions
The windmill supervisor that reads the production
database for job definitions and present them to
the different GRID executors in an easy-to-parse
XML format
The Executors, one for each GRID flavor, that
receive the job-definitions in XML format and
convert them to the job description language of
that particular GRID
Don Quijote, the Atlas Data Management System,
moves files from their temporary output locations
to their final destination on some Storage
Element and registers the files in the Replica
Location Service of that GRID.

In order to handle the task of ATLAS DC2 an
automated production system was designed
The ATLAS production system consists of 4
components

8
DC2 production phases
Bytestream Raw Digits
Task Flow for DC2 data
ESD
Bytestream Raw Digits
Digits (RDO) MCTruth
Mixing
Reconstruction
Hits MCTruth
Events HepMC
Geant4
Digitization
Bytestream Raw Digits
ESD
Digits (RDO) MCTruth
Hits MCTruth
Events HepMC
Pythia
Reconstruction
Geant4
Digitization
Digits (RDO) MCTruth
Hits MCTruth
Events HepMC
Pile-up
Geant4
Bytestream Raw Digits
ESD
Bytestream Raw Digits
Mixing
Reconstruction
Digits (RDO) MCTruth
Events HepMC
Hits MCTruth
Geant4
Bytestream Raw Digits
Pile-up
20 TB
5 TB
20 TB
30 TB
5 TB
Event Mixing
Digitization (Pile-up)
Reconstruction
Detector Simulation
Event generation
Byte stream
Persistency Athena-POOL
TB
Physics events
Min. bias Events
Piled-up events
Mixed events
Mixed events With Pile-up
Volume of data for 107 events
9
DC2 production phases
Process No. of events Event/ size CPU power Volume of data
MB kSI2k-s TB
Event generation 107 0.06 156
Simulation 107 1.9 504 30
Pile-up/ Digitization 107 3.3/1.9 144/16 35
Event mixing Byte-stream 107 2.0 5.4 20

The ATLAS DC2 which started in July 2004 finished
the simulation part at the end of September 2004.
10 million events (100000 jobs) were generated
and simulated using the three Grid Flavors
The Grid technologies have provided the tools to
generate a large Monte Carlo simulation samples
The digitization and Pile-up part was completed
in December. The pile-up was done on a sub-sample
of 2 M events.
The event mixing and byte-stream production are
going on

10
The 3 Grid flavors

LCG (http//lcg.web.cern.ch/LCG/)
The job of the LHC Computing Grid Project LCG
is to prepare the computing infrastructure for
the simulation, processing and analysis of LHC
data for all four of the LHC collaborations. This
includes both the common infrastructure of
libraries, tools and frameworks required to
support the physics application software, and the
development and deployment of the computing
services needed to store and process the data,
providing batch and interactive facilities for
the worldwide community of physicists involved in
LHC.
NorduGrid (http//www.nordugrid.org/)
The aim of the NorduGrid collaboration is to
deliver a robust, scalable, portable and fully
featured solution for a global computational and
data Grid system. NorduGrid develops and deploys
a set of tools and services the so-called ARC
middleware, which is a free software.
Grid3 (http//www.ivdgl.org/grid2003/)
The Grid3 collaboration has deployed an
international Data Grid with dozens of sites and
thousands of processors. The facility is operated
jointly by the U.S. Grid projects iVDGL, GriPhyN
and PPDG, and the U.S. participants in the LHC
experiments ATLAS and CMS.

Both Grid3 and NorduGrid have similar approaches
using the same foundations (GLOBUS) as LCG but
with slightly different middleware.

11
The 3 Grid flavors LCG

This infrastructure has been operating since
2003.
The resources used (computational and storage)
are installed at a large number of Regional
Computing Centers, interconnected by fast
networks.

82 sites, 22 countries (This number is evolving
very fast)
6558 TB
7269 CPUs (shared)

12
The 3 Grid flavors NorduGRID

NorduGrid is a research collaboration established
mainly across Nordic Countries but includes sites
from other countries.
They contributed to a significant part of the DC1
(using the Grid in 2002).
It supports production on non-RedHat 7.3
platforms

11 countries, 40 sites, 4000 CPUs,
30 TB storage

13
The 3 Grid flavors GRID3

Sep 04
30 sites, multi-VO
shared resources
3000 CPUs (shared)

The deployed infrastructure has been in operation
since November 2003
At this moment running 3 HEP and 2 Biological
applications
Over 100 users authorized to run in GRID3

14
ATLAS DC2 production on LCG, GRID3 and
NorduGrid

G4 simulation

total
Validated Jobs
Day
15
Typical job distribution on LCG, GRID3 and
NorduGrid
16
Distributed Analysis system ADA

The physicists want to use the Grid to perform
the analysis of the data too.
ADA (ATLAS Distributed Analysis) project aims at
putting together all software components to
facilitate the end-user analysis.

DIAL It defines the job components (dataset,
task, applications, etc..). Together with LSF or
Condor provides interactivity ( a low response
time).
ATPROD production system to be used for low mass
scale
ARDA Analysis system to be interfaced to EGEE
middleware

The ADA architecture

17
Lessons learned from DC2

Main problems
The production system was in development during
DC2 phase.
The beta status of the services of the Grid
caused troubles while the system was in operation
For example, the Globus RLS, the Resource Broker
and the information system were unstable at the
initial phase.
Specially on LCG, lack of uniform monitoring
system.
The mis-configuration of sites and site stability
related problems.
Main achievements
To have an automatic production system making use
of Grid infrastructure.
6 TB (out of 30 TB) of data have been moved among
the different Grid flavours using Don Quijote
servers.
235000 jobs were submitted by the production
system
250000 logical files were produced and 2500-3500
jobs per day distributed over the three Grid
flavours per day.

18
Conclusions

The generation and simulation of events for ATLAS
DC2 have been completed using 3 flavours of Grid
Technology.
They have been proven to be usable in a coherent
way for a real production and this is a major
achievement.
This exercise has taught us that all the involved
elements (Grid middleware, production system,
deployment and monitoring tools) need
improvements.
Between the start of DC2 in July 2004 and the end
of September 2004 (it corresponds G4-simulation
phase), the automatic production system has
submitted 235000 jobs, they consumed 1.5 million
SI2K months of cpu and produced more than 30TB of
physics data.
ATLAS is also pursuing a model for distributed
analysis which would improve the productivity of
end users by profiting from Grid available
resources.

19
Backup Slides
20
Supervisor-Executors
Jabber communication pathway
supervisors
executors
1. lexor 2. dulcinea 3. capone 4. legacy
numJobsWanted executeJobs getExecutorData getStatu
s fixJob killJob
Windmill
Prod DB (jobs database)
Don Quijote (file catalog)
21
NorduGRID ARC features

ARC is based on Globus Toolkit with core services
replaced
Currently uses Globus Toolkit 2
Alternative/extended Grid services
Grid Manager that
Checks user credentials and authorization
Handles jobs locally on clusters (interfaces to
LRMS)
Does stage-in and stage-out of files
Lightweight User Interface with built-in resource
broker
Information System based on MDS with a NorduGrid
schema
xRSL job description language (extended Globus
RSL)
Grid Monitor
Simple, stable and non-invasive

22
LCG software

LCG-2 core packages
VDT (Globus2, condor)
EDG WP1 (Resource Broker, job submission tools)
EDG WP2 (Replica Management tools) lcg tools
One central RMC and LRC for each VO, located at
CERN, ORACLE backend
Several bits from other WPs (Config objects,
InfoProviders, Packaging)
GLUE 1.1 (Information schema) few essential LCG
extensions
MDS based Information System with significant LCG
enhancements (replacements, simplified (see
poster))
Mechanism for application (experiment) software
distribution
Almost all components have gone through some
reengineering
robustness
scalability
efficiency
adaptation to local fabrics
The services are now quite stable and the
performance and scalability has been
significantly improved (within the limits of the
current architecture)

23
Grid3 software

Grid environment built from core Globus and
Condor middleware, as delivered through the
Virtual Data Toolkit (VDT)
GRAM, GridFTP, MDS, RLS, VDS
equipped with VO and multi-VO security,
monitoring, and operations services
allowing federation with other Grids where
possible, eg. CERN LHC Computing Grid (LCG)
USATLAS GriPhyN VDS execution on LCG sites
USCMS storage element interoperability
(SRM/dCache)
Delivering the US LHC Data Challenges