Title: ATLAS Data Challenges
1ATLAS Data Challenges
- Oxana Smirnova, Lund University, 2005-08-16based
on the slides of G. Poulard and P. Nevski
2Overview
- Introduction
- ATLAS experiment
- ATLAS Data Challenges program
- Data Challenge 2
- The 3 Grid flavors (LCG Grid3/OSG and
NorduGrid/ARC) - ATLAS production system
- ATLAS DC2 production
- Conclusions
3Large Hadron Collider (LHC) at CERN
Mont Blanc, 4810 m
Geneva
4The ATLAS Experiment
ATLAS collaboration 2000 Collaborators 150
Institutes 34 Countries
Event rate 2x109 events per year (200 Hz) with
an average event size of 1.6 Mbyte.
Diameter 25 m Barrel toroid length 26
m Endcap end-wall chamber span 46 m Overall
weight 7000 Tons
5Challenge of the LHC computing
Storage Raw recording rate of 0.1 1
GBytes/sec Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk space Processing
200,000 of todays fastest PCs
6Full chain of HEP data processing
Slide adapted from Ch.Collins-Tooth and
J.R.Catmore
7ATLAS Data Challenges
- Scope and Goals of the Data Challenges (DCs)
- Validate
- Computing Model
- Software
- Data Model
- DC1 (2002-2003)
- Put in place the full software chain
- Simulation of the data digitization pile-up
- Reconstruction
- Production system
- Tools (bookkeeping monitoring )
- Intensive use of Grid
- Build the ATLAS DC community
- DC2 (2004)
- Similar exercise as DC1, BUT
- Use of the Grid middleware developed in several
projects - LHC Computing Grid project (LCG) to which CERN is
committed - Grid3/OSG in US
- NorduGrid/ARC in Nordic countries and elsewhere
8DC2 production flow
9DC2 production phases
- ATLAS DC2 started in July 2004
- The simulation part was finished by the end of
September and the pile-up and digitization parts
by the end of November - 10 Mevents were generated, simulated and
digitized and 2 Mevents were piled-up - Event mixing and reconstruction was done for 2.4
Mevents in December. - The Grid technology has provided the means to
perform this worldwide mass production
10The 3 Grid flavors
- LCG (http//cern.ch/LCG/)
- The job of the LHC Computing Grid Project LCG
is to prepare computing infrastructure for
simulation, processing and analysis of the LHC
data for all four of the LHC collaborations. This
includes - common infrastructure of libraries, tools and
frameworks required to support the physics
application software - development and deployment of the computing
services needed to store and process the data,
providing batch and interactive facilities for
the worldwide community of physicists involved in
LHC - Grid3 (http//www.ivdgl.org/grid2003/)
- The Grid3 collaboration has deployed an
international Data Grid with dozens of sites and
thousand of processors. The facility is operated
jointly by the US Grid projects and the US
participants in the LHC experiments ATLAS and CMS - NorduGrid (http//www.nordugrid.org/)
- The aim of the NorduGrid collaboration is to
deliver a robust, scalable, portable and fully
featured solution for a global computational and
data Grid system. NorduGrid develops and deploys
a set of tools and services the so-called ARC
middleware, which is a free software
- LCG, Grid3 and NorduGrid have similar approaches
using the same foundations (the Globus Toolkit),
but so far are not fully interoperable
11The 3 Grid flavors LCG
Number of sites and resources is evolving rapidly
12The 3 Grid flavors NorduGrid/ARC
- A Grid based on ARC middleware
- Driven (so far) mostly by the needs of the LHC
experiments - One of the worlds largest production-level Grids
- Contributed significantly to the DC1 (using the
Grid already in 2002) - Supports production on several operating systems
(non-CERN platforms) - Contribution to DC2
- 22 sites in 7 countries
- 3000 CPUs (dedicated 700)
- 7 Storage Services of 12TB
- 1FTE in charge of the production
13The 3 Grid flavors Grid3
- September 04
- 30 sites, multi-VO
- shared resources
- 3000 CPUs (shared)
- The deployed infrastructure has been in operation
since November 2003 - At this moment running 3 HEP and 2 Biological
applications - Over 100 users are authorized to run in GRID3
14ATLAS Production System
- Thin application-specific layer on top of the
Grid and legacy systems - Don Quijote is a data management system,
interfacing to Grid data indexing services (RLS) - Production Database holds job definitions and
status records
- Windmill the supervisor, interacts between
the ProdDB and the executors - Can re-submit a job as many times as required
- Executors use Grid-specific API to schedule and
manipulate the jobs - Capone Grid3
- Dulcinea ARC
- Lexor LCG2
15Emerging Hyperinfrastructure
16ATLAS DC2 countries and sites
- Australia (3)
- Austria (1)
- Canada (4)
- CERN (1)
- Czech Republic (2)
- Denmark (4)
- France (1)
- Germany (12)
- Italy (7)
- Japan (1)
- Netherlands (1)
- Norway (4)
- Poland (1)
- Slovenia (1)
- Spain (3)
- Sweden (7)
- Switzerland (1)
- Taiwan (1)
- UK (7)
- USA (19)
19 countries 72 sites
12 countries 31 sites
7 countries 22 sites
17ATLAS DC2 production
Accumulated number of jobs as of November 30, 2004
Total
18Job distribution
As of 30 November 2004
19 countries 72 sites 260000 Jobs 2
MSi2k.months
19Production Rate Growth
Expected rate 4000 jobs/day
20GRID Job statistics
ATLAS production in 2004-2005
- 516450 jobs done
- 60259 jobs NOT done
- 75872 jobs had no input
- 36085 jobs aborted (bad definitions)
- Not that bad !
- LCGCGGRID3NG
- 40 30 20 10
6
ltNattemptgt1.7
- CondorG (CG) refers to direct job submission to
LCG resources, circumventing workload management
and accounting services. Used by ATLAS since
March 2005, planned to get expanded to all the
Grid flavors
21Production efficiency
- Why such differences?
- Human factor
- Maximum efficiencies reached with best qualified
operators - Middleware issues
- ATLAS software, on-demand installation problems
- Databases issues
- Data movement issues
Grid3
NorduGrid
LCG
Graphs by A. Vanyashin
22DC2 lessons the problems
- The production system was in development during
DC2 - The beta status of the services of the Grid
caused troubles while the system was in operation - For example the Globus RLS, the Resource Broker
and the information system of LCG were unstable
at the initial phase - Especially on LCG, lack of a monitoring system
- Misconfiguration of sites and site instabilities
- But also
- Human factor (expired credentials, typos, lack of
qualification, exhaustion) - Network problems (connection lost between two
processes) - Data Management System problems (eg. connection
with mass storage systems, data movement and
registration failures)
23DC2 lessons the achievements
- Have run a large scale production on Grid ONLY,
using 3 Grid flavors - Have an automatic production system making use of
Grid infrastructures - Few 10 TB of data have been moved among the
different Grid flavors using the Don Quijote
(ATLAS Data Management) servers - 260000 jobs were submitted by the production
system - 260000 logical files were produced and 2500
jobs were run per day
24Conclusion
- ATLAS DC2 proved that a Virtual Organization can
efficiently use a variety of Grids - 3 Grids might be too much, but better than 72
individual sites - General lesson production needs a better
planning - software readiness, input generation, QA etc
- Job submission, control and monitoring have to
be significantly improved - Data management became critical, needs more
efforts - 25 to 50 of all job failures are due to data
management issues - ATLAS databases a good progress, still a lot of
work ahead - Software installation can and should be
improved - Better communications with resource providers is
vital