ATLAS Data Challenges

About This Presentation

Title:

ATLAS Data Challenges

Description:

ATLAS Data Challenges. Oxana Smirnova, Lund University, 2005-08-16. based on the s of G. Poulard and P. Nevski. 2. 2005-08-16. Overview. Introduction ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 25

Provided by: oxan9

Category:

more less

Transcript and Presenter's Notes

Title: ATLAS Data Challenges

1
ATLAS Data Challenges

Oxana Smirnova, Lund University, 2005-08-16based
on the slides of G. Poulard and P. Nevski

2
Overview

Introduction
ATLAS experiment
ATLAS Data Challenges program
Data Challenge 2
The 3 Grid flavors (LCG Grid3/OSG and
NorduGrid/ARC)
ATLAS production system
ATLAS DC2 production
Conclusions

3
Large Hadron Collider (LHC) at CERN
Mont Blanc, 4810 m
Geneva
4
The ATLAS Experiment
ATLAS collaboration 2000 Collaborators 150
Institutes 34 Countries
Event rate 2x109 events per year (200 Hz) with
an average event size of 1.6 Mbyte.
Diameter 25 m Barrel toroid length 26
m Endcap end-wall chamber span 46 m Overall
weight 7000 Tons
5
Challenge of the LHC computing
Storage Raw recording rate of 0.1 1
GBytes/sec Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk space Processing
200,000 of todays fastest PCs
6
Full chain of HEP data processing
Slide adapted from Ch.Collins-Tooth and
J.R.Catmore
7
ATLAS Data Challenges

Scope and Goals of the Data Challenges (DCs)
Validate
Computing Model
Software
Data Model
DC1 (2002-2003)
Put in place the full software chain
Simulation of the data digitization pile-up
Reconstruction
Production system
Tools (bookkeeping monitoring )
Intensive use of Grid
Build the ATLAS DC community
DC2 (2004)
Similar exercise as DC1, BUT
Use of the Grid middleware developed in several
projects
LHC Computing Grid project (LCG) to which CERN is
committed
Grid3/OSG in US
NorduGrid/ARC in Nordic countries and elsewhere

8
DC2 production flow
9
DC2 production phases

ATLAS DC2 started in July 2004
The simulation part was finished by the end of
September and the pile-up and digitization parts
by the end of November
10 Mevents were generated, simulated and
digitized and 2 Mevents were piled-up
Event mixing and reconstruction was done for 2.4
Mevents in December.
The Grid technology has provided the means to
perform this worldwide mass production

10
The 3 Grid flavors

LCG (http//cern.ch/LCG/)
The job of the LHC Computing Grid Project LCG
is to prepare computing infrastructure for
simulation, processing and analysis of the LHC
data for all four of the LHC collaborations. This
includes
common infrastructure of libraries, tools and
frameworks required to support the physics
application software
development and deployment of the computing
services needed to store and process the data,
providing batch and interactive facilities for
the worldwide community of physicists involved in
LHC
Grid3 (http//www.ivdgl.org/grid2003/)
The Grid3 collaboration has deployed an
international Data Grid with dozens of sites and
thousand of processors. The facility is operated
jointly by the US Grid projects and the US
participants in the LHC experiments ATLAS and CMS
NorduGrid (http//www.nordugrid.org/)
The aim of the NorduGrid collaboration is to
deliver a robust, scalable, portable and fully
featured solution for a global computational and
data Grid system. NorduGrid develops and deploys
a set of tools and services the so-called ARC
middleware, which is a free software

LCG, Grid3 and NorduGrid have similar approaches
using the same foundations (the Globus Toolkit),
but so far are not fully interoperable

11
The 3 Grid flavors LCG
Number of sites and resources is evolving rapidly
12
The 3 Grid flavors NorduGrid/ARC

A Grid based on ARC middleware
Driven (so far) mostly by the needs of the LHC
experiments
One of the worlds largest production-level Grids
Contributed significantly to the DC1 (using the
Grid already in 2002)
Supports production on several operating systems
(non-CERN platforms)
Contribution to DC2
22 sites in 7 countries
3000 CPUs (dedicated 700)
7 Storage Services of 12TB
1FTE in charge of the production

13
The 3 Grid flavors Grid3

September 04
30 sites, multi-VO
shared resources
3000 CPUs (shared)

The deployed infrastructure has been in operation
since November 2003
At this moment running 3 HEP and 2 Biological
applications
Over 100 users are authorized to run in GRID3

14
ATLAS Production System

Thin application-specific layer on top of the
Grid and legacy systems
Don Quijote is a data management system,
interfacing to Grid data indexing services (RLS)
Production Database holds job definitions and
status records

Windmill the supervisor, interacts between
the ProdDB and the executors
Can re-submit a job as many times as required
Executors use Grid-specific API to schedule and
manipulate the jobs
Capone Grid3
Dulcinea ARC
Lexor LCG2

15
Emerging Hyperinfrastructure
16
ATLAS DC2 countries and sites

Australia (3)
Austria (1)
Canada (4)
CERN (1)
Czech Republic (2)
Denmark (4)
France (1)
Germany (12)
Italy (7)

Japan (1)
Netherlands (1)
Norway (4)
Poland (1)
Slovenia (1)
Spain (3)
Sweden (7)
Switzerland (1)
Taiwan (1)
UK (7)
USA (19)

19 countries 72 sites
12 countries 31 sites
7 countries 22 sites
17
ATLAS DC2 production
Accumulated number of jobs as of November 30, 2004
Total
18
Job distribution
As of 30 November 2004
19 countries 72 sites 260000 Jobs 2
MSi2k.months
19
Production Rate Growth
Expected rate 4000 jobs/day
20
GRID Job statistics
ATLAS production in 2004-2005

516450 jobs done
60259 jobs NOT done
75872 jobs had no input
36085 jobs aborted (bad definitions)
Not that bad !
LCGCGGRID3NG
40 30 20 10

6
ltNattemptgt1.7

CondorG (CG) refers to direct job submission to
LCG resources, circumventing workload management
and accounting services. Used by ATLAS since
March 2005, planned to get expanded to all the
Grid flavors

21
Production efficiency

Why such differences?
Human factor
Maximum efficiencies reached with best qualified
operators
Middleware issues
ATLAS software, on-demand installation problems
Databases issues
Data movement issues

Grid3
NorduGrid
LCG
Graphs by A. Vanyashin
22
DC2 lessons the problems

The production system was in development during
DC2
The beta status of the services of the Grid
caused troubles while the system was in operation
For example the Globus RLS, the Resource Broker
and the information system of LCG were unstable
at the initial phase
Especially on LCG, lack of a monitoring system
Misconfiguration of sites and site instabilities
But also
Human factor (expired credentials, typos, lack of
qualification, exhaustion)
Network problems (connection lost between two
processes)
Data Management System problems (eg. connection
with mass storage systems, data movement and
registration failures)

23
DC2 lessons the achievements

Have run a large scale production on Grid ONLY,
using 3 Grid flavors
Have an automatic production system making use of
Grid infrastructures
Few 10 TB of data have been moved among the
different Grid flavors using the Don Quijote
(ATLAS Data Management) servers
260000 jobs were submitted by the production
system
260000 logical files were produced and 2500
jobs were run per day

24
Conclusion

ATLAS DC2 proved that a Virtual Organization can
efficiently use a variety of Grids
3 Grids might be too much, but better than 72
individual sites
General lesson production needs a better
planning
software readiness, input generation, QA etc
Job submission, control and monitoring have to
be significantly improved
Data management became critical, needs more
efforts
25 to 50 of all job failures are due to data
management issues
ATLAS databases a good progress, still a lot of
work ahead
Software installation can and should be
improved
Better communications with resource providers is
vital

Write a Comment

User Comments (0)

About PowerShow.com

ATLAS Data Challenges - PowerPoint PPT Presentation

ATLAS Data Challenges

ATLAS Data Challenges. Oxana Smirnova, Lund University, 2005-08-16. based on the s of G. Poulard and P. Nevski. 2. 2005-08-16. Overview. Introduction ... – PowerPoint PPT presentation