ATLAS Data Challenges - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

ATLAS Data Challenges

Description:

ATLAS Data Challenges. Oxana Smirnova, Lund University, 2005-08-16. based on the s of G. Poulard and P. Nevski. 2. 2005-08-16. Overview. Introduction ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 25
Provided by: oxan9
Category:

less

Transcript and Presenter's Notes

Title: ATLAS Data Challenges


1
ATLAS Data Challenges
  • Oxana Smirnova, Lund University, 2005-08-16based
    on the slides of G. Poulard and P. Nevski

2
Overview
  • Introduction
  • ATLAS experiment
  • ATLAS Data Challenges program
  • Data Challenge 2
  • The 3 Grid flavors (LCG Grid3/OSG and
    NorduGrid/ARC)
  • ATLAS production system
  • ATLAS DC2 production
  • Conclusions

3
Large Hadron Collider (LHC) at CERN
Mont Blanc, 4810 m
Geneva
4
The ATLAS Experiment
ATLAS collaboration 2000 Collaborators 150
Institutes 34 Countries
Event rate 2x109 events per year (200 Hz) with
an average event size of 1.6 Mbyte.
Diameter 25 m Barrel toroid length 26
m Endcap end-wall chamber span 46 m Overall
weight 7000 Tons
5
Challenge of the LHC computing
Storage Raw recording rate of 0.1 1
GBytes/sec Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk space Processing
200,000 of todays fastest PCs
6
Full chain of HEP data processing
Slide adapted from Ch.Collins-Tooth and
J.R.Catmore
7
ATLAS Data Challenges
  • Scope and Goals of the Data Challenges (DCs)
  • Validate
  • Computing Model
  • Software
  • Data Model
  • DC1 (2002-2003)
  • Put in place the full software chain
  • Simulation of the data digitization pile-up
  • Reconstruction
  • Production system
  • Tools (bookkeeping monitoring )
  • Intensive use of Grid
  • Build the ATLAS DC community
  • DC2 (2004)
  • Similar exercise as DC1, BUT
  • Use of the Grid middleware developed in several
    projects
  • LHC Computing Grid project (LCG) to which CERN is
    committed
  • Grid3/OSG in US
  • NorduGrid/ARC in Nordic countries and elsewhere

8
DC2 production flow
9
DC2 production phases
  • ATLAS DC2 started in July 2004
  • The simulation part was finished by the end of
    September and the pile-up and digitization parts
    by the end of November
  • 10 Mevents were generated, simulated and
    digitized and 2 Mevents were piled-up
  • Event mixing and reconstruction was done for 2.4
    Mevents in December.
  • The Grid technology has provided the means to
    perform this worldwide mass production

10
The 3 Grid flavors
  • LCG (http//cern.ch/LCG/)
  • The job of the LHC Computing Grid Project LCG
    is to prepare computing infrastructure for
    simulation, processing and analysis of the LHC
    data for all four of the LHC collaborations. This
    includes
  • common infrastructure of libraries, tools and
    frameworks required to support the physics
    application software
  • development and deployment of the computing
    services needed to store and process the data,
    providing batch and interactive facilities for
    the worldwide community of physicists involved in
    LHC
  • Grid3 (http//www.ivdgl.org/grid2003/)
  • The Grid3 collaboration has deployed an
    international Data Grid with dozens of sites and
    thousand of processors. The facility is operated
    jointly by the US Grid projects and the US
    participants in the LHC experiments ATLAS and CMS
  • NorduGrid (http//www.nordugrid.org/)
  • The aim of the NorduGrid collaboration is to
    deliver a robust, scalable, portable and fully
    featured solution for a global computational and
    data Grid system. NorduGrid develops and deploys
    a set of tools and services the so-called ARC
    middleware, which is a free software
  • LCG, Grid3 and NorduGrid have similar approaches
    using the same foundations (the Globus Toolkit),
    but so far are not fully interoperable

11
The 3 Grid flavors LCG
Number of sites and resources is evolving rapidly
12
The 3 Grid flavors NorduGrid/ARC
  • A Grid based on ARC middleware
  • Driven (so far) mostly by the needs of the LHC
    experiments
  • One of the worlds largest production-level Grids
  • Contributed significantly to the DC1 (using the
    Grid already in 2002)
  • Supports production on several operating systems
    (non-CERN platforms)
  • Contribution to DC2
  • 22 sites in 7 countries
  • 3000 CPUs (dedicated 700)
  • 7 Storage Services of 12TB
  • 1FTE in charge of the production

13
The 3 Grid flavors Grid3
  • September 04
  • 30 sites, multi-VO
  • shared resources
  • 3000 CPUs (shared)
  • The deployed infrastructure has been in operation
    since November 2003
  • At this moment running 3 HEP and 2 Biological
    applications
  • Over 100 users are authorized to run in GRID3

14
ATLAS Production System
  • Thin application-specific layer on top of the
    Grid and legacy systems
  • Don Quijote is a data management system,
    interfacing to Grid data indexing services (RLS)
  • Production Database holds job definitions and
    status records
  • Windmill the supervisor, interacts between
    the ProdDB and the executors
  • Can re-submit a job as many times as required
  • Executors use Grid-specific API to schedule and
    manipulate the jobs
  • Capone Grid3
  • Dulcinea ARC
  • Lexor LCG2

15
Emerging Hyperinfrastructure
16
ATLAS DC2 countries and sites
  • Australia (3)
  • Austria (1)
  • Canada (4)
  • CERN (1)
  • Czech Republic (2)
  • Denmark (4)
  • France (1)
  • Germany (12)
  • Italy (7)
  • Japan (1)
  • Netherlands (1)
  • Norway (4)
  • Poland (1)
  • Slovenia (1)
  • Spain (3)
  • Sweden (7)
  • Switzerland (1)
  • Taiwan (1)
  • UK (7)
  • USA (19)

19 countries 72 sites
12 countries 31 sites
7 countries 22 sites
17
ATLAS DC2 production
Accumulated number of jobs as of November 30, 2004
Total
18
Job distribution
As of 30 November 2004
19 countries 72 sites 260000 Jobs 2
MSi2k.months
19
Production Rate Growth
Expected rate 4000 jobs/day
20
GRID Job statistics
ATLAS production in 2004-2005
  • 516450 jobs done
  • 60259 jobs NOT done
  • 75872 jobs had no input
  • 36085 jobs aborted (bad definitions)
  • Not that bad !
  • LCGCGGRID3NG
  • 40 30 20 10

6
ltNattemptgt1.7
  • CondorG (CG) refers to direct job submission to
    LCG resources, circumventing workload management
    and accounting services. Used by ATLAS since
    March 2005, planned to get expanded to all the
    Grid flavors

21
Production efficiency
  • Why such differences?
  • Human factor
  • Maximum efficiencies reached with best qualified
    operators
  • Middleware issues
  • ATLAS software, on-demand installation problems
  • Databases issues
  • Data movement issues

Grid3
NorduGrid
LCG
Graphs by A. Vanyashin
22
DC2 lessons the problems
  • The production system was in development during
    DC2
  • The beta status of the services of the Grid
    caused troubles while the system was in operation
  • For example the Globus RLS, the Resource Broker
    and the information system of LCG were unstable
    at the initial phase
  • Especially on LCG, lack of a monitoring system
  • Misconfiguration of sites and site instabilities
  • But also
  • Human factor (expired credentials, typos, lack of
    qualification, exhaustion)
  • Network problems (connection lost between two
    processes)
  • Data Management System problems (eg. connection
    with mass storage systems, data movement and
    registration failures)

23
DC2 lessons the achievements
  • Have run a large scale production on Grid ONLY,
    using 3 Grid flavors
  • Have an automatic production system making use of
    Grid infrastructures
  • Few 10 TB of data have been moved among the
    different Grid flavors using the Don Quijote
    (ATLAS Data Management) servers
  • 260000 jobs were submitted by the production
    system
  • 260000 logical files were produced and 2500
    jobs were run per day

24
Conclusion
  • ATLAS DC2 proved that a Virtual Organization can
    efficiently use a variety of Grids
  • 3 Grids might be too much, but better than 72
    individual sites
  • General lesson production needs a better
    planning
  • software readiness, input generation, QA etc
  • Job submission, control and monitoring have to
    be significantly improved
  • Data management became critical, needs more
    efforts
  • 25 to 50 of all job failures are due to data
    management issues
  • ATLAS databases a good progress, still a lot of
    work ahead
  • Software installation can and should be
    improved
  • Better communications with resource providers is
    vital
Write a Comment
User Comments (0)
About PowerShow.com