Title: Jos SALT
1EGEE-II and the implications in the LHC Computing
GRID
- José SALT
-
-
22 de Junio de 2006 -
-
-
Curso Postgrado GRID y e-Ciencia Kick-off
meeting de Int.eu.grid
Instituto de Física de Cantabria
1
2Overview
- 1. EGEE II fast description of the project
- 2. GRID Computing in LHC Experiments
- 3.- The EGEE vision and its relationship with the
ATLAS TIER-2 - 4.- Conclusions and Perspectives
2
31.- EGEE-II Fast description of the project
- EGEE brings together scientists and engineers of
90 institutions - In over 30 countries worlwide
- To provide seamless GRID infrastructure for
e-Science - Available 24 h/day x 7days/week
- Funded by EU (European Commission)
- Two original scientifical fields HEP and Life
Sciences but it integrates nos many other
fields from Geology up to Computing Chemistry - Infra 30.000 CPUS , 5 PBbytes storage
- Maintains 10.000 concurrent jobs on average
3
4EGEE-II Activities Packages
- SSA (Specific Service Activities)
- SA1 GRID Operations, Support and Management
IFIC-IFCA - SA2 Networking support
- SA3 Integration, testing and certification
IFIC - Networking Activities
- NA1 Management of the Project
- NA2 Dissemination, Outreach and Communication
IFIC-IFCA - NA3 Training and Induction IFIC-IFCA
- NA4 Application identification an Support
CNB - NA5 Policy and International Cooperation
- Joint Research Activities
- JRA1 Middleware Re-Engineering
- JRA2 Quality Assurance
4
5EGEE-II activities
- Operation of the GRID Infrastructure (ROC
Manager) - SA1 Infrastructure Operations European GRID
Support GRID Support, Operation and Management
and includes tasks as GRID Monitoring and control
and resource and user support - Resource Operation Center (ROC)
- activities are coordinated in Federations. SWE
(South West Europe) LIP, IFIC,IFCA, PIC - SA3 Integration, Testing and Certification to
manage the process of building deployable and
documented MW distributions, starting by
integrating Mw packages and components from a
variety of sources. - NA2 NA3 Dissemination and Training
- NA4 Applications (HEP, Biomed)
5
6BIFI (Zaragoza)
CIEMAT (Madrid)
6
7, why so many GRID e-Science projects?
GRID has different approaches and perspectives
from a) the Infrastructure point of view b)
the GRID development and Deployment an point of
view c) From the applications point of view
- Antecedents During the last 6 years IFIC (and
IFCA) /CSIC has participated in GRID projects in
the EU Framework - Program DATAGRID (2001-2004), CROSSGRID
(2002-2005) and, now, in EGEE (I and II phases) - DATAGRID tried to cover all these aspects,
CROSSGRID invest more effort in incorporating
new applications. - To join the efforts to establish the e-Science in
Spain - EGEE and LCG (LHC Computing GRID) are strongly
coupled but they offer complementary visions of
a given problem (e-Science) - EGEE continue the effort in the 3 fronts but the
complexity of the different aspects generates
related projects - The HEP Community have had a leader role in
several GRID initiatives of GRID and e-Science
7
8- EGEE as incubator of GRID and e-science projects
Int.eu.grid
8
92.- GRID computing in the LHC experiments
- High Energy Physics there is a list of
experiments ( accelerator and non-accelerator)
with different scientifical objectives within the
field of Elementary Particle - Accelerator Fermilab, SLAC, CERN, etc
- Non-accelerator Astropartícles (AMS, ANTARES,
K2K, MAGIC,...) - Problems in Computing Computing power, Data
Storage and access to data - Main Challenge in LHC experiments
9
10Application in High Energy Physics
- The LHC Computing Grid
- A Global Computing Facility for Physics
Where? CERN Name of Accelerator LHC ( Large
Hadron Collider)
(less than 2 year) left for first colisisions in
LHC
10
José Salt
11The 4 LHC experiments ALICE, ATLAS, CMS y LHCb
11
12- Detector study of collisions of p-p at high
energies - Start of Data Taking Spring 2007
- Level 3 Trigger 200 events/s, being the event
size of 1.6 MB/event - Data Volume 2 PB/year during 10 years
- Estimated CPU to process data in LHC 100.000
PCs - This generates 3 problems
- Data Storage
- Processing
- Users scatterd Worlwide
Solution
GRID TECHNOLOGIES
12
13 The ATLAS Computing Model
- To cover a width range of activities from the
Storage of Raw Data up to provide the possibility
of performing Data Analysis in an Universitary
Department (member of ATLA Coll.) - The data undergo several transformations in order
to get a reduction in size and the extraction of
relevant information
Data reduction chain
- the analysis physicist will navigate the data
along their different formats in order to extract
the needed information . This activity will have
a big influence in the establshmente of the fine
adjust of the Computing Model
13
14Tier-1
Tier-2
Tier-3 Centers
PCs , laptops
RAL
IN2P3
- LHC Computing Model (in a nutshell!!)
- Tier-0 CERN centre
- To filter the Raw Data
- Reconstruction ? Event Summary Data (ESD)
- Registration of Raw Data ESD
- Distribution of Raw Data and ESD to Tier-1
- Tier-1
- Permanent storage and organization of Raw Data,
ESD, calibration data, metadata, Analysis Data
and DataBases ? data services by using GRID - Massive Data Analysis
- Reprocessed Rae Data ? ESD
- Call Center at the National/regional level
- gt high availability for the on line of Data
Acquisition, Massive Storage and managed of data
for long term commitments
Tier-0
FNAL
CNAF
FZK
PIC
ICEPP
BNL
- Tier-2
- Services of Data storage on disk issued by GRID
- To provide Simulated Data on experiment demand
- To provide of Analysis capacity to the Physics
Groups. Operation of a installation of System of
Data analysis (20 working lines in parallel) - To provide the network services for the
interchange with TIER-1
14
15High Energy Physics E-Infraestructures for LHC
The TIERS
TIER-1 PIC CIEMAT,IFAE TIER-2 CMS
Tier-2 CIEMAT, IFCA ATLAS Tier-2
UAM,IFAE e IFIC LHCb Tier-2 USC, UB
the underlined centers are the
coordinators TIER-3 University Depts., Research
Centers, etc
Recent Funding by the HEP Spanish Program
(2005-2007)
15
16230 PCs (172 IFIC 58 ICMOL)
96 Athlon 1.2 GHz 1 Gbyte SDRAM
96 athlon 1.4 GHz 1Gbyte DDR
30 nodos PE850 (DELL) Dual Core _at_ 3.2 GHz
Disk local 40 GB/160GB
Fast Ethernet agregating with Gigabit ethernet
IFIC
Manpower 6 FTE
Robot STK L700e700 Up to 134 TB Capacity
4 disk servers (5 TB) 2 tape servers
16
17UAM
IFAE
- The computer nodes and the disks will be hosted
in racks at the PIC Computer room. - Location reserved for the Atlas Tier-2
Disk servers 4.5 TB
17
183.- The EGEE vision and its relationship with the
ATLAS TIER-2
An Important Issue System of Distributed
Análysis based on GRID
Open Issues from LHC experiments
18
1919
20An Important Issue System of Distributed
Análysis based on GRID
- Has to be performed in parallel to the ATLAS
production (up to 50 of ATLAS resources) - Differences
- Prod. Jobs are tipically long simulation jobs,
CPU dominated and have large memory requirements - DA jobs much more IO oriented jobs with
considerably smaller memory requiremnts - Plans according to the 3 flavours
- LCG plan to use the gLite Resource Broker and
CondorG to submit jobs to sites support DA by
providing a special CE for analysis or short
jobs - Prototype of a DA system at the TIER-2
20
21Need of a Distributed Data Management
in ATLAS
- GRID provides services and tools for the
Distributed Data Management - - File Catalog of low level, storage and
transfer services - ATLAS uses different GRID flavours (LCG, OSG,
NorduGrid), where each one has its own version of
these services - . Its needed to implement a specific layer over
the GRID middleware - Whats the objective? to manage the data flow
of ATLAS according to the computing model
providing only one entry point for all the
distributed data of ATLAS - The DDM (Distributes Data Management) wants to
achieve the previous objective by means of a
software called Don Quijota (DQ)
21
22Don Quijote
- The first version of DQ simply provided an
interface to the different catalogs of the 3 GRID
flavours in order to locate the data and a simple
system of file transfer - DQ was tested in the Data Challenge 2 (DC2)
program of de ATLAS, which has to validate the
software and data model of the experiment - due to the (a) scalablity problem and (b) the
progress in GRID Mw, DQ has had to be
re-engineered DQ2
DQ
queries
LCG EDG RLS
OSG Globus RLS
NG Globus RLS
22
23 Open Issues from LHC experiments
- Security, authorization, authentication
- VOMS available and stable (Priohigh)
- VOMS groups and roles used by all middleware
(Prio high) - Information System
- Stable access to static information (Priomedium)
- Access to the static information
- Storage Management
- SRM interface provided by all Storage Element
Services (done) - Support for disk quota management (prio low)
support for disk quota management both at group
and user level should be offered by all Storage
Services - Checking of the integrity/validity after the new
replica creation (prio critical)
25
24- Data Management
- FTS improvements and feature requests as
specified in the FTS workshop (Prio Critical) - Central entry point for all transfers FTS should
provide a single central entry point for all the
required transfer channels including T0-T1, T1-T1
and T1-T2/T2-T1 transfers and for the T2 sites
running analysis tasks (Prio Critical) - Support priorities, with possibility to do late
reshuffling (Prio Low) - POSIX file access based on the LFN
- File access API (GFAL library) using multiple
instances of LFC (Prio High)
26
25- Workload Management
- Capability of handling 1 million of short jobs
(30 min) in 1 day with RB service Feature
needed for SC4. The final short job number is
evaluated to be 1 million (Prio High) - Efficient use of the information system in the
matchmaking capability of sending the jobs to
the sites where the input files are present and
having enough free CPU slots (Prio High) - Support for different priorities based on VOMS
groups/roles (PRIO High) - The RB should reschedule the jobs in its internal
task queue, using a priorization system gt this
feature is already available in gLite RB (Prio
High)
27
26- Workload Management (cont)
- CE service directly accessible by service/clients
other than RB - Allow for changing identity of a job running on
the WN
- Monitoring Tools and Accounting
- A scalable tool to collect VO specific
information - Publish/subscribe to logging and bookeepìng and
local batch system events for all jobs in the VO - Support for accounting, with site, user and group
granulariry (DGAS or equivalent) - Possibility ot aggregate by VO (user) specified
tag
28
27LCG Deployment Schedule
29
284 - Conclusions and Perspectives
- EGEE-2
- Provide the MW, the General GRID framework, etc
- User Support, dissemination and training
- High level of synergy
- TIER-2
- The progress achieved until now has been very
important a Production System is working in a
acceptable way - Next problem to be solved to have access to a
powerful System of GRID Distributed Data Analysis
- The succes of GRID in HEP (LHC) will be very
important in the e-Science programs - FINAL OBJECTIVE every physicist of any ATLAS
center should be able to do her/his analysis from
his/her home institute in a efective and fast
wway - EGEE-II TIER-2
- Very good relationship with TIER-2 operation,
collaborative framework, User Support, GRID
Middleware progress. GRID Distributed Analysis - To go beyond the ATLAS TIER-2 vision GRID for
High Energy and Nuclear Physics ( Theoretical and
Experimental) - To extend to the industrial partners and National
GRID initiatives
30
29Slide Backup
30Don Quijote 2
- due to the (a) scalablity problem and (b) the
progress in GRID Mw, DQ has had to be
re-engineered DQ2 - DQ2 architecture consists in datasets, central
catalogs and site services - DQ2 is based in the concept of dataset
versions - Defined as a collection of files or another
datasets - DQ2 relies on the ATLAS central catalogs (global
catalog) which define the datasets and their
locations - dataset is the unity of data movement as well
- To permit the movement os data it has been
distributed the site services which use the
subscription mechanism to move data from a
place to another. - More information
- https//uimon.cern.ch/twiki/bin/view/Atlas/DDM
23
31Ejemplo trabajo GRID ATLAS (Jul 2004 Julio-Marzo
05)
APLICACIONES CIENTIFICAS EN GRID
- 660K jobs total in (LCG,Nordugrid,US Grid3)
- 400 kSI2k years of CPU
- In latest period average 7K jobs/day with 5K in
LCG
Mix of jobs Prep for Rome
DC2 (short jobs period)
DC2 (long jobs period)
José Salt
20